2021.02.21(pm): Kaggle Loan data binary classification

I’m currently learning R programming, and I’m trying to classify Kaggle loan customers using R programming.

The first analysis that can be done using raw data as it is is to estimate (predict, classify) 1 dependent variable with 3 categories using 10 input variables (independent variables, X).

Here, the dependent variable (target variable, Y) is loan_status, and the three categories are as follows.

PAIDOFF: Repay all loans within the deadline

COLLECTION: Non-payment until data collection

COLLECTION_PAIDOFF: The deadline has passed, but all loans are repaid

Originally, it is a multi classification problem that categorizes the above three categories, but I will do a binary classification that categorizes repayment within the deadline into success or failure through some modifications.

loan <- loan %>% 
  mutate(Loan_ID        = factor(Loan_ID),
         loan_status    = factor(loan_status),
         effective_date = factor(effective_date),
         due_date       = factor(due_date),
         paid_off_time  = factor(paid_off_time),
         education      = factor(education),
         Gender         = factor(Gender))

summary(loan)

visualization

loan %>% 
  ggplot(aes(loan_status)) +
  geom_bar() + 
  labs(title = "Bar plot",
       subtitle = "Succes People",
       caption = "Source: Kaggle Loan data")