I’m currently learning R programming, and I’m trying to classify Kaggle loan customers using R programming.
The first analysis that can be done using raw data as it is is to estimate (predict, classify) 1 dependent variable with 3 categories using 10 input variables (independent variables, X).
Here, the dependent variable (target variable, Y) is loan_status, and the three categories are as follows.
PAIDOFF: Repay all loans within the deadline
COLLECTION: Non-payment until data collection
COLLECTION_PAIDOFF: The deadline has passed, but all loans are repaid
Originally, it is a multi classification problem that categorizes the above three categories, but I will do a binary classification that categorizes repayment within the deadline into success or failure through some modifications.
loan <- loan %>%
mutate(Loan_ID = factor(Loan_ID),
loan_status = factor(loan_status),
effective_date = factor(effective_date),
due_date = factor(due_date),
paid_off_time = factor(paid_off_time),
education = factor(education),
Gender = factor(Gender))
summary(loan)
visualization
loan %>%
ggplot(aes(loan_status)) +
geom_bar() +
labs(title = "Bar plot",
subtitle = "Succes People",
caption = "Source: Kaggle Loan data")

