Home Credit Default Risk

7 min readAug 19, 2021

In this project, I leveraged the power of Machine Learning to predict how capable each applicant is in repaying their loan.

Problem Statement

In order to predict how capable each applicant is of repaying a loan, I make use of a variety of alternative data — including telco and transactional information — to predict Home Credit clients’ repayment abilities. While Home Credit is currently using various statistical and machine learning methods to make these predictions, they are challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

The objective of this project is to use historical loan application data to predict whether or not an applicant will be able to repay a loan. This is a standard supervised classification model:

Supervised: The labels are included in the training data and the goal is to train a model to learn to predict the labels from the features
Classification: The label is a binary variable, 0 (will repay loan on time), 1 (will have difficulty repaying loan)

The following procedures are followed in this project:

Downloading a real-world dataset from Kaggle
Exploratory data analysis and visualization
Splitting a dataset into training, validation & test sets
Filling/imputing missing values in numeric columns
Scaling numeric features to a (0,1)(0,1) range
Encoding categorical columns as one-hot vectors
Training different machine learning models using Scikit-learn
Evaluating a model using a validation set and test set
Saving a model to disk and loading it back

Training Data

The file application_train.csv contains the training data. Let's load it into a Pandas dataframe.

The training dataset contains 307,511 rows and 122 columns. The dataset contains numeric and categorical columns. My objective is to create a model to predict the value in the column TARGET.

This is the main training dataset with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature SK_ID_CURR. The training application data comes with the TARGET label having binary values of:

0: Loan was repaid
1: Loan was not repaid

Each row shows a loan application at Home Credit. The TARGET column contains the value to be predicted.

Here is the Descriptive Statistics including those that summarize the central tendency, dispersion and shape of a training dataset’s distribution, excluding NaN values.

As evident from the foregoing data column description, there is plethora of inter-related information one can use to build an inclusive predictive model for default risk prediction of loan application.

While we are able to fill in missing values for most columns, it might be a good idea to discard the rows where the value of TARGET is missing to make the analysis and modeling simpler.

Test Data

The statistics above demonstrates that the Test dataset is considerably smaller than the training counterpart (~ 15% of the Training Data size) and lacks the TARGET column as expected.

Exploratory Data Analysis and Visualization

TARGET Column Distribution

The TARGET is the binary variable that we are trying to predict with 2 values:

I examine the number of loans falling into each category followed by the distribution of the TARGET variable.

We can observe from the distribution above that there is an imbalance distribution of TARGET variable that leads to an imbalanced class problem. There are far more loans that were repaid on time than loans that were not repaid. Once I get into more sophisticated machine learning models, I can apply weight to the classes by their representation in the data to account for such imbalance.

Identifying Missing Values

Machine learning models cannot work with missing numerical data. The process of filling missing values is called imputation. Therefore, to brush up the data for more accurate modeling, I examine the number and percentage of missing values in each column prior to imputation.

When it comes time to building machine learning models, I need to fill in these missing values (known as imputation). I may also use models such as Gradient Boosting model, XGBoost, that can handle missing values with no need for imputation. Another option would be dropping columns with high percentage of missing values, though it is impossible to know ahead of time if these columns will be helpful to my model. Therefore, I keep all of the columns for now.

Categorical Column Statistics

Now, let’s now look at the number of unique entries in each of the categorical columns (object data type).

As demonstrated from the count distribution of categorical features above, most of the categorical variables have a relatively small number of unique entries. I also find an efficient way to deal with these categorical variables.

We can see from above histogram that while there is twice as many Female applicants as Male applicants, majority of applicants were able to repay the loan regardless of the gender which means Gender of the client (CODE_GENDER) might not be a powerful predictive variable for predicting how capable each applicant is of repaying a loan.

As evident from the histogram above, majority of the loan application came from the clients with Working income status, and the percentage of the clients who were able to repay the loan on time are much higher than the ones with payment difficulties or late payment in all of the income type categories.

By plotting the clients’ Level of Highest Education vs. clients’ Repayment Abilities, I observe that the vast volume of the application came from people with lowest education level which makes logical sense because those are the ones who constitute the majority of unbanked population without sufficient credit histories even though they were mostly successful in their on-time repayment of the acquired loan.

Preparing Dataset for Training

During data preparation, I perform the following steps to prepare the dataset for training:

Create a train/test/validation split
Identify input and target columns
Identify numeric and categorical columns
Encode categorical columns through label encoding and one-hot encoding for binary and non-binary categorical features, respectively
Impute (fill) missing numeric values
Scale numeric values to the (0,1) range

Creating Training, Validation and Test Datasets

While building real-world machine learning models, it is quite common to split the dataset into three parts:

Training set: Used to train the model, i.e., compute the loss and adjust the model’s weights using an optimization technique.
Validation set: Used to evaluate the model during training, tune model hyper-parameters (optimization technique, regularization etc.), and pick the best version of the model. Picking a good validation set is essential for training models that generalize well.
Test set: Used to compare different models or approaches and report the model’s final accuracy. For many datasets, test sets are provided separately. The test set should reflect the kind of data the model will encounter in the real-world, as closely as feasible.

As a general rule of thumb we can use around 60% of the data for the training set, 20% for the validation set and 20% for the test set. If a separate test set is already provided, you can use a 75%-25% training-validation split.

When rows in the dataset have no inherent order, it is common practice to pick random subsets of rows for creating test and validation sets. This can be done using the train_test_split utility from scikit-learn. Learn more about it here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Since a separate test set (application_test.csv) is already provided out of which I made a test DataFrame (app_test_df), I use a 75%-25% training-validation split to construct the Validation Set to evaluate the model during training, tune model hyperparameters (optimization technique, regularization etc.), and pick the best version of the model.

Identifying Input and Target Columns

Often, not all the columns in a dataset are useful for training a model. The current training application dataset app_train_df, incorporates the TARGET variable having binary values of {0: Loan was repaid, 1: Loan was not repaid} which should be filtered out as a separate column.

Let’s create a list of input columns, and also identify the target column.

Let’s also identify which columns are numerical and which ones are categorical. This will be useful later, as I will need to convert the categorical data to numbers for training a machine learning model such as logistic regression.

Let’s view some statistics for the numeric columns.

I may have to do some data cleaning as well in case the ranges of the numeric columns do not seem reasonable.

Let’s also check the number of categories in each of the categorical columns.