LoanTap Credit Risk Modelling

LoanTap
Credit Risk Modelling

Logistic regression model to predict loan default risk — helping LoanTap's credit team decide who to approve, who to reject, and at what threshold.

TypeLogistic Regression · ROC-AUC

DomainFinTech / Lending

DatasetLoan applicant records

ToolsPython · Scikit-learn · Imbalanced-learn

CourseScaler Academic Case Study

Who will default on their loan?

LoanTap is an online platform that evaluates personal loan applications. The business challenge: build a model that identifies loan defaulters before the loan is issued. A missed defaulter costs money. An over-cautious model rejects creditworthy customers. The right balance is a business decision, not just a technical one.

💳

The real-world complication

80% of applicants are Fully Paid. 20% default. This class imbalance means a naive model can achieve 80% accuracy by predicting everyone as Fully Paid — while completely failing at its actual job.

What was cleaned and dropped

Column	Action	Reason
emp_title	Dropped	High cardinality — too many unique values
title	Dropped	High cardinality — free text, not useful
address	Dropped	Not a predictive feature for credit risk
emp_length	Imputed	4.6% missing — filled with mode
mort_acc	Imputed	9.5% missing — filled with median by grade
pub_rec_bankruptcies	Imputed	Small % missing — filled with 0

Model pipeline

EDA — Defaulter Patterns

Analysed loan_status distribution (80/20 imbalance). Identified grade, interest rate, and annual income as visually strong predictors. Fully Paid customers have lower interest rates and better grades.

countplot · heatmap · boxplot

Feature Engineering

One-hot encoded categorical variables (grade, sub_grade, home_ownership, purpose, initial_list_status, application_type). Converted term to numeric. Created zip_code feature.

get_dummies() · label encoding

Train-Test Split + Scaling

80/20 split with stratification to preserve class ratio. StandardScaler applied to numerical features.

train_test_split(stratify=y) · StandardScaler

Logistic Regression + Evaluation

Fitted LogisticRegression. Evaluated with classification report, ROC-AUC, ROC curve, and Precision-Recall curve. Attempted threshold tuning to improve defaulter recall.

LogisticRegression · roc_auc_score · roc_curve

Class Imbalance Analysis

Model predicts almost all as Fully Paid despite threshold tuning. Root cause: severe class imbalance. Identified SMOTE and class_weight='balanced' as next steps.

roc_auc_score · precision_recall_curve

Python — model_and_evaluation.py

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve, classification_report

model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)

y_prob = model.predict_proba(X_test_scaled)[:,1]
y_pred = model.predict(X_test_scaled)

roc_auc = roc_auc_score(y_test, y_prob)
print("ROC-AUC:", roc_auc)

# Problem: model predicts almost all as class 0 (Fully Paid)
# Defaulter recall ≈ 0 despite threshold tuning
# Root cause: 80:20 class imbalance
# Fix: class_weight='balanced' or SMOTE oversampling

What the model revealed

📊

ROC-AUC is the right metric here

With class imbalance, accuracy is misleading. ROC-AUC measures the model's ability to discriminate between classes regardless of threshold — the correct metric for credit scoring.

⚠️

Threshold tuning alone can't fix imbalance

Lowering the decision threshold didn't improve defaulter recall. The model is too biased toward the majority class — structural imbalance requires structural fixes.

💡

Grade and interest rate are strongest signals

Higher loan grades (A, B) and lower interest rates correlate strongly with repayment. These are the most informative features for the model.

🔧

Next step: SMOTE + balanced weights

Applying SMOTE to oversample the minority class or setting class_weight='balanced' in LogisticRegression are the two fastest paths to improving defaulter recall.

🏦

Business interpretation

The current model is profit-focused — it approves too many loans. A risk-controlled model needs higher defaulter recall, even at the cost of some false positives (rejecting good applicants). The right threshold depends on LoanTap's cost-benefit ratio for each type of error.

LoanTap
Credit Risk Modelling

Who will default on their loan?

What was cleaned and dropped

Model pipeline

What the model revealed

LoanTapCredit Risk Modelling

Who will default on their loan?

What was cleaned and dropped

Model pipeline

What the model revealed

LoanTap
Credit Risk Modelling