Home About Expertise Projects Blogs Contact
Logistic RegressionCredit RiskCompleted

LoanTap
Credit Risk Modelling

Logistic regression model to predict loan default risk — helping LoanTap's credit team decide who to approve, who to reject, and at what threshold.

TypeLogistic Regression · ROC-AUC
DomainFinTech / Lending
DatasetLoan applicant records
ToolsPython · Scikit-learn · Imbalanced-learn
CourseScaler Academic Case Study
80:20
Class Imbalance Ratio
ROC-AUC
Primary Metric Used
~0
Defaulter Recall — Key Problem
20%
Test Set Size
01 — Business Problem

Who will default on their loan?

LoanTap is an online platform that evaluates personal loan applications. The business challenge: build a model that identifies loan defaulters before the loan is issued. A missed defaulter costs money. An over-cautious model rejects creditworthy customers. The right balance is a business decision, not just a technical one.

💳
The real-world complication
80% of applicants are Fully Paid. 20% default. This class imbalance means a naive model can achieve 80% accuracy by predicting everyone as Fully Paid — while completely failing at its actual job.
02 — Data Cleaning

What was cleaned and dropped

ColumnActionReason
emp_titleDroppedHigh cardinality — too many unique values
titleDroppedHigh cardinality — free text, not useful
addressDroppedNot a predictive feature for credit risk
emp_lengthImputed4.6% missing — filled with mode
mort_accImputed9.5% missing — filled with median by grade
pub_rec_bankruptciesImputedSmall % missing — filled with 0
03 — Methodology

Model pipeline

01
EDA — Defaulter Patterns
Analysed loan_status distribution (80/20 imbalance). Identified grade, interest rate, and annual income as visually strong predictors. Fully Paid customers have lower interest rates and better grades.
countplot · heatmap · boxplot
02
Feature Engineering
One-hot encoded categorical variables (grade, sub_grade, home_ownership, purpose, initial_list_status, application_type). Converted term to numeric. Created zip_code feature.
get_dummies() · label encoding
03
Train-Test Split + Scaling
80/20 split with stratification to preserve class ratio. StandardScaler applied to numerical features.
train_test_split(stratify=y) · StandardScaler
04
Logistic Regression + Evaluation
Fitted LogisticRegression. Evaluated with classification report, ROC-AUC, ROC curve, and Precision-Recall curve. Attempted threshold tuning to improve defaulter recall.
LogisticRegression · roc_auc_score · roc_curve
05
Class Imbalance Analysis
Model predicts almost all as Fully Paid despite threshold tuning. Root cause: severe class imbalance. Identified SMOTE and class_weight='balanced' as next steps.
roc_auc_score · precision_recall_curve
Python — model_and_evaluation.py
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve, classification_report

model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)

y_prob = model.predict_proba(X_test_scaled)[:,1]
y_pred = model.predict(X_test_scaled)

roc_auc = roc_auc_score(y_test, y_prob)
print("ROC-AUC:", roc_auc)

# Problem: model predicts almost all as class 0 (Fully Paid)
# Defaulter recall ≈ 0 despite threshold tuning
# Root cause: 80:20 class imbalance
# Fix: class_weight='balanced' or SMOTE oversampling
04 — Key Findings

What the model revealed

📊
ROC-AUC is the right metric here
With class imbalance, accuracy is misleading. ROC-AUC measures the model's ability to discriminate between classes regardless of threshold — the correct metric for credit scoring.
⚠️
Threshold tuning alone can't fix imbalance
Lowering the decision threshold didn't improve defaulter recall. The model is too biased toward the majority class — structural imbalance requires structural fixes.
💡
Grade and interest rate are strongest signals
Higher loan grades (A, B) and lower interest rates correlate strongly with repayment. These are the most informative features for the model.
🔧
Next step: SMOTE + balanced weights
Applying SMOTE to oversample the minority class or setting class_weight='balanced' in LogisticRegression are the two fastest paths to improving defaulter recall.
🏦
Business interpretation
The current model is profit-focused — it approves too many loans. A risk-controlled model needs higher defaulter recall, even at the cost of some false positives (rejecting good applicants). The right threshold depends on LoanTap's cost-benefit ratio for each type of error.
05 — Tech Stack
Python 3PandasScikit-learnLogistic RegressionROC-AUCPrecision-Recall CurveStandardScaler
← Back to Projects View on GitHub ↗