Home About Expertise Projects Blogs Contact
Ensemble LearningRandom ForestCompleted

OLA Driver
Churn Prediction

Random Forest and Gradient Boosting models to predict which OLA drivers will leave the platform — helping OLA retain drivers before they churn.

TypeEnsemble Learning · Classification
DomainRide-hailing / Gig Economy
DatasetOLA driver records
ToolsPython · Scikit-learn · Random Forest · GBM
CourseScaler Academic Case Study
84%
Gradient Boosting Accuracy
0.856
ROC-AUC Score
0.93
Churn Recall (Class 1)
RF vs GB
Two Models Compared
01 — Business Problem

Which drivers are about to leave OLA?

OLA's gig economy model depends on driver supply. Driver churn is expensive — acquiring and onboarding new drivers costs far more than retaining existing ones. The goal: build a model that identifies at-risk drivers early, so OLA's retention team can intervene before the driver leaves.

🚗
Why ensemble methods
Logistic regression assumes linear relationships. Driver churn is driven by complex interactions between income, tenure, ratings, and business value — non-linear patterns that tree-based ensembles handle naturally.
02 — Data Preparation

Building the driver features

01
Date Parsing & Feature Engineering
Converted MMM-YY, Dateofjoining, and LastWorkingDate to datetime. Created tenure feature (months between joining and last working date). Created churn target: 1 if LastWorkingDate exists, 0 if still active.
pd.to_datetime · relativedelta
02
Aggregation by Driver
Dataset had multiple rows per driver (monthly records). Aggregated to one row per driver using mean for numerical features and max for binary indicators.
groupby('Driver_ID').agg()
03
Missing Value Treatment
Imputed missing income and rating values with median. Dropped rows with excessive missing data.
fillna(median)
04
Train-Test + Scaling
80/20 split. StandardScaler applied. class_weight='balanced' used in both models to handle imbalance.
train_test_split · StandardScaler · class_weight
03 — Model Comparison

Random Forest vs Gradient Boosting

MetricRandom ForestGradient BoostingWinner
Accuracy81%84%GBM ✓
ROC-AUC0.8440.856GBM ✓
Churn Recall (Class 1)0.910.93GBM ✓
Non-Churn Recall (Class 0)0.610.64GBM ✓
🏆
Final model: Gradient Boosting
GBM outperforms Random Forest on every metric. Critically, churn recall of 0.93 means the model correctly identifies 93% of drivers who will leave — exactly what OLA needs for proactive retention.
Python — ensemble_models.py
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, roc_auc_score

# Random Forest
rf = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
rf.fit(X_train_s, y_train)
print("RF ROC-AUC:", roc_auc_score(y_test, rf.predict_proba(X_test_s)[:,1]))
# 0.844

# Gradient Boosting
gb = GradientBoostingClassifier(random_state=42)
gb.fit(X_train_s, y_train)
print("GBM ROC-AUC:", roc_auc_score(y_test, gb.predict_proba(X_test_s)[:,1]))
# 0.856 — GBM wins
04 — Key Findings

Business implications

🎯
93% of churners identified
Gradient Boosting correctly flags 93 out of every 100 drivers who will leave. That's the number that matters for OLA's retention programme.
💰
Business value drives churn most
Drivers with lower total business value are more likely to leave. OLA should focus retention incentives on drivers whose earnings are declining.
📅
Tenure is a risk signal
Shorter-tenured drivers churn more. A structured onboarding programme for the first 3–6 months could significantly reduce early churn.
Rating decline precedes churn
Falling driver ratings often precede departure. Rating trajectory is an early warning signal the model can flag weeks in advance.
05 — Tech Stack
Python 3PandasScikit-learnRandom ForestGradient BoostingROC-AUCStandardScaler
← Back to Projects View on GitHub ↗