OLA Driver Churn Prediction

OLA Driver
Churn Prediction

Random Forest and Gradient Boosting models to predict which OLA drivers will leave the platform — helping OLA retain drivers before they churn.

TypeEnsemble Learning · Classification

DomainRide-hailing / Gig Economy

DatasetOLA driver records

ToolsPython · Scikit-learn · Random Forest · GBM

CourseScaler Academic Case Study

Which drivers are about to leave OLA?

OLA's gig economy model depends on driver supply. Driver churn is expensive — acquiring and onboarding new drivers costs far more than retaining existing ones. The goal: build a model that identifies at-risk drivers early, so OLA's retention team can intervene before the driver leaves.

🚗

Why ensemble methods

Logistic regression assumes linear relationships. Driver churn is driven by complex interactions between income, tenure, ratings, and business value — non-linear patterns that tree-based ensembles handle naturally.

Building the driver features

Date Parsing & Feature Engineering

Converted MMM-YY, Dateofjoining, and LastWorkingDate to datetime. Created tenure feature (months between joining and last working date). Created churn target: 1 if LastWorkingDate exists, 0 if still active.

pd.to_datetime · relativedelta

Aggregation by Driver

Dataset had multiple rows per driver (monthly records). Aggregated to one row per driver using mean for numerical features and max for binary indicators.

groupby('Driver_ID').agg()

Missing Value Treatment

Imputed missing income and rating values with median. Dropped rows with excessive missing data.

fillna(median)

Train-Test + Scaling

80/20 split. StandardScaler applied. class_weight='balanced' used in both models to handle imbalance.

train_test_split · StandardScaler · class_weight

Random Forest vs Gradient Boosting

Metric	Random Forest	Gradient Boosting	Winner
Accuracy	81%	84%	GBM ✓
ROC-AUC	0.844	0.856	GBM ✓
Churn Recall (Class 1)	0.91	0.93	GBM ✓
Non-Churn Recall (Class 0)	0.61	0.64	GBM ✓

🏆

Final model: Gradient Boosting

GBM outperforms Random Forest on every metric. Critically, churn recall of 0.93 means the model correctly identifies 93% of drivers who will leave — exactly what OLA needs for proactive retention.

Python — ensemble_models.py

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, roc_auc_score

# Random Forest
rf = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
rf.fit(X_train_s, y_train)
print("RF ROC-AUC:", roc_auc_score(y_test, rf.predict_proba(X_test_s)[:,1]))
# 0.844

# Gradient Boosting
gb = GradientBoostingClassifier(random_state=42)
gb.fit(X_train_s, y_train)
print("GBM ROC-AUC:", roc_auc_score(y_test, gb.predict_proba(X_test_s)[:,1]))
# 0.856 — GBM wins

Business implications

🎯

93% of churners identified

Gradient Boosting correctly flags 93 out of every 100 drivers who will leave. That's the number that matters for OLA's retention programme.

💰

Business value drives churn most

Drivers with lower total business value are more likely to leave. OLA should focus retention incentives on drivers whose earnings are declining.

📅

Tenure is a risk signal

Shorter-tenured drivers churn more. A structured onboarding programme for the first 3–6 months could significantly reduce early churn.

⭐

Rating decline precedes churn

Falling driver ratings often precede departure. Rating trajectory is an early warning signal the model can flag weeks in advance.

OLA Driver
Churn Prediction

Which drivers are about to leave OLA?

Building the driver features

Random Forest vs Gradient Boosting

Business implications

OLA DriverChurn Prediction

Which drivers are about to leave OLA?

Building the driver features

Random Forest vs Gradient Boosting

Business implications

OLA Driver
Churn Prediction