Home About Expertise Projects Blogs Contact
Feature EngineeringEDACompleted

Delhivery
Logistics Pipeline Analysis

Feature engineering and exploratory analysis on Delhivery's shipment data to find where actual delivery times diverge from routing predictions — and why.

TypeFeature Engineering · EDA
DomainLogistics / Supply Chain
DatasetDelhivery shipment records
ToolsPython · Pandas · Scikit-learn
CourseScaler Academic Case Study
12.5
Avg Days per Delivery
GGN→BLR
Busiest Corridor
Furniture
Slowest Category (20.8d)
01 — Business Problem

Why don't predictions match reality?

Delhivery uses OSRM (Open Source Routing Machine) to predict delivery times and distances. But OSRM operates on road network data — it doesn't know about traffic jams, loading delays, or idle time between segments. The goal: quantify the gap between predicted and actual delivery performance, and identify where the biggest inefficiencies are.

🚚
The core insight
Segment-level data undercounts total trip time. Something is happening between segments — waiting, loading, re-routing — that the tracking system doesn't capture. Finding and fixing that gap improves both forecasting and customer SLAs.
02 — Methodology

How I cleaned and engineered features

01
Data Cleaning
Converted all values to string type, stripped whitespace, replaced nan/null/None placeholders with "Unknown". Handled mixed-type columns that prevented numerical operations.
astype(str) · str.strip() · replace()
02
Outlier Capping (IQR Method)
Applied IQR-based capping to all key numerical columns (delivery_time, osrm_time, distances). Prevented extreme outliers from distorting aggregations.
quantile() · np.where() · IQR
03
Feature Engineering — Time Gaps
Created derived features: gap between actual_time and osrm_time, gap between segment sum and total trip time, distance mismatch ratio.
Column arithmetic · ratio features
04
StandardScaler Normalisation
Scaled 9 numerical columns to standardise ranges before comparison analysis. Prevents high-magnitude columns (distance) from dominating low-magnitude ones (time ratios).
StandardScaler · fit_transform()
05
Route & Hub Analysis
Aggregated by origin-destination pairs to find busiest corridors and longest average delivery routes. Identified Gurgaon as the dominant hub.
groupby · sort_values · nlargest()
Python — feature_engineering.py
# IQR-based outlier capping
def cap_outliers(col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    df[col] = np.where(df[col] < Q1 - 1.5*IQR, Q1 - 1.5*IQR,
              np.where(df[col] > Q3 + 1.5*IQR, Q3 + 1.5*IQR, df[col]))

# Engineered gap feature
df['time_gap'] = df['actual_time'] - df['osrm_time']
df['dist_gap'] = df['actual_distance_to_destination'] - df['osrm_distance']

# Busiest corridors
top_routes = df.groupby(['source_center', 'destination_center']).size()\
               .sort_values(ascending=False).head(5)
03 — Key Findings

Where the gaps are

⏱️
Actual time consistently exceeds OSRM prediction
Most trips take longer than predicted. Points cluster below the 45° diagonal on actual vs predicted scatter — Delhivery is routinely underestimating delivery time.
📦
Segment sum ≠ total trip time
Total trip time is almost always higher than the sum of its segment times. Time is being lost between segments that tracking doesn't capture.
🏭
Gurgaon is the dominant hub
Gurgaon_Bilaspur_HB appears as source or destination in the most high-volume corridors. Bottlenecks here cascade across the network.
🪑
Furniture category is slowest
Avg 20.79 days for Furniture Office delivery — 65% slower than the overall average. Large items need logistics-specific routing improvements.
⚠️
Critical operational gap
The gap between segment-level tracking and total trip time means Delhivery's SLA forecasts are structurally optimistic. The fix: add inter-segment tracking to capture idle/wait time properly.
04 — Tech Stack
Python 3PandasNumPyScikit-learn (StandardScaler)SeabornMatplotlib
← Back to Projects View on GitHub ↗