Delhivery Logistics Pipeline Analysis

Delhivery
Logistics Pipeline Analysis

Feature engineering and exploratory analysis on Delhivery's shipment data to find where actual delivery times diverge from routing predictions — and why.

TypeFeature Engineering · EDA

DomainLogistics / Supply Chain

DatasetDelhivery shipment records

ToolsPython · Pandas · Scikit-learn

CourseScaler Academic Case Study

Why don't predictions match reality?

Delhivery uses OSRM (Open Source Routing Machine) to predict delivery times and distances. But OSRM operates on road network data — it doesn't know about traffic jams, loading delays, or idle time between segments. The goal: quantify the gap between predicted and actual delivery performance, and identify where the biggest inefficiencies are.

🚚

The core insight

Segment-level data undercounts total trip time. Something is happening between segments — waiting, loading, re-routing — that the tracking system doesn't capture. Finding and fixing that gap improves both forecasting and customer SLAs.

How I cleaned and engineered features

Data Cleaning

Converted all values to string type, stripped whitespace, replaced nan/null/None placeholders with "Unknown". Handled mixed-type columns that prevented numerical operations.

astype(str) · str.strip() · replace()

Outlier Capping (IQR Method)

Applied IQR-based capping to all key numerical columns (delivery_time, osrm_time, distances). Prevented extreme outliers from distorting aggregations.

quantile() · np.where() · IQR

Feature Engineering — Time Gaps

Created derived features: gap between actual_time and osrm_time, gap between segment sum and total trip time, distance mismatch ratio.

Column arithmetic · ratio features

StandardScaler Normalisation

Scaled 9 numerical columns to standardise ranges before comparison analysis. Prevents high-magnitude columns (distance) from dominating low-magnitude ones (time ratios).

StandardScaler · fit_transform()

Route & Hub Analysis

Aggregated by origin-destination pairs to find busiest corridors and longest average delivery routes. Identified Gurgaon as the dominant hub.

groupby · sort_values · nlargest()

Python — feature_engineering.py

# IQR-based outlier capping
def cap_outliers(col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    df[col] = np.where(df[col] < Q1 - 1.5*IQR, Q1 - 1.5*IQR,
              np.where(df[col] > Q3 + 1.5*IQR, Q3 + 1.5*IQR, df[col]))

# Engineered gap feature
df['time_gap'] = df['actual_time'] - df['osrm_time']
df['dist_gap'] = df['actual_distance_to_destination'] - df['osrm_distance']

# Busiest corridors
top_routes = df.groupby(['source_center', 'destination_center']).size()\
               .sort_values(ascending=False).head(5)

Where the gaps are

⏱️

Actual time consistently exceeds OSRM prediction

Most trips take longer than predicted. Points cluster below the 45° diagonal on actual vs predicted scatter — Delhivery is routinely underestimating delivery time.

📦

Segment sum ≠ total trip time

Total trip time is almost always higher than the sum of its segment times. Time is being lost between segments that tracking doesn't capture.

🏭

Gurgaon is the dominant hub

Gurgaon_Bilaspur_HB appears as source or destination in the most high-volume corridors. Bottlenecks here cascade across the network.

🪑

Furniture category is slowest

Avg 20.79 days for Furniture Office delivery — 65% slower than the overall average. Large items need logistics-specific routing improvements.

⚠️

Critical operational gap

The gap between segment-level tracking and total trip time means Delhivery's SLA forecasts are structurally optimistic. The fix: add inter-segment tracking to capture idle/wait time properly.

Delhivery
Logistics Pipeline Analysis

Why don't predictions match reality?

How I cleaned and engineered features

Where the gaps are

DelhiveryLogistics Pipeline Analysis

Why don't predictions match reality?

How I cleaned and engineered features

Where the gaps are

Delhivery
Logistics Pipeline Analysis