Home About Expertise Projects Blogs Contact
Original ProjectSQLCompleted

Ride-Sharing
Analytics Database

A full SQL analytics project simulating an Uber/Ola-style ride-hailing database — built on 50,000 real NYC Yellow Taxi trips enriched with synthetic drivers, riders, and surge pricing tables.

TypeSQL · Data Engineering · Analytics
DomainRide-Hailing / Transportation
DatasetNYC Yellow Taxi Jan 2023 · 50K rows
ToolsSQL · Python · DuckDB · SQLite · Faker
CategoryOriginal Project
50K
Real Trip Records
200
Synthetic Drivers
2,000
Synthetic Riders
25+
SQL Queries Written
01 — The Idea

Build a real analytics DB from scratch.

Most SQL portfolio projects use a pre-built dataset and run a few queries. This one goes further — the database itself was designed and built from scratch, with a proper schema, foreign key relationships, integrity checks, and a multi-layer analytics pipeline on top of it.

The base data is the NYC Yellow Taxi dataset (January 2023) — real public trip records downloaded from the NYC TLC. On top of that, synthetic tables for Drivers, Riders, and Surge Pricing were generated to simulate the full data model of a ride-hailing company.

🚕
Why NYC Taxi data
It's real, public, and messy in all the right ways — missing passengers, zero fares, outstation trips, reversed timestamps. Cleaning it and building a proper database on top of it is exactly the kind of work a data engineer does in practice.
02 — Database Schema

4 tables. One clean data model.

TableSourceRowsKey Fields
trips_cleanNYC TLC (real)50,000trip_id, driver_id, rider_id, pickup_ts, fare_amount, flags
driversSynthetic (Faker)200driver_id, name, join_date, rating
ridersSynthetic (Faker)2,000rider_id, name, join_date
surge_pricingSynthetic rules5surge_id, hour_range, multiplier
SQL — schema.sql
-- trips_clean: real NYC taxi trips + engineered flags
CREATE TABLE trips_clean (
    trip_id          INTEGER PRIMARY KEY AUTOINCREMENT,
    pickup_ts        TIMESTAMP,
    dropoff_ts       TIMESTAMP,
    passenger_count  INTEGER,
    trip_distance    REAL,
    pu_loc           INTEGER,
    do_loc           INTEGER,
    fare_amount      REAL,
    total_amount     REAL,
    driver_id        INTEGER,
    rider_id         INTEGER,
    -- Engineered flags
    cancellation_flag INTEGER,
    no_show_flag     INTEGER,
    is_free          INTEGER,
    outstation_flag  INTEGER,
    FOREIGN KEY (driver_id) REFERENCES drivers(driver_id),
    FOREIGN KEY (rider_id) REFERENCES riders(rider_id)
);
03 — Data Pipeline

From raw Parquet to clean database

01
Load Raw Data via DuckDB
Loaded NYC Yellow Taxi Jan 2023 Parquet file directly from TLC's CloudFront URL using DuckDB — no manual download required. Sampled 200K rows as staging data.
DuckDB · read_parquet() · CloudFront URL
02
Clean & Engineer Flags
Applied business logic to create 5 flag columns: cancellation_flag (distance=0 or passengers=0), no_show_flag, is_free (fare=0), is_adjustment (negative fares), outstation_flag. Marked invalid records (dropoff before pickup).
Pandas · Boolean masking · Flag engineering
03
Sample 50K Clean Rows
Filtered out invalid records and sampled exactly 50,000 clean rows. Saved to CSV and loaded into SQLite.
df.sample() · sqlite3 · to_sql()
04
Generate Synthetic Tables
Used Python Faker library to generate 200 realistic drivers with names, join dates, and ratings. Generated 2,000 riders. Created 5 surge pricing rules. Assigned driver_id and rider_id to each trip using Zipf-weighted random assignment — simulating how some drivers take many more trips than others.
Faker · numpy Zipf weights · random.seed(42)
05
Write SQL Queries — 25+
Organised queries into 4 sections: data integrity checks, core KPIs, advanced analytics, and research-style analyses. Saved to queries.sql with a companion insights markdown file.
SQLite · GROUP BY · JOINs · CTEs · Window Functions
Python — data_pipeline.py
import duckdb, pandas as pd, numpy as np, sqlite3
from faker import Faker

# Step 1: Load from NYC TLC CloudFront (no manual download)
url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet"
df_stage = duckdb.query(f"SELECT * FROM read_parquet('{url}') LIMIT 200000").to_df()

# Step 2: Engineer flags
df['cancellation_flag'] = ((df['trip_distance']==0)|(df['passenger_count']==0)).astype(int)
df['outstation_flag']   = (df['trip_distance'] > 50).astype(int)

# Step 3: Synthetic drivers with Zipf workload distribution
weights = np.random.zipf(1.5, 200).astype(float)
weights /= weights.sum()
df['driver_id'] = np.random.choice(range(1,201), size=len(df), p=weights)

# Step 4: Load to SQLite
conn = sqlite3.connect("taxi_sample.db")
df.to_sql("trips_clean", conn, if_exists="replace", index=False)
04 — Analytics — 4 Sections

25+ queries across 4 analytical layers

SectionQueriesFocus
Data Integrity6 queriesOrphan trips, future timestamps, null IDs, zero-fare anomalies
Core KPIs7 queriesDaily revenue, avg fare per km, trips per hour, top drivers, airport vs downtown
Advanced Analytics7 queriesSurge pricing impact, cancellation patterns, loyalty riders, payment type breakdown
Research Analyses5 queriesDriver productivity (earnings/hour), cohort retention, Gini coefficient, surge elasticity
SQL — advanced_analytics.sql
-- Loyalty riders: same rider + same driver, 3+ trips
SELECT rider_id, driver_id, COUNT(*) AS trips_together
FROM trips_clean
GROUP BY rider_id, driver_id
HAVING trips_together >= 3
ORDER BY trips_together DESC;

-- Gini coefficient for driver income inequality
WITH earnings AS (
    SELECT driver_id, SUM(total_amount) AS total_earned
    FROM trips_clean GROUP BY driver_id
),
ranked AS (
    SELECT total_earned,
           ROW_NUMBER() OVER (ORDER BY total_earned) AS rank_n,
           COUNT(*) OVER () AS total_n
    FROM earnings
)
SELECT
    1 - 2.0 * SUM((total_n - rank_n + 0.5) * total_earned)
    / (total_n * SUM(total_earned)) AS gini_coefficient
FROM ranked;
-- Result: Gini ~0.62 → high income inequality among drivers
05 — Key Insights

What the data revealed

💰
Daily revenue: $450K–$550K
With average fares of $28–31, the dataset shows stable daily revenue patterns with weekend spikes — consistent with real Uber/Ola revenue dynamics.
✈️
Airport zone (132) earns 3× more
Average fare from JFK/LaGuardia zones is ~$75 vs ~$20 for downtown Manhattan. Airport trips are the high-value segment.
📉
Gini coefficient ~0.62
High income inequality among drivers — top 20% of drivers earn 65%+ of total revenue. Same pattern observed in real gig economy platforms.
🔁
Loyalty riders are rare
Only a small fraction of riders take 3+ trips with the same driver — most rides are transactional, not relationship-driven.
Surge pricing boosts revenue 18–23%
Peak hour multipliers (1.5–2.0×) during morning and evening rush significantly impact total revenue without meaningfully reducing trip volume.
878 zero-passenger trips flagged
Data integrity checks caught 878 trips with no passengers — likely cancellations that weren't properly marked in the source data.
06 — Project Structure

How it's organised

FolderContents
data/yellow_2023_01_sample_with_flags_enriched.csv — 50K cleaned trips with flags and synthetic IDs
sql/schema.sql — table definitions · queries.sql — all 25+ analysis queries
docs/queries_and_insights.md — every query with its business insight explained
notebooks/eda_and_loading.ipynb — data cleaning, flag engineering, synthetic table generation
07 — Tech Stack
SQLPython 3DuckDBSQLitePandasFakerNumPyNYC TLC DataZipf Distribution
← Back to Projects View on GitHub ↗