A full SQL analytics project simulating an Uber/Ola-style ride-hailing database — built on 50,000 real NYC Yellow Taxi trips enriched with synthetic drivers, riders, and surge pricing tables.
Most SQL portfolio projects use a pre-built dataset and run a few queries. This one goes further — the database itself was designed and built from scratch, with a proper schema, foreign key relationships, integrity checks, and a multi-layer analytics pipeline on top of it.
The base data is the NYC Yellow Taxi dataset (January 2023) — real public trip records downloaded from the NYC TLC. On top of that, synthetic tables for Drivers, Riders, and Surge Pricing were generated to simulate the full data model of a ride-hailing company.
| Table | Source | Rows | Key Fields |
|---|---|---|---|
| trips_clean | NYC TLC (real) | 50,000 | trip_id, driver_id, rider_id, pickup_ts, fare_amount, flags |
| drivers | Synthetic (Faker) | 200 | driver_id, name, join_date, rating |
| riders | Synthetic (Faker) | 2,000 | rider_id, name, join_date |
| surge_pricing | Synthetic rules | 5 | surge_id, hour_range, multiplier |
-- trips_clean: real NYC taxi trips + engineered flags CREATE TABLE trips_clean ( trip_id INTEGER PRIMARY KEY AUTOINCREMENT, pickup_ts TIMESTAMP, dropoff_ts TIMESTAMP, passenger_count INTEGER, trip_distance REAL, pu_loc INTEGER, do_loc INTEGER, fare_amount REAL, total_amount REAL, driver_id INTEGER, rider_id INTEGER, -- Engineered flags cancellation_flag INTEGER, no_show_flag INTEGER, is_free INTEGER, outstation_flag INTEGER, FOREIGN KEY (driver_id) REFERENCES drivers(driver_id), FOREIGN KEY (rider_id) REFERENCES riders(rider_id) );
import duckdb, pandas as pd, numpy as np, sqlite3 from faker import Faker # Step 1: Load from NYC TLC CloudFront (no manual download) url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet" df_stage = duckdb.query(f"SELECT * FROM read_parquet('{url}') LIMIT 200000").to_df() # Step 2: Engineer flags df['cancellation_flag'] = ((df['trip_distance']==0)|(df['passenger_count']==0)).astype(int) df['outstation_flag'] = (df['trip_distance'] > 50).astype(int) # Step 3: Synthetic drivers with Zipf workload distribution weights = np.random.zipf(1.5, 200).astype(float) weights /= weights.sum() df['driver_id'] = np.random.choice(range(1,201), size=len(df), p=weights) # Step 4: Load to SQLite conn = sqlite3.connect("taxi_sample.db") df.to_sql("trips_clean", conn, if_exists="replace", index=False)
| Section | Queries | Focus |
|---|---|---|
| Data Integrity | 6 queries | Orphan trips, future timestamps, null IDs, zero-fare anomalies |
| Core KPIs | 7 queries | Daily revenue, avg fare per km, trips per hour, top drivers, airport vs downtown |
| Advanced Analytics | 7 queries | Surge pricing impact, cancellation patterns, loyalty riders, payment type breakdown |
| Research Analyses | 5 queries | Driver productivity (earnings/hour), cohort retention, Gini coefficient, surge elasticity |
-- Loyalty riders: same rider + same driver, 3+ trips SELECT rider_id, driver_id, COUNT(*) AS trips_together FROM trips_clean GROUP BY rider_id, driver_id HAVING trips_together >= 3 ORDER BY trips_together DESC; -- Gini coefficient for driver income inequality WITH earnings AS ( SELECT driver_id, SUM(total_amount) AS total_earned FROM trips_clean GROUP BY driver_id ), ranked AS ( SELECT total_earned, ROW_NUMBER() OVER (ORDER BY total_earned) AS rank_n, COUNT(*) OVER () AS total_n FROM earnings ) SELECT 1 - 2.0 * SUM((total_n - rank_n + 0.5) * total_earned) / (total_n * SUM(total_earned)) AS gini_coefficient FROM ranked; -- Result: Gini ~0.62 → high income inequality among drivers
| Folder | Contents |
|---|---|
| data/ | yellow_2023_01_sample_with_flags_enriched.csv — 50K cleaned trips with flags and synthetic IDs |
| sql/ | schema.sql — table definitions · queries.sql — all 25+ analysis queries |
| docs/ | queries_and_insights.md — every query with its business insight explained |
| notebooks/ | eda_and_loading.ipynb — data cleaning, flag engineering, synthetic table generation |