Original ProjectSQLCompleted

Ride-Sharing
Analytics Database

A full SQL analytics project simulating an Uber/Ola-style ride-hailing database — built on 50,000 real NYC Yellow Taxi trips enriched with synthetic drivers, riders, and surge pricing tables.

TypeSQL · Data Engineering · Analytics

DomainRide-Hailing / Transportation

DatasetNYC Yellow Taxi Jan 2023 · 50K rows

ToolsSQL · Python · DuckDB · SQLite · Faker

CategoryOriginal Project

50K

Real Trip Records

200

Synthetic Drivers

2,000

Synthetic Riders

25+

SQL Queries Written

01 — The Idea

Build a real analytics DB from scratch.

Most SQL portfolio projects use a pre-built dataset and run a few queries. This one goes further — the database itself was designed and built from scratch, with a proper schema, foreign key relationships, integrity checks, and a multi-layer analytics pipeline on top of it.

The base data is the NYC Yellow Taxi dataset (January 2023) — real public trip records downloaded from the NYC TLC. On top of that, synthetic tables for Drivers, Riders, and Surge Pricing were generated to simulate the full data model of a ride-hailing company.

🚕

Why NYC Taxi data

It's real, public, and messy in all the right ways — missing passengers, zero fares, outstation trips, reversed timestamps. Cleaning it and building a proper database on top of it is exactly the kind of work a data engineer does in practice.

02 — Database Schema

4 tables. One clean data model.

Table	Source	Rows	Key Fields
trips_clean	NYC TLC (real)	50,000	trip_id, driver_id, rider_id, pickup_ts, fare_amount, flags
drivers	Synthetic (Faker)	200	driver_id, name, join_date, rating
riders	Synthetic (Faker)	2,000	rider_id, name, join_date
surge_pricing	Synthetic rules	5	surge_id, hour_range, multiplier

SQL — schema.sql
-- trips_clean: real NYC taxi trips + engineered flags
CREATE TABLE trips_clean (
    trip_id          INTEGER PRIMARY KEY AUTOINCREMENT,
    pickup_ts        TIMESTAMP,
    dropoff_ts       TIMESTAMP,
    passenger_count  INTEGER,
    trip_distance    REAL,
    pu_loc           INTEGER,
    do_loc           INTEGER,
    fare_amount      REAL,
    total_amount     REAL,
    driver_id        INTEGER,
    rider_id         INTEGER,
    -- Engineered flags
    cancellation_flag INTEGER,
    no_show_flag     INTEGER,
    is_free          INTEGER,
    outstation_flag  INTEGER,
    FOREIGN KEY (driver_id) REFERENCES drivers(driver_id),
    FOREIGN KEY (rider_id) REFERENCES riders(rider_id)
);

03 — Data Pipeline

From raw Parquet to clean database

Load Raw Data via DuckDB

Loaded NYC Yellow Taxi Jan 2023 Parquet file directly from TLC's CloudFront URL using DuckDB — no manual download required. Sampled 200K rows as staging data.

DuckDB · read_parquet() · CloudFront URL

Clean & Engineer Flags

Applied business logic to create 5 flag columns: cancellation_flag (distance=0 or passengers=0), no_show_flag, is_free (fare=0), is_adjustment (negative fares), outstation_flag. Marked invalid records (dropoff before pickup).

Pandas · Boolean masking · Flag engineering

Sample 50K Clean Rows

Filtered out invalid records and sampled exactly 50,000 clean rows. Saved to CSV and loaded into SQLite.

df.sample() · sqlite3 · to_sql()

Generate Synthetic Tables

Used Python Faker library to generate 200 realistic drivers with names, join dates, and ratings. Generated 2,000 riders. Created 5 surge pricing rules. Assigned driver_id and rider_id to each trip using Zipf-weighted random assignment — simulating how some drivers take many more trips than others.

Faker · numpy Zipf weights · random.seed(42)

Write SQL Queries — 25+

Organised queries into 4 sections: data integrity checks, core KPIs, advanced analytics, and research-style analyses. Saved to queries.sql with a companion insights markdown file.

SQLite · GROUP BY · JOINs · CTEs · Window Functions

Python — data_pipeline.py
import duckdb, pandas as pd, numpy as np, sqlite3
from faker import Faker

# Step 1: Load from NYC TLC CloudFront (no manual download)
url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet"
df_stage = duckdb.query(f"SELECT * FROM read_parquet('{url}') LIMIT 200000").to_df()

# Step 2: Engineer flags
df['cancellation_flag'] = ((df['trip_distance']==0)|(df['passenger_count']==0)).astype(int)
df['outstation_flag']   = (df['trip_distance'] > 50).astype(int)

# Step 3: Synthetic drivers with Zipf workload distribution
weights = np.random.zipf(1.5, 200).astype(float)
weights /= weights.sum()
df['driver_id'] = np.random.choice(range(1,201), size=len(df), p=weights)

# Step 4: Load to SQLite
conn = sqlite3.connect("taxi_sample.db")
df.to_sql("trips_clean", conn, if_exists="replace", index=False)

04 — Analytics — 4 Sections

25+ queries across 4 analytical layers

Section	Queries	Focus
Data Integrity	6 queries	Orphan trips, future timestamps, null IDs, zero-fare anomalies
Core KPIs	7 queries	Daily revenue, avg fare per km, trips per hour, top drivers, airport vs downtown
Advanced Analytics	7 queries	Surge pricing impact, cancellation patterns, loyalty riders, payment type breakdown
Research Analyses	5 queries	Driver productivity (earnings/hour), cohort retention, Gini coefficient, surge elasticity

SQL — advanced_analytics.sql
-- Loyalty riders: same rider + same driver, 3+ trips
SELECT rider_id, driver_id, COUNT(*) AS trips_together
FROM trips_clean
GROUP BY rider_id, driver_id
HAVING trips_together >= 3
ORDER BY trips_together DESC;

-- Gini coefficient for driver income inequality
WITH earnings AS (
    SELECT driver_id, SUM(total_amount) AS total_earned
    FROM trips_clean GROUP BY driver_id
),
ranked AS (
    SELECT total_earned,
           ROW_NUMBER() OVER (ORDER BY total_earned) AS rank_n,
           COUNT(*) OVER () AS total_n
    FROM earnings
)
SELECT
    1 - 2.0 * SUM((total_n - rank_n + 0.5) * total_earned)
    / (total_n * SUM(total_earned)) AS gini_coefficient
FROM ranked;
-- Result: Gini ~0.62 → high income inequality among drivers

05 — Key Insights

What the data revealed

💰

Daily revenue: $450K–$550K

With average fares of $28–31, the dataset shows stable daily revenue patterns with weekend spikes — consistent with real Uber/Ola revenue dynamics.

✈️

Airport zone (132) earns 3× more

Average fare from JFK/LaGuardia zones is ~$75 vs ~$20 for downtown Manhattan. Airport trips are the high-value segment.

📉

Gini coefficient ~0.62

High income inequality among drivers — top 20% of drivers earn 65%+ of total revenue. Same pattern observed in real gig economy platforms.

🔁

Loyalty riders are rare

Only a small fraction of riders take 3+ trips with the same driver — most rides are transactional, not relationship-driven.

⚡

Surge pricing boosts revenue 18–23%

Peak hour multipliers (1.5–2.0×) during morning and evening rush significantly impact total revenue without meaningfully reducing trip volume.

❌

878 zero-passenger trips flagged

Data integrity checks caught 878 trips with no passengers — likely cancellations that weren't properly marked in the source data.

06 — Project Structure

How it's organised

Folder	Contents
data/	yellow_2023_01_sample_with_flags_enriched.csv — 50K cleaned trips with flags and synthetic IDs
sql/	schema.sql — table definitions · queries.sql — all 25+ analysis queries
docs/	queries_and_insights.md — every query with its business insight explained
notebooks/	eda_and_loading.ipynb — data cleaning, flag engineering, synthetic table generation

07 — Tech Stack

SQLPython 3DuckDBSQLitePandasFakerNumPyNYC TLC DataZipf Distribution

← Back to Projects View on GitHub ↗

Ride-SharingAnalytics Database