01 — Problem Statement
What should Netflix produce more of?
Netflix has a content library of over 8,800 titles spread across 190+ countries. The business question: which content types, genres, and regions are driving growth, and where should the platform focus its production investment? This project uses exploratory data analysis to answer exactly that.
💡
Why this matters
Content strategy decisions at streaming platforms are worth billions. A wrong bet — wrong genre, wrong market, wrong format — wastes production budgets. EDA on historical content data is one of the clearest ways to surface what's working.
02 — Dataset Overview
What the data contains
The Netflix titles dataset covers every title available on the platform as of mid-2021 — 8,807 entries spanning release years from 1925 to 2021.
| Column | Description | Missing % |
| type | Movie or TV Show | 0% |
| title | Content title | 0% |
| director | Director name | 30% |
| cast | Actor list | 10% |
| country | Country of origin | 9% |
| date_added | When added to Netflix | 0.1% |
| rating | Content rating (TV-MA, PG, etc.) | 0.01% |
| duration | Minutes (Movies) or Seasons (TV) | 0.01% |
| listed_in | Genre tags | 0% |
03 — Methodology
How I approached it
01
Data Loading & Type Fixing
Loaded CSV, converted date_added to datetime, extracted year_added and month_added. Parsed duration into numeric + unit columns. Cast categorical columns properly.
pd.to_datetime() · str.extract() · astype(category)
02
Missing Value Treatment
Director (30%) and cast (10%) filled with "Unknown". Country filled with mode. Ratings and duration filled with mode values. No rows dropped.
fillna() · mode()
03
Univariate Analysis
Distribution of release years (1925–2021). Type split: 70% Movies vs 30% TV Shows. Rating distribution: TV-MA dominates. Duration: movies cluster at 90–100 min.
value_counts() · histplot() · countplot()
04
Bivariate & Time Analysis
Content added per year (spike post-2016, peak 2019). Movies vs TV Shows trend over time. Country-wise production breakdown.
countplot(hue) · groupby() · barplot()
05
Business Insight Extraction
Cross-referenced content type, genre, country, and rating to identify strategic patterns for content investment recommendations.
Correlation heatmap · Multi-axis analysis
04 — Key Code
The analysis
# Parse dates and extract time features
df['date_added'] = pd.to_datetime(df['date_added'].str.strip(), errors='coerce')
df['year_added'] = df['date_added'].dt.year
df['month_added'] = df['date_added'].dt.month_name()
# Split duration into number + unit
df[['duration_num', 'duration_unit']] = df['duration'].str.extract(r'(\d+)\s*(\w+)')
df['duration_num'] = pd.to_numeric(df['duration_num'], errors='coerce')
# Type distribution
print(df['type'].value_counts())
# Movie 5377 (~70%)
# TV Show 2410 (~30%)
Upload: Content Added per Year chart
Upload screenshot from notebook
Content additions peaked in 2019 — Netflix's most aggressive expansion year
05 — Key Findings
What the data revealed
🎬Movies dominate at 70%
Netflix is primarily a movie platform by count. TV Show growth accelerated post-2018 but movies still lead 2:1.
🌏USA & India are top content markets
US leads production significantly, with India as the fastest-growing content market — critical for subscriber growth in South Asia.
📅2019 was the content peak
Netflix added the most titles in 2019, just before the pandemic shifted strategy toward original productions.
⏱️Sweet spot: 90–100 minute movies
Content between 90–120 minutes dominates. Very short (<60 min) and very long (>150 min) movies are rare.
📺TV-MA is the dominant rating
Mature content dominates the catalogue — Netflix is clearly targeting adult audiences over family-friendly content.
🎭Drama + International = biggest genres
Dramas, International Movies, and Comedies are the top 3 genres by catalogue size.
📊
Range summary
Release years span 1925–2021. Top countries: US, India, UK, Japan, South Korea. Most TV Shows have only 1–2 seasons — short-run series dominate the format.
06 — Recommendations
Business recommendations
01
Double down on India
India is the second largest content producer and fastest growing market. More Indian originals = more subscribers.
02
Invest in limited series format
1–2 season TV shows perform well. Fewer episodes = lower production cost, faster release cycles.
03
Fix metadata quality
30% of titles have no director data. Poor metadata means weak search, weak discovery, weak recommendations.
04
Target 90–120 min runtime for movies
This is the viewer sweet spot. Avoid commissioning very long or very short films unless the content justifies it.
07 — Tech Stack
Tools used
Python 3PandasNumPyMatplotlibSeabornGoogle Colab