Home About Expertise Projects Blogs Contact
Recommender SystemsMatrix FactorizationCompleted

Zee Entertainment
Movie Recommender System

Building a movie recommendation engine using Pearson Correlation, Cosine Similarity, and SVD Matrix Factorization — evaluated on a 1 million+ rating dataset.

TypeRecommender Systems · Collaborative Filtering
DomainOTT / Streaming
Dataset1M+ ratings · 6,040 users · MovieLens format
ToolsPython · Surprise · Scikit-learn · Pandas
CourseScaler Academic Case Study
1M+
Ratings Processed
0.88
SVD RMSE
27%
SVD MAPE
3
CF Methods Compared
01 — Business Problem

What should a Zee subscriber watch next?

Zee Entertainment runs one of India's largest OTT platforms. Recommendation quality directly drives watch time and subscriber retention. A bad recommendation wastes a user's time and increases churn. The goal: build a recommendation system that surfaces genuinely relevant content from a catalogue of thousands of titles.

🎬
The dataset
MovieLens-format data with 1M+ ratings across 6,040 users and thousands of movies. Users have rated at least 20 movies each, with an average of 166 ratings per user — a relatively dense dataset for collaborative filtering.
02 — Dataset Profile

Who's in the data

AttributeValueNote
Total ratings1,000,209Dense dataset — good for CF
Unique users6,040All have rated 20+ movies
Avg ratings per user166High engagement
Max ratings per user2,314Some power users
Dominant gender~75% maleGender-biased dataset
Top age group25–34Largest rating contributor
Dominant decade1990s moviesDataset skews older films
Most common rating4 out of 5Positive bias in ratings
03 — Three Recommendation Approaches

Building up from simple to complex

01
Pearson Correlation (User-User CF)
Computed Pearson correlation between user rating vectors. Recommended movies highly rated by similar users. Limited by sparsity — correlation is noisy on sparse vectors.
corrwith() · sort_values()
02
Cosine Similarity (Item-Item CF)
Built item-item similarity matrix using cosine distance. More robust than Pearson — cosine handles magnitude differences better. Produces more consistent genre-aligned recommendations.
cosine_similarity · pivot_table
03
SVD Matrix Factorization
Used Surprise library's SVD algorithm. Decomposes the user-item matrix into latent factors. Learns hidden patterns (e.g., users who like action films but haven't rated them). Best performing model.
Surprise SVD · train_test_split · accuracy.rmse
Python — svd_recommender.py
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise import accuracy

# Load data in Surprise format
reader = Reader(rating_scale=(1,5))
data = Dataset.load_from_df(df[['UserID','MovieID','Rating']], reader)

trainset, testset = train_test_split(data, test_size=0.2)

svd = SVD()
svd.fit(trainset)

predictions = svd.test(testset)
rmse = accuracy.rmse(predictions)  # RMSE ≈ 0.88

# MAPE calculation
mape = np.mean([abs((p.r_ui - p.est)/p.r_ui) for p in predictions if p.r_ui != 0])
# MAPE ≈ 0.27 → 27% average prediction error
04 — Model Comparison

Which approach works best

MethodRMSEMAPEQuality
Pearson CorrelationHighHighWeak — sparse data problem
Cosine SimilarityMediumMediumBetter — genre-consistent recs
SVD Matrix Factorization0.880.27Best — latent factor learning
SVD is production-ready
RMSE of 0.88 on a 1–5 scale means predictions are typically within 1 star of the actual rating. MAPE of 27% is acceptable for sparse recommendation data. SVD should be the deployed model.
05 — Tech Stack
Python 3PandasScikit-learnSurprise LibrarySVDCosine SimilarityPearson Correlation
← Back to Projects View on GitHub ↗