Zee Entertainment Movie Recommender

Zee Entertainment
Movie Recommender System

Building a movie recommendation engine using Pearson Correlation, Cosine Similarity, and SVD Matrix Factorization — evaluated on a 1 million+ rating dataset.

TypeRecommender Systems · Collaborative Filtering

DomainOTT / Streaming

Dataset1M+ ratings · 6,040 users · MovieLens format

ToolsPython · Surprise · Scikit-learn · Pandas

CourseScaler Academic Case Study

What should a Zee subscriber watch next?

Zee Entertainment runs one of India's largest OTT platforms. Recommendation quality directly drives watch time and subscriber retention. A bad recommendation wastes a user's time and increases churn. The goal: build a recommendation system that surfaces genuinely relevant content from a catalogue of thousands of titles.

🎬

The dataset

MovieLens-format data with 1M+ ratings across 6,040 users and thousands of movies. Users have rated at least 20 movies each, with an average of 166 ratings per user — a relatively dense dataset for collaborative filtering.

Who's in the data

Attribute	Value	Note
Total ratings	1,000,209	Dense dataset — good for CF
Unique users	6,040	All have rated 20+ movies
Avg ratings per user	166	High engagement
Max ratings per user	2,314	Some power users
Dominant gender	~75% male	Gender-biased dataset
Top age group	25–34	Largest rating contributor
Dominant decade	1990s movies	Dataset skews older films
Most common rating	4 out of 5	Positive bias in ratings

Building up from simple to complex

Pearson Correlation (User-User CF)

Computed Pearson correlation between user rating vectors. Recommended movies highly rated by similar users. Limited by sparsity — correlation is noisy on sparse vectors.

corrwith() · sort_values()

Cosine Similarity (Item-Item CF)

Built item-item similarity matrix using cosine distance. More robust than Pearson — cosine handles magnitude differences better. Produces more consistent genre-aligned recommendations.

cosine_similarity · pivot_table

SVD Matrix Factorization

Used Surprise library's SVD algorithm. Decomposes the user-item matrix into latent factors. Learns hidden patterns (e.g., users who like action films but haven't rated them). Best performing model.

Surprise SVD · train_test_split · accuracy.rmse

Python — svd_recommender.py

from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise import accuracy

# Load data in Surprise format
reader = Reader(rating_scale=(1,5))
data = Dataset.load_from_df(df[['UserID','MovieID','Rating']], reader)

trainset, testset = train_test_split(data, test_size=0.2)

svd = SVD()
svd.fit(trainset)

predictions = svd.test(testset)
rmse = accuracy.rmse(predictions)  # RMSE ≈ 0.88

# MAPE calculation
mape = np.mean([abs((p.r_ui - p.est)/p.r_ui) for p in predictions if p.r_ui != 0])
# MAPE ≈ 0.27 → 27% average prediction error

Which approach works best

Method	RMSE	MAPE	Quality
Pearson Correlation	High	High	Weak — sparse data problem
Cosine Similarity	Medium	Medium	Better — genre-consistent recs
SVD Matrix Factorization	0.88	0.27	Best — latent factor learning

⭐

SVD is production-ready

RMSE of 0.88 on a 1–5 scale means predictions are typically within 1 star of the actual rating. MAPE of 27% is acceptable for sparse recommendation data. SVD should be the deployed model.

Zee Entertainment
Movie Recommender System

What should a Zee subscriber watch next?

Who's in the data

Building up from simple to complex

Which approach works best

Zee EntertainmentMovie Recommender System

What should a Zee subscriber watch next?

Who's in the data

Building up from simple to complex

Which approach works best

Zee Entertainment
Movie Recommender System