Building a movie recommendation engine using Pearson Correlation, Cosine Similarity, and SVD Matrix Factorization — evaluated on a 1 million+ rating dataset.
Zee Entertainment runs one of India's largest OTT platforms. Recommendation quality directly drives watch time and subscriber retention. A bad recommendation wastes a user's time and increases churn. The goal: build a recommendation system that surfaces genuinely relevant content from a catalogue of thousands of titles.
| Attribute | Value | Note |
|---|---|---|
| Total ratings | 1,000,209 | Dense dataset — good for CF |
| Unique users | 6,040 | All have rated 20+ movies |
| Avg ratings per user | 166 | High engagement |
| Max ratings per user | 2,314 | Some power users |
| Dominant gender | ~75% male | Gender-biased dataset |
| Top age group | 25–34 | Largest rating contributor |
| Dominant decade | 1990s movies | Dataset skews older films |
| Most common rating | 4 out of 5 | Positive bias in ratings |
from surprise import Dataset, Reader, SVD from surprise.model_selection import train_test_split from surprise import accuracy # Load data in Surprise format reader = Reader(rating_scale=(1,5)) data = Dataset.load_from_df(df[['UserID','MovieID','Rating']], reader) trainset, testset = train_test_split(data, test_size=0.2) svd = SVD() svd.fit(trainset) predictions = svd.test(testset) rmse = accuracy.rmse(predictions) # RMSE ≈ 0.88 # MAPE calculation mape = np.mean([abs((p.r_ui - p.est)/p.r_ui) for p in predictions if p.r_ui != 0]) # MAPE ≈ 0.27 → 27% average prediction error
| Method | RMSE | MAPE | Quality |
|---|---|---|---|
| Pearson Correlation | High | High | Weak — sparse data problem |
| Cosine Similarity | Medium | Medium | Better — genre-consistent recs |
| SVD Matrix Factorization | 0.88 | 0.27 | Best — latent factor learning |