Home About Expertise Projects Blogs Contact
ClusteringK-MeansCompleted

Scaler Learner
Segmentation via Clustering

K-Means and Hierarchical clustering on Scaler's learner dataset to identify distinct learner segments based on salary, experience, and job profile.

TypeK-Means · Hierarchical Clustering
DomainEd-Tech / HR Analytics
DatasetScaler learner records
ToolsPython · Scikit-learn · SciPy · Seaborn
CourseScaler Academic Case Study
3
Optimal Clusters
K-Means
Primary Algorithm
CTC + Exp
Clustering Features
01 — Business Problem

Who are Scaler's different types of learners?

Scaler wants to understand their learner base better. Different learners have different profiles — some are freshers trying to break into tech, others are experienced engineers upskilling for senior roles. Identifying these segments helps Scaler personalise course recommendations, mentorship pairing, and placement support.

🎯
Why clustering
Unlike classification, clustering is unsupervised — we don't know the segments in advance. The algorithm discovers natural groupings in the data. This is exploratory analysis, not prediction.
02 — Data Preparation

Cleaning a messy salary dataset

01
Missing Value Handling
Dropped rows missing job_position (large count). Dropped remaining rows with any null. Dataset reduced but cleaned.
dropna(subset=['job_position'])
02
Experience Feature Engineering
Created 'experience' = 2024 - orgyear. Converts joining year to years of experience — a more meaningful clustering dimension.
current_year - df['orgyear']
03
Duplicate Removal
Removed duplicate rows to prevent bias in clustering. Duplicates inflate the density of certain points and pull cluster centres.
drop_duplicates()
04
Outlier Removal
Applied IQR method to CTC column. Removed salaries above Q3 + 1.5×IQR and below ₹100,000 (unrealistic entries).
IQR · quantile() · df[df['ctc']>=100000]
05
StandardScaler Normalisation
Scaled CTC and experience to equal range before clustering. Without scaling, high-magnitude CTC values dominate the distance calculations.
StandardScaler · fit_transform()
Python — clustering.py
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[['ctc', 'experience']])

# Elbow method to find optimal k
inertias = []
for k in range(1,11):
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(X_scaled)
    inertias.append(km.inertia_)
# Elbow at k=3 → 3 clusters optimal

# Hierarchical: dendrogram shows 3 natural groups
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(X_scaled[:2000], method='ward')
# Big vertical gap at top → 3 clusters confirmed
03 — The Three Learner Segments

Who Scaler's learners are

Cluster 0
Early Career
Low CTC · Low Experience
Freshers & Junior devs
Breaking into tech
Cluster 1
Mid Career
Mid CTC · Mid Experience
3–7 years in industry
Upskilling for senior roles
Cluster 2
Senior / High Earners
High CTC · High Experience
8+ years · Leadership path
Niche upskilling
📚
Segment-specific course recommendations
Early career learners need DSA and system design fundamentals. Mid-career learners need ML/AI upskilling. Senior learners need leadership, architecture, or niche specialisation.
🎓
Mentorship matching improvement
Pairing learners within the same cluster for peer learning — and with the cluster above them for mentorship — creates more relevant connections.
💼
Placement strategy differentiation
Early career: entry-level FAANG prep. Mid: lateral moves to product companies. Senior: leadership roles and startups.
📊
CTC is skewed — median is better
Mean CTC is pulled up by outliers (max 200M). Median salary is the appropriate central tendency measure for each cluster.
04 — Tech Stack
Python 3PandasScikit-learnK-MeansHierarchical ClusteringSciPy DendrogramSeaborn
← Back to Projects View on GitHub ↗