Scaler Learner Segmentation

Scaler Learner
Segmentation via Clustering

K-Means and Hierarchical clustering on Scaler's learner dataset to identify distinct learner segments based on salary, experience, and job profile.

TypeK-Means · Hierarchical Clustering

DomainEd-Tech / HR Analytics

DatasetScaler learner records

ToolsPython · Scikit-learn · SciPy · Seaborn

CourseScaler Academic Case Study

Who are Scaler's different types of learners?

Scaler wants to understand their learner base better. Different learners have different profiles — some are freshers trying to break into tech, others are experienced engineers upskilling for senior roles. Identifying these segments helps Scaler personalise course recommendations, mentorship pairing, and placement support.

🎯

Why clustering

Unlike classification, clustering is unsupervised — we don't know the segments in advance. The algorithm discovers natural groupings in the data. This is exploratory analysis, not prediction.

Cleaning a messy salary dataset

Missing Value Handling

Dropped rows missing job_position (large count). Dropped remaining rows with any null. Dataset reduced but cleaned.

dropna(subset=['job_position'])

Experience Feature Engineering

Created 'experience' = 2024 - orgyear. Converts joining year to years of experience — a more meaningful clustering dimension.

current_year - df['orgyear']

Duplicate Removal

Removed duplicate rows to prevent bias in clustering. Duplicates inflate the density of certain points and pull cluster centres.

drop_duplicates()

Outlier Removal

Applied IQR method to CTC column. Removed salaries above Q3 + 1.5×IQR and below ₹100,000 (unrealistic entries).

IQR · quantile() · df[df['ctc']>=100000]

StandardScaler Normalisation

Scaled CTC and experience to equal range before clustering. Without scaling, high-magnitude CTC values dominate the distance calculations.

StandardScaler · fit_transform()

Python — clustering.py

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[['ctc', 'experience']])

# Elbow method to find optimal k
inertias = []
for k in range(1,11):
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(X_scaled)
    inertias.append(km.inertia_)
# Elbow at k=3 → 3 clusters optimal

# Hierarchical: dendrogram shows 3 natural groups
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(X_scaled[:2000], method='ward')
# Big vertical gap at top → 3 clusters confirmed

Who Scaler's learners are

Cluster 0

Early Career
Low CTC · Low Experience
Freshers & Junior devs
Breaking into tech

Cluster 1

Mid Career
Mid CTC · Mid Experience
3–7 years in industry
Upskilling for senior roles

Cluster 2

Senior / High Earners
High CTC · High Experience
8+ years · Leadership path
Niche upskilling

📚

Segment-specific course recommendations

Early career learners need DSA and system design fundamentals. Mid-career learners need ML/AI upskilling. Senior learners need leadership, architecture, or niche specialisation.

🎓

Mentorship matching improvement

Pairing learners within the same cluster for peer learning — and with the cluster above them for mentorship — creates more relevant connections.

💼

Placement strategy differentiation

Early career: entry-level FAANG prep. Mid: lateral moves to product companies. Senior: leadership roles and startups.

📊

CTC is skewed — median is better

Mean CTC is pulled up by outliers (max 200M). Median salary is the appropriate central tendency measure for each cluster.

Scaler Learner
Segmentation via Clustering

Who are Scaler's different types of learners?

Cleaning a messy salary dataset

Who Scaler's learners are

Scaler LearnerSegmentation via Clustering

Who are Scaler's different types of learners?

Cleaning a messy salary dataset

Who Scaler's learners are

Scaler Learner
Segmentation via Clustering