Home About Expertise Projects Blogs Contact
Academic Case StudyCybersecurityCompleted

Network Anomaly
Detection

Machine learning to detect malicious network traffic — Logistic Regression and Random Forest classifiers on 123 engineered features, deployed as a Flask API with a Tableau dashboard.

TypeClassification · Anomaly Detection
DomainCybersecurity · Network Security
DatasetNetwork Anomaly Traffic Dataset
ToolsPython · Scikit-learn · Flask · Tableau · ngrok
CategoryAcademic Case Study
99.9%
Random Forest Accuracy
0.999
ROC-AUC Score
123
Features After Encoding
Flask
API Deployed
01 — Problem Statement

Detect cyber attacks before they cause damage.

Network anomalies — DDoS attacks, unauthorised access, port scanning, malicious data exfiltration — are expensive and dangerous. Firewalls and rule-based systems catch known patterns. Machine learning catches the unknown ones — unusual traffic behaviour that doesn't match any predefined rule.

The goal: build a model that classifies network connections as normal or attack in real time, with high precision and minimal false positives — then expose it as an API that any network monitoring system can call.

🔒
The scale of the problem
A single corporate network generates millions of connection events per day. A model that operates at 99.9% accuracy means roughly 1 false alert per 1,000 connections — acceptable for a security triage system where humans review flagged traffic.
02 — Dataset

What network traffic looks like as data

The dataset captures per-connection network traffic attributes — bytes transferred, protocol types, connection flags, error rates, and service types. Each row is one network connection, labelled as normal or one of several attack categories (collapsed to a binary label).

Feature GroupExamplesRelevance
Traffic Volumesrcbytes, dstbytesHighest importance
Connection Behavioursamesrvrate, diffsrvrate, countHigh importance
Connection Statusflag_SF, loggedin, lastflagMedium importance
Protocol (encoded)protocoltype_tcp, protocoltype_icmpMedium importance
Service (encoded)service_http, service_ftpSupporting features
📊
Class distribution
Normal traffic vs attack traffic were both well-represented in the dataset. The target column (attack label) was binarised: 0 = normal, 1 = any attack type. After deduplication: training on 80%, testing on 20%.
03 — Methodology

End-to-end pipeline

01
Data Cleaning & Preprocessing
Dropped duplicates. Lowercased all column names. Binarised the target variable (attack → 1, normal → 0). Applied one-hot encoding to categorical columns (protocoltype, service, flag) — producing 123 total features.
drop_duplicates() · get_dummies() · label binarisation
02
EDA & Visualisation
Plotted normal vs attack distribution. Analysed feature distributions — srcbytes and dstbytes showed heavy right skew and outliers. Built correlation heatmap on a 5,000-row sample. Visualised protocol and service behaviour in Tableau.
countplot · hist · heatmap · boxplot · Tableau
03
Hypothesis Testing — T-Test
Tested whether srcbytes differs significantly between normal and attack traffic. T-statistic confirmed significant difference (p=0.035). Normal mean: 13K bytes. Attack mean: 82K bytes. Attacks involve 6× more data transfer.
scipy.stats.ttest_ind()
04
Hypothesis Testing — Chi-Square
Tested whether protocol type (TCP/UDP/ICMP) is associated with attack likelihood. Chi-Square p-value ≈ 3e-79 — extremely significant. ICMP protocol shows the highest anomaly ratio.
chi2_contingency() · pd.crosstab()
05
Model 1 — Logistic Regression
Baseline model. Applied StandardScaler + SimpleImputer (median strategy). Achieved 98.9% accuracy. Good precision and recall across both classes.
LogisticRegression(max_iter=1000) · StandardScaler · SimpleImputer
06
Model 2 — Random Forest
Improved on LR with 100 estimators. Accuracy: 99.94%. ROC-AUC: 0.999. Confusion matrix: TN=13418, FP=4, FN=10, TP=11763. Only 14 misclassifications out of 25,195 test samples.
RandomForestClassifier(n_estimators=100) · roc_curve() · auc()
Python — model_pipeline.py
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# One-hot encode categoricals → 123 features
df = pd.get_dummies(df, columns=['protocoltype','service','flag'])

# Scale + impute
scaler = StandardScaler()
imputer = SimpleImputer(strategy='median')
X_train = imputer.fit_transform(scaler.fit_transform(X_train))
X_test  = imputer.transform(scaler.transform(X_test))

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Results
print("Accuracy:", accuracy_score(y_test, rf.predict(X_test)))
# 0.9994
print("ROC-AUC:", roc_auc_score(y_test, rf.predict_proba(X_test)[:,1]))
# 0.999
04 — Model Results

The numbers in detail

ModelAccuracyPrecisionRecallROC-AUC
Logistic Regression98.9%~0.99~0.99
Random Forest ✓99.94%~0.999~0.9990.999

Confusion Matrix

Predicted NormalPredicted Attack
Actual NormalTN = 13,418FP = 4
Actual AttackFN = 10TP = 11,763
⚠️
Why 99.9% needs context
High accuracy is partly because this dataset has strong class separability — normal and attack traffic are clearly different in the feature space. Real-world network data is noisier and more adversarial. The model would need retraining on live data and adversarial samples to maintain performance in production.
05 — Flask API Deployment

From model to live API

The trained Random Forest model was saved with joblib and wrapped in a Flask REST API. ngrok exposed it publicly for testing — simulating a real deployment where any network monitoring agent can POST connection features and get an anomaly prediction in milliseconds.

Python — flask_app.py
from flask import Flask, request, jsonify
import joblib, pandas as pd

model         = joblib.load("model.pkl")
feature_names = joblib.load("features.pkl")

app = Flask(__name__)

@app.route("/predict", methods=["POST"])
def predict():
    data = request.json
    # Fill missing features with 0 (safe default)
    input_data = {col: data.get(col, 0) for col in feature_names}
    df = pd.DataFrame([input_data])
    prediction = model.predict(df)
    return jsonify({"prediction": int(prediction[0])})
    # 0 = Normal, 1 = Attack

# Example request:
# POST /predict
# { "srcbytes": 1000, "dstbytes": 500 }
# Response: { "prediction": 0 }  → Normal traffic
🔗 Live Tableau Dashboard

Interactive visualisation of traffic distribution, protocol vs anomaly trends, and service-level attack patterns.

View Tableau Dashboard ↗
06 — Top Feature Importances

What signals anomalies

RankFeatureCategoryImportance
#1srcbytesTraffic Volume
High
#2dstbytesTraffic Volume
High
#3samesrvrateConnection Behaviour
Med
#4diffsrvrateConnection Behaviour
Med
#5flag_SFConnection Status
Med
#6countConnection Frequency
Low
07 — Key Findings
📡
Attack traffic transfers 6× more data
Normal mean srcbytes: 13K. Attack mean: 82K. T-test p=0.035 confirms this is statistically significant — not random variation.
🌐
ICMP shows highest anomaly ratio
Protocol type is strongly associated with attack type (Chi-Square p≈3e-79). ICMP connections are disproportionately malicious — often used in ping floods and ICMP tunnelling.
🎯
Only 14 errors out of 25K test samples
FP=4, FN=10. The model is conservative on false positives — security teams would rather miss a few attacks than be flooded with false alarms.
Sub-millisecond prediction via API
The Flask API returns predictions in <1ms — fast enough for real-time network traffic analysis at scale.
08 — Tech Stack
Python 3Scikit-learnRandom ForestLogistic RegressionFlaskngrokTableauScipy StatsjoblibSeaborn
← Back to Projects View on GitHub ↗