Academic Case StudyCybersecurityCompleted

Network Anomaly
Detection

Machine learning to detect malicious network traffic — Logistic Regression and Random Forest classifiers on 123 engineered features, deployed as a Flask API with a Tableau dashboard.

TypeClassification · Anomaly Detection

DomainCybersecurity · Network Security

DatasetNetwork Anomaly Traffic Dataset

ToolsPython · Scikit-learn · Flask · Tableau · ngrok

TableauView Dashboard ↗

CategoryAcademic Case Study

99.9%

Random Forest Accuracy

0.999

ROC-AUC Score

123

Features After Encoding

Flask

API Deployed

01 — Problem Statement

Detect cyber attacks before they cause damage.

Network anomalies — DDoS attacks, unauthorised access, port scanning, malicious data exfiltration — are expensive and dangerous. Firewalls and rule-based systems catch known patterns. Machine learning catches the unknown ones — unusual traffic behaviour that doesn't match any predefined rule.

The goal: build a model that classifies network connections as normal or attack in real time, with high precision and minimal false positives — then expose it as an API that any network monitoring system can call.

🔒

The scale of the problem

A single corporate network generates millions of connection events per day. A model that operates at 99.9% accuracy means roughly 1 false alert per 1,000 connections — acceptable for a security triage system where humans review flagged traffic.

02 — Dataset

What network traffic looks like as data

The dataset captures per-connection network traffic attributes — bytes transferred, protocol types, connection flags, error rates, and service types. Each row is one network connection, labelled as normal or one of several attack categories (collapsed to a binary label).

Feature Group	Examples	Relevance
Traffic Volume	srcbytes, dstbytes	Highest importance
Connection Behaviour	samesrvrate, diffsrvrate, count	High importance
Connection Status	flag_SF, loggedin, lastflag	Medium importance
Protocol (encoded)	protocoltype_tcp, protocoltype_icmp	Medium importance
Service (encoded)	service_http, service_ftp	Supporting features

📊

Class distribution

Normal traffic vs attack traffic were both well-represented in the dataset. The target column (attack label) was binarised: 0 = normal, 1 = any attack type. After deduplication: training on 80%, testing on 20%.

03 — Methodology

End-to-end pipeline

Data Cleaning & Preprocessing

Dropped duplicates. Lowercased all column names. Binarised the target variable (attack → 1, normal → 0). Applied one-hot encoding to categorical columns (protocoltype, service, flag) — producing 123 total features.

drop_duplicates() · get_dummies() · label binarisation

EDA & Visualisation

Plotted normal vs attack distribution. Analysed feature distributions — srcbytes and dstbytes showed heavy right skew and outliers. Built correlation heatmap on a 5,000-row sample. Visualised protocol and service behaviour in Tableau.

countplot · hist · heatmap · boxplot · Tableau

Hypothesis Testing — T-Test

Tested whether srcbytes differs significantly between normal and attack traffic. T-statistic confirmed significant difference (p=0.035). Normal mean: 13K bytes. Attack mean: 82K bytes. Attacks involve 6× more data transfer.

scipy.stats.ttest_ind()

Hypothesis Testing — Chi-Square

Tested whether protocol type (TCP/UDP/ICMP) is associated with attack likelihood. Chi-Square p-value ≈ 3e-79 — extremely significant. ICMP protocol shows the highest anomaly ratio.

chi2_contingency() · pd.crosstab()

Model 1 — Logistic Regression

Baseline model. Applied StandardScaler + SimpleImputer (median strategy). Achieved 98.9% accuracy. Good precision and recall across both classes.

LogisticRegression(max_iter=1000) · StandardScaler · SimpleImputer

Model 2 — Random Forest

Improved on LR with 100 estimators. Accuracy: 99.94%. ROC-AUC: 0.999. Confusion matrix: TN=13418, FP=4, FN=10, TP=11763. Only 14 misclassifications out of 25,195 test samples.

RandomForestClassifier(n_estimators=100) · roc_curve() · auc()

Python — model_pipeline.py
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# One-hot encode categoricals → 123 features
df = pd.get_dummies(df, columns=['protocoltype','service','flag'])

# Scale + impute
scaler = StandardScaler()
imputer = SimpleImputer(strategy='median')
X_train = imputer.fit_transform(scaler.fit_transform(X_train))
X_test  = imputer.transform(scaler.transform(X_test))

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Results
print("Accuracy:", accuracy_score(y_test, rf.predict(X_test)))
# 0.9994
print("ROC-AUC:", roc_auc_score(y_test, rf.predict_proba(X_test)[:,1]))
# 0.999

04 — Model Results

The numbers in detail

Model	Accuracy	Precision	Recall	ROC-AUC
Logistic Regression	98.9%	~0.99	~0.99	—
Random Forest ✓	99.94%	~0.999	~0.999	0.999

Confusion Matrix

	Predicted Normal	Predicted Attack
Actual Normal	TN = 13,418	FP = 4
Actual Attack	FN = 10	TP = 11,763

⚠️

Why 99.9% needs context

High accuracy is partly because this dataset has strong class separability — normal and attack traffic are clearly different in the feature space. Real-world network data is noisier and more adversarial. The model would need retraining on live data and adversarial samples to maintain performance in production.

05 — Flask API Deployment

From model to live API

The trained Random Forest model was saved with joblib and wrapped in a Flask REST API. ngrok exposed it publicly for testing — simulating a real deployment where any network monitoring agent can POST connection features and get an anomaly prediction in milliseconds.

Python — flask_app.py
from flask import Flask, request, jsonify
import joblib, pandas as pd

model         = joblib.load("model.pkl")
feature_names = joblib.load("features.pkl")

app = Flask(__name__)

@app.route("/predict", methods=["POST"])
def predict():
    data = request.json
    # Fill missing features with 0 (safe default)
    input_data = {col: data.get(col, 0) for col in feature_names}
    df = pd.DataFrame([input_data])
    prediction = model.predict(df)
    return jsonify({"prediction": int(prediction[0])})
    # 0 = Normal, 1 = Attack

# Example request:
# POST /predict
# { "srcbytes": 1000, "dstbytes": 500 }
# Response: { "prediction": 0 }  → Normal traffic

🔗 Live Tableau Dashboard

Interactive visualisation of traffic distribution, protocol vs anomaly trends, and service-level attack patterns.

View Tableau Dashboard ↗

06 — Top Feature Importances

What signals anomalies

Rank	Feature	Category	Importance
#1	srcbytes	Traffic Volume	High
#2	dstbytes	Traffic Volume	High
#3	samesrvrate	Connection Behaviour	Med
#4	diffsrvrate	Connection Behaviour	Med
#5	flag_SF	Connection Status	Med
#6	count	Connection Frequency	Low

07 — Key Findings

📡

Attack traffic transfers 6× more data

Normal mean srcbytes: 13K. Attack mean: 82K. T-test p=0.035 confirms this is statistically significant — not random variation.

🌐

ICMP shows highest anomaly ratio

Protocol type is strongly associated with attack type (Chi-Square p≈3e-79). ICMP connections are disproportionately malicious — often used in ping floods and ICMP tunnelling.

🎯

Only 14 errors out of 25K test samples

FP=4, FN=10. The model is conservative on false positives — security teams would rather miss a few attacks than be flooded with false alarms.

⚡

Sub-millisecond prediction via API

The Flask API returns predictions in <1ms — fast enough for real-time network traffic analysis at scale.

08 — Tech Stack

Python 3Scikit-learnRandom ForestLogistic RegressionFlaskngrokTableauScipy StatsjoblibSeaborn

← Back to Projects View on GitHub ↗

Network AnomalyDetection