Machine learning to detect malicious network traffic — Logistic Regression and Random Forest classifiers on 123 engineered features, deployed as a Flask API with a Tableau dashboard.
Network anomalies — DDoS attacks, unauthorised access, port scanning, malicious data exfiltration — are expensive and dangerous. Firewalls and rule-based systems catch known patterns. Machine learning catches the unknown ones — unusual traffic behaviour that doesn't match any predefined rule.
The goal: build a model that classifies network connections as normal or attack in real time, with high precision and minimal false positives — then expose it as an API that any network monitoring system can call.
The dataset captures per-connection network traffic attributes — bytes transferred, protocol types, connection flags, error rates, and service types. Each row is one network connection, labelled as normal or one of several attack categories (collapsed to a binary label).
| Feature Group | Examples | Relevance |
|---|---|---|
| Traffic Volume | srcbytes, dstbytes | Highest importance |
| Connection Behaviour | samesrvrate, diffsrvrate, count | High importance |
| Connection Status | flag_SF, loggedin, lastflag | Medium importance |
| Protocol (encoded) | protocoltype_tcp, protocoltype_icmp | Medium importance |
| Service (encoded) | service_http, service_ftp | Supporting features |
from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, classification_report, roc_auc_score from sklearn.preprocessing import StandardScaler from sklearn.impute import SimpleImputer # One-hot encode categoricals → 123 features df = pd.get_dummies(df, columns=['protocoltype','service','flag']) # Scale + impute scaler = StandardScaler() imputer = SimpleImputer(strategy='median') X_train = imputer.fit_transform(scaler.fit_transform(X_train)) X_test = imputer.transform(scaler.transform(X_test)) # Train Random Forest rf = RandomForestClassifier(n_estimators=100, random_state=42) rf.fit(X_train, y_train) # Results print("Accuracy:", accuracy_score(y_test, rf.predict(X_test))) # 0.9994 print("ROC-AUC:", roc_auc_score(y_test, rf.predict_proba(X_test)[:,1])) # 0.999
| Model | Accuracy | Precision | Recall | ROC-AUC |
|---|---|---|---|---|
| Logistic Regression | 98.9% | ~0.99 | ~0.99 | — |
| Random Forest ✓ | 99.94% | ~0.999 | ~0.999 | 0.999 |
| Predicted Normal | Predicted Attack | |
|---|---|---|
| Actual Normal | TN = 13,418 | FP = 4 |
| Actual Attack | FN = 10 | TP = 11,763 |
The trained Random Forest model was saved with joblib and wrapped in a Flask REST API. ngrok exposed it publicly for testing — simulating a real deployment where any network monitoring agent can POST connection features and get an anomaly prediction in milliseconds.
from flask import Flask, request, jsonify import joblib, pandas as pd model = joblib.load("model.pkl") feature_names = joblib.load("features.pkl") app = Flask(__name__) @app.route("/predict", methods=["POST"]) def predict(): data = request.json # Fill missing features with 0 (safe default) input_data = {col: data.get(col, 0) for col in feature_names} df = pd.DataFrame([input_data]) prediction = model.predict(df) return jsonify({"prediction": int(prediction[0])}) # 0 = Normal, 1 = Attack # Example request: # POST /predict # { "srcbytes": 1000, "dstbytes": 500 } # Response: { "prediction": 0 } → Normal traffic
Interactive visualisation of traffic distribution, protocol vs anomaly trends, and service-level attack patterns.
View Tableau Dashboard ↗| Rank | Feature | Category | Importance |
|---|---|---|---|
| #1 | srcbytes | Traffic Volume | |
| #2 | dstbytes | Traffic Volume | |
| #3 | samesrvrate | Connection Behaviour | |
| #4 | diffsrvrate | Connection Behaviour | |
| #5 | flag_SF | Connection Status | |
| #6 | count | Connection Frequency |