Home About Expertise Projects Blogs Contact
Industry SimulationMachine LearningCompleted

British Airways
Data Science Simulation

Web scraping 1,000 customer reviews, sentiment analysis with TextBlob, and a Random Forest model to predict which customers will complete a booking — all in the British Airways Forage simulation.

PlatformForage · British Airways
CompletedMay 4, 2025
TasksWeb Scraping · NLP · ML
ToolsPython · BeautifulSoup · TextBlob · Scikit-learn
1,000+
Reviews Scraped
85.1%
Model Accuracy
10
Top Features Identified
2
Tasks Completed
🏅
British Airways · Forage Certificate
Completed: May 4, 2025
Verification Code: z5GL5FQviDkXuuvD2
Verify ↗
01 — Simulation Overview

What this simulation involved

The British Airways Forage simulation covers two real data science tasks that reflect what BA's data teams actually do: understanding customer sentiment from unstructured review data and predicting which customers will follow through with a booking.

✈️
Why this matters to British Airways
Airlines operate on razor-thin margins. Understanding why customers book — or don't — has direct revenue impact. Sentiment analysis from reviews surfaces operational issues that internal data misses entirely.
02 — Task 1 — Web Scraping & Sentiment Analysis

Mining 1,000 customer reviews

British Airways customer reviews from Skytrax were scraped using Python's requests and BeautifulSoup libraries. 10 pages × 100 reviews per page = 1,000 customer opinions on everything from cabin crew to food to delays.

01
Web Scraping
Built a loop to paginate through 10 pages of BA reviews on airlinequality.com. Each page scraped with requests.get() and parsed with BeautifulSoup to extract the review text.
requests · BeautifulSoup · for loop pagination
02
Data Cleaning
Stripped whitespace, removed rows with missing reviews, used regex to remove special characters. Saved cleaned reviews to CSV.
str.strip() · dropna() · re.sub()
03
Sentiment Analysis with TextBlob
Applied TextBlob sentiment polarity scoring to each review. Scores range -1 (negative) to +1 (positive). Generated sentiment distribution histogram.
TextBlob().sentiment.polarity
04
Word Cloud Visualisation
Combined all reviews into one string and generated a word cloud to surface the most frequently mentioned topics — delay, service, food, staff emerged prominently.
WordCloud · matplotlib
Python — scraping_and_sentiment.py
import requests
from bs4 import BeautifulSoup
from textblob import TextBlob

base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
reviews = []

# Scrape 10 pages × 100 reviews
for i in range(1, 11):
    url = f"{base_url}/page/{i}/?pagesize=100"
    soup = BeautifulSoup(requests.get(url).text, "html.parser")
    for review in soup.find_all("div", itemprop="reviewBody"):
        reviews.append(review.get_text())

# Sentiment scoring
df["sentiment"] = df["reviews"].apply(lambda x: TextBlob(x).sentiment.polarity)
# Range: -1 (negative) to +1 (positive)
# BA reviews: skewed negative (delays, service complaints)
03 — Task 2 — Booking Prediction Model

Predicting who will book

Using the BA customer booking dataset, the goal was to build a model that predicts booking_complete (1 or 0) from features like purchase lead time, trip type, flight hour, and add-on preferences.

01
Data Loading & Exploration
Loaded CSV with ISO-8859-1 encoding. Checked for missing values (none found). Explored feature distributions and unique values per column.
pd.read_csv(encoding='ISO-8859-1') · info() · describe()
02
Feature Engineering & Encoding
One-hot encoded sales_channel, trip_type, flight_day. Dropped high-cardinality columns route and booking_origin to prevent overfitting.
get_dummies() · drop(columns=)
03
Train-Test Split
80/20 split with random_state=42 for reproducibility. Confirmed shapes of training and test sets.
train_test_split(test_size=0.2)
04
Random Forest + Evaluation
Trained RandomForestClassifier. Evaluated with accuracy_score and classification_report. Plotted top 10 feature importances as a Seaborn bar chart.
RandomForestClassifier · feature_importances_
Python — booking_prediction.py
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Encode categoricals
df_enc = pd.get_dummies(df, columns=['sales_channel','trip_type','flight_day'], drop_first=True)
df_enc = df_enc.drop(columns=['route','booking_origin'])

X = df_enc.drop('booking_complete', axis=1)
y = df_enc['booking_complete']

rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

print("Accuracy:", accuracy_score(y_test, rf.predict(X_test)))
# 85.1%
print(classification_report(y_test, rf.predict(X_test)))
04 — Key Findings

What the simulation taught

🕒
Purchase lead time is the strongest predictor
Customers who book far in advance are significantly more likely to complete. The model's top feature by importance.
😤
BA reviews skew negative
Sentiment analysis revealed that delayed flights, cabin crew service, and in-flight food are the most complained-about topics — consistent with public perception.
📊
Real data pipelines start with messy sources
Web scraping produces unstructured, inconsistent text. Cleaning and structuring that data is where most of the work actually lives.
⚖️
Class imbalance is the hidden challenge
~83% of customers don't complete bookings — the same imbalance problem that appears in real credit risk and churn models.
05 — Tech Stack
Python 3BeautifulSoupRequestsTextBlobPandasScikit-learnRandom ForestWordCloudMatplotlib
← Back to Projects View on GitHub ↗