British Airways Data Science Simulation

British Airways
Data Science Simulation

Web scraping 1,000 customer reviews, sentiment analysis with TextBlob, and a Random Forest model to predict which customers will complete a booking — all in the British Airways Forage simulation.

PlatformForage · British Airways

CompletedMay 4, 2025

TasksWeb Scraping · NLP · ML

ToolsPython · BeautifulSoup · TextBlob · Scikit-learn

What this simulation involved

The British Airways Forage simulation covers two real data science tasks that reflect what BA's data teams actually do: understanding customer sentiment from unstructured review data and predicting which customers will follow through with a booking.

✈️

Why this matters to British Airways

Airlines operate on razor-thin margins. Understanding why customers book — or don't — has direct revenue impact. Sentiment analysis from reviews surfaces operational issues that internal data misses entirely.

Mining 1,000 customer reviews

British Airways customer reviews from Skytrax were scraped using Python's requests and BeautifulSoup libraries. 10 pages × 100 reviews per page = 1,000 customer opinions on everything from cabin crew to food to delays.

Web Scraping

Built a loop to paginate through 10 pages of BA reviews on airlinequality.com. Each page scraped with requests.get() and parsed with BeautifulSoup to extract the review text.

requests · BeautifulSoup · for loop pagination

Data Cleaning

Stripped whitespace, removed rows with missing reviews, used regex to remove special characters. Saved cleaned reviews to CSV.

str.strip() · dropna() · re.sub()

Sentiment Analysis with TextBlob

Applied TextBlob sentiment polarity scoring to each review. Scores range -1 (negative) to +1 (positive). Generated sentiment distribution histogram.

TextBlob().sentiment.polarity

Word Cloud Visualisation

Combined all reviews into one string and generated a word cloud to surface the most frequently mentioned topics — delay, service, food, staff emerged prominently.

WordCloud · matplotlib

Python — scraping_and_sentiment.py

import requests
from bs4 import BeautifulSoup
from textblob import TextBlob

base_url = "https://www.airlinequality.com/airline-reviews/british-airways"
reviews = []

# Scrape 10 pages × 100 reviews
for i in range(1, 11):
    url = f"{base_url}/page/{i}/?pagesize=100"
    soup = BeautifulSoup(requests.get(url).text, "html.parser")
    for review in soup.find_all("div", itemprop="reviewBody"):
        reviews.append(review.get_text())

# Sentiment scoring
df["sentiment"] = df["reviews"].apply(lambda x: TextBlob(x).sentiment.polarity)
# Range: -1 (negative) to +1 (positive)
# BA reviews: skewed negative (delays, service complaints)

Predicting who will book

Using the BA customer booking dataset, the goal was to build a model that predicts booking_complete (1 or 0) from features like purchase lead time, trip type, flight hour, and add-on preferences.

Data Loading & Exploration

Loaded CSV with ISO-8859-1 encoding. Checked for missing values (none found). Explored feature distributions and unique values per column.

pd.read_csv(encoding='ISO-8859-1') · info() · describe()

Feature Engineering & Encoding

One-hot encoded sales_channel, trip_type, flight_day. Dropped high-cardinality columns route and booking_origin to prevent overfitting.

get_dummies() · drop(columns=)

Train-Test Split

80/20 split with random_state=42 for reproducibility. Confirmed shapes of training and test sets.

train_test_split(test_size=0.2)

Random Forest + Evaluation

Trained RandomForestClassifier. Evaluated with accuracy_score and classification_report. Plotted top 10 feature importances as a Seaborn bar chart.

RandomForestClassifier · feature_importances_

Python — booking_prediction.py

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Encode categoricals
df_enc = pd.get_dummies(df, columns=['sales_channel','trip_type','flight_day'], drop_first=True)
df_enc = df_enc.drop(columns=['route','booking_origin'])

X = df_enc.drop('booking_complete', axis=1)
y = df_enc['booking_complete']

rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

print("Accuracy:", accuracy_score(y_test, rf.predict(X_test)))
# 85.1%
print(classification_report(y_test, rf.predict(X_test)))

What the simulation taught

🕒

Purchase lead time is the strongest predictor

Customers who book far in advance are significantly more likely to complete. The model's top feature by importance.

😤

BA reviews skew negative

Sentiment analysis revealed that delayed flights, cabin crew service, and in-flight food are the most complained-about topics — consistent with public perception.

📊

Real data pipelines start with messy sources

Web scraping produces unstructured, inconsistent text. Cleaning and structuring that data is where most of the work actually lives.

⚖️

Class imbalance is the hidden challenge

~83% of customers don't complete bookings — the same imbalance problem that appears in real credit risk and churn models.

British Airways
Data Science Simulation

What this simulation involved

Mining 1,000 customer reviews

Predicting who will book

What the simulation taught

British AirwaysData Science Simulation

What this simulation involved

Mining 1,000 customer reviews

Predicting who will book

What the simulation taught

British Airways
Data Science Simulation