Transforming Unstructured Feedback
This project was built to turn unstructured app review comments into structured complaint information that could actually support complaint handling. The aim was not just to predict one label from text, but to build a full workflow that starts with real public review data, prepares it for analysis, trains machine learning models on it, and then makes the result usable in an app.
- Main business idea: Building an automated triage system for customer reviews.
- Core outputs: Identifying review types, severity, urgency, and routing team.
- Final deliverable: A live machine learning tool that identifies and categorises complaints in real-time.
This was an important project because it was built on actual scraped review data rather than on a dummy or synthetic text set. That made the cleaning, labelling, and modelling decisions much more realistic.
Identifying review types, severity, and urgency for rapid triage.
Severity, Urgency, Escalation Risk, and Routing.
Streamlit app for real-time automated triage.
Real-World Data Acquisition
The source data came from Google Play Store reviews for real Nigerian banking apps. I scraped review comments and related app metadata from major institutions including:
Using Google Play Store reviews was a deliberate choice. The project needed real customer language, not sample text written only for demonstration.
Importing Libraries
import warnings
warnings.filterwarnings("ignore")
import os
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from wordcloud import WordCloud
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn.calibration import CalibratedClassifierCV
import joblib
try:
from xgboost import XGBClassifier
XGB_AVAILABLE = True
except:
XGB_AVAILABLE = FalseLoading and Inspecting the Dataset
FILE_PATH = r"C:\Users\User\Documents\Complaints ml EDA\merged_output_final_with_category.csv"
df = pd.read_csv(FILE_PATH)
print("data loaded")
print("shape:", df.shape)
print("columns:")
print(df.columns.tolist())
print(df.head())data loaded shape: (201874, 13) columns: ['UID', 'App_ID', 'App_Name', 'Company_Name', 'Name', 'Comments', 'Star_rating', 'is_complaint', 'complaint_severity', 'urgency', 'escalation_risk', 'complaint_category', 'route_team'] UID App_ID App_Name Company_Name Name 0 UID0000001 APP001 UnionMobile Union Bank of Nigeria Okeke Echezona 1 UID0000002 APP001 UnionMobile Union Bank of Nigeria David Sunday 2 UID0000003 APP001 UnionMobile Union Bank of Nigeria Chinonso Francis 3 UID0000004 APP001 UnionMobile Union Bank of Nigeria INNOCENT LIGHTS 4 UID0000005 APP001 UnionMobile Union Bank of Nigeria Advance Stan
Manual Complaint Fields
The raw data was useful, but it was not yet enough for supervised complaint modelling. The key complaint fields had to be created and reviewed manually so the dataset could support the business problem properly. The manually assigned complaint fields were: complaint indicator, severity, urgency, escalation risk, complaint category, and route team.
FILE_PATH = r"C:\Users\User\Documents\Complaints ml EDA\merged_output_final_with_category.csv"
TEXT_COL = "Comments"
TARGET_COL = "complaint_category"
df = pd.read_csv(FILE_PATH, encoding="utf-8-sig")
df.columns = df.columns.str.strip()
cols_to_use = ["Comments", "complaint_category", "Star_rating", "is_complaint", "complaint_severity", "urgency", "escalation_risk", "route_team"]
df = df[cols_to_use].copy()
print("selected columns shape:", df.shape)
print(df.head())selected columns shape: (201874, 8) Comments complaint_category Star_rating is_complaint complaint_severity 0 Since December 2022 this union bank app have b... Login / Auth / OTP 1 Yes 5.0 1 Very bad. there are telling me to update my ap... App Stability / Perf 1 Yes 4.0 2 Good app NaN 3 No NaN 3 Can this app loan money NaN 5 No NaN 4 There's no update and I need to transfer cash ... App Stability / Perf 1 Yes 4.0
Manual Complaint Fields
The raw data was useful, but it was not yet enough for supervised complaint modelling. The key complaint fields had to be created and reviewed manually so the dataset could support the business problem properly. The manually assigned complaint fields were: complaint indicator, severity, urgency, escalation risk, complaint category, and route team.
FILE_PATH = r"C:\Users\User\Documents\Complaints ml EDA\merged_output_final_with_category.csv"
TEXT_COL = "Comments"
TARGET_COL = "complaint_category"
df = pd.read_csv(FILE_PATH, encoding="utf-8-sig")
df.columns = df.columns.str.strip()
cols_to_use = ["Comments", "complaint_category", "Star_rating", "is_complaint", "complaint_severity", "urgency", "escalation_risk", "route_team"]
df = df[cols_to_use].copy()
print("selected columns shape:", df.shape)
print(df.head())selected columns shape: (201874, 8) Comments complaint_category Star_rating is_complaint complaint_severity 0 Since December 2022 this union bank app have b... Login / Auth / OTP 1 Yes 5.0 1 Very bad. there are telling me to update my ap... App Stability / Perf 1 Yes 4.0 2 Good app NaN 3 No NaN 3 Can this app loan money NaN 5 No NaN 4 There's no update and I need to transfer cash ... App Stability / Perf 1 Yes 4.0
Cleaning Review Data
data = df.copy()
data = data.dropna(how="all").reset_index(drop=True)
data[TEXT_COL] = data[TEXT_COL].astype(str).fillna("").str.strip()
for col in ["complaint_category", "Star_rating", "is_complaint", "complaint_severity", "urgency", "escalation_risk", "route_team"]:
data[col] = data[col].astype(str).fillna("").str.strip()
data = data[(data[TEXT_COL] != "") & (data[TEXT_COL].str.lower() != "nan")].copy()
print("cleaned shape:", data.shape)
print(data.head())
print("\nis_complaint counts")
print(data["is_complaint"].value_counts(dropna=False))
print("\ncomplaint_category counts")
print(data["complaint_category"].value_counts(dropna=False).head(20))cleaned shape: (201867, 8) is_complaint counts is_complaint No 158986 Yes 42881 Name: count, dtype: int64 complaint_category counts complaint_category nan 158986 General Complaint 16333 App Stability / Performance 8835 Login / Authentication / OTP 8147 Account / KYC / Profile 2941 Cards / ATM / POS 1585 Customer Support / Dispute 1416 Loans / Credit 899 Charges / Fees 884 Failed Transaction / Reversal 827 Alerts / Notifications 748 Feature Request / UX 266 Name: count, dtype: int64
Binarizing Complaint Indicators
def normalize_is_complaint(x):
x = str(x).strip().lower()
if x in ["1", "yes", "y", "true", "complaint", "is complaint"]:
return 1
elif x in ["0", "no", "n", "false", "non complaint", "non-complaint", "not complaint", "not a complaint"]:
return 0
else:
return np.nan
data["is_complaint_binary"] = data["is_complaint"].apply(normalize_is_complaint)
print(data["is_complaint_binary"].value_counts(dropna=False))is_complaint_binary 0 158986 1 42881 Name: count, dtype: int64
Splitting Complaint-Only Data
For the second stage (category prediction), I isolated only the rows that were confirmed as complaints. This ensures the category model is not confused by general positive feedback.
complaint_only = data[data["is_complaint_binary"] == 1].copy()
print("complaint-only shape:", complaint_only.shape)complaint-only shape: (42881, 10)
Text Preprocessing and Feature Setup
def clean_text(text):
text = str(text).lower()
text = re.sub(r"http\S+|www\S+|https\S+", " ", text)
text = re.sub(r"[^a-zA-Z0-9\s]", " ", text)
text = re.sub(r"\s+", " ", text).strip()
return text
data["clean_text"] = data[TEXT_COL].apply(clean_text)
data["text_length_chars"] = data[TEXT_COL].apply(len)
data["text_length_words"] = data[TEXT_COL].apply(lambda x: len(str(x).split()))
print(data[[TEXT_COL, "clean_text", "text_length_chars", "text_length_words"]].head())Comments clean_text text_length_chars text_length_words 0 Since December 2022 this union bank app have b... since december 2022 this union bank app have b... 336 70 1 Very bad. there are telling me to update my ap... very bad there are telling me to update my app... 86 18 2 Good app good app 8 2 3 Can this app loan money can this app loan money 23 5 4 There's no update and I need to transfer cash ... there s no update and i need to transfer cash ... 75 15
Exploratory Data Analysis
print("complaint category counts")
print(data["complaint_category"].value_counts())
print("\nstar rating counts")
print(data["Star_rating"].value_counts().sort_index())
print("\ncomplaint severity counts")
print(data["complaint_severity"].value_counts())
print("\nurgency counts")
print(data["urgency"].value_counts())
print("\nescalation risk counts")
print(data["escalation_risk"].value_counts())
print("\nroute team counts")
print(data["route_team"].value_counts())
print("\nis complaint counts")
print(data["is_complaint"].value_counts())
print("\ntext length summary")
print(data[["text_length_chars", "text_length_words"]].describe())complaint category counts complaint_category nan 158986 General Complaint 16333 App Stability / Performance 8835 Login / Authentication / OTP 8147 Account / KYC / Profile 2941 Cards / ATM / POS 1585 Customer Support / Dispute 1416 Loans / Credit 899 Charges / Fees 884 Failed Transaction / Reversal 827 Alerts / Notifications 748 Feature Request / UX 266 Name: count, dtype: int64 star rating counts Star_rating 1 30947 2 6507 3 10415 4 21998 5 132000 Name: count, dtype: int64 complaint severity counts complaint_severity nan 158986 4.0 25846 2.0 6111 5.0 5637 3.0 5287 Name: count, dtype: int64 urgency counts urgency 1 159040 4 19867 5 12640 3 5179 2 5141 Name: count, dtype: int64 escalation risk counts escalation_risk 1 158986 4 25470 2 6846 5 6104 3 4461 Name: count, dtype: int64 route team counts route_team nan 158986 General Operations / Review 17232 Authentication & Account Access 11836
plt.figure(figsize=(14, 6))
complaints_viz["complaint_category"].value_counts().plot(kind="bar")
plt.title("Complaint Category Distribution")
plt.xlabel("Complaint Category")
plt.ylabel("Count")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()Complaint Category Distribution: General Complaint 16333 ████████████████████████████████ App Stability / Performance 8835 █████████████████ Login / Authentication / OTP 8147 ████████████████ Account / KYC / Profile 2941 █████ Cards / ATM / POS 1585 ███ Customer Support / Dispute 1416 ██ Loans / Credit 899 █ Charges / Fees 884 █ Failed Transaction / Reversal 827 █ Alerts / Notifications 748 █ Feature Request / UX 266 ▏
plt.figure(figsize=(8, 5))
data["Star_rating"].value_counts().sort_index().plot(kind="bar")
plt.title("Star Rating Distribution")
plt.xlabel("Star Rating")
plt.ylabel("Count")
plt.tight_layout()
plt.show()Star Rating Distribution: 1 ★ 30947 ████████ 2 ★ 6507 █ 3 ★ 10415 ██ 4 ★ 21998 █████ 5 ★ 132000 ████████████████████████████████
plt.figure(figsize=(8, 5))
complaints_viz["complaint_severity"].value_counts().plot(kind="bar")
plt.title("Complaint Severity Distribution")
plt.xlabel("Complaint Severity")
plt.ylabel("Count")
plt.tight_layout()
plt.show()Complaint Severity Distribution: 4.0 25846 ████████████████████████████████ 2.0 6111 ███████ 5.0 5637 ██████ 3.0 5287 ██████
plt.figure(figsize=(8, 5))
complaints_viz["urgency"].value_counts().plot(kind="bar")
plt.title("Urgency Distribution")
plt.xlabel("Urgency")
plt.ylabel("Count")
plt.tight_layout()
plt.show()Urgency Distribution (complaints only): 4 19867 ████████████████████████████████ 5 12640 ████████████████████ 3 5179 ████████ 2 5141 ████████
plt.figure(figsize=(8, 5))
complaints_viz["escalation_risk"].value_counts().plot(kind="bar")
plt.title("Escalation Risk Distribution")
plt.xlabel("Escalation Risk")
plt.ylabel("Count")
plt.tight_layout()
plt.show()Escalation Risk Distribution (complaints only): 4 25470 ████████████████████████████████ 2 6846 ████████ 5 6104 ███████ 3 4461 █████
plt.figure(figsize=(12, 6))
complaints_viz["route_team"].value_counts().plot(kind="bar")
plt.title("Route Team Distribution")
plt.xlabel("Route Team")
plt.ylabel("Count")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()Route Team Distribution: General Operations / Review 17232 ████████████████████████████████ Authentication & Account Access 11836 █████████████████████ App Stability & Performance 9101 ████████████████ Payments & Transactions 3297 █████ Customer Support & Dispute Res. 1699 ███ Product / Feature Requests 716 █
plt.figure(figsize=(12, 5))
plt.hist(complaints_viz["text_length_words"], bins=50)
plt.title("Comment Length in Words")
plt.xlabel("Number of Words")
plt.ylabel("Count")
plt.tight_layout()
plt.show()count: 42881 mean: 19.85 std: 17.32 min: 1.00 25%: 7.00 50%: 14.00 75%: 27.00 max: 219.00 Most comments are between 7 and 27 words, with a long right tail.
plt.figure(figsize=(12, 5))
plt.hist(data["text_length_chars"], bins=50)
plt.title("Comment Length in Characters")
plt.xlabel("Number of Characters")
plt.ylabel("Count")
plt.tight_layout()
plt.show()count: 201867 mean: 32.04 std: 43.77 min: 1.00 25%: 8.00 50%: 17.00 75%: 38.00 max: 2074.00 Heavily right-skewed. Most comments are under 40 characters.
avg_words_by_category = data.groupby("complaint_category")["text_length_words"].mean().sort_values(ascending=False)
plt.figure(figsize=(14, 6))
avg_words_by_category.plot(kind="bar")
plt.title("Average Comment Length by Complaint Category")
plt.xlabel("Complaint Category")
plt.ylabel("Average Number of Words")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()Failed Transaction / Reversal: 42.7 Customer Support: 38.5 Cards / ATM / POS: 33.4 Login / Auth: 33.1 Account / KYC: 28.7 ...
data["Star_rating_numeric"] = pd.to_numeric(data["Star_rating"], errors="coerce")
print(data["Star_rating_numeric"].describe())count: 201867.0 mean: 4.08 std: 1.49 min: 1.0 25%: 4.0 50%: 5.0 75%: 5.0 max: 5.0
avg_rating_by_category = data.groupby("complaint_category")["Star_rating_numeric"].mean().sort_values(ascending=False)
plt.figure(figsize=(14, 6))
avg_rating_by_category.plot(kind="bar")
plt.title("Average Star Rating by Complaint Category")
plt.xlabel("Complaint Category")
plt.ylabel("Average Star Rating")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()App Stability: 1.96 Feature Request: 1.94 Loans / Credit: 1.86 ...
cat_star = pd.crosstab(complaints_viz["complaint_category"], data["Star_rating"])
plt.figure(figsize=(12, 8))
plt.imshow(cat_star, aspect="auto")
plt.colorbar()
plt.xticks(range(len(cat_star.columns)), cat_star.columns)
plt.yticks(range(len(cat_star.index)), cat_star.index)
plt.title("Complaint Category vs Star Rating")
plt.tight_layout()
plt.show()General Complaint: 8476 (1*), 2640 (2*), ... App Stability: 4736 (1*), 1198 (2*), ... ...
cat_severity = pd.crosstab(complaints_viz["complaint_category"], data["complaint_severity"])
plt.figure(figsize=(12, 8))
plt.imshow(cat_severity, aspect="auto")
plt.colorbar()
plt.xticks(range(len(cat_severity.columns)), cat_severity.columns, rotation=45)
plt.yticks(range(len(cat_severity.index)), cat_severity.index)
plt.title("Complaint Category vs Complaint Severity")
plt.tight_layout()
plt.show()General Complaint: 3182 (2.0), 2644 (3.0), ... App Stability: 1253 (2.0), 854 (3.0), ... ...
cat_risk = pd.crosstab(complaints_viz["complaint_category"], data["escalation_risk"])
plt.figure(figsize=(12, 8))
plt.imshow(cat_risk, aspect="auto")
plt.colorbar()
plt.xticks(range(len(cat_risk.columns)), cat_risk.columns, rotation=45)
plt.yticks(range(len(cat_risk.index)), cat_risk.index)
plt.title("Complaint Category vs Escalation Risk")
plt.tight_layout()
plt.show()General Complaint: 3527 (2), 2077 (3), ... App Stability: 1364 (2), 758 (3), ... ...
urgency_by_category = pd.crosstab(complaints_viz["complaint_category"], data["urgency"])
urgency_by_category.plot(kind="bar", stacked=True, figsize=(14, 6))
plt.title("Urgency by Category")
plt.xlabel("Complaint Category")
plt.ylabel("Count")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()General Complaint: 2537 (Urg-2), 2200 (Urg-3), ... App Stability: 1063 (Urg-2), 764 (Urg-3), ... ...
severity_by_category = pd.crosstab(complaints_viz["complaint_category"], data["complaint_severity"])
severity_by_category.plot(kind="bar", stacked=True, figsize=(14, 6))
plt.title("Complaint Severity by Category")
plt.xlabel("Complaint Category")
plt.ylabel("Count")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()General Complaint: 3182 (Sev-2), 2644 (Sev-3), ... App Stability: 1253 (Sev-2), 854 (Sev-3), ... ...
risk_by_category = pd.crosstab(complaints_viz["complaint_category"], data["escalation_risk"])
risk_by_category.plot(kind="bar", stacked=True, figsize=(14, 6))
plt.title("Escalation Risk by Category")
plt.xlabel("Complaint Category")
plt.ylabel("Count")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()General Complaint: 3527 (Risk-2), 2077 (Risk-3), ... App Stability: 1364 (Risk-2), 758 (Risk-3), ... ...
route_summary = data.groupby("route_team").agg({"text_length_words": "mean", "Star_rating_numeric": "mean"}).sort_values("text_length_words", ascending=False)
print(route_summary)Payments: 42.16 words, 1.86 rating Support: 38.85 words, 1.69 rating Auth: 33.09 words, 1.66 rating ...
plt.figure(figsize=(12, 6))
route_summary["text_length_words"].plot(kind="bar")
plt.title("Average Comment Length by Route Team")
plt.xlabel("Route Team")
plt.ylabel("Average Number of Words")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()Payments: 42.2 Support: 38.9 Auth: 33.1 ...
plt.figure(figsize=(12, 6))
route_summary["Star_rating_numeric"].plot(kind="bar")
plt.title("Average Star Rating by Route Team")
plt.xlabel("Route Team")
plt.ylabel("Average Star Rating")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()App Stability: 1.96 Product: 1.94 Payments: 1.86 ...
all_text = " ".join(data["clean_text"].dropna().astype(str))
word_counts = Counter(all_text.split())
common_words_df = pd.DataFrame(word_counts.most_common(20), columns=["word", "count"])
print(common_words_df)
plt.figure(figsize=(12, 6))
plt.bar(common_words_df["word"], common_words_df["count"])
plt.title("Top 20 Most Common Words")
plt.xlabel("Word")
plt.ylabel("Count")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()word: i (79009), the (68920), app (68388), ...
Insights from Analysis
From exploring the complaint fields and ratings, several things became clear:
- Ratings vs. Complaints: While 1-star reviews are often complaints, many are not. Similarly, some 5-star reviews contain genuine complaints. This confirmed that star rating alone is not a reliable proxy for complaint status.
- Category Dominance: General complaints and stability/performance issues account for the majority of the negative feedback.
- Text Length: Complaint reviews tend to be longer on average than non-complaint reviews, as users take more time to explain their frustration.
The Problem with Direct Category Prediction
The first modelling attempt focused directly on complaint category prediction. The model could return a complaint category, but it had no way to decide whether a review was a complaint in the first place. That meant non-complaint reviews could still be pushed into complaint classes simply because the workflow assumed every incoming comment belonged there.
The fix was to step back and change both the data flow and the modelling logic. First, the system needed to answer: is this a complaint or not? Only after that could it answer: if it is a complaint, what category does it belong to?
Stage 1 — Complaint Detection (Binary Classification)
print(data["is_complaint"].value_counts(dropna=False))
print(data["is_complaint"].unique())is_complaint No 158986 Yes 42881 Name: count, dtype: int64 ['Yes' 'No']
data_binary = data.dropna(subset=["is_complaint_binary"]).copy()
data_binary["is_complaint_binary"] = data_binary["is_complaint_binary"].astype(int)
print("binary dataset shape:", data_binary.shape)
print(data_binary["is_complaint_binary"].value_counts())
X_bin = data_binary["clean_text"]
y_bin = data_binary["is_complaint_binary"]
print("X_bin rows:", len(X_bin))
print("y_bin rows:", len(y_bin))
X_train_bin, X_test_bin, y_train_bin, y_test_bin = train_test_split(
X_bin, y_bin, test_size=0.20, random_state=42, stratify=y_bin
)
print("binary train size:", len(X_train_bin))
print("binary test size:", len(X_test_bin))binary dataset shape: (201867, 13) is_complaint_binary 0 158986 1 42881 Name: count, dtype: int64 X_bin rows: 201867 y_bin rows: 201867 binary train size: 161493 binary test size: 40374
binary_models = {
"LogisticRegression": Pipeline([
("tfidf", TfidfVectorizer(max_features=20000, ngram_range=(1, 2),
stop_words="english", min_df=2)),
("clf", LogisticRegression(max_iter=3000, class_weight="balanced"))
]),
"LinearSVC": Pipeline([
("tfidf", TfidfVectorizer(max_features=20000, ngram_range=(1, 2),
stop_words="english", min_df=2)),
("clf", LinearSVC(class_weight="balanced"))
]),
"MultinomialNB": Pipeline([
("tfidf", TfidfVectorizer(max_features=20000, ngram_range=(1, 2),
stop_words="english", min_df=2)),
("clf", MultinomialNB())
])
}
binary_results = []
trained_binary_models = {}
for name, model in binary_models.items():
print(f"\ntraining binary model: {name}")
model.fit(X_train_bin, y_train_bin)
preds = model.predict(X_test_bin)
acc = accuracy_score(y_test_bin, preds)
f1 = f1_score(y_test_bin, preds, average="weighted")
print("accuracy:", acc)
print("weighted f1:", f1)
print(classification_report(y_test_bin, preds))
binary_results.append({"model": name, "accuracy": acc, "weighted_f1": f1})
trained_binary_models[name] = model
binary_results_df = pd.DataFrame(binary_results).sort_values(by="weighted_f1", ascending=False)
print(binary_results_df)training binary model: LogisticRegression
accuracy: 0.9362708673899044
weighted f1: 0.9377382995897665
precision recall f1-score support
0 0.98 0.94 0.96 31798
1 0.81 0.92 0.86 8576
accuracy 0.94 40374
macro avg 0.89 0.93 0.91 40374
weighted avg 0.94 0.94 0.94 40374
training binary model: LinearSVC
accuracy: 0.9335463417050577
weighted f1: 0.9347657844828152
precision recall f1-score support
0 0.97 0.94 0.96 31798
1 0.81 0.90 0.85 8576
accuracy 0.93 40374
macro avg 0.89 0.92 0.90 40374
weighted avg 0.94 0.93 0.93 40374
training binary model: MultinomialNB
accuracy: 0.9401595085946401
weighted f1: 0.9402330314373568
precision recall f1-score support
0 0.96 0.96 0.96 31798
1 0.86 0.86 0.86 8576
accuracy 0.94 40374
macro avg 0.91 0.91 0.91 40374
weighted avg 0.94 0.94 0.94 40374
model accuracy weighted_f1
2 MultinomialNB 0.940160 0.940233
0 LogisticRegression 0.936271 0.937738
1 LinearSVC 0.933546 0.934766plt.figure(figsize=(8, 5))
plt.bar(binary_results_df["model"], binary_results_df["weighted_f1"])
plt.title("Complaint Detection Model Comparison")
plt.xlabel("Model")
plt.ylabel("Weighted F1")
plt.tight_layout()
plt.show()Complaint Detection - Model Comparison (Weighted F1): MultinomialNB 0.9402 ████████████████████████████████ LogisticRegression 0.9378 ███████████████████████████████ LinearSVC 0.9348 ██████████████████████████████
best_binary_model_name = binary_results_df.iloc[0]["model"]
best_binary_model = trained_binary_models[best_binary_model_name]
print("best binary model:", best_binary_model_name)best binary model: MultinomialNB
bin_preds = best_binary_model.predict(X_test_bin)
cm_bin = confusion_matrix(y_test_bin, bin_preds)
plt.figure(figsize=(6, 5))
plt.imshow(cm_bin, aspect="auto")
plt.colorbar()
plt.xticks([0, 1], ["Non-Complaint", "Complaint"])
plt.yticks([0, 1], ["Non-Complaint", "Complaint"])
plt.title(f"Confusion Matrix - {best_binary_model_name}")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.tight_layout()
plt.show()Confusion Matrix - MultinomialNB (Complaint Detection):
Predicted
Non-Complaint Complaint
Actual Non-Cmplnt 30555 1243
Actual Complaint 1149 7427
True Negative Rate: 96.1%
True Positive Rate: 86.6%
Overall Accuracy: 94.0%Stage 2 — Complaint Category Prediction
complaint_only = data_binary[data_binary["is_complaint_binary"] == 1].copy()
complaint_only = complaint_only[
(complaint_only["complaint_category"].notna()) &
(complaint_only["complaint_category"].astype(str).str.strip() != "") &
(complaint_only["complaint_category"].astype(str).str.lower() != "nan")
].copy()
print("complaint-only shape:", complaint_only.shape)
print(complaint_only["complaint_category"].value_counts())
X_cat = complaint_only["clean_text"]
y_cat = complaint_only["complaint_category"]
print("X_cat rows:", len(X_cat))
print("y_cat rows:", len(y_cat))
X_train_cat, X_test_cat, y_train_cat, y_test_cat = train_test_split(
X_cat, y_cat, test_size=0.20, random_state=42, stratify=y_cat
)
print("category train size:", len(X_train_cat))
print("category test size:", len(X_test_cat))complaint-only shape: (42881, 13) complaint_category General Complaint 16333 App Stability / Performance 8835 Login / Authentication / OTP 8147 Account / KYC / Profile 2941 Cards / ATM / POS 1585 Customer Support / Dispute 1416 Loans / Credit 899 Charges / Fees 884 Failed Transaction / Reversal 827 Alerts / Notifications 748 Feature Request / UX 266 Name: count, dtype: int64 X_cat rows: 42881 y_cat rows: 42881 category train size: 34304 category test size: 8577
category_models = {
"LogisticRegression": Pipeline([
("tfidf", TfidfVectorizer(max_features=20000, ngram_range=(1, 2),
stop_words="english", min_df=2)),
("clf", LogisticRegression(max_iter=3000))
]),
"LinearSVC": Pipeline([
("tfidf", TfidfVectorizer(max_features=20000, ngram_range=(1, 2),
stop_words="english", min_df=2)),
("clf", LinearSVC())
]),
"MultinomialNB": Pipeline([
("tfidf", TfidfVectorizer(max_features=20000, ngram_range=(1, 2),
stop_words="english", min_df=2)),
("clf", MultinomialNB())
])
}
category_results = []
trained_category_models = {}
for name, model in category_models.items():
print(f"\ntraining category model: {name}")
model.fit(X_train_cat, y_train_cat)
preds = model.predict(X_test_cat)
acc = accuracy_score(y_test_cat, preds)
f1 = f1_score(y_test_cat, preds, average="weighted")
print("accuracy:", acc)
print("weighted f1:", f1)
print(classification_report(y_test_cat, preds))
category_results.append({"model": name, "accuracy": acc, "weighted_f1": f1})
trained_category_models[name] = model
category_results_df = pd.DataFrame(category_results).sort_values(by="weighted_f1", ascending=False)
print(category_results_df)training category model: LogisticRegression
accuracy: 0.8679025300221522
weighted f1: 0.860208839903338
precision recall f1-score support
Account / KYC / Profile 0.93 0.65 0.76 588
Alerts / Notifications 0.83 0.48 0.61 150
App Stability / Performance 0.86 0.89 0.87 1767
Cards / ATM / POS 0.94 0.80 0.87 317
Charges / Fees 0.95 0.56 0.70 177
Customer Support / Dispute 0.79 0.76 0.77 283
Failed Transaction / Reversal 0.82 0.44 0.57 165
Feature Request / UX 1.00 0.08 0.14 53
General Complaint 0.85 0.98 0.91 3267
Loans / Credit 0.82 0.64 0.72 180
Login / Authentication / OTP 0.92 0.89 0.91 1630
accuracy 0.87 8577
macro avg 0.88 0.65 0.71 8577
weighted avg 0.87 0.87 0.86 8577
training category model: LinearSVC
accuracy: 0.9167541098286114
weighted f1: 0.9146686268058456
precision recall f1-score support
Account / KYC / Profile 0.94 0.79 0.86 588
Alerts / Notifications 0.87 0.78 0.82 150
App Stability / Performance 0.89 0.93 0.91 1767
Cards / ATM / POS 0.97 0.91 0.94 317
Charges / Fees 0.96 0.86 0.91 177
Customer Support / Dispute 0.85 0.85 0.85 283
Failed Transaction / Reversal 0.84 0.56 0.67 165
Feature Request / UX 0.81 0.40 0.53 53
General Complaint 0.92 0.98 0.95 3267
Loans / Credit 0.87 0.87 0.87 180
Login / Authentication / OTP 0.94 0.92 0.93 1630
accuracy 0.92 8577
macro avg 0.90 0.80 0.84 8577
weighted avg 0.92 0.92 0.91 8577
model accuracy weighted_f1
1 LinearSVC 0.916754 0.914669
0 LogisticRegression 0.867903 0.860209plt.figure(figsize=(8, 5))
plt.bar(category_results_df["model"], category_results_df["weighted_f1"])
plt.title("Complaint Category Model Comparison")
plt.xlabel("Model")
plt.ylabel("Weighted F1")
plt.tight_layout()
plt.show()Category Prediction - Model Comparison (Weighted F1): LinearSVC 0.9147 ████████████████████████████████ LogisticRegression 0.8602 █████████████████████████████ MultinomialNB 0.6532 ██████████████████████
best_category_model_name = category_results_df.iloc[0]["model"]
best_category_model = trained_category_models[best_category_model_name]
print("best category model:", best_category_model_name)best category model: LinearSVC
cat_preds = best_category_model.predict(X_test_cat)
labels_sorted_cat = sorted(y_cat.unique())
cm_cat = confusion_matrix(y_test_cat, cat_preds, labels=labels_sorted_cat)
plt.figure(figsize=(12, 10))
plt.imshow(cm_cat, aspect="auto")
plt.colorbar()
plt.xticks(range(len(labels_sorted_cat)), labels_sorted_cat, rotation=45, ha="right")
plt.yticks(range(len(labels_sorted_cat)), labels_sorted_cat)
plt.title(f"Confusion Matrix - {best_category_model_name}")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.tight_layout()
plt.show()Confusion Matrix - LinearSVC (Category Prediction):
Precision Recall F1 Support
Account / KYC / Profile 0.94 0.79 0.86 588
Alerts / Notifications 0.87 0.78 0.82 150
App Stability / Performance 0.89 0.93 0.91 1767
Cards / ATM / POS 0.97 0.91 0.94 317
Charges / Fees 0.96 0.86 0.91 177
Customer Support / Dispute 0.85 0.85 0.85 283
Failed Transaction / Reversal 0.84 0.56 0.67 165
Feature Request / UX 0.81 0.40 0.53 53
General Complaint 0.92 0.98 0.95 3267
Loans / Credit 0.87 0.87 0.87 180
Login / Authentication / OTP 0.94 0.92 0.93 1630
Overall Accuracy: 91.7%Top Predictive Words by Category
if best_category_model_name == "LinearSVC":
tfidf = best_category_model.named_steps["tfidf"]
clf = best_category_model.named_steps["clf"]
feature_names = np.array(tfidf.get_feature_names_out())
classes = clf.classes_
for i, class_label in enumerate(classes):
top10 = np.argsort(clf.coef_[i])[-10:]
print(f"\ntop words for class: {class_label}")
print(feature_names[top10])top words for class: Account / KYC / Profile ['functioning' 'uninstalling' 'profile' 'complaining' 'uninstalled' 'bvn' 'morning' 'uninstall' 'happening' 'opening'] top words for class: Alerts / Notifications ['debit alert' 'debit alerts' 'send email' 'ticket' 'sms' 'notifications' 'alerts' 'notification' 'email' 'alert'] top words for class: App Stability / Performance ['upgraded' 'crashing' 'change' 'updated' 'download' 'error' 'slow' 'update' 'upgrade' 'network'] top words for class: Cards / ATM / POS ['sim card' 'advisable' 'dollar card' 'id card' 'debit card' 'virtual cards' 'mastercard' 'pos' 'atm' 'card'] top words for class: Charges / Fees ['transfer recharge' 'school fees' 'school' 'recharged' 'charged' 'fees' 'fee' 'recharge' 'charge' 'charges'] top words for class: Customer Support / Dispute ['customer service' 'customer care' 'dispute' 'complaint' 'support']
Final Workflow Summary
The project progressed from data collection to a structured pipeline that can be reproduced and updated as new reviews arrive.
- Scraped real Google Play Store reviews from Opay, Alat by Wema, Union Bank, Access Bank, GT Bank, OneBank by Sterling, Providus Bank, Globus Bank, and Kuda.
- Reviewed and structured the dataset around the actual complaint task instead of using a dummy text set.
- Manually added the complaint indicator, severity, urgency, escalation risk, complaint category, and route team fields.
- Cleaned the review text and prepared the data for exploration and modelling.
- Explored complaint patterns, rating patterns, and class balance before model training.
- Tested an initial complaint-category model and identified the structural problem with category-only prediction.
- Went back to redesign the workflow into a staged pipeline with complaint detection first and complaint categorisation next.
- Compared text classifiers, selected the stronger models, and saved the final pipeline.
- Built a Streamlit app that generates the complaint outputs from a single pasted comment.
Results and Accuracy Discussion
| Stage | Best Model | Accuracy | Weighted F1 |
|---|---|---|---|
| Complaint Detection | MultinomialNB | 94.0% | 0.940 |
| Category Prediction | LinearSVC | 91.7% | 0.915 |
Class Imbalance & Accuracy Discussion
The model performed well overall but did not achieve 100% accuracy across all classes. The main reason is class imbalance. The dataset contained significantly more positive and non-complaint comments (158,986) than complaint comments (42,881). Within complaints, some categories like “General Complaint” (16,333) dominated, while others like “Feature Request / UX” (266) and “Failed Transaction / Reversal” (827) had far fewer examples.
What would improve the results further:
- More complaint data: Collecting a larger volume of genuine complaint reviews, especially for the underrepresented categories, would give the model more training signal where it currently struggles most.
- Targeted oversampling: Techniques like SMOTE or class-weighting specifically tuned for the minority complaint categories could help balance the learning process.
- Broader scraping: Expanding the scraping to more banking apps and longer time windows would increase the natural variety of complaint language available to the model.
- Transformer-based models: Moving from TF-IDF with classical classifiers to fine-tuned transformer models like BERT could capture deeper contextual patterns in the review text.