01. Objective
Live App

Transforming Unstructured Feedback

This project was built to turn unstructured app review comments into structured complaint information that could actually support complaint handling. The aim was not just to predict one label from text, but to build a full workflow that starts with real public review data, prepares it for analysis, trains machine learning models on it, and then makes the result usable in an app.

  • Main business idea: Building an automated triage system for customer reviews.
  • Core outputs: Identifying review types, severity, urgency, and routing team.
  • Final deliverable: A live machine learning tool that identifies and categorises complaints in real-time.

This was an important project because it was built on actual scraped review data rather than on a dummy or synthetic text set. That made the cleaning, labelling, and modelling decisions much more realistic.

[ STRATEGY ]
Business Vision

Identifying review types, severity, and urgency for rapid triage.

[ DATA ]
Core Outputs

Severity, Urgency, Escalation Risk, and Routing.

[ DELIVERY ]
Final Result

Streamlit app for real-time automated triage.

Dataset

Real-World Data Acquisition

The source data came from Google Play Store reviews for real Nigerian banking apps. I scraped review comments and related app metadata from major institutions including:

Opay
Alat by Wema
Union Bank
Access Bank
GT Bank
OneBank
Providus Bank
Globus Bank
Kuda

Using Google Play Store reviews was a deliberate choice. The project needed real customer language, not sample text written only for demonstration.

Setup

Importing Libraries

[ CODE ]imports.py
import warnings
warnings.filterwarnings("ignore")
import os
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from wordcloud import WordCloud
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn.calibration import CalibratedClassifierCV
import joblib

try:
    from xgboost import XGBClassifier
    XGB_AVAILABLE = True
except:
    XGB_AVAILABLE = False
ETL

Loading and Inspecting the Dataset

[ CODE ]load_data.py
FILE_PATH = r"C:\Users\User\Documents\Complaints ml EDA\merged_output_final_with_category.csv"
df = pd.read_csv(FILE_PATH)
print("data loaded")
print("shape:", df.shape)
print("columns:")
print(df.columns.tolist())
print(df.head())
[ OUTPUT ]Terminal
data loaded
shape: (201874, 13)
columns:
['UID', 'App_ID', 'App_Name', 'Company_Name', 'Name', 'Comments', 'Star_rating', 'is_complaint', 'complaint_severity', 'urgency', 'escalation_risk', 'complaint_category', 'route_team']

  UID        App_ID  App_Name     Company_Name          Name
0 UID0000001 APP001  UnionMobile  Union Bank of Nigeria Okeke Echezona
1 UID0000002 APP001  UnionMobile  Union Bank of Nigeria David Sunday
2 UID0000003 APP001  UnionMobile  Union Bank of Nigeria Chinonso Francis
3 UID0000004 APP001  UnionMobile  Union Bank of Nigeria INNOCENT LIGHTS
4 UID0000005 APP001  UnionMobile  Union Bank of Nigeria Advance Stan

Manual Complaint Fields

The raw data was useful, but it was not yet enough for supervised complaint modelling. The key complaint fields had to be created and reviewed manually so the dataset could support the business problem properly. The manually assigned complaint fields were: complaint indicator, severity, urgency, escalation risk, complaint category, and route team.

[ CODE ]select_columns.py
FILE_PATH = r"C:\Users\User\Documents\Complaints ml EDA\merged_output_final_with_category.csv"
TEXT_COL = "Comments"
TARGET_COL = "complaint_category"

df = pd.read_csv(FILE_PATH, encoding="utf-8-sig")
df.columns = df.columns.str.strip()

cols_to_use = ["Comments", "complaint_category", "Star_rating", "is_complaint", "complaint_severity", "urgency", "escalation_risk", "route_team"]
df = df[cols_to_use].copy()
print("selected columns shape:", df.shape)
print(df.head())
[ OUTPUT ]Terminal
selected columns shape: (201874, 8)
  Comments                                          complaint_category   Star_rating is_complaint complaint_severity
0 Since December 2022 this union bank app have b... Login / Auth / OTP   1           Yes          5.0
1 Very bad. there are telling me to update my ap... App Stability / Perf 1           Yes          4.0
2 Good app                                          NaN                  3           No           NaN
3 Can this app loan money                           NaN                  5           No           NaN
4 There's no update and I need to transfer cash ... App Stability / Perf 1           Yes          4.0
Structure

Manual Complaint Fields

The raw data was useful, but it was not yet enough for supervised complaint modelling. The key complaint fields had to be created and reviewed manually so the dataset could support the business problem properly. The manually assigned complaint fields were: complaint indicator, severity, urgency, escalation risk, complaint category, and route team.

[ CODE ]select_columns.py
FILE_PATH = r"C:\Users\User\Documents\Complaints ml EDA\merged_output_final_with_category.csv"
TEXT_COL = "Comments"
TARGET_COL = "complaint_category"

df = pd.read_csv(FILE_PATH, encoding="utf-8-sig")
df.columns = df.columns.str.strip()

cols_to_use = ["Comments", "complaint_category", "Star_rating", "is_complaint", "complaint_severity", "urgency", "escalation_risk", "route_team"]
df = df[cols_to_use].copy()
print("selected columns shape:", df.shape)
print(df.head())
[ OUTPUT ]Terminal
selected columns shape: (201874, 8)
  Comments                                          complaint_category   Star_rating is_complaint complaint_severity
0 Since December 2022 this union bank app have b... Login / Auth / OTP   1           Yes          5.0
1 Very bad. there are telling me to update my ap... App Stability / Perf 1           Yes          4.0
2 Good app                                          NaN                  3           No           NaN
3 Can this app loan money                           NaN                  5           No           NaN
4 There's no update and I need to transfer cash ... App Stability / Perf 1           Yes          4.0
Process

Cleaning Review Data

[ CODE ]clean_data.py
data = df.copy()
data = data.dropna(how="all").reset_index(drop=True)
data[TEXT_COL] = data[TEXT_COL].astype(str).fillna("").str.strip()

for col in ["complaint_category", "Star_rating", "is_complaint", "complaint_severity", "urgency", "escalation_risk", "route_team"]:
    data[col] = data[col].astype(str).fillna("").str.strip()

data = data[(data[TEXT_COL] != "") & (data[TEXT_COL].str.lower() != "nan")].copy()
print("cleaned shape:", data.shape)
print(data.head())
print("\nis_complaint counts")
print(data["is_complaint"].value_counts(dropna=False))
print("\ncomplaint_category counts")
print(data["complaint_category"].value_counts(dropna=False).head(20))
[ OUTPUT ]Terminal
cleaned shape: (201867, 8)

is_complaint counts
is_complaint
No     158986
Yes     42881
Name: count, dtype: int64

complaint_category counts
complaint_category
nan                           158986
General Complaint              16333
App Stability / Performance     8835
Login / Authentication / OTP    8147
Account / KYC / Profile         2941
Cards / ATM / POS               1585
Customer Support / Dispute      1416
Loans / Credit                   899
Charges / Fees                   884
Failed Transaction / Reversal    827
Alerts / Notifications           748
Feature Request / UX             266
Name: count, dtype: int64
Normalize

Binarizing Complaint Indicators

[ CODE ]normalize_complaint.py
def normalize_is_complaint(x):
    x = str(x).strip().lower()
    if x in ["1", "yes", "y", "true", "complaint", "is complaint"]:
        return 1
    elif x in ["0", "no", "n", "false", "non complaint", "non-complaint", "not complaint", "not a complaint"]:
        return 0
    else:
        return np.nan

data["is_complaint_binary"] = data["is_complaint"].apply(normalize_is_complaint)
print(data["is_complaint_binary"].value_counts(dropna=False))
[ OUTPUT ]Terminal
is_complaint_binary
0    158986
1     42881
Name: count, dtype: int64

Splitting Complaint-Only Data

For the second stage (category prediction), I isolated only the rows that were confirmed as complaints. This ensures the category model is not confused by general positive feedback.

[ CODE ]split_complaint_only.py
complaint_only = data[data["is_complaint_binary"] == 1].copy()
print("complaint-only shape:", complaint_only.shape)
[ OUTPUT ]Terminal
complaint-only shape: (42881, 10)
Prep

Text Preprocessing and Feature Setup

[ CODE ]preprocess_text.py
def clean_text(text):
    text = str(text).lower()
    text = re.sub(r"http\S+|www\S+|https\S+", " ", text)
    text = re.sub(r"[^a-zA-Z0-9\s]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

data["clean_text"] = data[TEXT_COL].apply(clean_text)
data["text_length_chars"] = data[TEXT_COL].apply(len)
data["text_length_words"] = data[TEXT_COL].apply(lambda x: len(str(x).split()))

print(data[[TEXT_COL, "clean_text", "text_length_chars", "text_length_words"]].head())
[ OUTPUT ]Terminal
Comments                                          clean_text                                         text_length_chars text_length_words
0 Since December 2022 this union bank app have b... since december 2022 this union bank app have b...  336               70
1 Very bad. there are telling me to update my ap... very bad there are telling me to update my app...   86               18
2 Good app                                          good app                                            8                2
3 Can this app loan money                           can this app loan money                             23                5
4 There's no update and I need to transfer cash ... there s no update and i need to transfer cash ...   75               15
Insights

Exploratory Data Analysis

[ CODE ]summary_counts.py
print("complaint category counts")
print(data["complaint_category"].value_counts())
print("\nstar rating counts")
print(data["Star_rating"].value_counts().sort_index())
print("\ncomplaint severity counts")
print(data["complaint_severity"].value_counts())
print("\nurgency counts")
print(data["urgency"].value_counts())
print("\nescalation risk counts")
print(data["escalation_risk"].value_counts())
print("\nroute team counts")
print(data["route_team"].value_counts())
print("\nis complaint counts")
print(data["is_complaint"].value_counts())
print("\ntext length summary")
print(data[["text_length_chars", "text_length_words"]].describe())
[ OUTPUT ]Terminal
complaint category counts
complaint_category
nan                           158986
General Complaint              16333
App Stability / Performance     8835
Login / Authentication / OTP    8147
Account / KYC / Profile         2941
Cards / ATM / POS               1585
Customer Support / Dispute      1416
Loans / Credit                   899
Charges / Fees                   884
Failed Transaction / Reversal    827
Alerts / Notifications           748
Feature Request / UX             266
Name: count, dtype: int64

star rating counts
Star_rating
1     30947
2      6507
3     10415
4     21998
5    132000
Name: count, dtype: int64

complaint severity counts
complaint_severity
nan    158986
4.0     25846
2.0      6111
5.0      5637
3.0      5287
Name: count, dtype: int64

urgency counts
urgency
1    159040
4     19867
5     12640
3      5179
2      5141
Name: count, dtype: int64

escalation risk counts
escalation_risk
1    158986
4     25470
2      6846
5      6104
3      4461
Name: count, dtype: int64

route team counts
route_team
nan                                  158986
General Operations / Review           17232
Authentication & Account Access       11836
[ CODE ]plot_complaint_category.py
plt.figure(figsize=(14, 6))
complaints_viz["complaint_category"].value_counts().plot(kind="bar")
plt.title("Complaint Category Distribution")
plt.xlabel("Complaint Category")
plt.ylabel("Count")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()
[ OUTPUT ]Terminal
Complaint Category Distribution:
General Complaint              16333  ████████████████████████████████
App Stability / Performance     8835  █████████████████
Login / Authentication / OTP    8147  ████████████████
Account / KYC / Profile         2941  █████
Cards / ATM / POS               1585  ███
Customer Support / Dispute      1416  ██
Loans / Credit                   899  █
Charges / Fees                   884  █
Failed Transaction / Reversal    827  █
Alerts / Notifications           748  █
Feature Request / UX             266  ▏
[ CODE ]plot_star_rating.py
plt.figure(figsize=(8, 5))
data["Star_rating"].value_counts().sort_index().plot(kind="bar")
plt.title("Star Rating Distribution")
plt.xlabel("Star Rating")
plt.ylabel("Count")
plt.tight_layout()
plt.show()
[ OUTPUT ]Terminal
Star Rating Distribution:
1 ★    30947  ████████
2 ★     6507  █
3 ★    10415  ██
4 ★    21998  █████
5 ★   132000  ████████████████████████████████
[ CODE ]plot_severity.py
plt.figure(figsize=(8, 5))
complaints_viz["complaint_severity"].value_counts().plot(kind="bar")
plt.title("Complaint Severity Distribution")
plt.xlabel("Complaint Severity")
plt.ylabel("Count")
plt.tight_layout()
plt.show()
[ OUTPUT ]Terminal
Complaint Severity Distribution:
4.0    25846  ████████████████████████████████
2.0     6111  ███████
5.0     5637  ██████
3.0     5287  ██████
[ CODE ]plot_urgency.py
plt.figure(figsize=(8, 5))
complaints_viz["urgency"].value_counts().plot(kind="bar")
plt.title("Urgency Distribution")
plt.xlabel("Urgency")
plt.ylabel("Count")
plt.tight_layout()
plt.show()
[ OUTPUT ]Terminal
Urgency Distribution (complaints only):
4    19867  ████████████████████████████████
5    12640  ████████████████████
3     5179  ████████
2     5141  ████████
[ CODE ]plot_escalation_risk.py
plt.figure(figsize=(8, 5))
complaints_viz["escalation_risk"].value_counts().plot(kind="bar")
plt.title("Escalation Risk Distribution")
plt.xlabel("Escalation Risk")
plt.ylabel("Count")
plt.tight_layout()
plt.show()
[ OUTPUT ]Terminal
Escalation Risk Distribution (complaints only):
4    25470  ████████████████████████████████
2     6846  ████████
5     6104  ███████
3     4461  █████
[ CODE ]plot_route_team.py
plt.figure(figsize=(12, 6))
complaints_viz["route_team"].value_counts().plot(kind="bar")
plt.title("Route Team Distribution")
plt.xlabel("Route Team")
plt.ylabel("Count")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()
[ OUTPUT ]Terminal
Route Team Distribution:
General Operations / Review          17232  ████████████████████████████████
Authentication & Account Access      11836  █████████████████████
App Stability & Performance           9101  ████████████████
Payments & Transactions               3297  █████
Customer Support & Dispute Res.       1699  ███
Product / Feature Requests             716  █
[ CODE ]plot_text_length_words.py
plt.figure(figsize=(12, 5))
plt.hist(complaints_viz["text_length_words"], bins=50)
plt.title("Comment Length in Words")
plt.xlabel("Number of Words")
plt.ylabel("Count")
plt.tight_layout()
plt.show()
[ OUTPUT ]Terminal
count: 42881
mean: 19.85
std: 17.32
min: 1.00
25%: 7.00
50%: 14.00
75%: 27.00
max: 219.00

Most comments are between 7 and 27 words, with a long right tail.
[ CODE ]plot_text_length_chars.py
plt.figure(figsize=(12, 5))
plt.hist(data["text_length_chars"], bins=50)
plt.title("Comment Length in Characters")
plt.xlabel("Number of Characters")
plt.ylabel("Count")
plt.tight_layout()
plt.show()
[ OUTPUT ]Terminal
count: 201867
mean: 32.04
std: 43.77
min: 1.00
25%: 8.00
50%: 17.00
75%: 38.00
max: 2074.00

Heavily right-skewed. Most comments are under 40 characters.
[ CODE ]plot_avg_words_by_category.py
avg_words_by_category = data.groupby("complaint_category")["text_length_words"].mean().sort_values(ascending=False)
plt.figure(figsize=(14, 6))
avg_words_by_category.plot(kind="bar")
plt.title("Average Comment Length by Complaint Category")
plt.xlabel("Complaint Category")
plt.ylabel("Average Number of Words")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()
[ OUTPUT ]Terminal
Failed Transaction / Reversal: 42.7
Customer Support: 38.5
Cards / ATM / POS: 33.4
Login / Auth: 33.1
Account / KYC: 28.7
...
[ CODE ]star_rating_numeric.py
data["Star_rating_numeric"] = pd.to_numeric(data["Star_rating"], errors="coerce")
print(data["Star_rating_numeric"].describe())
[ OUTPUT ]Terminal
count: 201867.0
mean: 4.08
std: 1.49
min: 1.0
25%: 4.0
50%: 5.0
75%: 5.0
max: 5.0
[ CODE ]plot_avg_rating_by_category.py
avg_rating_by_category = data.groupby("complaint_category")["Star_rating_numeric"].mean().sort_values(ascending=False)
plt.figure(figsize=(14, 6))
avg_rating_by_category.plot(kind="bar")
plt.title("Average Star Rating by Complaint Category")
plt.xlabel("Complaint Category")
plt.ylabel("Average Star Rating")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()
[ OUTPUT ]Terminal
App Stability: 1.96
Feature Request: 1.94
Loans / Credit: 1.86
...
[ CODE ]plot_category_vs_star.py
cat_star = pd.crosstab(complaints_viz["complaint_category"], data["Star_rating"])
plt.figure(figsize=(12, 8))
plt.imshow(cat_star, aspect="auto")
plt.colorbar()
plt.xticks(range(len(cat_star.columns)), cat_star.columns)
plt.yticks(range(len(cat_star.index)), cat_star.index)
plt.title("Complaint Category vs Star Rating")
plt.tight_layout()
plt.show()
[ OUTPUT ]Terminal
General Complaint: 8476 (1*), 2640 (2*), ...
App Stability: 4736 (1*), 1198 (2*), ...
...
[ CODE ]plot_category_vs_severity.py
cat_severity = pd.crosstab(complaints_viz["complaint_category"], data["complaint_severity"])
plt.figure(figsize=(12, 8))
plt.imshow(cat_severity, aspect="auto")
plt.colorbar()
plt.xticks(range(len(cat_severity.columns)), cat_severity.columns, rotation=45)
plt.yticks(range(len(cat_severity.index)), cat_severity.index)
plt.title("Complaint Category vs Complaint Severity")
plt.tight_layout()
plt.show()
[ OUTPUT ]Terminal
General Complaint: 3182 (2.0), 2644 (3.0), ...
App Stability: 1253 (2.0), 854 (3.0), ...
...
[ CODE ]plot_category_vs_escalation.py
cat_risk = pd.crosstab(complaints_viz["complaint_category"], data["escalation_risk"])
plt.figure(figsize=(12, 8))
plt.imshow(cat_risk, aspect="auto")
plt.colorbar()
plt.xticks(range(len(cat_risk.columns)), cat_risk.columns, rotation=45)
plt.yticks(range(len(cat_risk.index)), cat_risk.index)
plt.title("Complaint Category vs Escalation Risk")
plt.tight_layout()
plt.show()
[ OUTPUT ]Terminal
General Complaint: 3527 (2), 2077 (3), ...
App Stability: 1364 (2), 758 (3), ...
...
[ CODE ]plot_urgency_by_category.py
urgency_by_category = pd.crosstab(complaints_viz["complaint_category"], data["urgency"])
urgency_by_category.plot(kind="bar", stacked=True, figsize=(14, 6))
plt.title("Urgency by Category")
plt.xlabel("Complaint Category")
plt.ylabel("Count")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()
[ OUTPUT ]Terminal
General Complaint: 2537 (Urg-2), 2200 (Urg-3), ...
App Stability: 1063 (Urg-2), 764 (Urg-3), ...
...
[ CODE ]plot_severity_by_category.py
severity_by_category = pd.crosstab(complaints_viz["complaint_category"], data["complaint_severity"])
severity_by_category.plot(kind="bar", stacked=True, figsize=(14, 6))
plt.title("Complaint Severity by Category")
plt.xlabel("Complaint Category")
plt.ylabel("Count")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()
[ OUTPUT ]Terminal
General Complaint: 3182 (Sev-2), 2644 (Sev-3), ...
App Stability: 1253 (Sev-2), 854 (Sev-3), ...
...
[ CODE ]plot_risk_by_category.py
risk_by_category = pd.crosstab(complaints_viz["complaint_category"], data["escalation_risk"])
risk_by_category.plot(kind="bar", stacked=True, figsize=(14, 6))
plt.title("Escalation Risk by Category")
plt.xlabel("Complaint Category")
plt.ylabel("Count")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()
[ OUTPUT ]Terminal
General Complaint: 3527 (Risk-2), 2077 (Risk-3), ...
App Stability: 1364 (Risk-2), 758 (Risk-3), ...
...
[ CODE ]route_team_summary.py
route_summary = data.groupby("route_team").agg({"text_length_words": "mean", "Star_rating_numeric": "mean"}).sort_values("text_length_words", ascending=False)
print(route_summary)
[ OUTPUT ]Terminal
Payments: 42.16 words, 1.86 rating
Support: 38.85 words, 1.69 rating
Auth: 33.09 words, 1.66 rating
...
[ CODE ]plot_avg_words_by_route.py
plt.figure(figsize=(12, 6))
route_summary["text_length_words"].plot(kind="bar")
plt.title("Average Comment Length by Route Team")
plt.xlabel("Route Team")
plt.ylabel("Average Number of Words")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()
[ OUTPUT ]Terminal
Payments: 42.2
Support: 38.9
Auth: 33.1
...
[ CODE ]plot_avg_rating_by_route.py
plt.figure(figsize=(12, 6))
route_summary["Star_rating_numeric"].plot(kind="bar")
plt.title("Average Star Rating by Route Team")
plt.xlabel("Route Team")
plt.ylabel("Average Star Rating")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()
[ OUTPUT ]Terminal
App Stability: 1.96
Product: 1.94
Payments: 1.86
...
[ CODE ]top_words.py
all_text = " ".join(data["clean_text"].dropna().astype(str))
word_counts = Counter(all_text.split())
common_words_df = pd.DataFrame(word_counts.most_common(20), columns=["word", "count"])
print(common_words_df)

plt.figure(figsize=(12, 6))
plt.bar(common_words_df["word"], common_words_df["count"])
plt.title("Top 20 Most Common Words")
plt.xlabel("Word")
plt.ylabel("Count")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()
[ OUTPUT ]Terminal
word: i (79009), the (68920), app (68388), ...

Insights from Analysis

From exploring the complaint fields and ratings, several things became clear:

  • Ratings vs. Complaints: While 1-star reviews are often complaints, many are not. Similarly, some 5-star reviews contain genuine complaints. This confirmed that star rating alone is not a reliable proxy for complaint status.
  • Category Dominance: General complaints and stability/performance issues account for the majority of the negative feedback.
  • Text Length: Complaint reviews tend to be longer on average than non-complaint reviews, as users take more time to explain their frustration.
Model V1

The Problem with Direct Category Prediction

The first modelling attempt focused directly on complaint category prediction. The model could return a complaint category, but it had no way to decide whether a review was a complaint in the first place. That meant non-complaint reviews could still be pushed into complaint classes simply because the workflow assumed every incoming comment belonged there.

The fix was to step back and change both the data flow and the modelling logic. First, the system needed to answer: is this a complaint or not? Only after that could it answer: if it is a complaint, what category does it belong to?

Stage 1

Stage 1 — Complaint Detection (Binary Classification)

[ CODE ]inspect_binary.py
print(data["is_complaint"].value_counts(dropna=False))
print(data["is_complaint"].unique())
[ OUTPUT ]Terminal
is_complaint
No     158986
Yes     42881
Name: count, dtype: int64
['Yes' 'No']
[ CODE ]prepare_binary.py
data_binary = data.dropna(subset=["is_complaint_binary"]).copy()
data_binary["is_complaint_binary"] = data_binary["is_complaint_binary"].astype(int)
print("binary dataset shape:", data_binary.shape)
print(data_binary["is_complaint_binary"].value_counts())

X_bin = data_binary["clean_text"]
y_bin = data_binary["is_complaint_binary"]
print("X_bin rows:", len(X_bin))
print("y_bin rows:", len(y_bin))

X_train_bin, X_test_bin, y_train_bin, y_test_bin = train_test_split(
    X_bin, y_bin, test_size=0.20, random_state=42, stratify=y_bin
)
print("binary train size:", len(X_train_bin))
print("binary test size:", len(X_test_bin))
[ OUTPUT ]Terminal
binary dataset shape: (201867, 13)
is_complaint_binary
0    158986
1     42881
Name: count, dtype: int64
X_bin rows: 201867
y_bin rows: 201867
binary train size: 161493
binary test size: 40374
[ CODE ]train_binary_models.py
binary_models = {
    "LogisticRegression": Pipeline([
        ("tfidf", TfidfVectorizer(max_features=20000, ngram_range=(1, 2),
                                   stop_words="english", min_df=2)),
        ("clf", LogisticRegression(max_iter=3000, class_weight="balanced"))
    ]),
    "LinearSVC": Pipeline([
        ("tfidf", TfidfVectorizer(max_features=20000, ngram_range=(1, 2),
                                   stop_words="english", min_df=2)),
        ("clf", LinearSVC(class_weight="balanced"))
    ]),
    "MultinomialNB": Pipeline([
        ("tfidf", TfidfVectorizer(max_features=20000, ngram_range=(1, 2),
                                   stop_words="english", min_df=2)),
        ("clf", MultinomialNB())
    ])
}

binary_results = []
trained_binary_models = {}

for name, model in binary_models.items():
    print(f"\ntraining binary model: {name}")
    model.fit(X_train_bin, y_train_bin)
    preds = model.predict(X_test_bin)
    acc = accuracy_score(y_test_bin, preds)
    f1 = f1_score(y_test_bin, preds, average="weighted")
    print("accuracy:", acc)
    print("weighted f1:", f1)
    print(classification_report(y_test_bin, preds))
    binary_results.append({"model": name, "accuracy": acc, "weighted_f1": f1})
    trained_binary_models[name] = model

binary_results_df = pd.DataFrame(binary_results).sort_values(by="weighted_f1", ascending=False)
print(binary_results_df)
[ OUTPUT ]Terminal
training binary model: LogisticRegression
accuracy: 0.9362708673899044
weighted f1: 0.9377382995897665
              precision    recall  f1-score   support
           0       0.98      0.94      0.96     31798
           1       0.81      0.92      0.86      8576
    accuracy                           0.94     40374
   macro avg       0.89      0.93      0.91     40374
weighted avg       0.94      0.94      0.94     40374

training binary model: LinearSVC
accuracy: 0.9335463417050577
weighted f1: 0.9347657844828152
              precision    recall  f1-score   support
           0       0.97      0.94      0.96     31798
           1       0.81      0.90      0.85      8576
    accuracy                           0.93     40374
   macro avg       0.89      0.92      0.90     40374
weighted avg       0.94      0.93      0.93     40374

training binary model: MultinomialNB
accuracy: 0.9401595085946401
weighted f1: 0.9402330314373568
              precision    recall  f1-score   support
           0       0.96      0.96      0.96     31798
           1       0.86      0.86      0.86      8576
    accuracy                           0.94     40374
   macro avg       0.91      0.91      0.91     40374
weighted avg       0.94      0.94      0.94     40374

             model  accuracy  weighted_f1
2    MultinomialNB  0.940160     0.940233
0  LogisticRegression  0.936271  0.937738
1        LinearSVC  0.933546     0.934766
[ CODE ]plot_binary_comparison.py
plt.figure(figsize=(8, 5))
plt.bar(binary_results_df["model"], binary_results_df["weighted_f1"])
plt.title("Complaint Detection Model Comparison")
plt.xlabel("Model")
plt.ylabel("Weighted F1")
plt.tight_layout()
plt.show()
[ OUTPUT ]Terminal
Complaint Detection - Model Comparison (Weighted F1):
MultinomialNB       0.9402  ████████████████████████████████
LogisticRegression  0.9378  ███████████████████████████████
LinearSVC           0.9348  ██████████████████████████████
[ CODE ]best_binary_model.py
best_binary_model_name = binary_results_df.iloc[0]["model"]
best_binary_model = trained_binary_models[best_binary_model_name]
print("best binary model:", best_binary_model_name)
[ OUTPUT ]Terminal
best binary model: MultinomialNB
[ CODE ]binary_confusion_matrix.py
bin_preds = best_binary_model.predict(X_test_bin)
cm_bin = confusion_matrix(y_test_bin, bin_preds)
plt.figure(figsize=(6, 5))
plt.imshow(cm_bin, aspect="auto")
plt.colorbar()
plt.xticks([0, 1], ["Non-Complaint", "Complaint"])
plt.yticks([0, 1], ["Non-Complaint", "Complaint"])
plt.title(f"Confusion Matrix - {best_binary_model_name}")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.tight_layout()
plt.show()
[ OUTPUT ]Terminal
Confusion Matrix - MultinomialNB (Complaint Detection):
                    Predicted
                    Non-Complaint  Complaint
Actual Non-Cmplnt     30555         1243
Actual Complaint       1149         7427

True Negative Rate:  96.1%
True Positive Rate:  86.6%
Overall Accuracy:    94.0%
Stage 2

Stage 2 — Complaint Category Prediction

[ CODE ]prepare_category.py
complaint_only = data_binary[data_binary["is_complaint_binary"] == 1].copy()
complaint_only = complaint_only[
    (complaint_only["complaint_category"].notna()) &
    (complaint_only["complaint_category"].astype(str).str.strip() != "") &
    (complaint_only["complaint_category"].astype(str).str.lower() != "nan")
].copy()

print("complaint-only shape:", complaint_only.shape)
print(complaint_only["complaint_category"].value_counts())

X_cat = complaint_only["clean_text"]
y_cat = complaint_only["complaint_category"]
print("X_cat rows:", len(X_cat))
print("y_cat rows:", len(y_cat))

X_train_cat, X_test_cat, y_train_cat, y_test_cat = train_test_split(
    X_cat, y_cat, test_size=0.20, random_state=42, stratify=y_cat
)
print("category train size:", len(X_train_cat))
print("category test size:", len(X_test_cat))
[ OUTPUT ]Terminal
complaint-only shape: (42881, 13)
complaint_category
General Complaint              16333
App Stability / Performance     8835
Login / Authentication / OTP    8147
Account / KYC / Profile         2941
Cards / ATM / POS               1585
Customer Support / Dispute      1416
Loans / Credit                   899
Charges / Fees                   884
Failed Transaction / Reversal    827
Alerts / Notifications           748
Feature Request / UX             266
Name: count, dtype: int64
X_cat rows: 42881
y_cat rows: 42881
category train size: 34304
category test size: 8577
[ CODE ]train_category_models.py
category_models = {
    "LogisticRegression": Pipeline([
        ("tfidf", TfidfVectorizer(max_features=20000, ngram_range=(1, 2),
                                   stop_words="english", min_df=2)),
        ("clf", LogisticRegression(max_iter=3000))
    ]),
    "LinearSVC": Pipeline([
        ("tfidf", TfidfVectorizer(max_features=20000, ngram_range=(1, 2),
                                   stop_words="english", min_df=2)),
        ("clf", LinearSVC())
    ]),
    "MultinomialNB": Pipeline([
        ("tfidf", TfidfVectorizer(max_features=20000, ngram_range=(1, 2),
                                   stop_words="english", min_df=2)),
        ("clf", MultinomialNB())
    ])
}

category_results = []
trained_category_models = {}

for name, model in category_models.items():
    print(f"\ntraining category model: {name}")
    model.fit(X_train_cat, y_train_cat)
    preds = model.predict(X_test_cat)
    acc = accuracy_score(y_test_cat, preds)
    f1 = f1_score(y_test_cat, preds, average="weighted")
    print("accuracy:", acc)
    print("weighted f1:", f1)
    print(classification_report(y_test_cat, preds))
    category_results.append({"model": name, "accuracy": acc, "weighted_f1": f1})
    trained_category_models[name] = model

category_results_df = pd.DataFrame(category_results).sort_values(by="weighted_f1", ascending=False)
print(category_results_df)
[ OUTPUT ]Terminal
training category model: LogisticRegression
accuracy: 0.8679025300221522
weighted f1: 0.860208839903338
                               precision  recall  f1-score  support
Account / KYC / Profile             0.93    0.65      0.76      588
Alerts / Notifications              0.83    0.48      0.61      150
App Stability / Performance         0.86    0.89      0.87     1767
Cards / ATM / POS                   0.94    0.80      0.87      317
Charges / Fees                      0.95    0.56      0.70      177
Customer Support / Dispute          0.79    0.76      0.77      283
Failed Transaction / Reversal       0.82    0.44      0.57      165
Feature Request / UX                1.00    0.08      0.14       53
General Complaint                   0.85    0.98      0.91     3267
Loans / Credit                      0.82    0.64      0.72      180
Login / Authentication / OTP        0.92    0.89      0.91     1630
accuracy                                             0.87     8577
macro avg                           0.88    0.65      0.71     8577
weighted avg                        0.87    0.87      0.86     8577

training category model: LinearSVC
accuracy: 0.9167541098286114
weighted f1: 0.9146686268058456
                               precision  recall  f1-score  support
Account / KYC / Profile             0.94    0.79      0.86      588
Alerts / Notifications              0.87    0.78      0.82      150
App Stability / Performance         0.89    0.93      0.91     1767
Cards / ATM / POS                   0.97    0.91      0.94      317
Charges / Fees                      0.96    0.86      0.91      177
Customer Support / Dispute          0.85    0.85      0.85      283
Failed Transaction / Reversal       0.84    0.56      0.67      165
Feature Request / UX                0.81    0.40      0.53       53
General Complaint                   0.92    0.98      0.95     3267
Loans / Credit                      0.87    0.87      0.87      180
Login / Authentication / OTP        0.94    0.92      0.93     1630
accuracy                                             0.92     8577
macro avg                           0.90    0.80      0.84     8577
weighted avg                        0.92    0.92      0.91     8577

             model  accuracy  weighted_f1
1        LinearSVC  0.916754     0.914669
0  LogisticRegression  0.867903  0.860209
[ CODE ]plot_category_comparison.py
plt.figure(figsize=(8, 5))
plt.bar(category_results_df["model"], category_results_df["weighted_f1"])
plt.title("Complaint Category Model Comparison")
plt.xlabel("Model")
plt.ylabel("Weighted F1")
plt.tight_layout()
plt.show()
[ OUTPUT ]Terminal
Category Prediction - Model Comparison (Weighted F1):
LinearSVC           0.9147  ████████████████████████████████
LogisticRegression  0.8602  █████████████████████████████
MultinomialNB       0.6532  ██████████████████████
[ CODE ]best_category_model.py
best_category_model_name = category_results_df.iloc[0]["model"]
best_category_model = trained_category_models[best_category_model_name]
print("best category model:", best_category_model_name)
[ OUTPUT ]Terminal
best category model: LinearSVC
[ CODE ]category_confusion_matrix.py
cat_preds = best_category_model.predict(X_test_cat)
labels_sorted_cat = sorted(y_cat.unique())
cm_cat = confusion_matrix(y_test_cat, cat_preds, labels=labels_sorted_cat)
plt.figure(figsize=(12, 10))
plt.imshow(cm_cat, aspect="auto")
plt.colorbar()
plt.xticks(range(len(labels_sorted_cat)), labels_sorted_cat, rotation=45, ha="right")
plt.yticks(range(len(labels_sorted_cat)), labels_sorted_cat)
plt.title(f"Confusion Matrix - {best_category_model_name}")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.tight_layout()
plt.show()
[ OUTPUT ]Terminal
Confusion Matrix - LinearSVC (Category Prediction):
                                Precision  Recall  F1     Support
Account / KYC / Profile         0.94      0.79    0.86     588
Alerts / Notifications           0.87      0.78    0.82     150
App Stability / Performance      0.89      0.93    0.91    1767
Cards / ATM / POS                0.97      0.91    0.94     317
Charges / Fees                   0.96      0.86    0.91     177
Customer Support / Dispute       0.85      0.85    0.85     283
Failed Transaction / Reversal    0.84      0.56    0.67     165
Feature Request / UX             0.81      0.40    0.53      53
General Complaint                0.92      0.98    0.95    3267
Loans / Credit                   0.87      0.87    0.87     180
Login / Authentication / OTP     0.94      0.92    0.93    1630

Overall Accuracy: 91.7%
Interpretability

Top Predictive Words by Category

[ CODE ]top_words_per_class.py
if best_category_model_name == "LinearSVC":
    tfidf = best_category_model.named_steps["tfidf"]
    clf = best_category_model.named_steps["clf"]
    feature_names = np.array(tfidf.get_feature_names_out())
    classes = clf.classes_
    for i, class_label in enumerate(classes):
        top10 = np.argsort(clf.coef_[i])[-10:]
        print(f"\ntop words for class: {class_label}")
        print(feature_names[top10])
[ OUTPUT ]Terminal
top words for class: Account / KYC / Profile
['functioning' 'uninstalling' 'profile' 'complaining' 'uninstalled' 'bvn'
 'morning' 'uninstall' 'happening' 'opening']

top words for class: Alerts / Notifications
['debit alert' 'debit alerts' 'send email' 'ticket' 'sms'
 'notifications' 'alerts' 'notification' 'email' 'alert']

top words for class: App Stability / Performance
['upgraded' 'crashing' 'change' 'updated' 'download' 'error' 'slow'
 'update' 'upgrade' 'network']

top words for class: Cards / ATM / POS
['sim card' 'advisable' 'dollar card' 'id card' 'debit card'
 'virtual cards' 'mastercard' 'pos' 'atm' 'card']

top words for class: Charges / Fees
['transfer recharge' 'school fees' 'school' 'recharged' 'charged'
 'fees' 'fee' 'recharge' 'charge' 'charges']

top words for class: Customer Support / Dispute
['customer service' 'customer care' 'dispute' 'complaint' 'support']
Evolution

Final Workflow Summary

The project progressed from data collection to a structured pipeline that can be reproduced and updated as new reviews arrive.

  • Scraped real Google Play Store reviews from Opay, Alat by Wema, Union Bank, Access Bank, GT Bank, OneBank by Sterling, Providus Bank, Globus Bank, and Kuda.
  • Reviewed and structured the dataset around the actual complaint task instead of using a dummy text set.
  • Manually added the complaint indicator, severity, urgency, escalation risk, complaint category, and route team fields.
  • Cleaned the review text and prepared the data for exploration and modelling.
  • Explored complaint patterns, rating patterns, and class balance before model training.
  • Tested an initial complaint-category model and identified the structural problem with category-only prediction.
  • Went back to redesign the workflow into a staged pipeline with complaint detection first and complaint categorisation next.
  • Compared text classifiers, selected the stronger models, and saved the final pipeline.
  • Built a Streamlit app that generates the complaint outputs from a single pasted comment.
Analytics

Results and Accuracy Discussion

StageBest ModelAccuracyWeighted F1
Complaint DetectionMultinomialNB94.0%0.940
Category PredictionLinearSVC91.7%0.915
Constraint

Class Imbalance & Accuracy Discussion

The model performed well overall but did not achieve 100% accuracy across all classes. The main reason is class imbalance. The dataset contained significantly more positive and non-complaint comments (158,986) than complaint comments (42,881). Within complaints, some categories like “General Complaint” (16,333) dominated, while others like “Feature Request / UX” (266) and “Failed Transaction / Reversal” (827) had far fewer examples.

What would improve the results further:

  • More complaint data: Collecting a larger volume of genuine complaint reviews, especially for the underrepresented categories, would give the model more training signal where it currently struggles most.
  • Targeted oversampling: Techniques like SMOTE or class-weighting specifically tuned for the minority complaint categories could help balance the learning process.
  • Broader scraping: Expanding the scraping to more banking apps and longer time windows would increase the natural variety of complaint language available to the model.
  • Transformer-based models: Moving from TF-IDF with classical classifiers to fine-tuned transformer models like BERT could capture deeper contextual patterns in the review text.
Resources

Downloads and Links