Caught in the Web: Using Machine Learning to Combat Scam SMS Messages

TechnologyKreeshi ShavdiaEthan KawamaraTony MuneneShawn Mwenje

28 Mar

Aisha was mindlessly scrolling through her phone when an SMS pinged. The message announced a job vacancy at a well-known company, offering an attractive yet believable salary. It ended with, ‘Click this link to find out more.’

In increasingly tough times, Aisha couldn’t ignore the opportunity. She clicked the link, which led to a professional-looking website, and submitted her application, including her personal details. She felt optimistic when she received a follow-up SMS – informing her that she needed a certificate of good conduct to proceed. Time was of the essence, they claimed. Without it, the company would move on. The sender included a phone number and assured her that for Ksh 2,000, the certificate could be processed within hours. Panicked at the thought of losing the opportunity, Aisha sent the money.

Only later did she realise she had fallen victim to a scam.

Unfortunately, Aisha’s story is not unique. Across Kenya and beyond, almost 2.7 billion people experienced cyber scams in 2024! But what types of scam messages exist, and how can they be effectively identified and flagged?

Types of Scam Messages

Fraudulent SMS attacks often follow well-orchestrated patterns, preying on human vulnerabilities. Below are some common types:

Smishing (SMS phishing) involves tricking victims into revealing personal or financial information via text messages. These messages often appear to come from reputable organisations, such as banks or mobile service providers, and may contain links to fake websites. Aisha’s story is one such case.
M-Pesa Fraud: Fraudsters target mobile money users by sending fake payment confirmation messages or claiming accidental transfers. Victims are then coerced into sending back ‘refunds,’ which go straight to the fraudsters’ accounts.
Hoax SMS: These messages often announce false wins in lotteries or promotions, requiring victims to pay a ‘processing fee’ to claim their prize.
Impersonation Scams: Scammers impersonate trusted entities like mobile service providers, demanding urgent action, such as updating account details or paying overdue bills, under threat of service disruption.

According to a study published in the East African Journal of Information Technology, fraudsters exploit mobile networks using these tactics. It states mobile money services such as M-Pesa are often targeted due to their widespread use and accessibility.

Classifying scam messages is vital to protect consumers and maintain trust in communication systems. This is backed by fraudulent calls and SMS topping the list of consumer complaints to the Communications Authority of Kenya. Victims face financial losses, emotional distress, and potential identity theft. The study emphasises the need for proactive measures, including machine learning models, to monitor and flag suspicious activity in real-time especially as fraudsters continually evolve their methods, making it challenging for traditional detection systems to keep up. By identifying and flagging scam messages early, mobile service providers, regulatory authorities, and technology platforms can curb these activities and safeguard users.

Using Machine Learning to Flag Scam SMS

Machine learning (ML) offers a powerful solution to the problem of scam text classification. By training models on large datasets of labelled messages, ML algorithms can identify patterns and classify texts as genuine or fraudulent.

How Machine Learning Works in Scam Detection

Preprocessing Text Data
Messages are cleaned by removing stop words and punctuation. Stop words are common words in any language that do not carry significant meaning for analysis. Examples in English include ‘the,’ ‘is,’ and ‘and.’ Removing stop words reduces noise in the data, allowing models to focus on meaningful terms. The remaining statements are then tokenised, which is the process of breaking down a piece of text into smaller units, such as words or characters. Tokenisation is crucial because most machine learning models work with individual tokens rather than raw sentences.
Vectorisation
Techniques such as count vectorisation and term frequency-inverse document frequency (TF-IDF) are used to convert text into numerical data. This is done because machine learning models require numerical data as input. Vectorisation achieves this and enables models to analyse and learn patterns effectively.
Count Vectorisation creates a matrix of word frequencies. Each unique word in the text becomes a column, and each statement is represented as a row. The value in each cell indicates the number of times a word appears in the respective statement.
TF-IDF refines count vectorisation by considering word importance. Term frequency refers to how often a word appears in a document. Inverse document frequency (IDF) gives less weight to common words across all documents, emphasising rare but meaningful terms.
Model Training
Algorithms like Naïve Bayes, logistic regression, and support vector machines (SVM) are trained on features to learn the distinctions between scam and legitimate messages.
A) Naïve Bayes is a probabilistic algorithm based on Bayes' Theorem. It assumes all words are independent – hence ‘naïve’. The probability of a message being a scam is calculated using the probabilities of each word appearing in scam versus legitimate messages. This algorithm is simple, fast, and effective for text classification tasks like spam detection.
B) Logistic regression is a statistical method that predicts the probability of a message belonging to one of two classes such as scam or legitimate. It takes the numerical representation of a message like word frequencies or TF-IDF values. Each feature, such as frequency of a word, is multiplied by a weight, which the algorithm learns during training, and added together. To convert this weighted sum into a probability, logistic regression applies a sigmoid function. If the probability is close to 1, the model predicts the message is a scam. If it’s closer to 0, it predicts the message is legitimate.
C) Support Vector Machine (SVM) is a machine learning algorithm used for classification tasks. It works by finding the best boundary, called a hyperplane, that separates different classes of data, such as scam and legitimate messages. It works by plotting your data, each point represents a message, and its position is determined by numerical features such as word frequencies or TF-IDF scores. SVM finds the best ‘line’ – in 2D – ‘plane’ – in 3D – or ‘hyperplane’ in higher dimensions, that separates the classes, dividing scam messages from legitimate ones. The line it finds maximises the distance between the two classes.
Evaluation and Deployment

Once trained, the model is tested on unseen data to measure its accuracy, precision, and recall. After ensuring reliable performance, it can be deployed to classify incoming messages in real time, helping users identify potential scams.

Safaricom, Kenya’s leading mobile network operator, has adopted AI-powered solutions to enhance fraud detection and prevention. According to their newsroom, AI systems analyse patterns in messaging and user behaviour to identify anomalies indicative of scam attempts. Examples of features used could include transaction frequency, text content, sender's history, and language used, to detect fraudulent messages before they reach users. As would be expected, algorithms are trained on historical data to predict and block scams in real time. These systems improve over time by learning from new scams, ensuring they adapt to emerging threats and enhance detection capabilities. This ongoing learning process ensures that the system remains effective as fraud tactics evolve.

Conclusion

As the digital landscape becomes more intricate and interconnected, the importance of safeguarding personal data and financial transactions cannot be overstated. Scam SMS attacks are not just technical challenges but social and economic threats. Stories like Aisha’s serve as stark reminders of the real-world impact these scams have on individuals, families, and businesses. These scams exploit people's trust and often leave lasting consequences, both financially and psychologically. As technology evolves, so must our defences, ensuring a safer digital ecosystem for all.

The adoption of AI and machine learning by companies like Safaricom is a positive step forward, but continued innovation and vigilance are required to stay one step ahead of the scammers. While machine learning offers powerful tools for detecting and preventing these fraud attempts, it is not a silver bullet. Effective scam prevention requires a multi-faceted approach, incorporating not only advanced technology but also public awareness campaigns, user education, encouragement to report suspicious activity and collaboration among mobile service providers, regulatory bodies, and law enforcement. Through these combined efforts, we can protect vulnerable individuals from falling prey to fraud, while creating a more resilient system that adapts and responds to the evolving threat landscape.

JEPA Africa https://www.jepaafrica.com

Caught in the Web: Using Machine Learning to Combat Scam SMS Messages

Preparing for Peak Season: A Strategic Guide for the East African Tourism Industry

Deliverance or Damnation? The Fortunes and Fairytales Found Through Foreign Direct Investment for East Africa’s Economic Development