DOI : 10.17577/IJERTCONV14IS010033- Open Access

- Authors : Lathashree, Mr. Sunith Kumar T
- Paper ID : IJERTCONV14IS010033
- Volume & Issue : Volume 14, Issue 01, Techprints 9.0
- Published (First Online) : 01-03-2026
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License:
This work is licensed under a Creative Commons Attribution 4.0 International License
Multimodal Spam Detection: Combining Text, Image, and Audio Analysis
*
Lathashree, Mr. Sunith Kumar T
PG Student, St. Joseph Engineering College, Mangalore Assistant Professor, St. Joseph Engineering College, Mangalore
Abstract – As digital communication channels continue to evolve, the risk of spam content disseminating through various formatssuch as text messages, emails, images, and audio recordingshas significantly increased.In order to detect spam in any of the following content types, text, image, and audio, this system uses specialized machine learning tools dedicated to work with the specific content type inputs to achieve maximum accuracy. The system processes each content type through specialized conversion pipelines: raw text undergoes immediate analysis, visual content passes through optical character recognition to reveal hidden text elements, and audio recordings are transcribed into written form for examination. The system classifies the spam and the valid content with a reasonable accuracy by using a supervised Naive Bayes classifier using the labeled data. It also connects to Gmail in real time filtering email and offers performance figures to trace faults. Different engines Our anti-spam solution runs different forms of content on specialized machine learning pipelines – parsing text outright, parsing embedded text out of images using Tesseract OCR, transcribing audio to text using speech recognition. This is an upgraded content filtering product; it uses different analysis quality on different types of media Text content goes through direct machine learning analysis, but image content goes through text extraction by optical character recognition technology. The system transforms spoken audio content into machine-readable text through advanced voice recognition technology. Keywords: Multimodal anti-spam, machine learning, OCR, TTS, real world classification, Flask, gmail filtering.
-
INTRODUCTION
With the rapid proliferation of digital communication platforms, spam has evolved beyond traditional text messages and emails to include image-based advertisements, phishing voice calls, and multimedia content. Spam not only clutters communication channels but can also open the door to risks like scams, identity theft, and the unintentional distribution of false information. Spam filters based exclusively on text analysis typically miss out on false messages that exempt themselves by including non-text parts such as pictures, sound, or other media. Consequently, there is a noticeable requirement to have an intelligent detection system, which could accept various types of content and point out spam in any case in a correct way. In that respect, the current study suggests an extended spam detection methodology that uses machine learning, speech recognition, and image processing to detect spam in text, image and audio messages. Unlike other systems which work based on a single channel, whether text, image, or audio input, this
solution will integrate multiple channels that include text, image, and audio stream in order to enhance reliability and reduce errors. Moreover, the system contains a real-time Gmail interface and is implemented as a Flaskbased web application, which gives the end-users and system administrators simple wizards to manage their spam levels, train models, and achieve predictive analytics. The capacity to receive different input forms assists the system in identifying spam with more clarity, despite the fact that nature of this content continues to evolve.
-
LITERATURE REVIEW
Recent years have seen a notable rise in the use of deep learning and combined data processing methods to improve the accuracy of spam detection across multiple content formats. Zhang et al. [1] introduced MMTD, a multilingual and multimodal detection framework that applies transformerbased models to both textual inputs and document-like images. Their approach removed the need for OCR and was able to reach 99.8% accuracy when tested on a dataset that included content in various languages. In a related study, Lee and Kim [2] proposed converting spam messages into two-dimensional Unicode images and applying convolutional neural networks for classification. A method that transformed spam content into image formusing both color and black-and-white variationsproved highly effective, especially when applied to messages in different scripts and languages [3] The authors reviewed a wide range of models used for spam detection, starting with classical techniques such as Naive Bayes and Support Vector Machines, and extending to recent advancements involving deep learning. The study also brought attention to major difficulties in detecting spam on platforms like email and IoT devices, pointing out challenges like uneven data distribution, time-sensitive processing, and the demand for systems that can evolve over time. Further extending the multimodal perspective, Zhang et al. [4] developed a fusion- based model that processes both textual and image data by extracting visual features using CNNs and representing text with CBOW embeddings. These components were then classified using SVM and Random Forest, showing improved performance over earlier fusion techniques. Collectively, these studies underscore the growing relevance of integrating
multiple data typesincluding textual content, image features, and metadatato enhance the adaptability and precision of spam detection systems across languages and platforms.
-
METHODOLOGY
Traditional Spam Filtering Architecture
Fig. 1. Traditional spam filtering architecture using layered mechanisms such as content filters, blacklists, and header analysis.
Traditional spam filtering systems typically rely on a layered approach, incorporating mechanisms such as content filters, header inspection, blacklists, and challenge-response protocols. These filters are primarily rule-based and focus on analyzing the textual content of messages. As shown in Figure 1, the traditional architecture consists of multiple filtering stages, each contributing to the detection of unwanted messages using specific rules and known indicators. While effective for detecting simple spam, this type of system often fails when dealing with more complex or obfuscated spam, such as that hidden within images or audio recordings.
To overcome these limitations, the proposed system adopts a multimodal spam detection framework capable of handling input from text, image, and audio sources. The trained model analyzes processed data to automatically distinguish spam from legitimate content Textual data, including user inputs, email bodies, or batch CSV entries, is first preprocessed through cleaning steps like token separation, stop word filtering, and feature conversion before being analyzed.
For image inputs, the system leverages OCR technology to identify and retrieve textual content hidden within the visuals. Once extracted, the text is passed through the same classification process used for regular text inputs. Our system takes a smart approach to spam detection across different formats. Audio messages are first converted into written text using speech recognition tools, then processed using the same filtering methods applied to normal text messages. The platform connects with Gmail, scanning each incoming email as soon as it arrives to ensure protection against unwanted content. We regularly assess the systems effectiveness by reviewing both false positives and missed spam, allowing us to refine and enhance its accuracy over time. Developed using Pythons Flask framework, the platform provides user-friendly
dashboards for both users and admins, including features for sstem retraining, account management, and monitoring spam activity trends. Whether you need to check a single suspicious message or analyze entire contact lists in bulk, the system handles it all through one intuitive interface. The system is built to be adaptable, allowing it to respond smoothly to emerging spam strategies and support various content types as requirements evolve.
-
RESULT AND DISCUSSION
The proposed multimodal spam detection system was evaluated using a variety of test inputstext messages, image files containing embedded text, audio recordings, and realtime emails fetched from Gmail. The system demonstrated strong performance across all input types. The systems text classification module, trained using labeled examples of spam and non-spam messages, achieved over 98% accuracy while maintaining minimal misclassification errors. These results highlight the models reliability in detecting spam within both emails and messaging platforms. For image-based spam, the integration of Tesseract OCR successfully extracted hidden textual content, enabling the classifier to detect obfuscated or promotional spam. Similarly, the speech-to-text module accurately transcribed audio content and achieved high classification accuracy when tested on spam voice recordings such as robocalls or scam messages. In terms of usability and system performance, the Flask-based web application provided a responsive and user-friendly interface for both end-users and administrators. The admin dashboard offered effective tools for retraining the model, analyzing detection trends, and visualizing results through spam analytics charts. Batch CSV uploads and real-time Gmail integration further showcased the systems scalability and real-world applicability. The unified approach to handling multimodal inputs in a single framework proved to be both efficient and robust. These outcomes suggest that using a combination of input types like written messages, images, and voice data helps the system detect spam more accurately and adapt more easily to diverse real-world communication formats.
-
IMPLEMENTATION
The multimodal spam detection system was implemented using Python and the Flask web framework to offer a responsive, user-friendly interface.The platform is built using a modular approach, where text, image, and audio inputs are managed independently before being funneled through a common classification process. Text data is first cleaned by breaking it into individual terms, removing common or irrelevant words, and converting it into a machine- learningfriendly format using TF-IDF encoding. When dealing with images, the system applies Tesseract OCR to identify and extract embedded text, which is then processed in the same manner as standard textual data. Similarly, audio files are
transcribed into text using Googles Speech Recognition API and processed accordingly. A key feature of the platform is its interactive dashboard, which enables users to detect spam through manual text input, uploading of images, audio files, or CSVs, and fetching emails directly from Gmail. The dashboard includes pie charts that visually represent the proportion of spam versus non-spam messages, helping administrators quickly understand detection outcomes and system behavior patterns. With its flexible architecture, realtime input handling, and continuous performance tracking, the system is built to perform efficiently in fast-moving, real-world scenarios.
-
CONCLUSION
This research presents a unified spam detection approach designed to handle the increasing variety of spam content found in formats such as text, image-based, and audio messages. By integrating machine learning with Optical Character Recognition (OCR) and speech-to-text technologies, the system effectively detects spam in messages that traditional textonly filters may overlook. The system uses the same Naive Bayes model for analyzing text, images, and audio inputs, while added components such as Gmail integration, bulk predictions via CSV files, and a web-accessible dashboard make it practical and user-friendly. Experimental results confirmed the systems robustness, achieving high accuracy in classifying spam across all input types. The system is built using Flask and follows a component-based structure, making it adaptable for different
use cases and easy to connect with existing messaging or email services. This project demonstrates how analyzing various types of inputsuch as text, images, and audiocan lead to more reliable spam detection and opens up possibilities for future improvements like real-time processing and deep learning integration.
-
REFERENCES
-
Z. Zhang, Z. Deng, W. Zhang, and L. Bu, MMTD: A Multilingual and Multimodal Spam Detection Model Combining Text and Document Images, *Applied Sciences*, vol. 13, no. 21, pp. 118, 2023.
-
J.-S. Lee and H. Kim, Visualization Technology and Deep-Learning for Multilingual Spam Message Detection, *Electronics*, vol. 12, no. 3, pp. 114, 2023.
-
N. Ahmed, M. N. Uddin, A. S. M. Kayes, A. M. Al Islam, and P. Watters, Machine Learning Techniques for Spam Detection in Email and IoT Platforms, *Security and Privacy*, vol. 2022, Article ID 1862888, pp. 120, 2022.
-
Z. Zhang, E. Damiani, H. Al Hamadi, C. Y. Yeun, and F. Taher, A Late Multi-Modal Fusion Model for Detecting Hybrid Spam E-Mail,
*Journal of Computer Science and Engineering*, vol. 18, no. 1, pp. 45 57, 2023.
-
X. Carreras and L. Marquez, Boosting Trees for AntiSpam Email Filtering, *arXiv preprint*, cs/0109015, 2001.
-
B. Mehta, M. Hofmann, and W. Nejdl, Detecting Image Spam Using Visual Features and Near Duplicate Detection, in *Proc. 17th Intl Conf. on WWW*, 2008.
-
J. Gomes, A. C. Moreira, and E. F. Couto, Improving Spam Detection Based on Structural Similarity, *arXiv preprint*, cs/0504012, 2005.
-
Wikipedia Contributors, Image spam, *Wikipedia, The Free Encyclopedia*.
-
S. Varadarajan, A. Velusamy, and R. Nagarajan, Advancing Image Spam Detection: Evaluating Machine Learning Models Through Comparative Analysis, *Applied Sciences*, vol. 15, no. 11, p. 6158, 2023.
-
Y. Li, R. Zhang, W. Rong, and X. Mi, SpamDam: Towards Privacy- Preserving and Adversary-Resistant SMS Spam Detection, *arXiv preprint*, arXiv:2404.09481, Apr. 2024.
