🔒
Global Scientific Platform
Serving Researchers Since 2012

AI-Based Multilingual Fake News Detection Using Machine Learning Algorithms

DOI : 10.5281/zenodo.20541588
Download Full-Text PDF Cite this Publication

Text Only Version

AI-Based Multilingual Fake News Detection Using Machine Learning Algorithms

Mohit Sinha

Computer Science Engineering Galgotias University Greater Noida, India

Ranjan Sah

Computer Science Engineering Galgotias University Greater Noida, India

Ms. Shweta Mayor Sabharwal

Assistant Professor Galgotias University Greater Noida, India

Abstract – The blistering development of digital and social media. has redened the sharing of information and has also. empowered the mass spread of false news. Misleading con-tent signicantly impacts public opinion, social harmony, and decision-making processes. This paper proposes an AI-based fake news detection system that uses machine learning and natural language processing (NLP) techniques to classify news articles. Five supervised learning algorithms Logistic Regression, deci-sion Tree, RF, Gradient Boosting, and Support Vector Machine (SVM)were implemented and evaluated on a labeled dataset. Text preprocessing techniques such as tokenization, stopword removal, and TF-IDF vectorization were applied to convert written information into numeric formats. A multilingual feature was introduced using automated language detection and transla-tion to support non-English news, particularly Hindi. Gradient Boosting achieved the highest accuracy and F1-score, indicating its effectiveness in identifying fake news. The system we created is scalable and independent of any particular language, enabling it to effectively identify misleading content across multiple online platforms.

KeywordsFake News Detection, Machine Learning, NLP, TF-IDF, Multilingual Translation, SVM.

  1. Introduction

    The rapid expansion of social media and digital news platforms has made facts without problems handy, however it has also increased the unfold of fake information[1]. False information may impact elections, ignite social. hysteria, and downgrade the trust people have towards good journalism. Because of the huge expanse and speed of online contents, manual. verication methods are no longer applicable, emphasizing on the. need to have automated detection systems of falsied information.

    Utilizing device learning techniques combined with herbal language processing methods has proven effective in analysis text and categorizing information into real and fake classications [10]. Several analogous algorithms utilized for this aim comprise logistic regression,and random forests and Grade Boosting are extensively used for double textbook bracket. However, most existing models are limited to English language datasets and fail to address multilingual misinformation, especially in countries like India where fake news is circulated in languages such as Hindi. This research presents an AI-based multilingual fake news detection system.

    The system incorporates automatic language detection and translation to convert non-English news into English before processing. Text is preprocessed using tokenization, stopword removal, and TF-IDF vectorization, followed by classication using multiple supervised ML models. A comparative performance analysis identies the most efcient algorithm based on accuracy, precision, recall, and F1-score.

    Key controll of this work include:

    1. A platform that identies misinformation in multiple languages through the use of translation software and ML methods.

    2. Comparative evaluation of ve ML algorithms on a benchmark dataset.

    3. A scalable and language-independent approach suitable for real-time applications.

    Novelty. Although standard supervised classiers and TF-IDF features are common in the literature, the contribution here lies in coupling a single, unied ve-model comparison pipeline with an automatic translate-then-classify layer aimed specically at Indian-language misinformation. Rather than treating each algorithm in isolation, the same preprocessing and feature space is shared across all ve models so that their behaviour can be compared fairly, and Hindi inputs are routed through detection and translation before they reach that shared pipeline. This combination of fair multi-model evaluation and a lightweight language bridge is what separates the present system from earlier English-only studies [10], [14].

  2. LITERATURE REVIEW

    Much interest is in exploring how to. detect false news as a result of the progress of scientists. who have inquired into various modes, which entail articial. linguistic analysis and intelligence algorithms. Early studies principally concerned with feature-based approach with conventional. Naive Bayes, Logistic Regression and other classiers. Aid Vector Machines (AVM). These models relied on bag ofwords, TF-IDF, n-gram-based features to classify news. as real or fake [10]. As Rashlin et al. and Shu et al. revealed. that TF-IDF and Logistic Regression yield. English dataset competition accuracy [2]. However, these approaches tended to miss more linguistic context.

    Ways to get over this limitation are to enhance ensemble literacy. equal to Random wooded area and Gradient Boosting are. been introduced. Potthast et al. and Ahmed et studies. al [4]. indicated that ensemble models are superior to the single models. classiers because of their improved capability of generalization. These models minimize overtting and are effective in a variety of situations. datasets, which means that they would be effective in fake news detection tasks.

    SVMs were preferred because of their. high-dimensional text space effectiveness. Research by Pe 4zos et al [6]. had reported that SVM with linear. When used together with TF-IDF features, kernels will have high accuracy and F1-scores. Most of these styles were however. limited by English-language datasets which are mostly analyzed. their inapplicability in multilinguals.

    New models have been substituted by multilingual and cross models. identifying lingual fake information. Research done by Alam Et al. and Bozarth and Budak had suggested the use of translation. still remains problems with non-English content based systems. go on in precision of translation and retention of meaning. [2],[3]. Models such as LSTM, CNN and are used in deep learning. Their implementation with better contextual using BERT has also been applied. time-consuming (large computer resources required) and large datasets.[7], [8]

    According to the literature review, it is evident that traditional ML algorithm with efcient preprocessing has the ability to learn. high speed at lower cost of computation. However, bilingual exibility is a very important issue. This study removes this issue by connecting automated language. discovery and restatement using ML in order to make a scalable and detecting language-independent fake news[5].

    Recent work continues to conrm this trade-off. Studies published in 2024 and 2025 report that classical models built on careful preprocessing and TF-IDF features remain competitive with far heavier architectures while staying interpretable and cheap to run [10], [14]. At the same time, transformer-based detectors such as BERT and its hybrids have pushed accuracy higher on English benchmarks [11], [12], and dedicated Hindi and Indic-language datasets have begun to appear, underlining how little of the misinformation problem is covered by English-only systems [14], [15]. The present study is positioned within this recent line of work by keeping a classical, low-cost core and adding a translation bridge for Hindi.

  3. METHODOLOGY / SYSTEM ARCHITETURE

    1. System Overview

      The proposed system detects fake news by recycling textual facts using herbal language processing (NLP) and system literacy (ML) ways. The workow consists of ve sequential

      phases: data collection, data cleaning, feature selection, model training, and forecasting. A multilingual layer is integrated to translate non-English input into English before classication.

    2. System Architecture

      The architecture containing modules:

      1. Data Collection Collections of genuine and fabricated news articles(Fake.csv and True.csv) are combined into a single dataset.

      2. Preprocessing Textual content is converted to low-ercase and cleansed by removing punctuation, URLs, numbers, special characters, and stopwords.

      3. Feature Extraction The cleaned, polished converted into numerical information appropriate for system train-ing models through TF-IDF vectorization.

      4. Model Training ve supervised algorithms (LR, DT, RF, Gradient Boosting and Support Vector Machine) are trained and assessed based on accuracy, precision, consideration, and F1-score.

      5. Multilingual Processing The system detects the input language using langdetect. If the text is nonEnglish (e.g., Hindi), it is translated into English using google-trans before classication.

    3. Workow Summary

      TABLE I

      Processes Involved in Fake News Detection

      Process

      Description

      Data Input

      Fake.csv and True.csv merged into a unied

      dataset.

      Preprocessing

      Removal of noise, punctuation, URLs, num-

      bers and stopwords.

      Feature Extraction

      TF-IDF vectorizer converts text to numeri-

      cal vectors.

      Model Training

      LR, DT, RF, GB, and SVM algorithms

      trained and compared.

      Multilingual Layer

      Detects language translates to English

      classies.

    4. Algorithms Used

      • Logistic Regression: A linear classier that predicts prob-abilities using a sigmoid function; used as a baseline model. The probability is given by:

        1

        h(x)= (1)

        1+ eT x

      • Decision Tree: Splits features into hierarchical nodes but is prone to overtting with complex data.

      • Random Forest: An ensemble of many decision trees whose individual outputs are combined by majority vot-ing; it lowers the overtting of a single tree and improves stability across varied data.

      • Gradient Boosting: Sequentially builds weak learners to minimize classication error.

      • Support Vector Machine (SVM): Constructs an optimal hyperplane in high-dimensional space; provides high ac-curacy on textual data.

      Fig. 1. Data input and preprocessing pipeline.

      Fig. 2. Model development, assessment, and inference pipeline.

    5. Language Translation Layer

    To handle multilingual input, the system integrates auto-matic language detection. If the language is not English, the content is translated into English before passing through the preprocessing and prediction pipeline. This ensures language-independent classication.

    In practice this layer works in three steps. First, langdetect inspects the raw input and returns a language code. Second, if that code is not English, googletrans forwards the text to a translation service and returns an English version. Third, the translated text is cleaned and vectorized exactly like a native English article, so a single trained model serves every language. The practical effect is visible in manual testing: a Hindi weather headline written in Devanagari was detected, translated, and then classied correctly as real news, matching

    Fig. 3. Multilingual support module.

    the label it would have received in English. Because the clas-sier only ever sees English features, accuracy on translated Hindi inputs stays close to the English results, with the main source of error being translation quality rather than the model itself.

  4. IMPLEMENTATION AND RESULTS

    1. Implementation Environment

      The system became evolved the usage of Python 3.8 in the Anaconda surroundings. Jupyter (ipynb) became used for coding and checking out. Key libraries include Pandas, NumPy, Scikit-learn, Matplotlib, Langdetect, and Google-trans.The Fake News dataset was sourced from Kaggle and consists of two labeled les: Fake.csv and True.csv [9]. The hardware conguration used for implementation included an Intel i5 processor, 8 GB RAM, and Windows 10 (64-bit) operating system.

    2. Execution Workow

      The implementation followed a structured pipeline:

      1. Data Collection and Merging: The Fake and True news datasets were combined into a single DataFrame and shufed to remove ordering bias.

      2. Text Preprocessing: To change textual content to lower-case and remove URLs, punctuation, digits, HTML tags, stopwords, and unique characters, a special function was developed.

      3. Feature Extraction: The cleaned text facts are trans-formed using the TF-IDF Vectorizer to create high di-mensional feature vectors suitable for machine learning.

      4. Model Training and Testing: The dataset is split into 75% training and 25% testing systems for understanding trends in fashionLogistic Regression, decision Tree, RF, Gradient Boosting,and Support Vector Machine were trained and evaluated.

      5. Multilingual Input Testing: For user inputs, the system detects the language of the text. If the input is not in English (e.g., Hindi), it is translated using googletrans, processed, and classied by the trained models.

    3. Performance Evaluation

      TABLE II

      Performance Comparison of Machine Learning Algorithms

      Algorithm

      Acc. (%)

      Prec. (%)

      Rec. (%)

      F1 (%)

      Logistic Regression

      98.84

      99.0

      99.0

      98.77

      Decision Tree

      99.61

      99.9

      99.9

      99.58

      Random Forest

      98.99

      99.0

      99.0

      98.93

      Gradient Boosting

      99.54

      99.8

      99.9

      99.51

      Support Vector Ma-

      chine (SVM)

      99.55

      100.0

      99.0

      99.52

    4. Result Analysis

      Among all models, DT accomplished the best accuracy of 99.61%.However, both SVM and Gradient Boosting provided additional stable performance in terms of accuracy and F1-rating, deployment. The inclusion of the multilingual layer ensured that the system could accurately classify both English and translated Hindi news inputs, demonstrating functionality.

      Fig. 4. Accuracy and F1-score comparison across the ve models.

    5. Comparison with Deep-Learning Approaches

      To place these results in context, Table III lists accuracies reported by recent transformer- and deep-learning-based detec-tors. It should be read with care, since those gures come from different datasets and splits and are not directly comparable; the tale is meant only to show the broad performance band. The comparison indicates that the proposed classical pipeline reaches accuracy in the same range as BERT-based systems on similar Kaggle-style data, while needing far less computation, memory, and training data. Transformer models retain a clear advantage in capturing context and sarcasm, which is where the present system is weakest, but for a lightweight, inter-pretable, and multilingual deployment the classical approach remains a reasonable choice.

      TABLE III

      Indicative Comparison with Deep-Learning Models

      Model (Reference)

      Acc. (%)

      Notes

      BERT on Kaggle dataset

      [11]

      99.23

      Deep contextual

      model; high compute.

      GBERT (GPT + BERT)

      [12]

      95.30

      Hybrid transformer;

      large training data.

      BERT-CNN / BERT-

      LSTM fusion [13]

      96.20

      Blended deep

      model.

      Proposed (Decision Tree)

      99.61

      Classical ML + TF-

      IDF; low compute.

      Proposed (SVM)

      99.55

      Classical ML +

      TF-IDF; stable, low compute.

  5. Benefit to Society

    Beyond its technical results, the system is meant to serve a practical social purpose. Misinformation spreads fastest in regional languages, where fewer automated checks exist, so a tool that accepts Hindi as well as English can help ordinary readers, students, and small newsrooms verify a claim before sharing it. Fact-checking organisations and social-media moderators could use the same pipeline to triage large volumes of posts and ag suspicious items for human review, reducing the manual effort needed during elections, health emergencies, and other periods when false information is most damaging. Because the model is light enough to run on a basic computer,

    it is also suitable for schools and community groups that want to teach digital literacy without expensive infrastructure. In this way the work contributes, in a small but concrete manner, to safer and more trustworthy online information for multilingual audiences.

  6. CONCLUSION AND FUTURE WORK

The proposed Multilingual fake news discovery model. system positively identied and categorized news papers. as natural or unnatural on ways of machine literacy. By integrating Algorithms made up of Language Processing (NLP). Logistic Regression, DT, RF, Gradient Boosting,and Support. The system was a Vector Machine (SVM) that was highly accurate. across diverse datasets. Which preprocessing pipeline does? was a combination of text normalization, stop-word removal, and TF-IDF vectorization- demonstrated necessary to enhance feature. performance of quality and classication.

The only algorithms which have been tried in all the cases are the Decision Tree and the are. The accuracy of SVM was quite impressive (99.6%). the ability of classical supervised models in suitable conditions. optimized. Addition of translation layer Multilingual. through the assistance of automatic language detecting and translating APIs. and made the model applicable to non-English data, so that it can handle such languages as Hindi with minimal. loss in performance. It is the location of the system through such exibility. as a language-free and scalable system of determining fake news on the Internet.

The device is good where variable results are to be met. it may also be improved with deep intended datasets. BERT, LSTM, or transformer learning models. relies on semantic and contextual neness based models. of text. Integration of web real time data collection. scraping-based and API-based fact-checking modules would be improved. the actual working of this plan. Future studies also may take into account hybrid methods of ensemble that is a compostion of. traditional ML that involves the neural networks to enhance generalization. in other languages, and in other spheres.

To sum it up, this paper demonstrates that it is possible to combine device. Multilingual preprocessing offers mastering with multilingual preprocessing. applicable and effective method of detecting fake news. With added improvements in real-time analytics and contextual. knowledge, such systems may be crucial in minimising. indelity and propagating online reality over all. global media environments.

Acknowledgment

The author demonstrates that s/he has a strong respect towards Galgotias. Technical and resources University, branch CSE, to do so. nance, and advice that is needed to fulll this study. project. Special considerations to the faculty and the supervisor of the project. members because of their great

contribution and support. in an article bearing the title of AI-Based Multilingual Fake News. Large Scale Detection with Machine Learning Algorithms. The author it is also grateful to the open-source developer community and ofcially provided datasets and data that led to. the achievement and successful fulllment of this research.

References

  1. H. F. Villela, F. Correa, J. S. de A. N. Ribeiro, A. Rabelo and D. B.

    1. Carvalho, fake news detection: a systematic literature evaluation of machine learning algorithms and datasets, J. Interactive Systems, vol. 14, no. 1, 2023.

  2. S. Han, Go-lingual switch learning to identify fake news in a low-resource tongue arXiv preprint arXiv:2208.12482, 2022

  3. Multilingual deep literacy frame for fake news discovery using 11 relational variables like sentiment, realities, or data, Knowledge and Information Systems, 2023.

  4. R. C. Thompson, S. Joseph, and T. T. Adeliyi, system acquiring expertise in methods for false news identication, Information, vol. 16, no. 3, 2023.

  5. M. Rani and C. Virmani, Detection of fake information on Social Media: A evaluate, Proc. ICICC 2022, 8 pp., 2022.

  6. N. N. Prachi, M. H. Ra, E. Alam, R. Khan and others, Identication of false information via device examination and natural language pro-cessing algorithms J. Advances in information technology, vol. thirteen, no. 6,Dec.

  7. M. Hoy and T. Koulouri, a systematic review at the Detection of fake information Articles, arXiv preprint arXiv:2110.11240, Oct 2021.

  8. False Information Detection: An in-depth assessment of Theoretical and Practical Approaches, Technologies, vol. 12, no. 11, 2024.

  9. https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset

  10. R. C. Thompson, S. Joseph and T. T. Adeliyi, Machine Learning Strategies for Fake News Detection, Information, vol. 16, no. 3, art. 189, 2025.

  11. Fake News Detection Using Machine Learning and Deep Learning Algorithms: A Comprehensive Review, Computers, vol. 14, no. 9, art. 394, 2025.

  12. P. K. Mishra et al., GBERT: A hybrid deep learning model based on GPT-BERT for fake news detection, Heliyon, vol. 10, 2024.

  13. A. O. Balogun et al., BERT-based blended approach for fake news detection, J. Big Data and Articial Intelligence, vol. 2, no. 1, pp. 715, 2024.

  14. S. Garg and D. K. Sharma, Fake news detection in the Hindi language using multi-modality via transfer and ensemble learning, Internet Tech-nology Letters, 2024.

  15. S. Bansal et al., MMCFND: Multimodal Multilingual Caption-aware Fake News Detection for Low-resource Indic Languages, arXiv preprint arXiv:2410.10407, 2024.