馃審
Trusted Engineering Publisher
Serving Researchers Since 2012

AI-Based Fake Interview Detection Systems using Multimodal Analysis: A Comprehensive Survey

DOI : https://doi.org/10.5281/zenodo.19631114
Download Full-Text PDF Cite this Publication

Text Only Version

AI-Based Fake Interview Detection Systems using Multimodal Analysis: A Comprehensive Survey

Yash Giri, Vansh Garg, Vaibhav Pal Chandel, Vansh Agarwal, Amit Kumar

Department of Computer Science and Engineering Meerut Institute of Engineering and Technology Meerut, India

AbstractThe rapid proliferation of sophisticated Gener- ative Articial Intelligence (AI), including Large Language

Models (LLMs) and neural text-to-speech (TTS) engines, has fundamentally disrupted digital communication and re- mote assessment processes. In corporate recruitment and academic evaluations, the utilization of AI to script, generate, or synthesize interview responses poses a critical threat to the authenticity and fairness of the screening process. This comprehensive survey examines the current landscape of AI- generated content detection systems, with a specic focus on multimodal architectures that integrate textual, acoustic, and visual forensics. We critically review existing methodologies for identifying machine-generated text through perplexity scoring and readability metrics, evaluating synthetic speech via Mel- frequency cepstral coefcients (MFCCs) and spectral atness, and ensuring secure, ofine speech-to-text (STT) transcription. Emphasis is placed on privacy-preserving, edge-computed frameworks that operate autonomously on CPU-bound devices without relying on external cloud APIs. Through an extensive literature review, we analyze the integration of frameworks such as Vosk for multilingual ASR (English, Hindi, Hinglish), GPT-2 for textual analysis, and Librosa for voice processing. Furthermore, this paper highlights the limitations of unimodal detection systems, demonstrating how multimodal late-fusion architectures signicantly enhance classication accuracy and reduce false positives. Finally, we explore open challenges, including cross-lingual deepfakes, low-resource language pro- cessing, and the future trajectory of real-time behavioral and lip-sync analysis in automated interview proctoring.

KeywordsAI Detection, Speech Processing, Voice Analysis, Perplexity, Multimodal Analysis, Vosk, Fake Interview Detec-

tion, Generative AI, MFCC, Human-Computer Interaction.

  1. Introduction

    1. The Rise of Generative AI in Remote Assessments

      The transition to remote and hybrid working environ- ments, accelerated by the global events of 2020, has made online video interviews the standard paradigm for corpo- rate recruitment and academic screening. Concurrently, the exponential advancement of Generative Articial Intelli- gence (AI) has democratized access to highly sophisticated text and speech generation tools. Large Language Models (LLMs) such as OpenAIs GPT-4, Anthropics Claude, and Metas Llama are capable of generating uent, highly contextual, and technically accurate responses to interview questions in milliseconds. When coupled with advanced Text-to-Speech (TTS) synthesizers and deepfake video gen- eration technology, malicious actors can orchestrate com- pletely synthetic interview performances that are virtually

      indistinguishable from genuine human interaction to the untrained observer.

      This technological convergence has catalyzed an arms race between generative models and forensic detection sys- tems. Human resources departments, educational institu- tions, and cybersecurity rms face an urgent mandate to ver- ify the authenticity of digital communications. Traditional proctoring solutions, which rely heavily on screen sharing and browser locking, are fundamentally inadequate against ofine or secondary-device AI assistance. Furthermore, hu- man evaluators frequently fail to detect the subtle acoustic anomalies or statistical consistencies that characterize AI- generated speech and text, necessitating the deployment of algorithmic detection mechanisms.

    2. The Shift Toward Multimodal and Ofine Detection

      Early efforts in AI detection predominantly focused on unimodal analysisattempting to classify text indepen- dently of audio, or audio independently of video. While text- based detectors (such as those utilizing RoBERTa or GPT-2 to analyze perplexity) achieved early success, their efcacy has degraded as LLMs have been ne-tuned to mimic human burstiness and stylistic variance. Similarly, acoustic spoong detectors face challenges against high-delity voice cloning models like ElevenLabs or VITS.

      The academic consensus has strongly shifted toward Mul-

      timodal Analysis. By fusing multiple streams of datathe semantic structure of the transcribed text, the acoustic prop- erties of the voice, and the visual behavior of the speakera system can cross-verify authenticity. A candidate utilizing an LLM script might pass a voice-authenticity test but fail a textual perplexity test.

      Furthermore, privacy concerns present a massive hurdle for detection systems. Interview recordings contain highly sensitive Personally Identiable Information (PII) and bio- metric data. Transmitting raw video and audio to third-party cloud APIs (such as Google Cloud Speech or OpenAI) vio- lates stringent corporate compliance policies and global data protection regulations (e.g., GDPR, CCPA). Consequently, there is a profound industrial need for lightweight, ofine, CPU-bound detection architectures that can perform real- time transcription and analysis locally.

    3. Objectives and Organization of the Survey

    This paper presents a comprehensive survey of the methodologies, algorithms, and architectures driving the

    next generation of AI fake interview detection systems. We aim to synthesize current research on ofine Speech-to-Text (STT) processing, linguistic AI detection, and digital signal processing for voice forensics.

    The remainder of this paper is structured as follows: Section II discusses the background and threat vectors of AI spoong. Section III details the methodology of our systematic literature review. Section IV explores text- based AI detection heuristics. Section V examines acoustic voice analysis and signal processing. Section VI details the integration of ofine ASR systems. Section VII proposes a standardized multimodal fusion architecture. Section VIII presents a comparative analysis and evaluation metrics. Sec- tion IX outlines open challenges, and Section X concludes the survey.

  2. Background and Threat Modeling

    A. The Evolution of Generative Spoong

    The threat of synthetic media in interviews can be cat-

    • Privacy Intrusions: Exposing biometric interview data to external servers presents severe legal liabilities.

    • Network Dependency: Systems fail in regions with unreliable or restricted internet access.

    Therefore, the paradigm must shift toward ofine models utilizing frameworks like Vosk for ASR, Librosa for acoustic extraction, and local GPT-2 deployments for NLP analysis.

  3. Systematic Literature Review Methodology

    To rigorously assess the state-of-the-art in multimodal AI detection, a systematic literature review was conducted spanning publications from 2018 to 2024.

    1. Search Strategy and Criteria

      Literature was aggregated from primary academic databases, including IEEE Xplore, ACM Digital Library, ScienceDirect, and arXiv. The search strings utilized Boolean logic targeting the intersection of multiple domains:

      egorized into three distinct vectors: Textual, Acoustic, and (AI Detection O Deepfake Text OR Perplex-

      Visual. ity) AND Natural Language Processing

      1. Textual Spoong: In this scenario, the candidate operates a secondary device or hidden monitor displaying responses generated in real-time by an LLM. The candidate then reads the generated text aloud. While the acoustic voice is human, the cognitive origin of the content is machine. Such responses are often characterized by hyper-uent, highly structured, and syntactically predictable sentences that lack the hesitations, ller words (e.g., um, uh), and non-linear thought progressions typical of spontaneous human speech.

      2. Acoustic Spoong (Voice Cloning): With the advent

        of zero-shot voice cloning, malicious actors can generate synthetic audio that perfectly mimics a target individuals voice using only a few seconds of reference audio. In an interview setting, this might involve routing synthetic audio through a virtual microphone driver to bypass the interviewers detection.

      3. Visual Spoong (Deepfakes): The most sophisticated

        attacks utilize real-time deepfake technology. Open-source software like DeepFaceLive allows users to superimpose a synthetic or altered face onto their webcam feed, while lip- sync algorithms synchronize the avatars mouth movements with the synthetic audio stream.

    2. The Deciencies of Cloud-Based Detection

      Initial solutions to combat these threats relied heavily on API-based microservices. A typical pipeline would record the interview, transmit the MP4 le to an AWS or Google Cloud bucket, invoke a transcription API, and subsequently call an AI-detection API.

      This architecture suffers from critical aws:

      • Latency: Network transmission of large media les introduces prohibitive delays, making real-time inter- vention impossible.

      • Cost: Processing thousands of hours of interview

        footage through proprietary APIs incurs massive com- putational costs.

      • (Voice Spoong OR Synthetic Speech Detection

        OR MFCC) AND Signal Processing

      • (Ofine ASR OR Vosk OR Kaldi) AND Mul- tilingual Transcription

      • (Multimodal Fusion OR Interview Proctoring) AND Authenticity Detection

      B. Study Selection

      Over 150 papers were initially retrieved. Following ab- stract screening and removal of duplicates, 45 primary stud- ies were selected based on the following inclusion criteria:

      1. The study must propose or evaluate a quantitative method

      for detecting synthetic text or speech. 2. The methodology must be computationally feasible for edge or local-desktop execution (excluding massive parameter models requiring cluster GPUs). 3. The research must address real-world con- versational data (e.g., LibriSpeech, Common Voice) rather than heavily sanitized laboratory datasets.

  4. Text-Based AI Detection Methodologies

    When a candidate reads a script generated by ChatGPT, the acoustic voice is human, rendering traditional voice- spoong algorithms useless. The detection must occur at the semantic and syntactic level.

    1. Information Theory and Perplexity Scoring

      The most robust heuristic for identifying LLM-generated text relies on Information Theory, specically the metric of Perplexity (P). Language models generate text autoregres- sively, calculating the probability distribution of the next token given the preceding context.

      If we evaluate a sequence of words W = w1, w2, …, wN

      using an open-source model like GPT-2, the probability of the sequence is:

      N

      P(W )= P(wi|w1,…, wi1) (1)

      i=1

      TABLE I

      Taxonomy of Key Literature in AI-Generated Content Detection (2018-2024)

      Authors & Year

      Modality

      Proposed Methodology/Technology

      Identied Limitations

      Radford et al. (2019)

      Text

      Language Models are Unsupervised Multitask

      Learners (GPT-2 foundation).

      Groundwork model; not a detection paper per se.

      Mitchell et al. (2023)

      Text

      DetectGPT: Zero-Shot Machine-Generated Text

      Detection using probability curvature.

      Computationally heavy; requires multiple pertur-

      bations per query.

      Snyder et al. (2018)

      Audio

      X-Vectors: Robust DNN Embeddings for

      Speaker Recognition using Kaldi.

      Highly sensitive to background noise and chan-

      nel degradation.

      Wang et al. (2020)

      Audio

      ASVspoof Challenge: Detecting synthetic and

      converted speech using MFCC and LFCC.

      Struggles with unseen vocoders in zero-shot sce-

      narios.

      Panayotov et al. (2015)

      ASR

      LibriSpeech: ASR corpus for ofine acoustic

      modeling.

      Read speech; lacks spontaneous interview arti-

      facts.

      Korshunov et al. (2022)

      Video/Multimodal

      Deepfakes detection combining lip-sync incon-

      sistencies and audio MFCCs.

      Requires GPU acceleration for real-time video

      frame processing.

      Perplexity is the exponentiated average negative log- likelihood of the sequence:

      P(W )= exp

      log P(wi|w<i)

      i=1

      (2)

      1 N

      will contain microscopic anomalies. While humans perceive these synthetic voices as natural, digital signal processing (DSP) libraries like Pythons librosa can expose their

      N

      mathematical articiality.

      The AI Signature: Because LLMs sample from the highest probability tokens to maximize uency, the text they generate is mathematically predictable to another LLM. Therefore, AI-generated text exhibits a distinctively low perplexity score. Conversely, spontaneous human speech is chaotic. Humans use rare words, abrupt topic shifts, and unique colloquialisms, resulting in a high perplexity score. To implement this ofine, a system can utilize the Hug- ging Face Transformers library to load a quantized version of GPT-2 into local RAM. The transcribed text from the interview is tokenized and fed into the model to extract the loss, providing a rapid, CPU-friendly AI-probability metric.

      Raw Audio Signal (16kHz Mono)

      Pre-emphasis & Framing (25ms)

      Hamming Windowing

      Discrete Cosine Transform (MFCCs)

      Logarithmic Amplitude

      Mel Filterbank Integration

    2. Readability Metrics and Burstiness

    Perplexity alone is vulnerable to adversarial prompting (e.g., Write this with high perplexity). Therefore, it must be fused with statistical readability metrics.

    Burstiness: This measures the variance in sentence length

    and complexity. Human speech is highly burstya long, complex run-on sentence is often followed by a brief, punchy fragment. AI models tend to produce uniform, medium-length sentences with consistent syntactic trees.

    Readability Indices: Metrics such as the Flesch Reading

    Ease (FRE) score compute syllables per word and words per sentence.

    ( )

    FRE = 206.835 1.015 ( total words )

    Fast Fourier Transform (FFT)

    Fig. 1. Digital Signal Processing Pipeline for Extracting Mel-Frequency Cepstral Coefcients (MFCCs) for Voice Authenticity Analysis.

    1. Mel-Frequency Cepstral Coefcients (MFCCs)

      MFCCs are the cornerstone of audio processing, repre- senting the short-term power spectrum of a sound based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of freuency.

      84.6

      total sentences total syllables

      total words

      (3)

      Synthetic voices, especially those generated by older vocoders or optimized for low-latency streaming, often struggle to perfectly replicate the intricate high-frequency phase information and micro-tremors of human vocal cords.

      In an interview context, spoken language generally has

      a high FRE score (easy to read, conversational). If the transcription yields an unusually low FRE score (dense, highly academic), it strongly indicates the candidate is reading a pre-written, AI-generated essay.

  5. Acoustic Voice Analysis and Signal Processing

    If a candidate employs a Text-to-Speech (TTS) engine or a real-time voice changer, the acoustic signal itself

    By extracting the rst 13-20 MFCCs and analyzing their variance over a 25-second rolling window, a system can detect the atness of synthetic generation. A human voice exhibits signicant MFCC variance due to emotion, breath, and imperfect articulation, whereas an AI voice maintains unnatural mathematical consistency.

    1. Pitch Variability and Spectral Flatness

    Pitch (F0) Tracking: Spontaneous human speech con- tains signicant prosodic variation. A candidate thinking of

    an answer will naturally alter their pitch, employ pauses, and change speaking rates. Synthetic speech often defaults to a highly normalized, monotonic pitch contour. By calculating the standard deviation of the fundamental frequency (F0), a system ags abnormally stable pitch contours.

    Spectral Flatness: This measures how noise-like a sound is, calculated as the ratio of the geometric mean to the arithmetic mean of the power spectrum.

    1. Multimodal Fusion Architecture

      The dening innovation of modern fake-interview detec- tion is the integration of these disparate modalities into a unied scoring mechanism. A unimodal system is inher- ently awed; a candidate might have a naturally monotonic voice (triggering a false positive on the acoustic test) but provide highly bursty, spontaneous text (indicating human cognition).

      Flatness =

      N N1

      x(n)

      n=0

      1 N1 x(n)

      (4)

      Text AI Score (Perplexity & FRE)

      Voice AI Score (MFCC Var & Pitch)

      N n=0

      Vocoder artifacts in synthetic speech often manifest as un- natural distributions of noise across the frequency spectrum, heavily altering the spectral atness prole compared to a human speaking into a standard laptop microphone.

      Weighted Multimodal Fusion

      S f inal = (w1 路 ST )+ (w2 路 SV )

  6. Offline Speech-to-Text (ASR) Integration

The nexus bridging acoustic analysis and NLP is the Automatic Speech Recognition (ASR) module. Because in- terviews require stringent privacy, the system cannot utilize APIs like Google Cloud Speech. The architecture must rely on ofine inference engines.

  1. The Vosk and Kaldi Frameworks

    Yes

    Verdict: AI-Generated

    Threshold Check (Sfinal > ?)

    No

    Verdict: Human

    Vosk is an open-source, portable speech recognition toolkit that provides lightweight models (often under 50MB) capable of running seamlessly on CPUs. It is built upon the Kaldi ASR framework, utilizing Time Delay Neural Networks (TDNN) and Hidden Markov Models (HMM) for acoustic modeling, coupled with n-gram language models.

    Implementation Workow: The interview system cap- tures audio via the users webcam and microphone (handled efciently by OpenCV and PyAudio). The audio stream is standardized to 16 kHz, 16-bit mono WAV format to match the training data of the Vosk acoustic models. The data is fed into the KaldiRecognizer in streaming chunks (e.g., 4000 bytes at a time).

  2. Multilingual and Code-Switching Challenges

A profound challenge in regions like India is the preva- lence of code-switching (e.g., Hinglish, a blend of Hindi and English). Traditional monolingual ASR models fail spectacularly when a candidate switches languages mid- sentence, generating nonsensical phonetic approximations that subsequently corrupt the GPT-2 perplexity analysis.

To resolve this, modern ofine screening archi- tectures must preload specic regional models (e.g., vosk-model-hi-en) specically trained on mixed- language corpora. A dedicated Graphical User Interface (GUI) built in Tkinter or ttkbootstrap allows the proctor or candidate to select the appropriate language parameter prior to the recording, ensuring the ASR decodes the phonemes accurately.

Fig. 2. Late-Fusion Multimodal Scoring Architecture for determining Interview Authenticity.

  1. Late-Fusion Strategy

    The most computationally efcient approach for CPU- bound systems is Late Fusion (Decision-Level Fusion), as illustrated in Fig. 2.

    1. The NLP Module outputs a normalized Text AI Score (ST [0, 1]). 2. The Voice Module outputs a normalized Voice AI Score (SV [0, 1]). 3. A Scoring Engine applies a weighted fusion algorithm:

    Sfinal = (w1 路 ST )+ (w2 路 SV )+ (w3 路 ) (5)

    Where w1 and w2 are empirically derived weights (e.g., w1 = 0.6, w2 = 0.4 depending on microphone quality condence), and represents a condence penalty applied if the ASR module reports a low transcription condence.

    If Sfinal exceeds a predened empirical threshold , the GUI dashboard visually ags the interview segment as Suspicious/AI-Generated.

  2. Workow Automation via Python and Tkinter

To make this mathematically dense pipeline accessible to HR professionals, it is encapsulated within a user-friendly desktop application. Using libraries like Pythons Tkinter augmented by ttkbootstrap, the system offers two pri- mary modes: 1. Live Recording Mode: Leverages OpenCV to capture 25-second windows of webcam and microphone data directly into buffer memory, simulating live interview

proctoring. 2. File Upload Mode: Allows proctors to up- load pre-recorded MP4 or WAV les. The system utilizes FFmpeg subprocesses to strip video metadata and extract the raw 16kHz mono audio required for analysis.

The GUI subsequently displays the transcript, the isolated NLP scores, the acoustic scores, and exports a nal com-

prehensive PDF report (utilizing reportlab) for archival compliance.

  1. Comparative Analysis and Evaluation Metrics

    To quantify the efcacy of multimodal detection, re- searchers utilize mixed datasets containing genuine human interviews (e.g., subsets of LibriSpeech or Common Voice) interwoven with AI-generated text spoken by TTS engines or humans reading LLM scripts.

    1. Standard Evaluation Metrics

      The primary metrics utilized are:

      • Equal-Error-Rate (EER): The point on the ROC curve where the False Acceptance Rate (FAR) equals the False Rejection Rate (FRR). A lower EER indicates a highly accurate biometric/spoong system.

      • F1-Score: The harmonic mean of precision and recall,

        crucial for evaluating text classication where data might be imbalanced.

    2. Performance Observations

    Extensive experimental evaluation consistently demon- strates that unimodal systems are brittle. * A text-only detector analyzing a 10-second response lacks sufcient tokens to generate a mathematically signicant perplexity score, resulting in high FRR. * A voice-only detector fails entirely if the user is a human reading an AI-generated script off-screen, yielding an F1-score approaching zero for that specic threat vector.

    TABLE II

    Theoretical Comparison of Detection Modalities in Mixed

    Threat Scenarios

    Threat Scenario

    Text-Only

    Voce-Only

    Multimodal

    Human reading LLM Script

    High Acc

    Fails

    High Acc

    TTS speaking Human Text

    Fails

    High Acc

    High Acc

    Full Deepfake (LLM + TTS)

    High Acc

    High Acc

    Very High Acc

    Short Response (< 5 words)

    Fails

    Moderate

    Moderate

    The multimodal fusion approach mitigates these vulner- abilities. By cross-referencing text and audio, the system establishes a robust baseline that signicantly outperforms isolated heuristics, making it suitable for academic and enterprise-level preliminary screening.

  2. Open Challenges and Vulnerabilities

    Despite rapid advancements, ofine multimodal detection systems possess several inherent limitations that provide avenues for future research.

    1. Acoustic Degradation and Hardware Variance

      The extraction of MFCCs and spectral features relies heavily on the quality of the input audio. In remote inter- views, candidates utilize varying hardwarefrom high-end condenser microphones to degraded, built-in laptop mics operating in echo-heavy rooms. High background noise, packet-loss clipping over Zoom, or aggressive active noise cancellation (ANC) applied by the operating system can articially alter spectral atness, generating false positives in the Voice AI module.

    2. The Evolving LLM Landscape

      Detection via GPT-2 perplexity is highly effective against older models (GPT-3, standard LLMs). However, as can- didates utilize newer models with high-temperature sam- pling, specialized humanizer prompting, or bespoke lo- cal models (like Llama-3), the perplexity gap between human and machine text is rapidly shrinking. Advanced stylometric analysis and deep-learning-based embeddings (e.g., RoBERTa sequence classication) will be required to replace basic heuristic thresholds.

    3. Absence of Visual Forensics

    The current generation of ofine, CPU-bound systems deliberately omits real-time video frame analysis due to intense computational constraints. Consequently, while the system can detect AI text and synthetic audio, it cannot detect an empty room deepfakewhere a synthetic face is puppeted over the webcam feed. True holistic proctor- ing requires the integration of facial landmark behavioral tracking and lip-sync consistency algorithms.

  3. Future Directions

    The trajectory of AI-fake detection points toward more deeply integrated, neural-network-driven architectures rather than simple mathematical heuristics.

    1. Deepfake Video Integration

      Future systems must integrate lightweight Convolutional Neural Networks (CNNs), such as MobileNetV2, to perform frame-by-frame analysis of the video stream. By mapping facial landmarks using libraries like MediaPipe, the system can detect micro-inconsistencies in blinking rates, unnatural pixel blending around the jawline, and temporal mismatches between the audio waveform and lip movement (lip-sync forgery).

    2. Early Fusion and Joint Embeddings

      Rather than calculating text and voice scores separately (Late Fusion), future models will utilize Early Fusion. In this architecture, raw acoustic features and tokenized text embeddings are concatenated into a single, massive vector and fed into a dedicated, multimodal Transformer model. This allows the neural network to identify cross- modal correlations that human-dened heuristics miss (e.g., recognizing that a specic synthetic TTS model always mispronounces words with low NLP perplexity).

    3. Continuous Behavioral Proling

    Authenticity detection will evolve from point-in-time scoring to continuous behavioral proling. Utilizing eye- gaze tracking, the system can monitor if the candidates eyes are systematically scanning text off-screen horizontally, correlating this gaze behavior with periods of high speech uency.

  4. Conclusion

The integrity of remote interviews and academic assess- ments is under unprecedented assault from Generative AI technologies. As this comprehensive survey demonstrates, relying on unimodal, cloud-based detection mechanisms is no longer viable due to latency, privacy restrictions, and algorithmic blind spots.

The future of interview proctoring lies in self-contained, ofine, multimodal architectures. By harmonizing Speech- to-Text transcription via Vosk, textual perplexity evalua- tion via quantized language models, and acoustic signal processing via Librosa, organizations can deploy robust, CPU-efcient systems that rapidly cross-verify authentic- ity. While current systems provide a vital rst line of defensesuccessfully identifying scripted responses and synthetic voicesthe arms race continues. The integration of visual deepfake forensics, advanced early-fusion neural embeddings, and behavioral tracking will be paramount to ensuring truth and transparency in the next generation of digital human-computer interaction.

REFERENCES

  1. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, Language Models are Unsupervised Multitask Learners, OpenAI Blog, vol. 1, no. 8, pp. 9, 2019.

  2. E. Mitchell, Y. Lee, A. Khazatsky, C. D. Manning, and C. Finn, DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature, in Proceedings of the 40th International Conference on Machine Learning (ICML), 2023, pp. 2495024977.

  3. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, X-Vectors: Robust DNN Embeddings for Speaker Recognition, in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 53295333.

  4. X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V. Vestman, A. Kinnunen, K. A. Lee, L. Juvela, A. Alku, Y.-H. Peng, H.-T. Hwang, Y. Tsao, H.-M. Wang,

    S. Le Maguer, M. Becker, F. Henderson, R. Schlu篓ter, D. Saito, A. Ariyaeeinia, E. Pellom, and K. S. R. Murty, ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech, Computer Speech & Language, vol. 64, 2020.

  5. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, Librispeech: an ASR corpus based on public domain audio books, in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 52065210.

  6. P. Korshunov and S. Marcel, DeepFakes: a New Threat to Face Recognition? Assessment and Detection, arXiv preprint arXiv:1812.08685, 2018.

  7. S. W. Smith, The Scientist and Engineers Guide to Digital Signal Processing. California Technical Publishing, 1997.

  8. A. Baevski, Y. Hao, A. Conneau, and M. Auli, wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations, in Advances in Neural Information Processing Systems, vol. 33, 2020,

    pp. 1244912460.

  9. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal,

    A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language Models are Few-Shot Learners, in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 18771901.

  10. J. Wang, Y. Zheng, X. Chen, and M. Li, A Survey on Deepfake Video Detection, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, pp. 91239144, 2022.

  11. A. Lugosch, M. Ravanelli, P. Ignoto, V. S. Tomar, and Y. Bengio, Speech Model Pr-training for End-to-End Spoken Language Un- derstanding, in Interspeech 2019, 2019, pp. 814818.

  12. B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, librosa: Audio and Music Signal Analysis in Python, in Proceedings of the 14th Python in Science Conference, 2015, pp. 1825.

  13. I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.

  14. F. Davis, Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information Technology, MIS Quarterly, vol. 13, no. 3, pp. 319340, 1989.

  15. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky,

    G. Stemmer, and K. Vesely, The Kaldi Speech Recognition Toolkit, in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, 2011.

  16. Y. Zhang, F. Metze, and S. Roukos, Multimodal Fusion for Video Search, in IEEE International Conference on Multimedia and Expo, 2018.

  17. L. Jiang, W. Li, X. Tian, and H. Li, Robust Synthetic Speech Detection via Feature Fusion, IEEE Signal Processing Letters, vol. 28, pp. 12051209, 2021.

  18. H. Farid, Image Forgery Detection: A Survey, IEEE Signal Pro- cessing Magazine, vol. 16, no. 2, pp. 1625, 2009.

  19. J. R. R. Kincaid, R. P. Fishburne, R. L. Rogers, and B. S. Chissom, Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel, Research Branch Report 8-75, Chief of Naval Technical Training, 1975.

  20. P. Warden, Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition, arXiv preprint arXiv:1804.03209, 2018.

  21. R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, Common Voice: A Massively-Multilingual Speech Corpus, in Proceedings of the 12th Language Resources and Evaluation Conference, 2020, pp. 4218 4222.

  22. S. Agarwal, H. Farid, Y. Gu, M. He, K. Nagano, and H. Li, Pro- tecting World Leaders Against Deep Fakes, in CVPR Workshops, 2019.

  23. G. Bradski, The OpenCV Library, Dr. Dobbs Journal of Software Tools, 2000.

  24. T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi,

    P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer,

    P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush, Transformers: State- of-the-Art Natural Language Processing, in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020, pp. 3845.

  25. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is All You Need, in Advances in Neural Information Processing Systems, vol. 30, 2017,

pp. 59986008.