🔒
International Engineering Publisher
Serving Researchers Since 2012

Deepcheck: A Unified Multimodal Deepfake Detection Framework with Cross-Modal Consistency Analysis, Learned Fusion, and Explainable AI

DOI : 10.5281/zenodo.20551739
Download Full-Text PDF Cite this Publication

Text Only Version

Deepcheck: A Unified Multimodal Deepfake Detection Framework with Cross-Modal Consistency Analysis, Learned Fusion, and Explainable AI

Rugved Joshi

Artificial Intelligence and Data Science

PVG’s College of Engineering, Technology and Management Pune, India

Amay Choudhari

Artificial Intelligence and Data Science

PVG’s College of Engineering, Technology and Management Pune, India

Varad Khadke

Artificial Intelligence and Data Science

PVG’s College of Engineering, Technology and Management Pune, India

Prof. Vijayalaxmi Kanade

Artificial Intelligence and Data Science

PVG’s College of Engineering, Technology and Management Pune, India

Ayush Shah

Artificial Intelligence and Data Science

PVG’s College of Engineering, Technology and Management Pune, India

Abstract – Deepfake detection has emerged as a critical challenge in the digital age, with single-modality detectors showing significant limitations in real-world scenarios. This paper presents DeepCheck, a comprehensive trimodal framework for detecting deepfakes across image, audio, and video modalities. Our approach addresses four key research gaps: cross-modal consistency analysis for enhanced detection, explainable AI integration via GradCAM for interpretability, a learned meta-learner fusion mechanism for intelligent multimodal decision-making, and provenance classification for deepfake source attribution. DeepCheck achieves exceptional per-modality accuracies (Image: 99.26%, Audio: 99.87%, Video: 96.75%) while providing interpretable results and provenance insights. Our experimental evaluation demonstrates the superiority of trimodal analysis over single-modality approaches through comprehensive ablation studies. Results show Val-Test gaps of <0.01%, indicating excellent generalization across modalities.

Keywords – Deepfake Detection, Multimodal Learning, Explainable AI, GradCAM, Meta-Learner Fusion

  1. INTRODUCTION

      1. The Deepfake Crisis

        The rapid advancement of generative models and deep learning techniques has enabled the creation of highly realistic synthetic media, commonly referred to as deepfakes. These synthetic media can be maliciously manipulated videos, audios, and images that convincingly depict events that never occurred. The proliferation of deepfakes poses significant threats to political systems and democratic processes, personal privacy and reputation management, media credibility and trust in digital content, as well as

        financial and legal systems vulnerable to deepfake-based fraud. The speed of deepfake creation and distribution now outpaces traditional verification methods, necessitating automated detection solutions.

      2. Limitations of Single-Modality Detectors

        While existing deepfake detectors have achieved respectable performance on individual modalities, they suffer from critical limitations that must be addressed. Image detectors fail when subtle facial manipulations are present, while audio detectors may miss voice synthesis artifacts in background noise. Video detectors struggle with low frame rates and compression artifacts, and cross-modal inconsistencies such as mismatched lip movements go undetected. Additionally, results from single-modality approaches lack interpretability for human verification, and they provide no information about deepfake source or generation method.

      3. Our Contributions

    This paper presents DeepCheck, a novel framework that addresses these limitations through four key innovations. The first innovation is Cross-Modal Consistency Analysis, which analyzes synchronization between visual lip movements and audio speech patterns to detect temporal inconsistencies characteristic of deepfakes. The second innovation is Explainable AI Integration, which employs GradCAM-based visualization and attention mechanisms to provide interpretable evidence for detection decisions, enabling human verification of AI predictions. The third innovation is Learned Meta-Learner Fusion, which implements an intelligent fusion mechanism that learns optimal weighting of

    modalities rather than using fixed heuristics, adapting to dataset characteristics. Finally, the fourth innovation is Provenance Classification, which classifies deepfakes by generation method (face-swap, lip-sync, voice synthesis) to support forensic analysis and attribution.

  2. RELATED WORK

      1. Single-Modal Image Deepfake Detectors

        Early deepfake detection efforts focused on convolutional neural networks trainedto distinguish authentic from manipulated images. MesoNet represents a shallow CNN architecture specifically designed for binary face classification, achieving moderate accuracy on video frame analysis. XceptionNet demonstrates a transfer learning approach leveraging ImageNet pre-training and shows superior performance on facial deepfake detection through dilated convolutions that capture fine-grained artifacts. EfficientNet provides efficient scaling of neural networks achieving state-of-the-art accuracy with reduced computational cost, with particularly EfficientNet-B4 variants showing strong performance on deepfake detection tasks. These approaches, while effective, are vulnerable to adversarial examples and fail when presented with out-of distribution deepfakes or sophisticated manipulation techniques.

      2. Single-Modal Audio Deepfake Detectors

        Audio deepfake detection has received increasing attention with several key approaches. Wav2Vec learns features from raw audio waveforms using self-supervised learning, enabling robust detection of voice synthesis and voice conversion artifacts across languages. RawNet presents an end-to-end architecture that directly processes raw audio signals, learning discriminative features that capture subtle speaker-specific characteristics altered by voice synthesis. Mel-Spectrogram based approaches convert audio to time frequency representations enabling CNN-based classification of speech synthesis artifacts. Audio detection remains challenging due to background noise, compression artifacts, and the difficulty of distinguishing high-quality synthesis from authentic speech.

      3. Multimodal Deepfake Detection Approaches

        Recent work has explored combining multiple modalities to improve detection performance. AVAD (Audio-Visual Anomaly Detection) analyzes audio-visual synchronization as a primary indicator of deepfakes, leveraging the difficulty of perfect lip-sync generation. FakeCatcher detects inconsistencies between facial movements and audio using temporal modeling, demonstrating that cross-modal analysis significantly improves detection accuracy. Multi-stream architectures process image, audio, and video streams through separate networks with late-stage fusion, though often employing simplistic fusion mechanisms such as averaging and concatenation. However, existing multimodal approaches lack principled fusion strategies and often omit explainability and provenance analysis.

      4. Explainable AI in Deepfake Detection

    Interpretability in deepfake detection is critical for forensic applications and legal admissibility. GradCAM (Gradient weighted Class Activation Mapping) visualizes regions in input that most influence classification decisions through gradient information, enabling human verification of model reasoning. Attention mechanisms learn spatial and temporal attention weights indicating critical features for detection, providing interpretable decision paths. Saliency maps highlight pixels or frames most critical for deepfake determination, supporting forensic analysis. Despite these advances, most deepfake detectors remain black boxes, limiting their adoption in forensic and legal contexts where explainability is paramount.

  3. METHODOLOGY

      1. System Overview & Architecture

        DeepCheck employs a three-stage architecture for comprehensive deepfake detection. The system processes

        Fig: System Architechture diagram

        three input modalities: image, audio, and video. These inputs are processed through Stage 1, which focuses on multimodal feature extraction using specialized encoders. The image encoder utilizes EfficientNet-B4 pre-trained on ImageNet, the audio encoder uses ResNet18 operating on Mel Spectrogram representations, and the video encoder performs temporal modeling via frame-level EfficientNet-B4 with optical flow analysis. Stage 2 handles modality-specific detection and analysis, including per-modality binary classification for real versus fake content, cross-modal synchronization analysis, GradCAM-based attention visualization, and provenance feature extraction. Stage 3 implements intelligent fusion and decision-making, where the meta-learner module learns optimal modality weighting, performs cross-modal consistency scoring, computes final

        classification with confidence estimation, and performs provenance attribution via multi-class classification.

      2. Image Detection Module

        The image detection module employs EfficientNet-B4 with specialized training configuration. The system was trained on an NVIDIA L4 GPU with 23.66 GB memory and CUDA Version 12.8. The dataset consisted of 67,302 real images and 67,302 fake images for a total of 134,604 images, split into training (94,222 images), validation (20,190 images), and test sets (20,192 images). The architecture modifications replace the final classification layer with a custom head that includes Global Average Pooling, Dense(512, ReLU) with Dropout(0.5), Dense(2, Softmax) for binary classification, and an auxiliary output Dense(4) for provenance classification.

        The training follows a three-phase strategy. In Phase 1, focused on representation learning, the encoder backbone is frozen and only the dense head trains on the large dataset with an Adam optimizer at learning rate 0.001 and Cross-Entropy loss with class weights for imbalance handling. Phase 2 implements fine-tuning by unfreezing the last 2 residual blocks of the encoder with a lower learning rate of 0.0001 using SGD optimizer with momentum 0.9 and Focal Loss with =0.25, =2.0. Phase 3 applies adversarial robustness training through mixup augmentation with =0.2, adversarial training with FGSM perturbations (=0.03), and L2 penalty regularization (=0.0001).

        The training results for the image module show progressive improvement. Phase 1 achieved a final validation accuracy of 0.7767. Phase 2 achieved a best validation accuracy of 0.9926. The final test results show a test accuracy of 0.9926 with a test loss of 0.0032 and a Val-Test gap of 0.0001, indicating excellent generalization.

      3. Audio Detection Module

        The audio detection module employs ResNet18 on spectrogram representations. The training configuration used an NVIDIA L4 GPU with 23.66 GB memory and CUDA Version 12.8, with data augmentation using 6 workers and batch size of 64. The dataset consisted of 13,100 real audio clips and 13,100 fake audio clips for a total of 26,200 clips, split into training (18,340 samples, 70%), validation (3,930

        samples, 15%), and test sets (3,930 samples, 15%).

        The architecture adapts ResNet18 for 1-channel Mel Spectrogram input by replacing the initial Conv2d(3, 64, …) with Conv2d(1, 64, …) while keeping standard ResNet blocks with 4 residual blocks with skip connections. The custom classification head is identical to the image module.

        The training follows a two-phase strategy. Phase 1 implements supervised learning on LJSpeech (natural) and generated speech from Tacotron2 and Glow-TTS with data augmentation including SpecAugment (time warping, frequency masking, time masking), time stretching from 0.8

        to 1.2, and pitch shifting from -2 to +2 semitones. The loss function uses Cross-Entropy with Focal Loss variants and an Adam optimizer at learning rate 0.001. Phase 2 implements contrastive learning with triplet loss in embedding space, using authentic speech as anchors, other authentic speech from the same speaker as positives, and synthetic speech as negatives. The loss combines triplet_loss plus 0.5 times cross_entropy with an effective margin of 0.6.

        The audio module achieved a best validation accuracy of 0.9977, with final test results showing a test accuracy of 0.9987, test loss of 0.0005, and Val-Test gap of 0.0010.

      4. Video Detection Module

        The video detection module extends image detection with temporal analysis. The training configuration used an NVIDIA L4 GPU with 23.66 GB memory and CUDA Version 12.8. The dataset consisted of 13,436 real frames and 13,115 fake frames, with validation frames of 2,828 real and 2,807 fake, and test frames of 2,914 real and 2,899 fake. The total was 26,551 training frames, 5,635 validation frames, and 5,813 test frames, with 6 workers and batch size of 48.

        The architecture extracts frames at 1 fps for computational efficiency, with each frame processed by the image detection module using EfficientNet-B4. Temporal integration occurs through attention pooling using weighted averages of frame predictions, LSTM encoder with 2-layer LSTM containing 128 hidden units processing frame embeddings, or 3D CNN using SlowFast architecture for joint spatial-temporal learning. Video classification supports three methods: Method 1 uses majority vote over frame classifications, Method 2 uses learned temporal attention weights over frames, and Method 3 uses the final LSTM state processed through dense layers.

        The optical flow analysis computes optical flow between consecutive frames using FlowNet2, extracts motion features indicating unnaturally rigid or fluid movements, and combines flow features with appearance features in late fusion. Lip-sync consistency analysis detects mouth regions using facial landmark detection with a 68-point model, extracts mouth pixel sequences across frames, compares audio Mel-Spectrogram with mouth motion features, and computes cross-correlation coefficient targeting values greater than 0.85 for authentic videos.

        The video module achieved a best validation accuracy of 0.9766, with final test results showing a test accuracy of 0.9675, test loss of 0.0143, and Val-Test gap of 0.0091.

      5. Cross-Modal Consistency Analysis (Gap 1)

        Deepfakes often exhibit misalignment between visual and audio modalities, particularly imperfect lip-synchronization. The cross-modal consistency analysis extracts audio features using MFCCs with 13 coefficients at 10ms intervals and extracts visual features from the mouth region detected from face landmarks. The mouth region is then converted to

        motion features including horizontal displacement of lip corners, vertical displacement, and mouth opening ratio. Features are synchronized temporally to align audio and video timestamps, and cross-correlation is computed between audio and visual motion features. Additional metrics calculated include time-lag cross-correlation using a ±500ms window, phase difference between audio and visual spectrograms, and energy correlation in synchronized segments.

        The thresholding strategy determines classification based on cross-correlation coefficients. A coefficient greater than 0.75 indicates likely authentic content, coefficients between 0.5

        0.75 indicate uncertain content requiring other signals, and coefficients ess than 0.5 indicate likely deepfakes. The cross modal consistency score is integrated with classification by being fed as an additional feature to the meta-learner with a weight of 0.2 in the final decision (empirically optimized).

      6. GradCAM & Explainable AI Suite (Gap 2)

        Simple averaging or concatenation of modality predictions often proves suboptimal. Different datasets and scenarios benefit from different modality weightings. The meta-learner receives input predictions and confidence from each modality: image prediction [, ] audio prediction

        [, ] video prediction [, ] and cross modal consistency score [, ]

        The meta-learner network architecture implements input concatenation of [p_img, p_aud, p_vid, s_cross, std(predictions)], followed by a dense layer with 64 units and ReLU activation, dropout of 0.3, a dense layer with 32 units and ReLU activation, an output layer 1 with 3 units and softmax activation producing modality weights w_img, w_aud, w_vid, and an output layer 2 with 1 unit and sigmoid activation producing consistency weight w_cross.

        The final prediction is calculated as:

        + + +

        GradCAM implementation computes class-specific gradients =

        as / where A represents activation maps. Global

        +

        +

        +

        average pooling of gradients produces

        = (/)/

        Activation maps are weighted according to

        = ( )

        then resized to input resolution and normalized to [0, 1]. The system provides comprehensive interpretability components. For image-level explanation, GradCAM visualization highlights facial regions critical for detection with top-K saliency regions sorted by importance and confidence scores per region. For audio-level explanation, the system provides attention over time to identify which time segments most indicate synthesis, attention over frequency to identify which frequency bands show artifacts, and spectral visualization with highlighted anomaly regions. Provenance specific explanation includes feature importance for each provenance class and characteristic artifacts shown for each generation method: face-swap shows boundary artifacts and lighting inconsistencies, lip-sync shows temporal misalignment and mouth shape artifacts, and voice synthesis shows spectral discontinuities and pitch artifacts.

        Confidence calibration reports prediction confidence with uncertainty bounds using entropy-based uncertainty calculated as

        = ()

        Confidence is used for triage, where high confidence predictions lead to conclusive decisions and low confidence predictions require human review.

      7. Learned Meta-Learner Fusion (Gap 3)

        Training uses supervised learning on the validation set of the target dataset with Binary cross-entropy loss with L2 regularization. The optimizer is Adam with early stopping for 50 epochs with patience of 5. The meta-learner can be fine-tuned on new datasets for transfer learning or retrained from scratch for domain-specific optimization. Performance shows that learned fusion outperforms fixed weighting by 2-3%.

      8. Provenance Classification (Gap 4)

    Understanding the deepfake generation method enables forensic attribution and targeted mitigation strategies. The system classifies deepfakes into four categories. Face-Swap involves facial regions replaced using generative models with characteristic artifacts including face boundary misalignment, lighting changes, and unnatural skin texture transitions. Detection signals include high-frequency artifacts at face edges and inconsistent lighting direction. Lip-Sync (Face-Reenactment) modifies face expression to match audio with characteristic artifacts of unnatural mouth movements, eye-mouth desynchronization, and slight texture degradation. Detection signals include temporal inconsistencies and motion artifacts around the mouth. Voice Synthesis replaces or heavily modifies audio with characteristic artifacts including spectral discontinuities, unnatural prosody, breathing pattern anomalies, and frequency gaps. Detection signals include energy discontinuities, spectral anomalies, and phoneme boundary artifacts. Real (Authentic) content has no manipulation applied and serves as baseline for reference.

    The provenance classifier architecture uses a separate multi class classifier for 4-way classification. Input consists of modality-specific features from detection modules: image features from the EfficientNet-B4 penultimate layer (1280 dims), audio features from ResNet18 penultimate layer plus spectrogram statistics, and video features from temporal

    attention weights plus optical flow features. Features are fused through concatenation of all features (approximately 2000 dims), followed by Dense(512, ReLU) with BatchNorm and Dropout(0.5), Dense(256, ReLU) with BatchNorm and Dropout(0.3), and output Dense(4, Softmax) for 4-way classification.

    Training uses multi-class cross-entropy loss with class weights to handle imbalance. Data augmentation is tailored to each provenance type with an Adam optimizer at learning rate 0.0005. Performance achieves 92.34% top-1 accuracy on provenance classification.

  4. DATASETS & PREPROCESSING

      1. Image Dataset: Real vs Fake

        The image dataset sourced from Kaggle’s Real vs Synthetic Faces dataset consists of 70,000 real faces and 70,000 fake faces for a total of 140,000 images. Preprocessing begins with face detection using MTCNN detector with confidence greater than 0.99. Faces are then aligned using 68-point landmark alignment to a canonical pose. Normalization subtracts ImageNet mean and divides by standard deviation. Augmentation includes random horizontal flip with probability 0.5, random rotation between ±15°, random brightness variation of ±0.2, random contrast variation from

        0.8 to 1.2, JPEG compression quality from 60 to 100, and Gaussian blur with sigma between 0 and 2. The dataset is split 70% training, 15% validation, and 15% test.

      2. Audio Dataset: LJSpeech + Generated

        Natural speech source comes from LJSpeech, which contains 13,100 audio clips totaling approximately 24 hours. The speaker is Linda Johnson speaking English. Sample rate is 22 kHz (downsampled to 16 kHz) with approximately 10 seconds per clip. Synthetic speech source includes Tacotron2 as a text-to-speech synthesizer, Glow-TTS as a generative flow-based TTS, FastPitch as a pitch-controllable TTS, and HiFi-GAN as a vocoder for audio quality improvement.

        The synthesis procedure selects a subset of LJSpeech texts and generates synthetic speech using multiple TTS backends. Post-processing includes compression and noise addition, creating a balanced dataset of 10,000 real and 10,000 synthetic samples. Preprocessing removes silence at beginning and end, applies peak normalization to -1dB, converts audio to Mel-Spectrogram with 128 bins normalized to mean 0 and standard deviation 1. Augmentation includes SpecAugment with maximum time mask of 40 and maximum frequency mask of 30, time stretching with factor from 0.8 to 1.2, pitch shifting of ±2 semitones, and Gaussian noise with SNR 20dB. The dataset is split 70% training, 15% validation, and 15% test.

      3. Video Dataset: LAV-DF (Lips Are Vital – Deepfake)

    The video dataset comes from LAV-DF public dataset containing 10,600 total videos. There are 5,300 real videos totaling approximately 50 hours and 5,300 deepfake videos totaling approximately 50 hours. Video duration ranges from

    3-15 seconds with resolution from 480p-1080p (downsampled to 480p) and frame rate of 25 fps. The deepfake generation methods in the dataset include face-swap at 40% (using DeepFaceLab),lip-sync at 35% (using Wav2Lip), and voice synthesis at 25% (using various TTS engines).

    Preprocessing applies face detection and alignment using the same techniques as the image module. Frame extraction occurs at 1 fps for computational efficiency. Optical flow is computed between consecutive frames. Mouth region extraction uses bounding boxes around detected mouths. Temporal clipping ensures a minimum of 5 frames for consistency analysis. Augmentation includes video compression using H.264 with varying bitrates from 500kbps to 5Mbps, Gaussian blur with sigma from 0 to 2, frame dropping to simulate low frame rate, and lighting changes with brightness adjustment of ±20%. The dataset is split 70% training, 15% validation, and 15% test.

  5. EXPERIMENTS & RESULTS

      1. Per-Modality Detection Accuracy

        The image detection results using EfficientNet-B4 with a 3-phase training approach show test set size of 21,000 images (50% real, 50% fake) achieving accuracy of 99.26% with precision of 99.26% and recall of 99.26%. The F1-Score reaches 99.26% with ROC-AUC of 0.9998. Test loss is 0.0032 with a Val-Test gap of 0.0001, indicating excellent generalization.

        The audio detection results using ResNet18 on Mel Spectrograms with 2-phase training show test set size of 3,930 audio clips (50% real, 50% synthetic) achieving accuracy of 99.87% with precision of 99.87% and recall of 99.87%. The F1-Score reaches 99.87% with ROC-AUC of 0.9999. Test loss is 0.0005 with a Val-Test gap of 0.0010, indicating excellent generalization.

        The video detection results using temporal EfficientNet-B4 with LSTM and 3-phase training show test set size of 5,813 frames (50% real, 50% fake) achieving accuracy of 96.75% with precision of 96.75% and recall of 96.75%. The F1-Score reaches 96.75% with ROC-AUC of 0.9932. Test loss is 0.0143 with a Val-Test gap of 0.0091, indicating good generalization. The video accuracy is lower than image and audio due to compression artifacts and temporal inconsistencies in challenging video samples. Nevertheless, 96.75% represents state-of-the-art performance for temporal deepfake detection.

      2. Multimodal Fusion Results

        The ablation study comparing fusion strategies provides comprehensive insights into system performance. Single modality results show image only at 99.26% as baseline, audio only at 99.87% (+0.61% improvement), and video only at 96.75% (-2.51% compared to baseline). Two-modality fusion shows image and audio with averaging at 99.89% (+0.02%), concatenation at 99.91% (+0.04%), and meta

        learner at 99.94% (+0.07%). Three-modality fusion shows image, audio, and video with averaging at 99.88% (+0.01%), concatenation at 99.89% (+0.02%), and meta-learner at 99.95% (+0.08%). The full system with image, audio, video, and cross-modal consistency achieves 99.95% (+0.08%), representing the final DeepCheck system.

        Key findings from the ablation study reveal that audio shows best single-modality performance at 99.87%, image provides complementary information at 99.26%, and video contributes temporal consistency analysis at 96.75%. In two-modality fusion, meta-learner fusion outperforms averaging by achieving 99.94% (+0.07%), demonstrating the value of learned weighting. In three-modality fusion, meta-learner significantly improves trimodal performance to 99.95% (+0.08%), where video’s temporal analysis reduces false positives from sophisticated deepfakes. Cross-modal consistency adds marginal improvement from +0.00% to

        +0.08%, primarily benefiting lip-sync artifacts detection and preventing misclassification of videos with naturally poor lip synchronization.

      3. Cross-Modal Consistency Analysis Performance

        GradCAM visualization quality was evaluated on 500 random test samples by human annotators. Saliency regions proved semantically meaningful in 94.2% of cases, with highlighted regions corresponding to actual artifacts in 91.8% of cases. False positive explanations occurred in 8.2% of cases, while poor explanations showing random regions occurred in 4.0% of cases.

        Audio attention weight performance shows correct identification of synthesis artifacts in 89.3% of cases, with temporal attention capturing unnatural speech patterns in 85.7% of cases and frequency attention isolating spectral anomalies in 88.1% of cases.

        Confidence calibration shows an Expected Calibration Error of 0.023, indicating well-calibrated predictions. High confidence predictions (greater than 0.95) achieve 99.8% accuracy, medium confidence predictions (0.80-0.95) achieve 98.2% accuracy, and low confidence predictions (less than 0.80) achieve 91.4% accuracy and require human review.

        Lip-sync consistency metrics show real videos with mean cross-correlation coefficient of 0.82 and standard deviation 0.08, while deepfakes show mean of 0.61 and standard deviation 0.15, achieving separation of 2.1 standard deviations. The ablation impact shows that without cross modal analysis, performance reaches 99.94%, while with cross-modal analysis it reaches 99.95%, showing an improvement of +0.01% absolute.

        Failure cases include 2 real videos with naturally poor lip sync (correlation less than 0.75) and 3 highly sophisticated deepfakes with near-perfect lip-sync (correlation greater than 0.80). The system correctly resolves these ambiguous cases using audio and image signals.

      4. Provenance Classification Results

        On a test set of 1,590 deepfake videos, face-swap detection achieves 93.2% accuracy (627 of 672 videos), lip-sync

        detection achieves 92.1% accuracy (487 of 529 videos), and voice synthesis detection achieves 91.8% accuracy (242 of 263 videos). The overall 4-way accuracy is 92.34%.

        Per-class metrics for face-swap show precision of 94.1%, recall of 92.4%, and F1-Score of 93.2% with characteristic artifacts detected including boundary discontinuities (85%) and lighting inconsistencies (78%). Lip-sync metrics show precision of 91.8%, recall of 92.5%, and F1-Score of 92.1% with characteristic artifacts detected including mouth motion lag (88%) and eye-mouth desynchronization (72%). Voice synthesis metrics show precision of 92.1%, recall of 91.5%, and F1-Score of 91.8% with characteristic artifacts detected including spectral discontinuities (89%) and prosody anomalies (65%).

      5. Explainability Analysis

      6. Robustness Evaluation

    The robustness analysis evaluates adversarial and compression attacks. Baseline clean data achieves 99.95% accuracy with no performance drop. FGSM attack with epsilon 0.03 achieves 97.80% accuracy with -2.15% performance drop. PGD attack with epsilon 0.03 and 20 steps achieve 95.20% accuracy with -4.75% performance drop. DeepFool attack with epsilon 0.06 achieves 93.50% accuracy with -6.45% performance drop. Adversarial training applied post-training recovers to 99.50% accuracy with -0.45% performance drop.

    Compression robustness shows H.264 compression at 1 Mbps achieving 99.20% accuracy with -0.75% drop. MJPEG compression with variable bitrate achieves 98.80% accuracy with -1.15% drop. Severe compression at 500 kbps achieves 96.50% accuracy with -3.45% drop. Gaussian blur with sigma 2.0 achieves 98.90% accuracy with -1.05% drop. Noise addition with SNR 20dB achieves 99.10% accuracy with 0.85% drop. Noise addition with SNR 10dB achieves 97.30% accuracy with -2.65% drop.

    Detailed robustness analysis reveals that FGSM attack at epsilon 0.03 achieves 97.80% accuracy (-2.15%) while PGD attack at epsilon 0.03 with 20 steps achieves 95.20% accuracy (-4.75%). Adversarial training mitigation recovers to 99.50% accuracy, with certified defense mechanisms recommended for deployment. H.264 compression at bitrate 1Mbps achieves 99.20% accuracy (-0.75%), MJPEG compression achieves 98.80% accuracy (-1.15%), and severe compression at 500kbps achieves 96.50% accuracy (-3.45%). The audio modality compensates for low video quality. Out-of distribution generalization testing on recent deepfake datasets from 2024 shows Face Swapper v2 achieving 97.30% accuracy and MetaAI Make-A-Video achieving 96.80%

    accuracy, requiring modest fine-tuning for perfect generalization.

  6. DISCUSSION

      1. Why Trimodal Approach Outperforms Single-Modality

        The superior performance of trimodal analysis stems from complementary strengths of each modality. The image modality captures static facial artifacts including texture and boundary discontinuities, detects lighting inconsistencies and color bleeding, and identifies face-specific manipulation signs. However, it has the limitation of missing temporal dynamics and audio artifacts. The audio modality detects synthesis artifacts in the spectral domain, identifies prosody and phoneme boundary anomalies, and is robust to visual manipulation techniques. However, it is oblivious to visual inconsistencies. The video modality captures temporal artifacts and motion inconsistencies, enables lip-sync consistency analysis, and detects unnatural frame-to-frame changes. However, it has lower quality than the individual image modality due to compression.

        Synergistic effects emerge from combining these modalities. Audio helps resolve visual ambiguities such as naturally poor-quality video. Video provides temporal context that resolves frame-level uncertainties. Cross-modal consistency catches sophisticated single-modality attacks. The meta learner learns to weight modalities appropriately per input, optimizing for each specific scenario.

      2. Limitations

        Despite strong performance, the approach exhibits several technical limitations. The system requires clearly visible faces in frames with sufficient size (minimum 64×64 pixels). It fails for extreme pose variations (greater than 60° rotation), heavy occlusion (sunglasses, masks), and very distant faces in frames. Video analysis requires at least 5 frames for reliable lip-sync analysis, which is limiting in forensic scenarios involving very short video clips and low frame rate videos that may lose critical temporal information. Audio detection was trained primarily on English speech with performance degradation on non-English content of -2-5% and certain accents may show reduced accuracy. Models are trained on current generation deepfakes, so future synthesis techniques may evade detection and adversarial deepfakes specifically designed to fool detectors may succeed.

        Methodological limitations include dataset bias where training datasets may not represent full diversity of real deepfakes. Demographic bias shows better performance on certain face types, while geographic bias shows datasets primarily representing Western media. Provenance classification becomes more challenging when multiple techniques are combined in a single deepfake, reducing accuracy for hybrid approaches combining face-swap and voice synthesis. Human-in-the-loop requirements necessitate manual review for low-confidence predictions with trust score less than 0.80, with approximately 5-10% of test

        samples falling in the ambiguous zone, which limits scalability due to human verification bottlenecks.

      3. Future Work

    Immediate extensions include multilingual audio support, retraining the audio module on diverse languages with a target of supporting 20+ languages with less than 2% accuracy drop. End-to-end optimization would involve joint optimization of all components with a unified loss function instead of independent module training, with potential 0.5 to 1% additional accuracy improvement. Hardware acceleration would deploy the system on GPU clusters for real-time processing, improving current inference of approximately 500ms per sample (feasible at 2 fps).

    Long-term research directions include meta-learner generalization by training on large datasets of diverse deepfake sources to enable transfer to new deepfake types without retraining, researching domain adaptation techniques for new modalities. Few-shot detection would enable detection of new deepfake types with limited examples, leveraging meta-learning for rapid adaptation with a target of achieving greater than 95% accuracy with less than 100 labeled examples. Self-supervised pre-training would leverage unlabeled video and audio data for representation learning, reducing reliance on large labeled datasets and improving generalization to new deepfake methods. Temporal consistency learning would better leverage temporal information in videos through attention over long range temporal dependencies and modeling temporal patterns characteristic of authentic videos. Attribution and tracing would move beyond detection to identify generation tools and parameters, enabling law enforcement attribution and enforcement with potential for blockchain-based provenance tracking.

  7. CONCLUSION

    This paper presents DeepCheck, a comprehensive multimodal framework for deepfake detection that addresses four critical research gaps in the field. Cross-Modal Consistency Analysis enables detection of temporal misalignments characteristic of deepfakes, particularly lip sync inconsistencies, achieving 2.1 standard deviation separation between real and fake videos. Explainable AI Integration via GradCAM and attention mechanisms provides interpretable evidence for detection decisions, critical for forensic and legal applications with 94.2% semantic meaningfulness in visualizations. Learned Meta Learner Fusion intelligently combines multimodal signals, outperforming fixed fusion strategies by 0.08% and adapting to new datasets without retraining. Provenance Classification enables attribution to deepfake generation methods with face swap at 93.4%, lip-sync at 92.1%, and voice synthesis at 91.8%, supporting forensic analysis and targeted mitigation strategies.

    The experimental results demonstrate comprehensive performance across multiple system configurations. The image module using EfficientNet-B4 achieves 99.26%

    accuracy. The audio module using ResNet18 achieves 99.87% accuracy. The video module using EfficientNet-B4 achieves 96.75% accuracy. The trimodal system using DeepCheck achieves 99.95% accuracy.

    Key findings from the comprehensive evaluation include that trimodal approach outperforms the best single modality by 0.08%, meta-learner fusion provides 0.07% improvement over fixed weighting, robustness to compression achieves 98.2% at 1 Mbps H.264, adversarial robustness is demonstrated through various attacks, excellent generalization is shown with Val-Test gaps less than 0.01%, and human-level explainability achieves 94.2% saliency region meaningfulness.

    The combination of high accuracy, interpretability, explainability, and provenance information makes DeepCheck suitable for deployment in forensic, investigative, and legal settings where both accuracy and explainability are paramount. The framework’s adaptive fusion mechanism enables rapid deployment to emerging deepfake generation technologies without complete retraining. As deepfake technology continues to evolve at unprecedented rates, adaptive, interpretable, and modular detection systems like DeepCheck will prove increasingly critical for maintaining trust in digital media and protecting against malicious manipulation in political, social, and financial domains.

    Future work will focus on multilingual audio support across 20+ languages, end-to-end optimization achieving 0.5-1% additional gains, few-shot adaptation for novel deepfake methods with less than 100 labeled examples, and attribution and tracing for law enforcement applications.

  8. REFERENCES

  1. S. Muppalla, S. Jia, and S. Lyu, Integrating audio-visual features for multimodal deepfake detection, in 2023 IEEE MIT Undergraduate Research Technology Conference (URTC), 2023, pp. 5.

  2. A. Hashmi, S. A. Shahzad, C.-W. Lin, Y. Tsao, and H. M. Wang, AVTENet: A human-cognition-inspired audio-visual transformer-based ensemble network for video deepfake detection, IEEE Trans. Cogn. Dev. Syst., 2025.

  3. H. Zou et al., Cross-modality and within-modality regularization for audio-visual deepfake detection, in ICASSP 2024 IEEE Int. Conf. Acoust., Speech and Signal Process., 2024, pp. 49004904.

  4. X. Li et al., Safeear: Content privacy-preserving audio deepfake detection, in Proc. 2024 ACM SIGSAC Conf. Comput. Commun. Secur., 2024, pp. 35853599.

  5. Y. Du et al., CAD: A general multimodal framework for video deepfake detection via cross-modal alignment and distillation, arXiv preprint arXiv:2505.15233, 2025.

  6. W. Xu et al., A multimodal deviation perceiving framework for weakly-supervised temporal forgery localization, in Proc. 33rd ACM Int. Conf. Multimedia, 2025, pp. 1158111589.

  7. A. Yermakov, J. Cech, J. Matas, and M. Fritz, Deepfake detection that generalizes across benchmarks, arXiv preprint arXiv:2508.06248, 2025.

  8. I. Kukanov and J. W. Ng, KLASSify to verify: Audio visual deepfake detection using SSL-based audio and handcrafted visual features, in Proc. 33rd ACM Int. Conf. Multimedia, 2025, pp. 1370713713.

  9. N. Klein et al., Pindrop it! Audio and visual deepfake countermeasures for robust detection and fine-grained localization, in Proc. 33rd ACM Int. Conf. Multimedia, 2025, pp. 1370013706.

  10. D. Salvi et al., A robust approach to multimodal deepfake detection,

    J. Imaging, vol. 9, no. 6, p. 122, 2023.

  11. A. Sar et al., A unified neural framework for real-time deepfake detection across multimedia modalities to combat misleading content, IEEE Access, 2025.

  12. S. Dasgupta et al., Attention-enhanced CNN for high performance deepfake detection: A multi-dataset study, IEEE Access, 2025.

  13. D. Cozzolino, A. Pianese, M. Nießner, and L. Verdoliva, Audio-visual person-of-interest deepfake detection, in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 943952.

  14. E. Choi, J. Ahn, X. Piao, and J. K. Kim, Crome: Multimodal fake news detection using cross-modal tri transformer and metric learning, arXiv preprint arXiv:2501.12422, 2025.

  15. A. Kharel, M. Paranjape, and A. Bera, DF-TransFusion: Multimodal deepfake detection via lip-audio cross-attention and facial self-attention, arXiv preprint arXiv:2309.06511, 2023.

  16. M. Javed et al., Enhancing multimodal deepfake detection with localglobal feature integration and diffusion models, Signal, Image Video Process., vol. 19, no. 5, pp. 1 9, 2025.

  17. P. Liu, Q. Tao, and J. T. Zhou, Evolving from single modal to multi-modal facial deepfake detection: Progress and challenges, arXiv preprint arXiv:2406.06965, 2024.

  18. Z. Cai et al., Glitch in the matrix: A large scale benchmark for content driven audiovisual forgery detection and localization, Comput. Vis. Image Underst., vol. 236, p. 103818, 2023.

  19. R. Wang et al., AVT²-DWF: Improving deepfake detection with audio-visual fusion and dynamic weighting strategies, IEEE Signal Process. Lett., 2024.

  1. R. Wang et al., AVT²-DWF: Improving deepfake detection with audio-visual fusion and dynamic weighting strategies, IEEE Signal Process. Lett., 2024.

  2. Y. Zhu, Y. Wang, and Z. Yu, Multimodal fake news detection: MFND dataset and shallow-deep multitask learning, arXiv preprint arXiv:2505.06796, 2025.

  3. M. A. Raza and K. M. Malik, Multimodaltrace: Deepfake detection using audiovisual representation learning, in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 9931000.

  4. S. Karim et al., MCGANa cutting edge approach to real time investigate of multimedia deepfake multi collaboration of deep generative adversarial networks with transfer learning, Sci. Rep., vol. 14, no. 1,p. 29330, 2024

  5. S. Smeu, D. A. Boldisor, D. Oneata, and E. Oneata, Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning, in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2025, pp. 1881518825.