Deepcheck: A Unified Multimodal Deepfake Detection Framework with Cross-Modal Consistency Analysis, Learned Fusion, and Explainable AI

Rugved Joshi; Varad Khadke; Ayush Shah; Amay Choudhari; Prof. Vijayalaxmi Kanade

doi:10.5281/zenodo.20551739

Volume 15, Issue 06 (June 2026)

Deepcheck: A Unified Multimodal Deepfake Detection Framework with Cross-Modal Consistency Analysis, Learned Fusion, and Explainable AI

DOI : 10.5281/zenodo.20551739

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 130
Authors : Rugved Joshi, Varad Khadke, Ayush Shah, Amay Choudhari, Prof. Vijayalaxmi Kanade
Paper ID : IJERTV15IS060022
Volume & Issue : Volume 15, Issue 06 , June – 2026
Published (First Online): 05-06-2026
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Deepcheck: A Unified Multimodal Deepfake Detection Framework with Cross-Modal Consistency Analysis, Learned Fusion, and Explainable AI

Rugved Joshi

Artificial Intelligence and Data Science

PVG’s College of Engineering, Technology and Management Pune, India

Amay Choudhari

Artificial Intelligence and Data Science

PVG’s College of Engineering, Technology and Management Pune, India

Varad Khadke

Artificial Intelligence and Data Science

PVG’s College of Engineering, Technology and Management Pune, India

Prof. Vijayalaxmi Kanade

Artificial Intelligence and Data Science

PVG’s College of Engineering, Technology and Management Pune, India

Ayush Shah

Artificial Intelligence and Data Science

PVG’s College of Engineering, Technology and Management Pune, India

Abstract – Deepfake detection has emerged as a critical challenge in the digital age, with single-modality detectors showing significant limitations in real-world scenarios. This paper presents DeepCheck, a comprehensive trimodal framework for detecting deepfakes across image, audio, and video modalities. Our approach addresses four key research gaps: cross-modal consistency analysis for enhanced detection, explainable AI integration via GradCAM for interpretability, a learned meta-learner fusion mechanism for intelligent multimodal decision-making, and provenance classification for deepfake source attribution. DeepCheck achieves exceptional per-modality accuracies (Image: 99.26%, Audio: 99.87%, Video: 96.75%) while providing interpretable results and provenance insights. Our experimental evaluation demonstrates the superiority of trimodal analysis over single-modality approaches through comprehensive ablation studies. Results show Val-Test gaps of <0.01%, indicating excellent generalization across modalities.

Keywords – Deepfake Detection, Multimodal Learning, Explainable AI, GradCAM, Meta-Learner Fusion

INTRODUCTION
This paper presents DeepCheck, a novel framework that addresses these limitations through four key innovations. The first innovation is Cross-Modal Consistency Analysis, which analyzes synchronization between visual lip movements and audio speech patterns to detect temporal inconsistencies characteristic of deepfakes. The second innovation is Explainable AI Integration, which employs GradCAM-based visualization and attention mechanisms to provide interpretable evidence for detection decisions, enabling human verification of AI predictions. The third innovation is Learned Meta-Learner Fusion, which implements an intelligent fusion mechanism that learns optimal weighting of

modalities rather than using fixed heuristics, adapting to dataset characteristics. Finally, the fourth innovation is Provenance Classification, which classifies deepfakes by generation method (face-swap, lip-sync, voice synthesis) to support forensic analysis and attribution.
RELATED WORK
Interpretability in deepfake detection is critical for forensic applications and legal admissibility. GradCAM (Gradient weighted Class Activation Mapping) visualizes regions in input that most influence classification decisions through gradient information, enabling human verification of model reasoning. Attention mechanisms learn spatial and temporal attention weights indicating critical features for detection, providing interpretable decision paths. Saliency maps highlight pixels or frames most critical for deepfake determination, supporting forensic analysis. Despite these advances, most deepfake detectors remain black boxes, limiting their adoption in forensic and legal contexts where explainability is paramount.
METHODOLOGY
Understanding the deepfake generation method enables forensic attribution and targeted mitigation strategies. The system classifies deepfakes into four categories. Face-Swap involves facial regions replaced using generative models with characteristic artifacts including face boundary misalignment, lighting changes, and unnatural skin texture transitions. Detection signals include high-frequency artifacts at face edges and inconsistent lighting direction. Lip-Sync (Face-Reenactment) modifies face expression to match audio with characteristic artifacts of unnatural mouth movements, eye-mouth desynchronization, and slight texture degradation. Detection signals include temporal inconsistencies and motion artifacts around the mouth. Voice Synthesis replaces or heavily modifies audio with characteristic artifacts including spectral discontinuities, unnatural prosody, breathing pattern anomalies, and frequency gaps. Detection signals include energy discontinuities, spectral anomalies, and phoneme boundary artifacts. Real (Authentic) content has no manipulation applied and serves as baseline for reference.

The provenance classifier architecture uses a separate multi class classifier for 4-way classification. Input consists of modality-specific features from detection modules: image features from the EfficientNet-B4 penultimate layer (1280 dims), audio features from ResNet18 penultimate layer plus spectrogram statistics, and video features from temporal

attention weights plus optical flow features. Features are fused through concatenation of all features (approximately 2000 dims), followed by Dense(512, ReLU) with BatchNorm and Dropout(0.5), Dense(256, ReLU) with BatchNorm and Dropout(0.3), and output Dense(4, Softmax) for 4-way classification.

Training uses multi-class cross-entropy loss with class weights to handle imbalance. Data augmentation is tailored to each provenance type with an Adam optimizer at learning rate 0.0005. Performance achieves 92.34% top-1 accuracy on provenance classification.
DATASETS & PREPROCESSING
The video dataset comes from LAV-DF public dataset containing 10,600 total videos. There are 5,300 real videos totaling approximately 50 hours and 5,300 deepfake videos totaling approximately 50 hours. Video duration ranges from

3-15 seconds with resolution from 480p-1080p (downsampled to 480p) and frame rate of 25 fps. The deepfake generation methods in the dataset include face-swap at 40% (using DeepFaceLab),lip-sync at 35% (using Wav2Lip), and voice synthesis at 25% (using various TTS engines).

Preprocessing applies face detection and alignment using the same techniques as the image module. Frame extraction occurs at 1 fps for computational efficiency. Optical flow is computed between consecutive frames. Mouth region extraction uses bounding boxes around detected mouths. Temporal clipping ensures a minimum of 5 frames for consistency analysis. Augmentation includes video compression using H.264 with varying bitrates from 500kbps to 5Mbps, Gaussian blur with sigma from 0 to 2, frame dropping to simulate low frame rate, and lighting changes with brightness adjustment of ±20%. The dataset is split 70% training, 15% validation, and 15% test.
EXPERIMENTS & RESULTS
The robustness analysis evaluates adversarial and compression attacks. Baseline clean data achieves 99.95% accuracy with no performance drop. FGSM attack with epsilon 0.03 achieves 97.80% accuracy with -2.15% performance drop. PGD attack with epsilon 0.03 and 20 steps achieve 95.20% accuracy with -4.75% performance drop. DeepFool attack with epsilon 0.06 achieves 93.50% accuracy with -6.45% performance drop. Adversarial training applied post-training recovers to 99.50% accuracy with -0.45% performance drop.

Compression robustness shows H.264 compression at 1 Mbps achieving 99.20% accuracy with -0.75% drop. MJPEG compression with variable bitrate achieves 98.80% accuracy with -1.15% drop. Severe compression at 500 kbps achieves 96.50% accuracy with -3.45% drop. Gaussian blur with sigma 2.0 achieves 98.90% accuracy with -1.05% drop. Noise addition with SNR 20dB achieves 99.10% accuracy with 0.85% drop. Noise addition with SNR 10dB achieves 97.30% accuracy with -2.65% drop.

Detailed robustness analysis reveals that FGSM attack at epsilon 0.03 achieves 97.80% accuracy (-2.15%) while PGD attack at epsilon 0.03 with 20 steps achieves 95.20% accuracy (-4.75%). Adversarial training mitigation recovers to 99.50% accuracy, with certified defense mechanisms recommended for deployment. H.264 compression at bitrate 1Mbps achieves 99.20% accuracy (-0.75%), MJPEG compression achieves 98.80% accuracy (-1.15%), and severe compression at 500kbps achieves 96.50% accuracy (-3.45%). The audio modality compensates for low video quality. Out-of distribution generalization testing on recent deepfake datasets from 2024 shows Face Swapper v2 achieving 97.30% accuracy and MetaAI Make-A-Video achieving 96.80%

accuracy, requiring modest fine-tuning for perfect generalization.
DISCUSSION
Immediate extensions include multilingual audio support, retraining the audio module on diverse languages with a target of supporting 20+ languages with less than 2% accuracy drop. End-to-end optimization would involve joint optimization of all components with a unified loss function instead of independent module training, with potential 0.5 to 1% additional accuracy improvement. Hardware acceleration would deploy the system on GPU clusters for real-time processing, improving current inference of approximately 500ms per sample (feasible at 2 fps).

Long-term research directions include meta-learner generalization by training on large datasets of diverse deepfake sources to enable transfer to new deepfake types without retraining, researching domain adaptation techniques for new modalities. Few-shot detection would enable detection of new deepfake types with limited examples, leveraging meta-learning for rapid adaptation with a target of achieving greater than 95% accuracy with less than 100 labeled examples. Self-supervised pre-training would leverage unlabeled video and audio data for representation learning, reducing reliance on large labeled datasets and improving generalization to new deepfake methods. Temporal consistency learning would better leverage temporal information in videos through attention over long range temporal dependencies and modeling temporal patterns characteristic of authentic videos. Attribution and tracing would move beyond detection to identify generation tools and parameters, enabling law enforcement attribution and enforcement with potential for blockchain-based provenance tracking.
CONCLUSION

This paper presents DeepCheck, a comprehensive multimodal framework for deepfake detection that addresses four critical research gaps in the field. Cross-Modal Consistency Analysis enables detection of temporal misalignments characteristic of deepfakes, particularly lip sync inconsistencies, achieving 2.1 standard deviation separation between real and fake videos. Explainable AI Integration via GradCAM and attention mechanisms provides interpretable evidence for detection decisions, critical for forensic and legal applications with 94.2% semantic meaningfulness in visualizations. Learned Meta Learner Fusion intelligently combines multimodal signals, outperforming fixed fusion strategies by 0.08% and adapting to new datasets without retraining. Provenance Classification enables attribution to deepfake generation methods with face swap at 93.4%, lip-sync at 92.1%, and voice synthesis at 91.8%, supporting forensic analysis and targeted mitigation strategies.

The experimental results demonstrate comprehensive performance across multiple system configurations. The image module using EfficientNet-B4 achieves 99.26%

accuracy. The audio module using ResNet18 achieves 99.87% accuracy. The video module using EfficientNet-B4 achieves 96.75% accuracy. The trimodal system using DeepCheck achieves 99.95% accuracy.

Key findings from the comprehensive evaluation include that trimodal approach outperforms the best single modality by 0.08%, meta-learner fusion provides 0.07% improvement over fixed weighting, robustness to compression achieves 98.2% at 1 Mbps H.264, adversarial robustness is demonstrated through various attacks, excellent generalization is shown with Val-Test gaps less than 0.01%, and human-level explainability achieves 94.2% saliency region meaningfulness.

The combination of high accuracy, interpretability, explainability, and provenance information makes DeepCheck suitable for deployment in forensic, investigative, and legal settings where both accuracy and explainability are paramount. The framework’s adaptive fusion mechanism enables rapid deployment to emerging deepfake generation technologies without complete retraining. As deepfake technology continues to evolve at unprecedented rates, adaptive, interpretable, and modular detection systems like DeepCheck will prove increasingly critical for maintaining trust in digital media and protecting against malicious manipulation in political, social, and financial domains.

Future work will focus on multilingual audio support across 20+ languages, end-to-end optimization achieving 0.5-1% additional gains, few-shot adaptation for novel deepfake methods with less than 100 labeled examples, and attribution and tracing for law enforcement applications.
REFERENCES

S. Muppalla, S. Jia, and S. Lyu, Integrating audio-visual features for multimodal deepfake detection, in 2023 IEEE MIT Undergraduate Research Technology Conference (URTC), 2023, pp. 5.
A. Hashmi, S. A. Shahzad, C.-W. Lin, Y. Tsao, and H. M. Wang, AVTENet: A human-cognition-inspired audio-visual transformer-based ensemble network for video deepfake detection, IEEE Trans. Cogn. Dev. Syst., 2025.
H. Zou et al., Cross-modality and within-modality regularization for audio-visual deepfake detection, in ICASSP 2024 IEEE Int. Conf. Acoust., Speech and Signal Process., 2024, pp. 49004904.
X. Li et al., Safeear: Content privacy-preserving audio deepfake detection, in Proc. 2024 ACM SIGSAC Conf. Comput. Commun. Secur., 2024, pp. 35853599.
Y. Du et al., CAD: A general multimodal framework for video deepfake detection via cross-modal alignment and distillation, arXiv preprint arXiv:2505.15233, 2025.
W. Xu et al., A multimodal deviation perceiving framework for weakly-supervised temporal forgery localization, in Proc. 33rd ACM Int. Conf. Multimedia, 2025, pp. 1158111589.
A. Yermakov, J. Cech, J. Matas, and M. Fritz, Deepfake detection that generalizes across benchmarks, arXiv preprint arXiv:2508.06248, 2025.
I. Kukanov and J. W. Ng, KLASSify to verify: Audio visual deepfake detection using SSL-based audio and handcrafted visual features, in Proc. 33rd ACM Int. Conf. Multimedia, 2025, pp. 1370713713.
N. Klein et al., Pindrop it! Audio and visual deepfake countermeasures for robust detection and fine-grained localization, in Proc. 33rd ACM Int. Conf. Multimedia, 2025, pp. 1370013706.
D. Salvi et al., A robust approach to multimodal deepfake detection,

J. Imaging, vol. 9, no. 6, p. 122, 2023.
A. Sar et al., A unified neural framework for real-time deepfake detection across multimedia modalities to combat misleading content, IEEE Access, 2025.
S. Dasgupta et al., Attention-enhanced CNN for high performance deepfake detection: A multi-dataset study, IEEE Access, 2025.
D. Cozzolino, A. Pianese, M. Nießner, and L. Verdoliva, Audio-visual person-of-interest deepfake detection, in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 943952.
E. Choi, J. Ahn, X. Piao, and J. K. Kim, Crome: Multimodal fake news detection using cross-modal tri transformer and metric learning, arXiv preprint arXiv:2501.12422, 2025.
A. Kharel, M. Paranjape, and A. Bera, DF-TransFusion: Multimodal deepfake detection via lip-audio cross-attention and facial self-attention, arXiv preprint arXiv:2309.06511, 2023.
M. Javed et al., Enhancing multimodal deepfake detection with localglobal feature integration and diffusion models, Signal, Image Video Process., vol. 19, no. 5, pp. 1 9, 2025.
P. Liu, Q. Tao, and J. T. Zhou, Evolving from single modal to multi-modal facial deepfake detection: Progress and challenges, arXiv preprint arXiv:2406.06965, 2024.
Z. Cai et al., Glitch in the matrix: A large scale benchmark for content driven audiovisual forgery detection and localization, Comput. Vis. Image Underst., vol. 236, p. 103818, 2023.
R. Wang et al., AVT²-DWF: Improving deepfake detection with audio-visual fusion and dynamic weighting strategies, IEEE Signal Process. Lett., 2024.

R. Wang et al., AVT²-DWF: Improving deepfake detection with audio-visual fusion and dynamic weighting strategies, IEEE Signal Process. Lett., 2024.
Y. Zhu, Y. Wang, and Z. Yu, Multimodal fake news detection: MFND dataset and shallow-deep multitask learning, arXiv preprint arXiv:2505.06796, 2025.
M. A. Raza and K. M. Malik, Multimodaltrace: Deepfake detection using audiovisual representation learning, in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 9931000.
S. Karim et al., MCGANa cutting edge approach to real time investigate of multimedia deepfake multi collaboration of deep generative adversarial networks with transfer learning, Sci. Rep., vol. 14, no. 1,p. 29330, 2024
S. Smeu, D. A. Boldisor, D. Oneata, and E. Oneata, Circumventing shortcuts in audio-visual deepfake detection datasets with unsupervised learning, in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2025, pp. 1881518825.