Adversarial Attacks and Defences in Deepfake Detection: A Comprehensive Threat Model and Mitigation Framework Robust Detection Under Adversarial Conditions

doi:10.5281/zenodo.19650956

Volume 15, Issue 04 (April 2026)

Adversarial Attacks and Defences in Deepfake Detection: A Comprehensive Threat Model and Mitigation Framework Robust Detection Under Adversarial Conditions

DOI : 10.5281/zenodo.19650956

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 99
Authors : Rakhi Rathi, Aditya Shinde, Ishwar Shirole, Zaki Shaikh, Dr. V. C Bagal, Prof A. L. Rane
Paper ID : IJERTV15IS041030
Volume & Issue : Volume 15, Issue 04 , April – 2026
Published (First Online): 19-04-2026
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Adversarial Attacks and Defences in Deepfake Detection: A Comprehensive Threat Model and Mitigation Framework Robust Detection Under Adversarial Conditions

Robust Detection Under Adversarial Conditions

Rakhi Rathi, Aditya Shinde, Ishwar Shirole, Zaki Shaikh Co-Authors: Dr. V.C Bagal, Prof A.L. Rane

K. K Wagh Institute Of Engineering Education and Research, Nashik

Abstract – Deepfake detection systems have demonstrated impressive accuracy on standard benchmarks, yet remain vulnerable to adversarial attacks deliberately crafted to evade detection. This paper presents the first comprehensive evaluation of adversarial robustness in deepfake detection, testing 32 state-of-the-art detectors against 156 attack variants (L, L2, L0 constrained perturbations; targeted and untargeted; audio, video, and multimodal). We demonstrate that even robust classifiers achieve 100% attack success rates with imperceptible perturbations (L=4/255 0.016 in normalized audio amplitude), and that ensemble methods provide false security by accepting evasion through simple transferability. We propose ROBUST-DETECT, a certified defense combining randomized preprocessing, prediction hardening, and dynamic threshold adjustment, achieving -robustness (=0.01) with 96.1% clean accuracy on deepfake detection tasks. Our work establishes adversarial robustness as a critical evaluation dimension for deployment-ready deepfake detectors, provides practitioners with threat models, and offers practical defenses suitable for production systems.

Keywords – deepfake detection, adversarial robustness, evasion attacks, certified defenses, robust machine learning.

INTRODUCTION
1. Motivation
  
  Deepfake detection faces a fundamental challenge: as detection systems improve, motivated adversaries will invest in attacks to evade them. This adversarial dynamicwell- established in spam detection [1], intrusion detection [2], and malware analysis [3]has been largely neglected in deepfake literature.
  
  Consider a criminal distributing non-consensual deepfake content: detection systems flag the content within seconds, but what if the attacker could add imperceptible noise to bypass detection while remaining visually/aurally identical? This is not theoreticaladversarial examples are well-
  
  established in computer vision [4][5], speech processing [6], and natural language processing [7].
  
  Critical Question: Are current deepfake detectors robust to adversarial perturbations? Our evaluation reveals a troubling answer: no, with 100% attack success rates using imperceptible perturbations.
2. Threat Model
  
  We consider three attack scenarios:
  
  Scenario 1 Evasion: Adversary modifies a deepfake file to evade detection (confidence < 0.5) subject to imperceptibility constraints.
  
  Scenario 2 Poisoning: Adversary injects a backdoor into training data; poisoned model predicts authentic on trigger inputs while maintaining clean accuracy.
  
  Scenario 3 Model Extraction: Adversary queries detector API to extract the model through query-efficient extraction. This paper focuses on Scenario 1 (evasion), the most immediate threat.
3. Contributions
  1. First comprehensive adversarial robustness evaluation of
    
    32 deepfake detectors against 156 attack variants, establishing baseline vulnerability.
  2. Adversarial attack taxonomy for deepfake domain: frequency-domain perturbations, adversarial synthesis, codec-based attacks, GAN-based perturbations.
  3. Transferability analysis demonstrating adversarial deepfakes transfer across models (6692% attack success), enabling black-box attacks.
  4. ROBUST-DETECT framework combining randomized preprocessing, prediction hardening, and certified robustness guarantees (=0.01 robustness with 96.1% clean accuracy).
  5. Practical defenses suitable for deployment without retraining, including detection-time hardening and post- hoc robustification.
RELATED WORK
1. Adversarial ML Background
  
  Adversarial examples were discovered in image recognition
  [4] and have since spread to speech [6], NLP [7], and malware detection [8]. Key findings from other domains:
  - Image recognition: L perturbations with =8/255 can
    
    achieve 100% attack success [9]
  - Speech recognition: L2 perturbations of ~0.001 amplitude can force misclassification [10]
  - Transferability: 6090% of adversarial examples transfer to different models [11][12]
2. Defense Strategies
  
  Defense strategies include [13][14]: adversarial training (re- train model on perturbed examples [15]); certified defenses (randomized smoothing [16], IBP [17]); preprocessing (JPEG compression [18], median filtering [19]); detection-based adversarial example detectors [20][21]; and ensemble voting [22].
3. Deepfake Detection Robustness (Prior Work)
Adversarial robustness in deepfake detection is largely unexplored. A single prior work [23] tested one detector against simple perturbations, finding 100% success rates. Our work extends [23] by: 32× more models; 156 attack variants vs. 1; multiple attack methods beyond Lp perturbations; defense algorithms with theoretical guarantees; and analysis of transfer attacks and ensemble vulnerabilities.
ATTACK METHODS
1. Lp-Norm Constrained Perturbations
  
  Given deepfake audio x, we find perturbation such that
  
  ||||_p is minimized subject to: f(x+) < 0.5 (evasion goal) and x+ remains valid audio. Implementations include L attacks: FGSM [4], PGD [24]; L2 attacks: C&W [25]; and L0 sparse perturbations [26].
  
  Perceptual constraints include: amplitude clipping ( [0.01, 0.01] normalized amplitude); frequency masking (preserve energy in critical bands); and temporal filtering to avoid aliasing.
2. Audio-Specific Attacks
  
  Adversarial synthesis: Train GAN with adversarial loss including detector f, maximizing f(x_synthetic)=0.5 while maintaining perceptual quality.
  
  Codec-based attacks: Apply codec (MP3, OPUS, AAC) decompress use as adversarial perturbation. Advantage: naturally imperceptible. Disadvantage: limited perturbation budget.
  
  Phase manipulation: Modify phase spectrum while preserving magnitude. Adversarial phase patterns fool detectors but preserve intelligibility.
3. Transferability Analysis
We evaluate whether adversarial deepfakes crafted against Model A successfully evade Model B, using transfer success rate TSR(AB) = (# successful attacks on B) / (# total attacks) × 100%. Strong transferability implies black-box attacks are feasible.

EVALUATION RESULTS

Overall Attack Success Rates

TABLE I

Attack Success Rates Across 32 Detectors

/tr>

Attack Type	Models Tested	Mean Success	Std Dev	Min (Robust)	Max (Least Robust)
L (=8/255)	32	98.2%	3.1%	91.3%	100%
L (=4/255)	32	94.7%	5.2%	82.1%	99.8%
L (=2/255)	32	87.3%	8.9%	71.2%	96.8%
L2 (=0.005)	32	96.1%	4.2%	88.9%	99.9%
L0 (sparse)	32	91.4%	7.8%	79.3%	98.2%
Codec-based (OPUS)	32	73.2%	12.4%	51.2%	89.7%
Phase manipulation	32	81.1%	11.3%	62.4%	94.6%
Adversarial synthesis	28	89.4%	9.1%	71.8%	98.3%
Imperceptible (human)	156 variants	87.4%

Key Finding 1: Even with severely constrained perturbations (L=4/255), 94.7% of attacks succeed. These perturbations are imperceptible to humans. Human evaluation: 87.4% of adversarial deepfakes rated as imperceptible by listeners (n=50, A/B testing).

Robustness vs. Detection Accuracy Trade-off

TABLE II

Baseline Accuracy vs. Adversarial Robustness

Model Type	Baseline Acc.	L (8/255) Rob.	L (4/255) Rob.	Acc. Rob. Gap
Spectral CNN	0.943	0.182	0.341	0.602
Speaker Embedding	0.951	0.203	0.287	0.664
Ensemble (5 models)	0.967	0.156	0.248	0.719
Frequency Analysis	0.912	0.267	0.419	0.493
Mean	0.943	0.202	0.324	0.620

Key Finding 2: State-of-the-art accuracy (baseline ~94%) provides no robustness guarantee. Clean accuracy and adversarial robustness are largely uncorrelated. Ensemble paradox: ensembles show worse adversarial robustness (15.6% at =8/255) than single models.

Transferability Analysis

TABLE III

Cross-Model Attack Transferability

Source Model	Target A	Target B	Target C	Target Ensemble	Mean Transfer
Model A (Spectral)	98.2%	74.3%	68.9%	52.1%	73.3%
Model B (Speaker)	71.2%	96.1%	79.4%	58.3%	76.3%
Model C (Frequency)	68.4%	72.8%	91.4%	51.7%	71.1%

Source Model

Target A

Target B

Target C

Target Ensemble

Mean Transfer

Model D (GAN-

based)

78.9%

81.2%

75.3%

62.4%

74.5%

Mean

79.2%

81.1%

78.8%

56.1%

73.8%

TABLE IV

Defense Method	Clea n Acc.	L=4/25 5 Rob.	L=8/25 5 Rob.	Certifie d	Comput e Cost
Baseline (no defense)	0.961	0.047	0.018	No	1.0×
Adversarial Training	0.934	0.621	0.384	No	15.0× (retrain)
JPEG Compressio n	0.954	0.387	0.182	No	1.2×
Ensemble Voting	0.967	0.142	0.087	No	5.0×
ROBUST- DETECT (N=30)	0.961	0.897	0.742	Yes	30.0×
ROBUST- DETECT (N=10)	0.958	0.756	0.531	Yes	10.0×

ROBUST-DETECT vs. Baseline and Other Defenses

Key Finding 3: Strong transferability (mean 73.8%) enables practical black-box attacks. An adversary needs only query access to one model to craft attacks effective against others. Ensembles provide only marginal defense (56.1% transfer rate).

Attack Success vs. Perturbation Budget

Attack success rate increases monotonically with L perturbation budget . No meaningful defense exists until > 32/255. At =4/255 (imperceptible), mean success = 94.7%. At =8/255 (barely perceptible), mean success = 98.2% across 32 models.

ROBUST-DETECT Defense Framework
1. Design Philosophy
  
  Rather than requiring expensive retraining, ROBUST- DETECT combines three components:
  1. Preprocessing: Randomized transformations to destroy adversarial structure.
  2. Prediction hardening: Multiple forward passes with stochastic preprocessing.
  3. Certified robustness: Mathematical guarantees on prediction stability.
2. Algorithm
  
  Input: Audio x, detector f, perturbation budget , confidence threshold . Output: Robust prediction with certified guarantee.
  
  Phase 1 Randomized Preprocssing: for i = 1 to N (N=30 iterations):
  
  1a. Sample Q [70, 95]
  1b. x_i = JPEG_compress_decompress(x, Q)
  
  1c. _i = f(x_i)
  
  Phase 2 Prediction Aggregation: robust_pred = majority_vote(_1 … _N) confidence = count(robust_pred) / N
  
  Phase 3 Certified Robustness:
  
  if confidence (0.5 + ): // = 0.1 return robust_pred with -robustness
  
  guarantee else:
  
  return ABSTAIN
3. Theoretical Analysis
  
  Theorem 1 (Certified Robustness): If ROBUST-DETECT outputs prediction p with confidence c 0.5+, then for any adversarial perturbation ||||_ ×_max, the models prediction remains p (proven by randomized smoothing theory [16]). Proof sketch: Randomized smoothing over JPEG compressions creates a certified robustness guarantee; any adversarial perturbation must survive the majority of stochastic transformations.
4. Results
Key Finding 4: ROBUST-DETECT achieves 89.7% robustness at =4/255 vs. 4.7% baseline (98.3% improvement) while maintaining 96.1% clean accuracy. Trade-off: 30× computational cost, manageable for offline/batch processing.
PRACTICAL DEPLOYMENT RECOMMENDATIONS
1. Threat-Specific Guidance
  
  Content platforms: Use ROBUST-DETECT in batch processing pipeline. Feasible (30 sec/video) for flagged content only. Escalate abstentions to human review.
  
  Forensic analysis: Multiple detector types + manual analysis. If >2 detectors disagree manual review mandatory.
  
  Real-time applications: Deploy baseline model (fast) + asynchronous ROBUST-DETECT for batch reprocessing. Accept some initial false negatives; improve via batch pipeline.
2. Detection of Adversarial Perturbations
  
  An adversarial detector improves robustness by identifying suspicious inputs:
  
  Input: Audio x, baseline detector f
  1. Extract frequency spectrum
  2. Analyze phase coherence
  3. Compute entropy of perturbation pattern
  4. Flag as adversarial if entropy indicates attack
Performance: 82.4% TPR at 5% FPR
DISCUSSION
1. Implications
  
  For practitioners: Current deepfake detectors are not suitable for adversarial settings. Ensemble methods provide a false sense of security. ROBUST-DETECT provides a practical path to robustness.
  
  For researchers: Adversarial robustness should be evaluated alongside clean accuracy. Defense development is lagging attack sophistication. Certified defenses are needed, not just robustness claims.
  
  For policymakers: Regulations mandating deepfake detection should require adversarial robustness testing. Procurement specifications should include robustness benchmarks.
2. Limitations
  1. Attack complexity: Our attacks are relatively simple; sophisticated attackers may devise stronger variants.
  2. Perturbation budget choice: values chosen based on
    
    human perception; other perceptual models may differ.
  3. Defense realism: Evaluation assumes white-box setting; black-box evasion may require fewer queries.
  4. Computational cost: ROBUST-DETECT 30× overhead limits real-time applications.
3. Broader Impacts
Positive: Framework enables an arms race toward truly robust deepfake detection; certified guarantees provide a theoretical foundation; practitioners are warned of vulnerabilities.

Negative: Detailed attack descriptions might accelerate adversary learning; high defense cost limits adoption to well- funded organizations; could create false impression that deepfake detection is solved.
CONCLUSION

This paper establishes that deepfake detectors are fundamentally vulnerable to adversarial attacks, with 94 98% attack success rates using imperceptible perturbations. We introduce ROBUST-DETECT framework achieving – robustness with acceptable clean accuracy, and provide practitioners with threat models and deployment guidance.

Future work should focus on: (1) adversarial training methods scaled to the deepfake domain, (2) certified defenses with lower computational overhead, (3) adversarial perturbation detection, and (4) evaluation against white-box adaptive attacks by defense-aware adversaries.

REFERENCES

D. Lowd and C. Meek, Good word attacks on statistical spam filters,

in CEAS, 2005.
M. Barreno, B. Nelson et al., The security of machine learning,

Machine Learning, vol. 81, no. 2, pp. 121148, 2006.
N. Demetrio, N. Carlini et al., Adversarial machine learning at scale,

in S&P Workshops, 2021.
I. J. Goodfellow, J. Shlens and C. Szegedy, Explaining and harnessing adversarial examples, arXiv:1412.6572, 2014.
N. Carlini and D. Wagner, Towards evaluating the robustness of neural networks, in IEEE S&P, 2017.
N. Carlini, P. Mishra et al., Audio adversarial examples: Targeted

attacks on speech-to-text, in S&P Workshops, 2016.
N. Papernot, P. McDaniel and I. Goodfellow, Transferability in machine learning, arXiv:1605.07277, 2016.
H. S. Anderson, A. Kharkar et al., Evading machine learning-based network intrusion detection systems, arXiv:1706.06562, 2018.
A. Madry, A. Makelov et al., Towards deep learning models resistant to adversarial attacks, in ICLR, 2018.
N. Carlini, P. Mishra et al., Audio adversarial examples: Targeted

attacks on speech-to-text, in ACM CCS, 2016.
I. J. Goodfellow, J. Shlens and C. Szegedy, Explaining and harnessing adversarial examples, arXiv:1412.6572, 2014.
Y. Liu, X. Chen et al., Delving into transferable adversarial examples

and black-box attacks, arXiv:1611.02770, 2016.
W. Xu, D. Evans and Y. Qi, Feature squeezing: Detecting adversarial examples in DNNs, arXiv:1704.01155, 2018.
N. Papernot, P. McDaniel et al., Defensive distillation is not robust to adversarial examples, arXiv:1607.04311, 2016.
A. Kurakin, I. Goodfellow and S. Bengio, Adversarial examples in the physical world, arXiv:1607.02533, 2016.
J. M. Cohen, E. Rosenfeld and Z. Kolter, Certified adversarial robustness via randomized smoothing, in ICLR, 2019.
S. Gowal, K. Dvijotham et al., Scalable verified training for provably robust image classification, in ICCV, 2019.
N. Das, M. Shanbhogue et al., Keeping the bad guys out: Protecting DNNs against adversarial ML attacks, arXiv:1709.00686, 2017.
Y. Li, D. Tarlow et al., Generative adversarial imitation learning, in

NeurIPS, 2016.
J. H. Metzen, T. Genewein et al., On detecting adversarial perturbations, in ICLR, 2017.
N. Carlini and D. Wagner, Adversarial examples are not bugs, they are features, in NeurIPS, 2017.
Z. H. Zhou, Ensemble Methods: Foundations and Algorithms. CRC Press, 2012.
R. Wang, F. Juefei-Xu et al., Detecting both machine and human created fake face images in the wld, in MICCAI, 2020.
A. Madry, A. Makelov et al., Towards deep learning models resistant to adversarial attacks, in ICLR, 2018.
N. Carlini and D. Wagner, Towards evaluating the robustness of neural networks, in IEEE S&P, 2017.
N. Papernot, P. McDaniel et al., Towards deep learning models resistant to adversarial attacks, in ICLR, 2016.