Detection of Deepfakes Using Long Distance Attention

Kavana; Nishmitha J

doi:10.17577/IJERTCONV14IS010031

Techprints 9.0 - 2026 (Volume 14 - Issue 01)

Detection of Deepfakes Using Long Distance Attention

DOI : 10.17577/IJERTCONV14IS010031

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 40
Authors : Kavana, Nishmitha J
Paper ID : IJERTCONV14IS010031
Volume & Issue : Volume 14, Issue 01, Techprints 9.0
Published (First Online) : 01-03-2026
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Detection of Deepfakes Using Long Distance Attention

Kavana, Nishmitha J

Student, St. Joseph Engineering College, Mangalore Assistant Professor, St. Joseph Engineering College, Mangalore

Abstract – Developments in generative models have enabled the regular generation of highly convincing but misleading multimedia, resulting in new content challenges.authenticity, which presents significant issues for the credibility of visual content and the dependability of digital communications. Since real and tampered face features often differ only at a micro level, making identification much more difficult, existing detection techniques typically treat the problem as a binary classification

11 challenge, which frequently proves to be insufficient.We present a thorough deepfake detection method in this study that reframes the issue as a fine- grained classification problem.We present a spatial- temporal architecture that utilizes a new long- distance attention mechanism that can detect temporal inconsistencies between consecutive frames as well as spatial artifacts within individual frames.the model's capacity to extract global contextual information and highlight important facial regions. Our approach, which teaches using both visual and combining facial recognition algorithms with 33 convolutional neural networks (CNNs), specifically Xception, on video datasets. The model achieves 97% accuracy on image-based inputs and 85% accuracy on video-based inputs, outperforming the state- of- the-art methods on several publicly available benchmarks, according to the results. Our study offers a precise and deployable deepfake detection method to stop identity theft, deception, and other harmful uses of synthetic media.

Keywords:Xception, Facial recognition, Imagedataset, Videodataset, Model accuracy, Digital security.

INTRODUCTION

Quick developments in deepfake technology have significantly changed the digital landscape by making it possible to create incredibly realistic edited videos and images. Deepfakes have been abused to cause serious issues like political disinformation, celebrity impersonation, the dissemination of false information, and fraudulent schemes,

despite the fact that they may be helpful in accessibility, education, and film. These synthetic media productions, which primarily affect celebrities and public figures, are putting digital trust and content integrity at greater risk.

This study suggests a deepfake detection technique that uses convolutional neural networks such as a long-range attention system in addition to Xceptionn. Our method frames detection as a fine-grained classification problem to better identify the subtle differences between real and fake facial data, in contrast to traditional approaches that consider detection as a binary classification problem. The entire machine learning process data collection, cleaning, analysis, labeling, model training, and deployment is followed by the suggested system.

We used both image and video datasets to build the model. The image collection was carefully divided into training, testing, and validation groups to ensure accurate results.evaluation. The model works well with both static and dynamic content, as evidenced by its 97% accuracy on picture inputs and 85% accuracy on video inputs. This study offers a dependable and scalable method for detecting altered content and creating safe digital environments by fusing attention-based learning with spatial-temporal modeling.

Fig 1. ARCHITECTURE DIAGRAM
LITERATURE SURVEY

More research is being done on detecton techiques tha.t can keep up with increasingly complex alteration techniques because deepfake content is becoming more and more prevalent. Traditional techniques that often focused on subtle visual cues, such as irregular eye movements or facial animation artifacts. However, detecting deepfakes with conventional techniques has become more challenging since the advent of Generative Adversarial Networks (GANs). In order to automatically identify patterns that distinguish authentic from fake information, recent research has focused on deep learning techniques, specifically Convoolutional Neral Networkss (CNNs) and Reccurrent Neuoral Networks (RNNs). Rössler et al. showed how well Xception, which employs depthwise separable convolutions for efficient feature extraction, detects fake face imagery using the FaceForensics++ dataset.

Popular attention-driven frameworks for temporal learning include transformer models and associated long-distance attention variations.and contextual relationships in visual sequences, which are now helpful tools for spotting irregularities in video frames' temporal structure. Several academics have proposed the use of spatial- temporal networks to jointly explain inter-frame interactions and intra-frame artifacts. While other models, such as LipForensics, concentrate on errors in lipsync and audio- visual coherence, the DeepRhythm model, for instance, takes advantage of the temporal rhythms of facial blood flow.Notwithstanding these developments, problems still exist, including a dearth of diverse and annotated datasets, bias in the datasets, and problems with generlization.

Researchers have used pre-training, transfer learning, and data augmentation on large datasets such as FaceForensics++ and

DFDC (Deepfake Detection).challenge) to overcome dataset barriers. These datasets provide more dependable training because they contain both authentic and fraudulent content from various sources.But new datasets and approaches are always needed as deepfake techniques advance. Multi-modal deepfake detection is an additional innovative technique that improves accuracy by combining textual, auditory, and visual cues. Multi-modal learning, which incorporates video frames, audio tracks, and transcripts, has demonstrated better results in related tasks such as content moderation and fake news detection in content recognition and categorization (particularly on websites like YouTube).

These days, spectrogram analysis, semantic segmentation, natural language processing (NLP) for comment analysis, and cross-modal retrieval techniques are all used in AI- powered moderation systems. Even though extra tools like

Because of its exceptional feature extraction capabilities, deep learning continues to be the industry standard even in the face of the adoption of Random Forests, Support Vector Machines (SVMs), and Transfer Learning. Long-distance attention mechanisms in deepfake detection are still not well understood, despite recent advancements. By capturing long-range correlations between frames, these procedures enable models to identify subtle but persistent forging patterns. By incorporating long-distance attention into the detection process, this study bridges that gap and improves the model's overall resilience and temporal reasoning.
METHODOLOGY

Using a fine-grained classification framework improved by long-distance attention, this paper uses a thorough and organized approach to deepfake detection. The suggested makes use of a multi-phase pipeline that moves methodically from collecting data to deploying the model. Every step is essential to guaranteeing the detection system's resilience and utility. The actions listed below were taken:
1. DATA COLLECTION
  
  Both image and video datasets are used to train and assess the model. The image collection, which includes both authentic and altered portraits of people, isseparated into subsets for training (80%), validation (10%), and testing (10%). Real and fake video clips from benchmark datasets like FaceForensics++ and the Deeepfake Detecton Chalenge
  
  ,are included in the video dataset. For temporal analysis, video frames were taken out at predetermined intervals.
2. DATA CLEANING
  
  To eliminate duplicate, low-resolution, fuzzy, or corrupt entries, preprocessing was applied to the raw image and video data. To enable efficient frame-by-frame examination, each video The frame rate and resolution of the files were standardized. To preserve the dataset's integrity, samples with missing information or those that were incorrectly classified were eliminated.
3. DATA MANIPULATION AND PREPROCESSING
  
  Five frames per second (fps) were used to record video samples.Multi-task cascaded convolutional networks, or MTCNNs, were used to identify and prune face areas. Many Preprocessing methods were used to increase the model's generalizability. Images were standardized and reduced to a fixed resolution to ensure consistency and increase training efficacy across the dataset. Histogram equalization was used
  
  to adjust lighting conditions and enhance the contrast of facial features across frames.
4. DATA ANNOTATION
  
  Each image and video frame used in this study was classified as either authentic or fraudulent. For the model to be able to distinguish between authentic and modified content, items, precise labeling is crucial. The following annotation techniques were applied:
  1. Source Dataset Metadata: The FaceeForensics++ and DFDC datasets supplied ground-truth labels that indicated whether the media was real or had been altered.The primary source of annotations was these metadata labels.
  2. Manual Verification: Part of the data was manually reviewed using visualization tools. This ensured that the labels were accurate and assisted in identifying samples that were unclear or mislabeled.
  3. Automated Heuristics: Using recognized manipulation signatures, like compression artifacts, erratic lip-sync, or odd eye blinking, samples were automatically verified or marked for additional examination. These heuristics were created using prior research and basel.ine modeels that had already been trained.
5. DATA SPLITTING
  1. Image data: Separated into more manageable groups for testing, validation, and training.
  2. To prevent temporal errors, separate the video data into discrete clips.leaked information from the test and training datasets. This stops frames from the same film from showing up in different subsets.
6. MODEL SELECTION
  
  We chose a set of deep learning models that are appropriate for both temporal and spatial forgery detection:
  1. Because of Xceeption Network's proficiency indepth-wise separable convolutions to learn fine-grained features.
  2. Face Recognition Model: This model recognizes differences in identification over time and extracts face embeddings.
  3. The LongDistance Attention Module was used to improve temporal feature extraction and detect minute changes between consecutive frames.
  This combination allows the model to detect anomalies at both the spatial (frame) and temporal (sequence) levels.
7. MODEL TRAINING
  
  The model training phase, which sought to enhance generalization, prevent overfitting, and maximize learning efficacy, used both image and video datasets. The following combinations and strategies were employed:
  1. Loss Function: Binary cross-entropy was used to quantify the discrepancy between expected and actual labels. This
    
    loss function works well for binary classification issues such as deepfake detection.
  2. Optimizer: The Adam optimizer was employed due to its efficacy with sparse gradients and adaptive learning features. A learning rate scheduler was included in order to dynamically adjust the learning rate in response to validation performance.
  3. Batch Size: Following testing, it was found that a batch size of 32 or 64 ensured consistent model convergence by balancing memory usage and training velocity.
  4. Early Stopping: To lessen the chance of overfitting, early halting was employed. The training procedure was stopped when the validation loss stopped getting better throughout the course of succeeding epochs.
  5. Data Augmentation: Techniques like rotation, zooming, horizontal flipping, brightness manipulation, and noise injection were employed to fictitiously expand the training dataset's size. This improved the model's robustness and reduced overfitting.
  6. Training Environment: TensoreFlow/Keras with G.P.U acceleration was utilized for training, significantly cutting down on training time and enabling faster analysis of larger models and datasets.
  High accuracy and good generalization performance across both static (image) and dynamic (video) inputs were made possible by this training setup.
8. MODEL EVALUATION AND TESTING
Both image and video datasets were used to thoroughly assess the trained models' performance. To guarantee a thorough grasp of the model's efficacy in identifying deepfake content, a range of common categorization criteria were used. The evaluation metrics listed below were employed:
1. Accuracy: Determines the model's overall soundness by dividing the number of correctly predicted samples by the total number of samples.
2. Precision: This feature reduces false positives by showing the percentage of positive identifications (fake samples) that were truly accurate.
3. Recall (Sensitivity): Shows that the model can detect genuine deepafakes, reducing false negatives.
4. F1-Score: A fair evaluation metric that is especially useful when the dataset is unbalanced, the F1-Score is the harmonic mean of recall and precision.
5. AUC (Area Under the ROC Curve): Indicates how well the model can differentiate between authentic and fraudulent content at different threshold values. An increased AUC suggests improved performance in discrimination
  
  The model achieved:
  - 97% accuracy on image datasets, due to high- resolution spatial features.
  - 85% accuracy on video datasets, slightly lower due to motion blur, compression artifacts, and more complex temporal dynamics.
i) MODEL DEPLOYMENT

To make the solution practical and accessible, a lightweight deployment framework was developed:
1. Flask API: Provides endpoints for uploading and detecting deepfake content.
2. TensorFlow Lite: This application optimizes the learned model for real-time inference on edge devices.
3. Webcam Interface: Integrated to detect manipulated faces in live video streams.
RESULTS

The performance of the proposed deepfake detection technique was evaluated using a range of quantitative metrics on both image and video datasets. The results confirm the model's capacity to distinguish between modified and real media in a range of forms.
1. PERFORMANCE OF IMAGE DATASETS
  
  When tested on the picture dataset, which was separated into subsets for training (80%), validation (10%), and testing (10%), the model showed excellent accuracy and generalization. The following are important metrics:
  - 97% accuracy
  - 96.5% accuracy
  - 97.2% recall
  - F1-Score: 96.8%
  - AUC, or aarea u.ndrr the ROC curve, is 0.98.
    
    These results demonstrate the effectiveness of facial recognition embeddings, long-distance attention processes, and theTogether, the Xception architecture was able to detect subtle spatial anomalies in still photos.The model's capacity to identify even finely altered phony photos with few false positives was made possible by the application of sophisticated data augmentation, balanced sampling, and label verification.
    
    Fig 2. Confusion Matrix
2. PERFORMANCE OF THE VIDEO DATASET
  
  Additionally, deepfake movies from datasets like FaceForensics++ and DFDC were used to test the model. The deepfakes were implanted under various settings, such as altering compression levels, illumination, and motion blur. Both frame-level analysis and temporal consistency tests utilizing long-distance attention were part of the video evaluation. The following was the performance:
  - 85% accuracy
  - 84.1% accuracy
  - 83.6% recall
  - F1-Score: 83.8%
  - AUC: 0.90
    
    The model continued to perform well in spite of the inherent difficulties in video-based detection, including inter-frame noise, compression errors, and fast facial movements. When it came to identifying temporal irregularities like lip-sync errors and awkward frame transitions, the attention mechanism was very successful.
3. ANALYTICAL PERSPECTIVES
  
  According to a comparative analysis,
  - Because static features are simpler to record and examine, image-based detection produces higher accuracy.
  - Despite being more complicated, temporal modeling that can reveal frame-to-frame forging trends is advantageous for video-based detection.
  - Performance gains were mostly attributed to the long- distance attention layer, especially in the video domain where long-range dependencies are essential.
DISCUSSION

The experimental results of the study clearly show the potential of combining long-distance communication, convolutional neural networks (CNNs), and facial recognition.methods for identifying deepfake content in pictures and videos. The robustness and adaptability of the proposed framework were validated by the model's remarkable 97% accuracy on the image dataset and robust 85% accuracy on the video dataset.

One significant conclusion drawn from the data is that deepfake detection with photos usually yields superior results to detection with movies. Probably because pictures are Convolutional filters and fine-grained classifiers can more easily identify facial abnormalities such blending

errors, pixel-level inconsistencies, or disparities around the eyes because they are typically static and steady. On the other hand, dynamic problems like motion blur, shifting lighting, and compression artifacts are often present in video-based detection, which reduces model confidence and obscures forging evidence.

Adding a long-distance attention mechanism significantly improved performance by enabling the model to identify temporal abnormalities across frames, such as abrupt transitions, lip-sync issues, or changing facial landmarks.in the video options. Unlike standard CNNs, which are limited to local context, this attention-based layer allowed the model to handle global dependencies in time-series data, which is more effective at identifying subtle anomalies that occur over time rather than inside a single frame.

Furthermore, by evaluating identity-level coherence, the application of facial recognition-based embedding comparison provides an additional layer of validation. When a small face When movements were mistaken for manipulations, our method helped eliminate false positives.method assisted in removing false positives.

But some problems still exist. Despite the impressive 85% video accuracy, the system's issues with occlusion, low resolution, and inadequate lighting, among other real-world scenarios. For increased robustness, more research into multi-modal approaches that take into account speech patterns, facial dynamics, and audio inputs is required. Furthermore, frequent model upgrades and exposure to a variety of training datasets are required to prevent performance degradation because deepfake generation techniques are constantly changing.

Overall, the findings highlight the value of hybrid models that include attention mechanisms, facial recognition, and spatial- temporal analysis. The results lend credence to the implementation of such algorithms in practical applications where quick and accurate deepfake detection is essential, like social media moderation, news verification, video surveillance, and cybersecurity.
CONCLUSION

We propose a robust deepfake detection system that leverages deep CNN topologies, face recognition, and an extended attention mechanism to achieve better performance.spanning both image and video data. The suggested approach successfully captures the minute temporal and spatial irregularities that differentiate authentic video from edited content by presenting deepfake detection as a fine- grained classification task as opposed to a conventional binary problem. Evaluation results from

extensive testing on both datasets show that the system correctly detected deepfakes in 97% of static image samples and 85% of video samples. With the long-distance attention component being crucial in identifying frame-to-frame irregularities in movies, these results show that the model can handle both static and dynamic data.

This model's successful use across a range of datasets validates its potential for real-world implementation in domains like cybersecurity, law enforcement, and digital media verification. Despite the promising results, there are still problems, particularly when it comes to improving video detection in surroundings that are noisy or compressed. Future studies should look into multi-modal deepfake detection that combines adversarial training with textual signals, audio, and lip-sync analysis to further increase model resistance.

In conclusion, our study provides a versatile and high- performing solution to the increasing threat of synthetic media, paving the way for more secure and dependable digital communication networks.
REFERENCES

[1]	.Lu, Wei, et al. Detction off Depfake Videos Using	Long-
Distance Attention. an IEEE Transactions on Neurl		Netwoorks

[1]

pp. 936679. DOI.org (Crossref),

and Learning Systems, vol. 35, no. 7, Jul. 2024, https://doi.org/10.1109/TNNLS.2022.3233063.

pp. 117865906. DOI.org (Crossref),

Waseem, Saima, et al. DeepFake on Face and Expression Swap: A Review. IEEE Access, vol. 11, 2023,

https://doi.org/10.1109/ACCESS.2023.3324403.

DOI.org (Crossref),
.Ur Rehman Ahmed, Naveed, et al. Visual Deepfake Detection: Review of Techniques, Tools, Limitations, and Future Prospects. IEEE Access, vol. 13, 2025, pp. 192361.

https://doi.org/10.1109/ACCESS.2024.3523288.
Tolosana, Ruben, et al. DeepFakes Evolution: Analysis of Facial Regions and Fake Detection Performance. Pattern Recognition. ICPR International Workshops and Challenges, edited by Alberto Del Bimbo et al., vol. 12665, Springer International Publishing, 2021, pp. 44256. DOI.org (Crossref), https://doi.org/10.1007/978- 3-030- 68821-9_38.

Verdoliva, Luisa. Mdia Foreensics and DepFakes: An

Overview.

ItheEEE Journal of Selected Topics in Signal
Processing, vol. 14,,

no. 5, Aug. 202, pp. 91032. DOI.org (Crossref), https://doi.org/10.1109/JSTSP.2020.3002101.
Nirkin, Yuval, et al. DeepFake Detection Based on the Discrepancy Between the Face and Its Context. arXiv, 2020.

DOI.org (Datacite),

DOI.org(Datacite), https://doi.org/10.48550/ARXIV.2106.12832. https://doi.org/10.48550/ARXIV.2008.12262.
Li, Weichuang, et al. Detection of GAN-Generated Images by

Estimating Artifact Similarity. IEEE Signal Processing Letters, vol. 29, 2022, pp. 86266. DOI.org (Crossref), https://doi.org/10.1109/LSP.2021.3130525.
Korshunov, Pavel, and Sebastien Marcel. DeepFakes: A New Threat to Face Recognition? Assessment and Detection. arXiv, 2018. DOI.org (Datacite), https://doi.org/10.48550/ARXIV.1812.08685.
[10] [10].Tolosana, Ruben, et al. Deepfakes and beyond: A Survey of Face Manipulation and Fake Detection. Information Fusion, vol. 64, Dec. 2020, pp. 13148.

https://doi.org/10.1016/j.inffus.2020.06.014.

Verdoliva, Luisa. Mdia Foreensics and DepFakes: An	Overview.
ItheEEE Journal of Selected Topics in Signal

Detection of Deepfakes Using Long Distance Attention

INTRODUCTION

LITERATURE SURVEY

METHODOLOGY

DATA COLLECTION

DATA CLEANING

DATA MANIPULATION AND PREPROCESSING

DATA ANNOTATION

DATA SPLITTING

MODEL SELECTION

MODEL TRAINING

MODEL EVALUATION AND TESTING

i) MODEL DEPLOYMENT

RESULTS

PERFORMANCE OF IMAGE DATASETS

PERFORMANCE OF THE VIDEO DATASET

ANALYTICAL PERSPECTIVES

DISCUSSION

CONCLUSION

REFERENCES