Machine Learning Based Deepfake Detection System Using Facial Features

doi:10.5281/zenodo.18802817

Volume 15, Issue 02 (February 2026)

Machine Learning Based Deepfake Detection System Using Facial Features

DOI : 10.5281/zenodo.18802817

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 63
Authors : Pooja Kajale, Prachi More, Shantanu Halgaonkar, Krushna Karande, Nikita Gite, Pooja Jadhav
Paper ID : IJERTV15IS020474
Volume & Issue : Volume 15, Issue 02 , February – 2026
Published (First Online): 27-02-2026
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Machine Learning Based Deepfake Detection System Using Facial Features

Pooja Kajale

Dept. of CS & Design, Engineering, Dr. V. V. Patil COE Ahilyanagar, India

Krushna Karande

Dept. of CS & Design Engineering, Dr. V. V. Patil COE Ahilyanagar, India

Prachi More

Dept. of CS & Design Engineering, Dr. V. V. Patil COE Ahilyanagar, India

Nikita Gite

Dept. of CS & Design, Engineering, Dr. V. V. Patil COE Ahilyanagar, India

Shantanu Halgaonkar

Dept. of CS & Design Engineering, Dr. V. V. Patil COE Ahilyanagar, India

Pooja Jadhav

Dept. of CS & Design Engineering, Dr. V. V. Patil COE Ahilyanagar, India

Abstract – The rapid progress of deep learning and generative models has enabled the creation of highly realistic deepfake

videos, making it increasingly difcult to distinguish manipulated content from real videos. Such imitated videos may also resemble the actual facial expressions, head movements and the lips synchronization and that is why it is hard to nd out with the help of the traditional forensic methods. This poses a great challenge to the safety and credibility of digital media. This paper is a description of a basic and efcient deepfake detector, which considers both visual specics and dynamics of moving faces in videos. Individual video frames are analyzed with a ResNet50 model that can then learn valuable facial texture and structural hints that can be used to suggest manipulation. A Bidirectional LSTM network is used to learn temporal variations in face movements across time as well as identify unusual behavioral patterns such as irregular blink, abnormal head movement and lip-sync errors. There is also a Multi-Head Self-Attention mechanism which gives attention to the most signicant parts of the face and time intervals which are more prone to change. The suggested end-to-end architecture can be a strong and scalable solution to detect deepfake videos and can accommodate various approaches to deepfake generation.

Index TermsDeepfake Detection, ResNet50, Bidirectional LSTM (BiLSTM), Multi-Head Self- Attention, Facial Feature

Analysis, Video Manipulation Detection, Multimedia Security

INTRODUCTION
The explosive growth of articial intelligence has sped up creation of synthetic media, especially of deepfake videos using highly elaborate GAN architectures [22]- [24]. Deep- fakes alter facial expressions, physical characteristics and lip movements, and often lead to content that is virtually indistinguishable from real life imagery. Although such tech- nology enables creative use in entertainment and educational purposes, it at the same time poses signicant security and

ethical issues related to areas such as political disinformation, impersonation-based scams, coercion and social engineering attacks.

Current deepfake detection strategies address a either static frames and/or a continuous video sequence. Spatial detection methods study frame by frame for any aberration in the texture, boundaries blending or abnormal pixel pattern [12]. However, as the GANs develop, the spatial artefacts are less and less perceivable. On the other hand, temporal detection methods nds inconsistencies in blinking behaviour [14], unnatural head motion [5] or lip sync mismatch [6]. Such temporal cues offer considerably more informative cues, but can be bypassed by sophisticated models that can produce seamless transitions. A nagging problem is that there is no universal efcacy from any single technique. A hybrid methodology that cap- tures both visual textures through frames, as well as the dynamics of motion through frames, is therefore imperative. This is the driving force to establish a strong architecture consisting of ResNet50 for extracting detailed features in space, BiLSTM for modelling the temporal dependencies and MHSA to highlight manipulation-susceptible facial regions. The proposed model complements the deepfake detection by analysing texture change at the pixel level and facial dynamics

at the higher level simultaneously.

The main goal of this paper is to propose a deepfake de- tection framework, which can have a scalable architecture and can ensure high-level generalisation across various datasets. The architecture is designed to detect minute inconsistencies in the manipulated videos and hence offers a robust verication system for multimedia systems. Subsequent sections, as results of an extensive literature review and a system methodology,

mathematical modelling, architectural description, and pro- jected outcomes upon theoretical evaluation.
LITERATURE SURVEY
Deepfake detection has attracted much scholarly attention. Early investigations made use of convolutional neural network- based feature extraction in order to scrutinize inconsistencies and superuous artefacts in a texture. Al-Dulaimi and Kurnaz
[1] proposed a hybrid CNN and LSTM approach for image- level detection and proved that mashup both spatial and temporal representations improving the classication accuracy. Sabir et al. [2] made use of recurrent convolutional approaches to detect manipulation patterns for a series of frames, making it suitable for video level analysis of deepfakes. Pant et al. [3] discussed the deepfake generation techniques and highlighted the importance of LSTM models to capture manipulation se- quences. FakeCatcher [4] proposed a unique biological – signal detection mechanism, using photoplethysmographic – based pulse inconsistencies which are hard to synthesise articially. Optical ow models [5] studied the patterns of pixel move- ments to tell apart the real and the fake. Lip sync detection strategies [6] used the synchronisation of audio and lip motion to reveal frauds [6]. Although effective in some situations, such approaches often require inputs of good quality. Transformers and attention based techniques have also witnessed signicant momentum. CViT [7] and multitask learning models [8] cap- tured global dependencies within frames, while hierarchical frameworks, like HOLA [9], increased the robustness of detec- tions using contextual aggregation. Shen et al. [10] considered the priors based on languages for deepfake detection and Tariq et al. [11] proposed multimodal and explainable deepfake detection frameworks based on motion, texture and audio. Datasets such as FaceForensics++ [18], DFDC [19], Celeb-DF [20], and DeeperForensics [21] have also become benchmark tests for the evaluation of models for detection. Concurrently, improvements in generative adversarial networks (Progres- siveGAN [23], StyleGAN [24]) lead to the improvement of stronger detection mechanisms. Korshunov and Marcel [17],
[25] explored the RAN ngerprints using a frequency spectrum by providing new insights on artefact level detection. Collec- tively, the literature calls for hybrid models able to take care of high quality and temporally stable deepfakes. The proposed ResNet50 + BiLSTM + MHSA architecture is in line with these ndings.
PROPOSED METHODOLOGY
The methodology takes an end-to-end pipeline that is amal- gamated with multiple processing stages.
1. Data Collection and Preprocessing
  The model uses standard deepfake datasets, Face Foren- sics++, DFDC, Celeb-DF and Deeper Forensics. Preprocessing includes frame extraction, face detection using Dlib /MTCNN, face cropping and alignment using land mark points and resizing to 224*224 pixels and normalising the data as per ImageNet statistics. Attention map and regio of interest processing Landmark detection for eyes, mouth and jaw.
2. Extraction of the Spatial Features by ResNet50
  With pretrained weights on ImageNet, ResNet50 is used to extract deep visual information of each frame of the face. Residual connections are used in order to capture high level semantic info and ne grained textures. The output of the last convolutional block is the frame level embedding, which is sent to the temporal module.
3. Temporal Modelling using BiLSTM
  The bidirectional LSTM network takes sequences of ResNet50 embeddings, thus it captures the forward and back- ward temporal dependencies. This module is a key module in detecting anomalies like irregular blinking, lip sync mis- matches, head pose instability and inter frames discontinuities.
4. Multi Head Self Attention (MHSA)
  MHSA helps to rene the sequence embeddings by facil- itating selective attention over the temporal and spatial fea- tures. Multiple attention heads capture different appearances of manipulation, allowing the model to emphasize eye-mouth areas, observe smooth texture and answer to sudden changes of motion.
5. Classication Layer
  A classier for fully connected with softmax output will give the probabilities for the real vs. fake labels. Condence scores using the softmax probabilities are exported in order to provide interpretability.
SYSTEM DESIGN AND ARCHITECTURE
The proposed system architecture is designed as an end- to-end deepfake detection framework that processes images or videos uploaded by the user and produces a classication result along with a condence score. The system begins with the user interface, where the user uploads an image or video and later views the detection results. The input is securely transmitted to the backend through an API interface using REST/HTTPS protocols to ensure safe communication. An input manager handles the incoming data and forwards it to the preprocessing stage. During preprocessing, video inputs are decomposed into frames, faces are detected and aligned, and normalization is applied to standardize the data, ensuring robustness against variations in pose, lighting, and resolution.

Following preprocessing, the system performs spatial fea- ture extraction. For video inputs, a Bi-directional Long Short- Term Memory (BiLSTM) network is employed to capture tem- poral dependencies and motion-based inconsistencies across frames, which are common indicators of deepfakes. These extracted features are then passed to an attention module based on multi-head self-attention, which enables the model to focus on the most discriminative facial regions and temporal seg- ments that contribute signicantly to real or fake classication. The rened features are subsequently combined in the feature fusion stage, where spatial, temporal, and attention-weighted features are integrated into a unied representation.
1. Frame Extraction
  Each video m is decomposed into a sequence of k consec- utive frames:
  
  F(m) = {I1, I2,…, Ik} (2)
  
  where Ii denotes the i-th frame.This step converts the video stream into discrete image samples suitable for frame-level processing and temporal analysis.
2. Face Detection and Preprocessing
  For each frame Ii, a face detector algorithm identies the facial region using bounding box coordinates bi. The detected face is cropped as:
  
  Fig. 1. System Architecture of the Proposed Deepfake Detection System
  
  The fused feature vector is then fed into the classication module, which consists of fully connected layers followed by a Softmax function to predict whether the input is real or fake. The output module generates the nal decision along with a condence score, providing both interpretability and reliability of the prediction. Optionally, the system supports data storage,
  
  Xi = Crop(Ii, bi) (3) The cropped face image undergoes preprocessing:
  
  xi = Preprocess(Xi) (4)
  
  ×
  
  where preprocessing includes resizing to 224 224, nor- malization using ImageNet statistics, and illumination correc- tion.These operations reduce variations due to lighting, scale, and background noise.
3. ResNet50 Feature Extraction
  Each preprocessed face image xi is passed through a ResNet50 network pretrained on ImageNet:
  
  fi = (xi), fi Rd (5)
  
  ·
  
  where ( ) represents the ResNet50 feature extraction func- tion. The extracted feature vector fiencodes discriminative spatial information such as facial textures, edges, and structural artifacts introduced during manipulation.
4. BiLSTM Sequence Modeling
  To capture temporal dependencies across frames, the se- quence.
  
  {f1, f2,…, fk} (6)
  
  is processed using a Bidirectional Long Short Term Mem- ory(BiLSTM) network. The hidden state at time step t is given
  
  where extracted features, predictions, and condence scores
  
  by:
  
  can be stored for auditing or future analysis. In parallel, an ofine training phase is carried out by researchers using benchmark datasets such as FaceForensics++ and DFDC. During this phase, CNN and RNN-based models are trained and ne-tuned under supervised learning to optimize detection performance before deployment.
MATHEMATICAL MODEL
A. Input Representation

Let

M = {m1, m2,…, mn} (1)

M

denote the set of video samples, where N is the total number of videos. Each video m is represented as a temporal sequence of facial frames. This formulation enables joint spatial temporal modeling for facial appearance and motion patterns, which are essential for deepfake detection.

ht = [ht; ht] (7)

where ht and ht represent forward and backward temporal

dependencies, respectively.

This enables detection of temporal inconsistencies such as irregular blinking, lip- sync mismatches, and unnatural head movements.
1. Self-Attention Mechanism
  To determine which frames contribute more signicantly to the prediction, the model applies a self-attention mechanism:
  
  (Q, K, V ) = d V
  
  Attention softmax QK (8)
  
  Here, each frames feature vector is mapped to query Q, K, and V matrices.
  
  The attention mechanism assigns higher weights to more informative frames, improving robustness against irrelevant or noisy frames.
  
  Fig. 2. Working Flow Of Deepfake Detection
2. Classication
  The attention-enhanced feature representation is passed through a fully connected layer:
  
  z = Wx + b (9)
  
  The softmax function converts logits into class probabilities:
EXPECTED RESULTS AND ANALYSIS
1. Dataset Overview
  After preprocessing, frame extraction, face detection, nor- malization, and data augmentation, the deepfake video dataset was divided into training, validation, and testing subsets. The dataset consisted of two categories: Real and Fake videos. The preprocessing steps ensured uniform frame size, improved data quality, and reduced noise.
  
  TABLE I
  
  PRESENTS THE DETAILED DISTRIBUTION OF VIDEOS USED FOR TRAINING, VALIDATION AND TESTING.
  
  Dataset Split Real Videos Fake Videos Total
  
  Training 8,500 8,776 17,276
  
  Validation 1,300 1,290 2,590
  
  Testing 650 661 1,311
2. Training and Validation Performance
  training and validation phases, conrming stable convergence of the optimiztion process. These observations indicate that
  
  The training and validation performance of the proposed model is illustrated using accuracy and loss curves. The accuracy curve shows a steady increase during training, indi- cating effective learning of discriminative features. The close alignment between training and validation accuracy demon- strates good generalization capability and minimal overtting.
  
  C
  
  j=1
  
  exp(zj)
  
  p(y = c | x) = exp(zc)
  
  (10)
  
  Similarly, the loss curve exhibits a consistent decline for both
  
  where C = 2 corresponds to the Real and Fake classes. This ensures a valid probability distribution:
  
  the proposed architecture effectively learns meaningful spatial and temporal features from video data.
  1. Loss Function
    C
    
    L
    
    p(y = c | x) = 1 (11)
    
    c=1
    
    The model is trained using cross-entropy loss with 2 regularization to prevent overtting:
    
    L 1 L |
    
    N
    
    () = log p y(i) x(i) + 2 (12)
    
    N
    
    i=1
    
    This loss penalizes incorrect predictions and encourages the model to generalize well by limiting large parameter values.
  2. Video-Level Aggregation
  Since predictions are made at the frame level, the nal video-level probability is obtained by averaging:
  
  k
  
  p (c) = L p(c | x ) (13)
  
  (m) 1
  
  k i
  
  i=1
  
  This aggregation ensures that the models judgment is based on the overall evidence from all frames rather than a single moment. Similar computation is used for all other classes.
  
  Fig. 3. Training and validation accuracy curve of the proposed deepfake detection model
3. Confusion Matrix Analysis
The confusion matrix provides a detailed evaluation of classication performance across real and fake video classes. A high number of correctly classied samples is observed for both categories, reecting the robustness of the proposed deepfake detection system. The low misclassication rate further validates the effectiveness of the feature extraction and temporal modeling strategy employed.

on resource-constrained mobile devices, and enhancing the models generalization capability to effectively handle emerg- ing and previously unseen deepfake generation techniques. These advancements aim to improve the practical applicability and resilience of deepfake detection systems in real-world scenarios.

Fig. 4. Training and validation loss curve of the proposed deepfake detection model.

Fig. 5. Confusion matrix of the proposed deepfake detection system.
CONCLUSION AND FUTURE WORK

This work presents a hybrid deepfake detection framework that integrates ResNet50, BiLSTM, and a Multi-Head Self- Attention mechanism to effectively analyze both spatial and temporal characteristics of facial data. ResNet50 is employed to extract high-level spatial features from individual video frames, capturing subtle visual artifacts introduced during ma- nipulation. Temporal dependencies and motion inconsistencies across consecutive frames are then modeled using a BiLSTM network, enabling the system to detect dynamic irregularities that are difcult to identify through spatial analysis alone. Furthermore, the Multi-Head Self-Attention module enhances the feature representation by emphasizing the most discrim- inative regions and temporal segments, thereby improving the robustness and accuracy of the detection process for manipulated videos with ne-grained alterations.

The proposed architecture offers a comprehensive and scal- able solution for identifying deepfake content, demonstrating strong potential based on projected results. While the cur- rent implementation focuses primarily on visual cues, future research will explore the integration of audio-based features to further strengthen detection performance. Additional direc- tions include real-time performance optimization, deployment

Acknowledgment

We would like to express our sincere gratitude to Prof. P.P. Kajale, our guide, for her continuous guidance, support, and encouragement throughout this project. We also extend our thanks to the faculty and staff of the Department of Computer Science and Design, Dr. Vithalrao Vikhe Patil College of Engineering, Ahmednagar, for providing the resources and environment necessary to carry out this research.

References

O. A. H. H. Al-Dulaimi and S. Kurnaz, A Hybrid CNN-LSTM Approach for Precision Deepfake Image Detection, Electronics, vol. 13, no. 9, pp. 115, 2024.
E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi, and P. Natarajan, Recurrent Convolutional Strategies for Face Manipulation Detection in Videos, in Proc. IEEE Interface, pp. 8087, 2019.
S. Pant, C. Gosavi, and S. Barekar, Deepfake Detection using LSTM and Survey of Deepfake Creation Technologies, Int. J. Intelligent Systems and Applications in Engineering, vol. 12, no. 6, pp. 840845, 2023.
U. Ciftci, I. Demir, and L. Yin, FakeCatcher: Detection of Synthetic Portrait Videos Using Biological Signals, in Proc. IEEE/CVF CVPR,
pp. 1015910169, 2020.
I. Amerini, L. Galteri, R. Caldelli, and A. Del Bimbo, Deepfake Video Detection through Optical Flow based CNN, in Proc. IEEE CVPR Workshops, pp. 19, 2019.
S. Agarwal, S. Farid, M. Guarnera, H. Li, and R. Ng, De- tecting Deepfake Videos Using Lip-Sync Inconsistencies, in Proc. IEEE/CVF CVPR Workshops, Seattle, WA, USA, June 2020, pp. 110. doi:10.1109/CVPRW50498.2020.00123.
D. Wodajo and S. Atnafu, Deepfake Video Detection Using Convo- lutional Vision Transformer, arXiv preprint arXiv:2103.13054, 2021.
Available: https://arxiv.org/abs/2103.13054
M. Zou, B. Yu, Y. Zhan, S. Lyu, and K. Ma, Semantics- Oriented Multitask Learning for Deepfake Detection, arXiv preprint arXiv:2403.04192, 2024. Available: https://arxiv.org/abs/2403.04192
X. Wu, Y. Li, Y. Zhou, Z. Liu, and S. Lyu, HOLA: Hierarchi- cal Contextual Aggregation for Deepfake Detection, arXiv preprint arXiv:2507.22781, 2025. Available: https://arxiv.org/abs/2507.22781
G. Shen, Y. Li, and X. Yang, AuthGuard: Generalizable Deepfake Detection via Language Guidance, arXiv preprint arXiv:2506.04501, 2025. Available: https://arxiv.org/abs/2506.04501
S. Tariq, A. Kumar, and M. Rahman, Multimodal, Explainable, and Interactive Deepfake Detection Framework, arXiv preprint arXiv:2508.07596, 2025. Available: https://arxiv.org/abs/2508.07596
P. Zhou, X. Han, V. I. Morariu, and L. S. Davis, Two-Stream Neural Networks for Tampered Face Detection, in Proc. IEEE CVPR Work- shops, 2018.
Z. Sun, Y. Han, Z. Hua, N. Ruan, and W. Jia, Improving the Efciency and Robustness of Deepfakes Detection through Precise Geometric Features, in Proc. IEEE/CVF CVPR, pp. 110, 2021.
T. Jung, S. Kim, and K. Kim, DeepVision: Deepfake Detection using Human Eye Blinking Pattern, IEEE Access, vol. 8, pp. 8314483154, 2020.
Y. Li, M. Chang, and S. Lyu, Exposing DeepFake Videos by Detect- ing Face Warping Artifacts, arXiv preprint arXiv:1811.00656, 2018.
Available: https://arxiv.org/abs/1811.00656
D. Gu¨era and E. J. Delp, Deepfake Video Detection Using Recurrent Neural Networks, in Proc. IEEE AVSS, pp. 16, 2018.
P. Korshunov and S. Marcel, DeepFakes: A New Threat to Face Recognition?, in Proc. IEEE BTAS, pp. 16, 2018.
A. Ro¨ssler, D. Czzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner, FaceForensics++: Learning to Detect Manipulated Facial Videos, in Proc. IEEE/CVF ICCV, pp. 111, 2019.
B. Dolhansky, J. Bitton, B. Paum, D. Lu, R. Howes, M. Wang, and C. C. Ferrer, The DeepFake Detection Challenge (DFDC) Dataset, arXiv preprint arXiv:2006.07397, 2020. Available: https://arxiv.org/abs/2006.07397
Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu, Celeb-DF: A Large- Scale Challenging Dataset for DeepFake Detection, in Proc. IEEE/CVF CVPR, pp. 32073216, 2020.
L. Jiang, H. Zhou, H. Zhang, Z. Zhang, and S. Tang, DeeperForensics- 1.0: A Large-Scale Dataset for Real-World Face Forgery Detection, in Proc. IEEE/CVF CVPR, pp. 28892898, 2020.
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative Adversarial Networks, Communications of the ACM, vol. 63, no. 11, pp. 139144, 2020.
T. Karras, T. Aila, S. Laine, and J. Lehtinen, Progressive Growing of GANs for Improved Quality, Stability, and Variation, in International Conference on Learning Representations (ICLR), 2018.
T. Karras, S. Laine, and T. Aila, A Style-Based Generator Architecture for Generative Adversarial Networks, in Proc. IEEE/CVF CVPR, pp. 44014410, 2019.
P. Korshunov, A. Nunez, and S. Marcel, Detection of GAN-Based Facial Manipulation, IEEE Trans. Information Forensics and Security, vol. 16, pp. 18411852, 2021.

Dataset Split	Real Videos	Fake Videos	Total
Training	8,500	8,776	17,276
Validation	1,300	1,290	2,590
Testing	650	661	1,311