A Multi-Modal AI Proctoring System for Remote Examinations

doi:10.17577/IJERTV15IS051154

Volume 15, Issue 05 (May 2026)

A Multi-Modal AI Proctoring System for Remote Examinations

DOI : 10.17577/IJERTV15IS051154

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 0
Authors : Shardool Patil, Sudarshan Bankar, Irfan Patel, Vallabh Patil, P. P. Joshi
Paper ID : IJERTV15IS051154
Volume & Issue : Volume 15, Issue 05 , May – 2026
Published (First Online): 15-05-2026
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

A Multi-Modal AI Proctoring System for Remote Examinations

Shardool Patil, Sudarshan Bankar, Irfan Patel, Vallabh Patil, and P. P. Joshi

Department of Computer Engineering, Pune Institute of Computer Technology, Pune, India

AbstractThe large scale transition in higher education from an analog environment to digital through remote learning has created a need for effective methods to evaluate students remotely, and thus required that there be reliable solutions to support this type of evaluation. First generation on-line proctoring tools were based largely upon single modality surveillance; such as simple facial detection. However, these types of early solutions are now being exploited by sophisticated methods of academic dishonesty; such as using peripheral vision or having someone assist you while you take an assessment. Thus, we developed a new type of automated proctoring solution called ProctorGuard. This article presents a Multi-Modal Fusion Engine that combines in parallel geometric head pose estimation (PnP) with eye gaze at the ne grain level using L2CS-Net; object detection in real time using YOLOv8; and audio forensics. With an initial dynamic calibra-tion of baselines followed by use of smart temporal priorities for all sensor modalities, ProctorGuard is able to distinguish between normal movement and actual attempts to cheat. In addition, the authors have developed an end-to-end architecture for software applications incorporating both evidence logging as well as examiner dashboards.

Index TermsOnline Proctoring, Multi-Modal Fusion,

YOLOv8, Head Pose Estimation, L2CS-Net, Audio Forensics, Academic Integrity, Deep Learning.

Introduction

There is no doubt that the last ve years have caused an earthquake in the world of academics. There was an explosion in the number of people looking for ways to access educational content remotely, as well as a massive move toward digitaliz-ing all aspects of higher education. As a result of this trend, remote exams went from being second best option to rst-best option for millions of students across the globe. On the other hand, the increased distance between where students take tests and where they are monitored has created a signicant crisis concerning students academic integrity.

Without the physical monitoring of a professional invig-ilator, students now have countless methods available to them to use advanced cheating techniques. Prior to 2015, automated proctoring systems had been developed on single modal monitoring. These systems mainly employed simple facial recognition and/or minimal movement analysis. By the time most of these systems began to become widespread, it quickly became evident that there were numerous limitations in using unimodal monitoring. As soon as technology allowed for sophisticated surveillance (using multi-modal monitoring) that could monitor multiple body parts at once, a large amount sophisticated methods for evading proctoring were developed.

For example, if a candidate can maintain eye contact direcly with their computer screen and utilize their peripheral vision to glance at cheat sheets that are positioned outside of the frame, then this specic eyes forward, head down method would be completely ineffective against current automated proctoring systems.

In addition, systems that rely solely on video will never be able to detect auditory cheating. Examples of auditory cheating include whispering information to someone located near your desk who cannot be seen on camera; reading test questions out loud so that you can hear what is on the test; or hearing your partner provide you with correct answers via nearly imperceptible ear pieces. Therefore, in order to eliminate these forms of cheating, todays proctoring systems must transition into utilizing Multi-Modal Behavioral Analysis.

In this paper, we detail the implementation of Proc-torGuard, a system that operationalizes the theoretical models explored in recent academic surveys. ProctorGuard is a multi-modal AI proctoring system that goes beyond traditional eye-tracking. It utilizes a Multi-Modal Fusion Engine combining Computer Vision, Deep Learning, and Audio Forensics. The core contributions of this implementation paper are:
- PnP + L2CS-Net geometric head pose integration to improve upon ne-grained gaze estimation by eliminating head-forward, eyes-away vulnerabilities.
- Use of YOLOv8 for Real-Time Contraband Detection.
- Development of a dynamic calibration module which will learn a users normal resting position so as to avoid structural false positives.
- Multi-Modal Temporal Prioritizer; aggregation of multi-ple modalities over time; prioritization of harder evidence against softer visual cue evidence.
- End-To-End Architecture of examiner dashboard and automatic evidence logging.
Background and Previous Research

Automated invigilation has been continually advancing to-ward ner grain in collected data. This papers implementation is built upon many of the same key technologies as previous implementations.
1. Estimating Head Pose and Gaze
  
  The orientation of a persons head can provide an initial indication that a student is focused. In addition to head pose,
  
  estimating the direction or gaze of a student also provides a ne-grain measure of their attentiveness.
  
  In early implementations of automated invigilation, solu-tions used the PnP problem to identify 2D facial landmarks and then solve for 3D translation. However, the use of head pose alone is generally insufcient. Therefore, in addition to estimating head pose, it is equally important to estimate a students gaze. More recent architectures such as L2CS-Net have shown high levels of success using a classication-regression approach to estimate 3D vectors representing a students gaze from standard webcams. These models have shown signicant ability to handle the uncertainty associated with unconstrained environments.
2. Forensic Analysis of Objects and Auditory Data
While visual attention tracking is a valuable tool in deter-mining whether a student is attentive during exams, tracking visual attention alone does not account for all aspects of a students behavior. As such, both audio and video forensic analysis are necessary to determine the level at which a student was paying attention.

Visual forensics uses computer vision techniques such as object detection to detect if the student was looking at

something they were supposed to be focusing on, i.e., a

Fig. 1. ProctorGuard High-Level System Architecture, illustrating the ow of video and audio streams into the Fusion Engine.

facial points, we isolated six two-dimensional points: the tip of the nose; chin; left corner of each eye; and left and right corners of each mouth.

Using a generic 3D human face model, the camera matrix K, and distortion coefcients derived from the webcam, we solve the PnP problem to nd the rotation vector R and

translation vector T :

screen displaying a test question. Object detection is accom-

u

X

plished through various machine learning-based approaches.

s v = K R|T Y

(1)

1

One of these approaches includes the YOLO (You Only Look Once) framework, which has greatly improved the speed and efciency of real-time object detection. A version of this mode called YOLOv8 is now being utilized as an anchor-free detector for small objects including mobile phones that may be partially obscured while being held in the students hand.

Audio forensic analysis focuses on processing the audio stream to detect verbal assistance provided by another indi-vidual. Voice activity detection (VAD), is one method of ana-lyzing the audio stream. VAD detects when there is sufcient energy present in the signal to classify it as voice.
System Architecture

ProctorGuard has been developed with a modular design as well as an asynchronous pipeline. This was done to provide for the capability of ProctorGuard to operate in real time on commodity hardware. The overall system will be made up of three different sensory streams that are inputted into a fusion engine which will provide output. In addition to this there will also be an additional component that allows us to report our results externally.
1. Visual Stream: Head Orientation Estimation
  
  The visual stream takes in video feed from a standard 30 frame per second webcam. Using the Perspective-N-Point (PnP) geometric head pose estimation algorithm, we can estimate the heads orientation (Roll, Pitch, Yaw). First, using the MediaPipe Face mesh, we have extracted 468 three-dimensional facial points. From these 468 three-dimensional
  
  1 Z
  
  where (u, v) are the 2D image coordinates and (X, Y, Z) are the 3D world coordinates. The rotation vector is then converted to Euler angles to extract actionable Yaw and Pitch metrics.
2. Vision Stream: Gaze Estimation (L2CS-Net)
  
  In order to identify the head-forward, eyes-away abnor-mality in the video stream, we use L2CS-Net (looking to listen and Catching Stares Network). In our experiment, we have utilized pre-trained weight (l2csnet_gaze360.pkl) that has been trained under unconstrained conditions. To predict where a person is looking based on their eyes, the L2CS-Net model extracts a small rectangular portion of the image by cropping the localized face bounding box. This cropped rectangle is then fed into a resnet-50 backbone. After processing the cropped rectangle through this backbone, the L2CS-Net model outputs two continuous values; gaze yaw (gy) and gaze pitch (gp), which are independent of each other. Therefore, if there is an error in detecting where the head is located within the scene (head pose), the system will still be able to independently measure the direction of where the eyes are pointing at.
3. Object Detection Stream (yolov8)
  
  Using yolov8-nano (yolov8n.pt), we detect objects that are not allowed in school. Since yolov8-nano is designed to run as fast as possible on low-end cpus (which may include some student computers), it is used here to keep CPU temperature low and to avoid overheating. For now, the system is designed
  
  to nd three types of prohibited items (Mobile Phones, Books, and Persons) and we also include person in case someone enters the room who should not be there.
  
  As previously discussed, one major problem with using computer vision on people holding phones is that they tend to move them around quickly before putting them back in pockets. Because yolov8-nano operates at a maximum speed of 30 FPS, and because camera frames arrive so frequently, motion blur becomes a signicant challenge. To address this, we implemented a temporal-condence buffer. When a class of object is detected with high condence in frame t, we temporarily reduce the condence threshold in the same lo-calized area in frames t +1 to t + 3. This prevents motion-blurred images of objects from being ignored simply because the object moved too quickly.
4. Audio forensics stream
We use audio_monitor.py, a python module that uses PyAudio to sample audio from the default microphone in discrete chunks. Using a fast fourier transform (FFT), we perform a spectral analysis of these samples. We employ a dual-threshold method: 1. amplitude thresholding: this detects both loud events and normal conversation. 2. Whis-per Detection using spectral energy analysis: whispers often lack fundamental frequency pitch but contain substantial amounts of spectral energy in high frequency ranges (e.g., hissing sounds). We lter-out frequencies below 2000 hz and above 5000 hz from all sampled audio data to identify whispering or other forms of stealthy voice communication.
Core Implementation Modules
1. Dynamic Calibration Phase
  
  Early proctoring systems were pretty rigid. They essentially assumed everyone sits perfectly straight, which just isnt how bodies work. Different desk setups, spinal curves, and natural resting positions mean everyones normal looks a bit different. To x this, ProctorGuard kicks things off with a Dynamic Calibration Phase.
  
  For the rst N seconds (we usually set this to 5 seconds), we just ask the user to look right at the center of their screen. Using the Perspective-n-Point (PnP) algorithm, we map their 2D facial landmarks into 3D space to grab their actual head orientation. Then, we average out their head and gaze angles to establish a personal baseline:
2. The Fusion Engine and Priority Logic
  
  The fusion_engine.py script is essentially the brain of our system. After a personal baseline has been determined using the sensors, it continuously calculates how much each sensor has deviated (delta) from a persons individual center point:
  
  hp = headpitch Ohp (3)
  
  hy = headyaw Ohy (4)
  
  gp = gazepitch Ogp (5)
  
  gy = gazeyaw Ogy (6)
  
  If, as mentioned previously, multiple sensors are measuring different things; i.e., a smartphone in front of the camera with the students head directly in line with the middle of the screen and the camera. In this case, if all other factors were equal, you would likely receive many conicting messages. A solution was created for this problem. We built a decision tree based on priority levels. There are two types of evidence: Hard, and soft. Hard evidence Anytime YOLOv8 detects a phone or FFT determines whispering in real-time, then it will immediately trigger a violation. Soft evidence Whenever the facial mesh detects some sort of odd head position/tilt; it is not hard evidence by itself, so it needs another level of review before it can be considered an actual violation.
3. Temporal Filtering
  
  Temporal lters (as seen in algorithm 2) add a layer of common sense. Humans arent statues. They stretch, look down at our keyboards, and glance away when were trying to remember an answer. If the system was agging each individual one of those micro-movements, there would be no way for students to even blink during their tests.
  
  To make up for the lack of human patience, we applied temporal lters. We set a timer when a soft visual anomaly occurs (such as looking to one side very intensely). A soft visual anomaly will trigger a full 1.5 second (the default value of our TIME_THRESH) for the student to continue in this manner for the system to consider it an actual alert. In other words, if a student simply glances at her keyboard while typing and momentarily is looking down at her keyboard, she will have triggered an alert as the instant angle exceeded the threshold established in the FusionEngine (the default setting for TIME_THRESH was set to 1.5 seconds), however since this occurred within one second or less, she would also fail the time-based requirement.
4. Evidence Logging and Dashboarding
Ohy

N

1 X

= h

N

i=1

y,i

, Ogy

N

1 X

= g

N

i=1

y,i

(2)

When the system nally determines that an actual violation occurred, the violations are captured by logger.py module. The exact video frame that cntained the violation is captured along with the timestamp and the specic violation data (for

where Ohy is the Head Yaw Offset, and Ogy is the Gaze Yaw Offset. The FusionEngine class stores these offsets via the set_calibration() method, ensuring all subsequent analyses are calculated relative to the users natural posture.

example, eyes looking side +32 degrees), and then securely recorded to provide reviewers irrefutable audit trails.

We also created a real-time invigilation dashboard as part of our implementation. The dashboard was built using Streamlit

Algorithm 1 Temporal Priority Logic Evaluation

Require: hy, hp, gy, gp, ObjectState, AudioState

1: issuspicious False

2: reason

3: if ObjectState == Phone Detected then

4: issuspicious True

5: reason Unauthorized Object (Phone)

6: else if AudioState == Whisper OR Speech then

7: issuspicious True

8: reason Auditory Violation

9: else if |hy| > HEADYAWTHRESH(30) then

10: issuspicious True

11: reason Head Turned Side

12: else if |hp| > HEADPITCHTHRESH(35) then

13: issuspicious True

14: reason Head Tilted Up/Down

15: else if |gy| > GAZETHRESH(30) then

16: issuspicious True

17: reason Eyes Looking Side

18: end if

19: if issuspicious then

20: Update temporal buffer

21: if Duration > TIMETHRESH(1.5s) then

22: return True, reason

23: end if

24: else

25: Reset temporal buffer

26: end if

27: return False,

Fig. 2. Pseudocode for the Temporal Priority Logic Evaluation

(app.py). We were able to tie in this application to allow proctors to view the stream in real time while viewing overlaid bounding box information along with 3D gaze vector overlays. This represents the best of both worlds; scalable AI processing combined with the use of human judgment.

Experimental Setup and Evaluation

To assess whether ProctorGuard works as intended, we created a testing environment that mimics a typical test-taking experience using common equipment found in homes or ofces; this represents how students would likely be tested remotely. We used off the shelf hardware so that our results can represent a real-world use case.

Testing Environment

It wouldnt mean much if this only worked on a $10,000 supercomputer. We needed to know ProctorGuard could run smoothly on the kind of hardware an average student actually uses at home. So, we tested it on a mid-tier workstation: an Intel Core i7 processor, 16 GB of DDR4 RAM, and a standard 720p built-in webcam. We also specically skipped using a dedicated GPU for the deep learning models to make sure the CPU could handle the load entirely on its own.

Performance Metrics

We tested the effectiveness of ProctorGuard by simulating student behavior based upon a set of predetermined parameters and categorized those behaviors into two groups: benign behaviors (i.e., a student shifting position in their chair, taking time to think about answers, etc.) and malicious behaviors (e.g., a student using his/her cell phone, a student whispering to another person, etc.). In machine learning, you cant just look at basic accuracy, especially since actual cheating is relatively rare. We leaned heavily on the F1-Score to see how well we balanced nding the real cheaters (Recall) without falsely accusing honest, dgety students (Precision).

TABLE I

Detection Accuracy by Modality and Fusion

Conguration	Precision	Recall	F1-Score
Pose-Only Baseline	0.72	0.65	0.68
Pose + Gaze	0.81	0.78	0.79
YOLOv8 + Audio Only	0.88	0.70	0.78
ProctorGuard Fusion	0.92	0.89	0.90

Using just head pose would not have been sufcient as students could just freeze their necks and move their eyes to cheat, leaving us with a terrible F1-score of 0.68. But when we fed everything into the Fusion Engine (pose, gaze, YOLOv8 object detection, and audio analysis), our F1-score shot up to 0.90. We hit 92 percent precision specically because the temporal ltering successfully weeded out the innocent movements, while keeping recall high. The improvement in performance metrics such as precision and recall shown in the Table I shows that combining all three modalities signicantly improves the performance of the overall ProctorGuard System.

Computational Overhead

Running four heavy deep learning networks concurrently is tough on a standard CPU. To keep laptops from overheating and crashing mid-exam, we had to get creative with our optimization. We used the nano version of YOLOv8 and only ran the object detection every fth frame. A phone doesnt magically materialize in a millisecond, so we didnt lose any real security by doing this. This trick kept CPU usage hovering safely between 45 percent and 55 percent. The whole pipeline takes about 65 milliseconds from start to nish, giving us a smooth perceived rate of over 15 frames per secondmore than enough detail to catch any funny business without melting the students computer.

Challenges and Limitations

ProctorGuard represents an important technological devel-opment; nevertheless, a number of both technical and envi-ronmental limitations continue to challenge the deployment of Edge-Based Proctoring Systems:
- Low-Light Environments: Both MediaPipe Face Mesh and L2CS-Net experience signicant degradation in per-formance under low-light conditions or severe backlit
  
  conditions. As well, when light source shadows a sub-jects face, their eye-gaze vectors tend to become unpre-dictable and may lead to false positive determinations.
- VFOA Ambiguity: Vision Based False On Action (VFOA) ambiguities have yet to be fully solved. While the technology will be able to accurately determine within approximately 30 degrees the direction of the focus of attention that a student is visually xating upon to the right; because this system can use no camera(s) other than those provided by the second computer running the program there is simply no way of knowing for sure whether a student is looking at a stickie-note tacked up on the side of their desk as they consider an answer, or some form of unauthorized sticky-note.
- Adversarial Virtual Cameras: A technically savvy stu-dent may record their own video recordings using OBS studio virtual camera software, etc. Even though our system works very well with local inference on the video feed being recorded, our current version of the application does NOT include live cryptographic liveness detection to ensure that the source of the video stream is coming from an actual peripheral device and not simply from a le on a computer.
Future Scope

ProctorGuards future versions will implement a cloud-hybrid model to provide increased security compared to Proc-torGuard Edge Processing.
- Liveness Verication: Developing challenge-response methods to verify whether a student is actively present at the computer by requesting that the student follow a randomly moving point displayed on the screen, etc. to prevent virtual injection of camera images.
- Multi-Camera Support: Use a smart phone as an ad-ditional Hand Camera connected through the internet to view the desk and eliminate the VFOA ambiguity and capture misuse of devices outsid of the video frame.
- Transformer-Based Temporal Models: Replace the pri-ority tree logic with a spatio-temporal graph convolu-tional network (ST-GCN), or vision transformer (ViT) to recognize and learn sequential patterns of behavior occurring over time.
Conclusion

The security and integrity of online examinations can no longer rely on simple, uni-modal tracking methodologies. On-line testing has to move beyond unimodal tracking techniques; tracking is one way to detect cheating, but it will have to become more sophisticated as students continue to cheat using increasingly advanced tactics.

In this paper we introduced ProctorGuard which uses a multi-modal fusion engine to create a robust behavioral representation of the test-taker. By incorporating geometric head pose estimation, l2cs-net based eye-tracking, yolo v8 object detection and spectrogram audio forensic analysis, Proc-torGuard creates a high delity behavioral representation of

each test taker. We show how our dynamic calibration phase, combined with our temporal priority logic engine results in lower false positive rates than most existing solutions. The new framework focuses on empirical evidence; runs on commodity hardware; and presents a more accurate, more detailed, and fairer method of monitoring candidates in todays academic environment.

Acknowledgment

The authors would like to sincerely thank the Depart-ment of Computer Engineering at Pune Institute of Computer Technology (PICT) for providing the computational resources and academic guidance necessary to complete this project. Special thanks to the open-source communities maintaining the MediaPipe, YOLO, and L2CS-Net repositories.

References

S. Patil, S. Bankar, I. Patel, V. Patil, and P. P. Joshi, Multi-Modal Online Proctoring: A Survey on Fusion of Head Pose, Gaze, Object Detection, and Audio Forensics, Unpublished Survey Paper, Pune Institute of Computer Technology, 2026.
The Element of Choice: Online Students Perceptions of Online Exam Proctoring – OJDLA (2026).
Digital proctoring in higher education: a systematic literature review – Emerald Insight (2026).
Deep Learning-Based Multimodal Cheating Detection in Online Exam-inations (2026).
AutoOEP – A Multi-modal Framework for Online Exam Proctoring – arXiv (2025).
How Emerging Tech Is Disrupting the Cheating Industry Through Smarter Proctoring (2025).
A. Researcher et al., Enhancing online proctoring efciency: Utilizing AI, ResearchGate, 2026.
Audio Flagging in Online Proctoring: Enhancing Exam Security – ThinkExam (2026).
Virtual camera detection: Catching video injection attacks – arXiv (2025).
Next-Gen Liveness Detection for Deepfake and Injection Attacks – ROC (2025).
A. Farkhana et al., L2CS-Net: Fine-Grained Gaze Estimation in Un-constrained Environments, ResearchGate, 2026.
Ultralytics YOLO Evolution: An Overview of YOLO26, YOLO11, YOLOv8 – arXiv (2025).
YOLO11 vs. YOLOv8: The Evolution of Real-Time Object Detection – Ultralytics (2024).
Forensic Voice Comparison: The Essential Guide PHONEXIA (2026).
ProctorEdge: Advanced AI Examination Monitoring and Security Sys-tem (2025).