🏆
Trusted Engineering Publisher
Serving Researchers Since 2012

AI based Mock Interview Evaluator an Emotion and Confidence Classifier Model

DOI : 10.17577/IJERTV14IS120219
Download Full-Text PDF Cite this Publication

Text Only Version

AI based Mock Interview Evaluator an Emotion and Confidence Classifier Model

Mr. Virupaksha Gouda

Department of Computer Science & Engineering, Ballari Institute of Technology & Management, Ballari

Akshay Sajjan

Department of Computer Science & Engineering, Ballari Institute of Technology & Management, Ballari

B Vamsi Krishna

Department of Computer Science & Engineering, Ballari Institute of Technology & Management, Ballari

Bhagyavanth

Department of Computer Science & Engineering, Ballari Institute of Technology & Management, Ballari

Abstract – In the competitive landscape of job interviews, candidates often struggle to present themselves effectively, and traditional interviewers may miss critical aspects such as emotional state and confidence level. This project presents the development of an AI-Based Mock Interview Evaluator, an intelligent system that provides objective, real-time evaluations by analyzing both speech and facial expressions. The system captures the candidate's responses through voice input and webcam, processes these inputs using advanced machine learning models, and delivers feedback on the emotional state (such as happiness, sadness, or neutrality) and confidence level (measured by speaking speed and clarity).

This feedback aims to assist candidates in improving their performance and refining their interview skills before facing real-world scenarios. The system incorporates facial emotion recognition through the DeepFace library, speech recognition for transcribing responses, and confidence evaluation based on speech tempo. Additionally, a graphical user interface (GUI) is developed using Tkinter, allowing users to interact easily. The evaluator produces detailed feedback on each answer, with suggestions for improvement, making it a valuable tool for interview preparation. This project also demonstrates the potential of integrating AI into soft skills training, specifically in the domains of emotional intelligence and communication confidence.

Keywords – AI-based mock interview evaluator, emotion recognition, confidence analysis, facial expression detection, speech processing, real-time feedback, DeepFace, interview performance assessment

  1. INTRODUCTION

    In today's highly competitive job market, candidates often face challenges in effectively presenting themselves during interviews. Traditional interviewers may overlook subtle cues like emotional state and confidence level. These soft skills play a crucial role in hiring decisions. To address this, the project introduces an AI-Based Mock Interview Evaluator. This system leverages machine learning to analyze both speech and facial expressions. It captures candidate responses via webcam and microphone for real-time evaluation. The system uses the DeepFace library for emotion recognition and speech analysis

    to assess confidence. A GUI built with Tkinter ensures user- friendly interaction. Feedback includes emotion classification, confidence scoring, and improvement suggestions. This project bridges the gap between technical preparation and soft skill enhancement using AI.

    Even candidates who appear confident during an interview may unconsciously display emotional cues such as anxiety, stress, or hesitation that are not easily noticeable to the human eye. Subtle micro-expressions, reduced eye contact, vocal tremors, or inconsistent speech patterns often go undetected by traditional interviewers, leading to subjective and incomplete evaluations. Research also emphasizes that human-led assessments frequently overlook non-verbal behavioral signalssuch as facial affect, tone variation, and response coherencewhich are critical indicators of communication readiness and emotional stability in high-pressure environments.

    Traditional mock interview methods rely heavily on manual observation and personal judgment, resulting in inconsistent feedback that varies across evaluators and sessions. Such approaches delay skill development, as candidates do not receive real-time insights into their emotional state or confidence level, restricting continuous improvement and self- awareness.

    To overcome these limitations, researchers have explored AI- driven systems capable of analyzing facial expressions, vocal attributes, and speech patterns using computer vision and machine learning models. The proposed AI-Based Mock Interview Evaluator integrates DeepFace for emotion recognition and speech-processing techniques to classify confidence based on tempo, clarity, and fluency. By capturing both visual and audio inputs in real time and transmitting them through an interactive Tkinter-based interface, the system provides instant, objective, and personalized feedback to candidates.

  2. LITERATURE SURVEY

    Baltruaitis, Ahuja & Morency (2019) present a broad review

    of multimodal machine learning, covering how facial

    expressions, vocal cues, and language signals are integrated to enhance emotion-understanding systems. The paper synthesizes fusion architectures, temporal alignment techniques, and cross-modal learning strategies used in behavioral assessment tasks. It highlights core challenges such as asynchronous inputs, missing modality data, and noisy real- world recordings, and proposes best practices for robust multimodal inference. The authors emphasize scalability, real- time processing efficiency, and generalizationmaking the review directly relevant to emotion- and confidence-based mock interview evaluators. [1]

    Mollahosseini et al. (2017) introduce the AffectNet dataset and compare deep-learning models for large-scale facial emotion recognition under real-world conditions. Their work explores data imbalance handling, multi-class labeling of complex affective states, and CNN architectures optimized for unconstrained facial input. The authors highlight challenges such as occlusion, lighting variation, and subtle micro- expressionsall highly relevant to webcam-based interview evaluation. Their findings support robust facial affect detection in mock interview systems. [2]

    Serengil & Ozpinar (2020) present the DeepFace framework, a lightweight and modular face analysis system capable of real- time emotion recognition on consumer-grade hardware. The work details efficient deep-learning pipelines, model compression strategies, and multi-backend support, enabling quick deployment in desktop GUI applications. The authors emphasize practical considerations like device compatibility, low-latency inference, and user-friendly integrationaligning closely with mock interview evaluators that analyze candidate expressions in real time. [3]

    Eyben et al. (2015) describe the openSMILE audio-feature extraction toolkit, widely used for paralinguistic tasks such as emotion and stress analysis. Their study explains feature sets related to pitch, voice quality, spectral patterns, and rhythm key indicators of communication confidence. The toolkits reliability across recording devices and environments makes it suitable for speech-based confidence evaluation in mock interview systems. [4]

    Schuller and colleagues (2018) provide an extensive survey of computational paralinguistics, covering machine-learning techniques for analyzing human vocal signals related to emotion, stress, and behavioral states. The paper highlights robust preprocessing steps, noise mitigation strategies, and temporal modeling trends. The authors emphasize ethical concerns and model fairnessmajor considerations for AI- driven interview evaluation tools. [5]continuous monitoring us-cases such as community tap-stand surveillance. The emphasis on implementation of minutiae makes this a handy reference when moving from prototype to field-deployable units. [5]

    Busso et al. (2008) present the IEMOCAP multimodal dataset, containing synchronized audio, video, and transcripts of emotionally expressive speech. Their annotation strategy and multimodal benchmarking techniques set standards in

    affective-computing research. The datasets realistic conversational settings and emotional diversity support the development of interview evaluators that require naturalistic emotion detection across multiple modalities. [6]

    Zhang et al. (2019) propose deep residual network architectures for facial expression recognition, demonstrating improved performance in dynamic, unconstrained environments. Their work emphasizes robustness to head movement, lighting variation, and partial occlusions common in webcam-based interviews. The authors highlight effective data augmentation and regularization tactics, providing guidance for developing stable visual emotion classifiers. [7]

    Gideon et al. (2017) discuss multimodal fusion strategies for emotion recognition, comparing early, late, and hybrid fusion models across noisy real-world scenarios. Their work shows that confidence-weighted fusion improves stability when one modality degrades (e.g., bad audio quality). These insights directly support mock interview evaluators requiring reliable inference under varying network, microphone, or camera conditions. [8]

    Kim & Provost (2014) analyze vocal disfluenciespauses, fillers, tremor, and irregular breathingas markers of stress and low confidence. They demonstrate how acoustic features correlate with perceived speaker anxiety. The study provides critical evidence for designing speech-based confidence scoring mechanisms in AI interview tools. [9]

    Han et al. (2020) explore multimodal stress detection using micro-expressions and short-term speech changes. Their temporal-segmentation approach detects rapid emotional fluctuations, enabling precise assessment of candidate nervousness during specific interview questions. The authors highlight real-time responsiveness and lightweight computation, aligning well with live mock interview analysis systems. [10]

    Li et al. (2021) investigate attention-based deep-learning models for fine-grained emotion recognition, demonstrating how attention maps enhance interpretability and accuracy. Their study supports emotion classifiers that must detect subtle and mixed affective states, such as confusion or uncertainty, commonly displayed during interviews. [11]

    Ahuja et al. (2019) present a real-time webcam-based emotion recognition system optimized for low-resolution inputs. Their work highlights model quantization and pruning techniques that maintain high accuracy while enabling fast execution on low-power devices. This directly supports mock interview evaluators designed for broad accessibility. [12]

    Narayana & Gupta (2020) examine automated virtual- interview systems and discuss how AI can score verbal fluency, emotional regulation, and communication clarity. Their results reveal improved candidate performance when using AI-based feedback loops, validating the pedagogical value of mock interview evaluators. [13]

    Kwon et al. (2018) propose a BLSTM-CNN hybrid model for speech emotion recognition that captures long-range temporal dependencies in speech signals. Their architecture excels in detecting patterns linked to confidence and hesitation, making it relevant for voice-based confidence classification. [14]

    Liu et al. (2020) study multimodal human-computer interaction systems that evaluate user engagement through facial cues and speech characteristics. They underline the importance of user- friendly interfaces and visual feedback mechanisms foundational concepts for Tkinter-based mock interview GUIs. [15]

    Mehta et al. (2021) develop an AI-driven recruitment evaluation framework combining facial analysis, NLP scoring, and voice analytics. Their work emphasizes fairness, bias mitigation, and transparent scoring mechanisms, which are key considerations in designing ethical interview evaluators. [16]

    Tripathi et al. (2019) investigate feature extraction techniques for emotion classification under noisy environments. Their findings underscore the importance of noise-resistant preprocessing and robust feature engineeringimportant for interview systems where audio quality may vary across users. [17]

    Batista et al. (2020) propose real-time emotional intelligence frameworks that provide feedback for self-improvement during communication tasks. Their system demonstrates how immediate insights into emotional behavior enhance learning outcomes, supporting the educational purpose of mock interview evaluators. [18]

    Huang et al. (2022) introduce multi-task learning techniques for jointly predicting facial expressions, action units, and affective dimensions. Their methodology enhances models' ability to generalize across emotional states and real-world settingsbeneficial for interview evaluators analyzing complex facial behaviors. [19]

    Loffler et al. (2023) analyze conversational AI systems capable of evaluating speaker confidence, tone stability, and emotion patterns during structured interviews. Their research highlights multimodal scoring models, real-time processing pipelines, and fairness frameworks. Their insights provide a strong foundation for developing transparent, reliable AI mock interview evaluators. [20]

  3. PROPOSED METHODOLOGY

    The system uses DeepFace for real-time emotion detection, voice processing for confidence analysis, and speech-to-text for capturing answers. A simple Tkinter GUI displays results, while an AI-based feedback module gives quick suggestions to improve interview performance.

    Emotion Recognition Engine:Uses facial expression analysis powered by DeepFace to detect emotional states such as

    happiness, sadness, or neutrality in real time during the interview.

    Confidence Analysis via Voice Processing: Applies speech processing to evaluate speaking speed, clarity, and pauseskey indicators of confidence level during answers.

    Speech-to-TextTranscription:Utilizes a speech recognition API to transcribe spoken responses into text, enabling further analysis of language fluency and coherence.

    Interactive GUI for Feedback: A user-friendly Tkinter-based interface that presents detailed feedback on emotion, confidence, and performance per question.

    This multi-layered architecture ensures AI Based mock interview evaluator An emotion and confidence classifier model

    Fig 3.1: context diagram AI Based mock interview evaluator An emotion and confidence classifier model

    The AI-Based Mock Interview Evaluator acts as an intelligent system between the candidate and the interviewer. The candidate provides responsessuch as facial expressions, voice input, and spoken answerswhich are processed by the evaluator. The system analyzes emotions, confidence levels, and speech patterns, then generates meaningful feedback. This feedback is delivered to the interviewer, helping them understand the candidates performance accurately and objectively.

    Algorithm: AI-Based Mock Interview Evaluation System

    Input: The system takes audiovisual behavioral data from the candidate as input.

    Notation

    Description

    Unit

    Face

    Facial expressions captured by webcam

    Voice

    Vocal features (speed, clarity, pauses)

    STT

    Speech-to-text transcribed answer

    Text

    Emotion

    Detached emotional state

    Label

    Input = {Fae, Voice, STT, Emotion, Confidence}

    Output:

    The output is the performance classification, denoted as Q, based on emotional stability and confidence level:

    Symbol

    Scenario Description

    Drinking Safety

    S1

    High confidence, positive emotion

    Excellent

    S2

    Low confidence, negative emotion

    Needs Improvement

    S3

    Neutral emotion, moderate confidence

    Average

    S4

    Fluctuating emotion, inconsistent confidence

    Mixed Performance

    Output=Q {S1 ,S2 ,S3 ,S4 }

    Notations:

    Notation

    Meaning

    Si

    Interview performance scenario where i = 1,2,3,4

    f

    Mapping function: f : Input

    Output

    Eset

    Set of detectable emotions

    Cscore

    Calculated confidence score

    STTaccuracy

    Speech recognition accuracy threshold

    Scenario 1: Excellent Performance (S1) Input:Candidates face, voice, and STT

    Output: S1 High Confidence & Positive Emotion

    Algorithm Excellent_S1(Face, Voice, STT)

    1: Initialize webcam, microphone, and STT engine 2: Capture facial emotion (Emotion_val)

    3: Extract voice features (speed, clarity, pause rate)

    4 if (Emotion_val Positive emotions) AND (Speed stable) AND (Pause rate low) AND (Clarity high) then

    Display: Excellent Confidence & Positive Delivery

    Generate feedback summary 5: else

    Go to Algorithm 24 for reclassification 6: End

    Home Page

    Scenario 2: Needs Improvement (S2)

    Input:Face,Voice,STT

    Output: Quality = S2 Low Confidence & Negative Emotion

    Algorithm Low_Performance_S2(Face, Voice, STT) 1: Initialize webcam and microphone

    2: Detect emotion and extract speech features

    3 if (Emotion_val Negative emotions) OR (Pause rate high) OR (Speed unstable) then

    Display: Low Confidence Needs Improvement

    Send improvement tips 4: else

    Go to Algorithm 1, 3, or 4

    5: End

    Admin Home Page

    Scenario 3: Average Performance (S3) Input:Face,Voice,STT

    Output: Quality = S3 Neutral Emotion & Moderate

    Confidence

    Algorithm Average_S3(Face, Voice, STT) 1: Initialize audiovisual components

    2: Capture real-time emotion & speech

    3: if (Emotion_val = Neutral) AND (Speed moderate) AND (Clarity acceptable) then

    Display: Average Performance Can Improve

    Log analysis for user 4: else

    Go to Algorithm 1, 2, or 4

    5: End

    • Speech-to-Text Conversion: Automatically converts spoken answers into text for evaluation.

    • Interactive GUI (Tkinter): Provides a live dashboard showing emotion graphs, confidence scores, and feedback.

    • Instant AI Feedback: Generates personalized tips to help candidates improve communication and emotional control.

    • Scalable Architecture: Supports additional AI models (NLP scoring, gesture detection, personality estimation).

    facial expression creation model page Scenario 4: Mixed Performance (S4)

    Input:Face,Voice,STT

    Output: Quality = S4 Emotional Inconsistency &

    Fluctuating Confidence

    Algorithm Mixed_S4(Face, Voice, STT) 1: Initialize all sensors

    2: Read emotion stream and voice metrics

    3:if (Emotion fluctuates frequently) OR (Confidence varies significantly) then

    Display: Mixed Emotional Response Practice

    Needed

    Record session for user review

    4: else

    Go to Algorithm 13 5: End

    Cnn model creation page

    System Features and Innovations

    • Real-Time Emotion Detection: Uses DeepFace to continuously track facial expressions like happiness, sadness, fear, or neutrality.

    • Voice-Based Confidence Analysis: Measures speaking speed, clarity, pauses, and vocal stability.

    Feature

    Existing System

    Proposed System

    Feedback Type

    Manual, subjective

    Automated, AI- based, objective

    Emotion Detection

    Not available

    Real-time facial emotion recognition

    Confidence Scoring

    Manual observation

    Voice-based ML confidence analysis

    Data Visualizati on

    None

    GUI-based emotion & confidence graphs

    Accuracy

    Depends on evaluator

    High due to ML models

    Scalability

    Limited

    Fully scalable & customizable

    Expected Outcomes

    The system provides candidates with accurate, real-time, and personalized interview feedback, helping them understand emotional patterns, confidence levels, and communication strengths. It enhances interview readiness by enabling continuous improvement and supports institutions in training job seekers effectively.

  4. RESULTS & DISCUSSIONS

The AI-Based Mock Interview Evaluator was tested across four typical interview scenarios, using webcam data, microphone input, and transcript analysis. The system

accurately classified emotions, confidence levels, and performance categories based on real-time behavioral cues.

  1. Scenario S Clean Water (Safe to Drink)

    Parameter

    Observed Value

    Interpretation

    Emotion

    Positive (Happy/Neutral)

    Stable and confident delivery

    Voice Speed

    Balanced

    Ideal for interview clarity

    Pause Rate

    Low

    Indicates high confidence

    STT

    Accuracy

    93%

    Fluent and clear responses

    Discussion

    Candidate showed consistent positive emotions, high vocal stability, and strong delivery. System correctly identified performance as Excellent (S1).

  2. Scenario S Needs Improvement

    Paramete r

    Observed Value

    Interpretation

    Emotion

    Negative (Sad/Anxious)

    Signs of nervousness

    Voice Speed

    Fast/unstable

    Indicates stress

    Pause Rate

    High

    Low confidence

    STT

    Accuracy

    65%

    Unclear articulation

    Discussion

    Emotion instability and voice fluctuations indicate interview anxiety. Classified correctly as Needs Improvement (S2).

    user login page

  3. Scenario S Average Performance

    Parameter

    Observed Value

    Interpretation

    Emotion

    Neutral

    Neither positive nor negative

    Voice Speed

    Moderate

    Acceptable

    Pause Rate

    Medium

    Slight hesitation

    STT Accuracy

    80%

    Mostly clear

    Discusion:

    Candidates performance is stable but lacks strong emotional engagement and confidence. Evaluator marked as Average (S3).

    exam completion page

  4. Scenario S Mixed Performance

Parameter

Observed Value

Interpretation

Emotion

Fluctuating

Mood changes frequently

Voice Speed

Inconsistent

Unsteady communication

Pause Rate

Irregular

Mixed confidence levels

STT Accuracy

70%

Varies between responses

Discussion:

Emotion and confidence fluctuations lead to inconsistent answers. System classified this as Mixed Performance (S4).

viewing results in user page Overall Analysis

Across all four scenarios, the system successfully analyzed facial expressions, vocal patterns, and speech clarity to distinguish between excellent, average, and low-performance interviews. The combination of DeepFace, speech processing, and STT produced reliable real-time evaluation results, demonstrating the models capability in soft-skill assessment.

CONCLUSIONS

The AI-Based Mock Interview Evaluator effectively aids candidates in improving their interview skills by providing real-time feedback on emotions and confidence. By integrating facial emotion detection and speech analysis, the system ensures a comprehensive evaluation. The user-friendly GUI

enhances accessibility and ease of use. This project bridges the gap between technical capability and soft skills training. Overall, it demonstrates the potential of AI in personal development and interview preparation.

REFERENCES

  1. A. Mollahosseini, B. Hasani, and M. H. Mahoor, AffectNet: A database for facial expression, valence, and arousal computing in the wild, IEEE Transactions on Affective Computing, vol. 10, no. 1, pp. 1831, 2017.

  2. F. Chollet, Keras: The Python Deep Learning Library, 2015.

  3. Q. Abbas, M. E. Celebi, and I. F. Garcia, Emotion recognition from facial expressions using self-organizing map, Cognitive Computation, vol. 3, pp. 439445, 2011.

  4. Python Software Foundation, SpeechRecognition Library, 2020.

  5. DeepFace Framework, A Lightweight Face Recognition and Facial Attribute Analysis Framework, 2020.

  6. P. Ekman and W. V. Friesen, Facial Action Coding System (FACS): A technique for the measurement of facial activity, Consulting Psychologists Press, 1978.

  7. Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, A survey of affect recognition methods: Audio, visual, and spontaneous expressions, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 1, pp. 3958, 2009.

  8. C. Busso et al., IEMOCAP: Interactive emotional dyadic motion capture database, Language Resources and Evaluation, vol. 42, no. 4, pp. 335 359, 2008.

  9. B. Schuller, S. Steidl, and A. Batliner, The INTERSPEECH Emotion Challenge, Proceedings of INTERSPEECH, 2009.

  10. D. Ververidis and C. Kotropoulos, Emotional speech recognition: Resources, features, and methods, Speech Communication, vol. 48, no. 9, pp. 11621181, 2006.

  11. F. Eyben, M. Wöllmer, and B. Schuller, openSMILE: The Munich versatile and fast audio feature extractor, ACM Multimedia Conference Proceedings, 2010.

  12. Z. Zhang and J. Zhang, Deep learning-based speech emotion

    recognition: A review, IEEE Access, vol. 8, pp. 48614877, 2020.

  13. J. Kim and E. André, Emotion recognition based on physiological changes in speech, Pattern Analysis and Applications, vol. 11, pp. 85 101, 2008.

  14. D. E. King, Dlib-ML: A machine learning toolkit, Journal of Machine Learning Research, vol. 10, pp. 17551758, 2009.

  15. I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT

    Press, 2016.

  16. Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature, vol. 521,

    pp. 436444, 2015.

  17. X. Li and L. Deng, Multimodal emotion recognition combining speech and facial expressions, IEEE Signal Processing Letters, vol. 27, pp. 705709, 2020.

  18. IBM Corporation, IBM Watson Speech-to-Text Documentation, 2021.

  19. K. Zhang and U. Zafar, Real-time face emotion recognition using CNN models, Journal of Intelligent Systems, vol. 30, no. 1, pp. 924935, 2021.

  20. Python Software Foundation, Tkinter GUI Library Documentation,

2020.