🏆
Global Publishing Platform
Serving Researchers Since 2012

TalkLens: An AI-Powered Browser-Based System for Real-Time Interview Analysis and Behavioral Insight

DOI : 10.17577/IJERTCONV14IS010039
Download Full-Text PDF Cite this Publication

Text Only Version

TalkLens: An AI-Powered Browser-Based System for Real-Time Interview Analysis and Behavioral Insight

Nishanth U

Student, St. Joseph Engineering College, Mangalore, India Sumangala N

Assistant Professor, St. Joseph Engineering College, Mangalore, India

Abstract – Modern interviews demand more than just technical evaluation – they require an understanding of emotional cues, communication clarity, and candidate engagement. To support this need, we present. TalkLens, a browser-based, AI-powered system that provides a holistic candidate evaluation by integrating real-time facial emotion detection, speech transcription, resume-to-job-description analysis, and semantic answer quality assessment. The platform's pipeline utilizes the Roboflow API for emotion classification, AssemblyAI for high- accuracy speech-to-text conversion, and the Google Gemini API for two critical NLP tasks: scoring resume alignment with the job description and evaluating the contextual quality of a candidate's transcribed answers. These individual metrics are then aggregated into a final, objective evaluation using a transparent weighted scoring model designed to eliminate human bias. In simulated tests, the system achieved approximately 87% accuracy in emotion detection, 94% accuracy for speech transcription, and the final integrated model demonstrated a predictive accuracy of 89.5% in distinguishing high-potential candidates from low- potential ones. By leveraging these tools in a unified, automated framework, TalkLens enables HR professionals to gain deeper behavioral insights, reduce unconscious bias, and make more consistent, data-driven hiring decisions.

Index term – Interview Automation, Emotion Detection, TensorFlow.js, Roboflow, Speech Analysis, HR Tech, AssemblyAI, Multimodal Interaction

  1. INTRODUCTION

    Modern recruitment has evolved far beyond the evaluation of technical skills and qualifications listed on a resume. Today's organizations increasingly seek candidates who demonstrate a complex blend of competencies, including high emotional intelligence, strong communication skills, and adaptability under pressure. However, the traditional interview process, whether conducted in-person or remotely, is often ill-equipped to measure these nuanced soft skills objectively. Interviewers

    must rely heavily on subjective impressions and intuition, a method that can introduce significant inconsistency, lead to unstructured and incomparable feedback, and perpetuate unconscious biases related to gender, ethnicity, or background. These inherent limitations not only make it difficult for hiring teams to accurately and fairly evaluate a candidate's potential but can also undermine organizational goals for diversity and inclusion.

    To address these challenges, we propose

    TalkLens, an intelligent interview assistant designed to automate and standardize the candidate evaluation process through a multi-faceted application of artificial intelligence. TalkLens functions as a comprehensive, browser-based system that assesses candidates from their initial application to their verbal responses in an interview. The evaluation pipeline begins with an automated analysis of a candidate's uploaded resume against the job description, leveraging a large language model to generate a baseline score for experience and skill alignment. Following this, during the interview session, the system captures the candidate's webcam video and microphone audio in real time. This live data is used to perform two parallel analyses: facial expressions are examined for emotional cues, while spoken answers are transcribed into text. The system then performs a deeper semantic analysis on these transcripts to evaluate the quality, clarity, and relevance of the candidate's answers. Finally, all of these distinct metricsresume match, answer quality, and behavioral insights from facial analysis are aggregated into a final, objective score using a transparent weighted model, providing a holistic and data-driven hiring recommendation.

    This multimodal analysis provides HR professionals with richer, more consistent feedback while significantly reducing the cognitive load and notetaking burden on interviewers. The TalkLens platform is built using a modern technology stack, including TensorFlow.js for efficient, in-browser facial landmark detection, the Roboflow API for emotion

    recognition, and AssemblyAI for high-accuracy speech transcription. The Google Gemini API is utilized for the dual, high-level NLP tasks of both resume analysis and answer quality evaluation. Although TalkLens has not yet been tested in live recruitment environments, its full functionality and the integration of its components have been validated using simulated input and comprehensive test data. This paper presents the motivation, technical framework, component integration, and use case of TalkLens, highlighting its potential to serve as a powerful tool for creating a more consistent, fair, and intelligent hiring process.

  2. LITERATURE REVIEW

    The integration of artificial intelligence into recruitment processes has gained significant momentum, with research focusing on automating candidate screening, analyzing soft skills, and minimizing interviewer bias. As organizations increasingly shift toward digital hiring practices, a diverse ecosystem of tools has emerged to support virtual interviewing, emotion detection, and speech analysis. This review examines the key domains of research that inform the development of TalkLens. It surveys the existing literature on AI-driven interview platforms, facial emotion recognition, speech and language analysis, and automated resume screening, thereby identifying the critical research gap that our proposed system aims to address.

    1. AI in Video Interviewing Platforms

      AI-powered video interviewing platforms, such as the widely- used HireVue, have become prominent in modern corporate recruitment. These tools typically analyze a candidate's facial expressions, tone of voice, and use of language to generate performance scores and behavioral summaries intended to predict job success. However, despite their widespread adoption, these platforms have faced significant criticism regarding their "black-box" nature, where the proprietary algorithms that produce candidate scores lack transparency and interpretability. Studies by van Esch et al. highlight that HR professionals using these systems often cannot trace how specific scores are derived, which damages trust and raises ethical concerns about fairness and potential algorithmic bias. TalkLens was designed to directly address these transparency issues. Instead of relying on an opaque, monolithic scoring system, our platform features a modular architecture where each analytical component can be independently audited. The final evaluation is produced by a simple, transparent weighted formula, ensuring that reviewers can understand exactly how a decision was reached, promoting fairness and explainable AI principles.

    2. Facial Emotion Recognition Techniques

      Facial emotion recognition is a cornerstone of behavioral analysis in AI systems, most commonly built using Convolutional Neural Networks (CNNs) trained on vast,

      labeled datasets like FER2013 and Affect Net. Seminal work by Mollahosseini et al. demonstrated that emotion classification models can achieve high accuracy by analyzing facial landmarks and action units. A primary architectural challenge in implementing this technology for live interviews is the trade-off between latency and privacy. Many systems require server-side processing of video streams, which can introduce delays and create data privacy concers. TalkLens implements a hybrid approach to mitigate these issues. It uses TensorFlow.js for efficient, client-side facial landmark detection directly in the browser, which minimizes the data sent to the server. Only periodic screenshots are then sent to a backend Roboflow API for the final emotion classification. This architecture enables real-time behavioral feedback while enhancing user privacy by avoiding the need to store or transmit raw video files.

    3. Automated Speech and Language Analysis

      Automatic Speech Recognition (ASR) technology is critical for converting spoken responses into text for analysis. Modern APIs from providers like Google Cloud Speech-to-Text and AssemblyAI offer high-accuracy live transcription with advanced features such as punctuation and keyword extraction. As shown by Wang et al., integrating ASR into HR applications significantly enhances interview record-keeping and improves the quality of feedback. TalkLens utilizes AssemblyAI for its robust transcription capabilities but extends this functionality beyond simple record-keeping. The textual output is subsequently passed to the Google Gemini API for a deeper semantic analysis of the candidate's answer quality, relevance, and clarity. This two-step process allows TalkLens to evaluate not just the clarity of speech, but the cognitive substance of

      what a candidate says, adding a layer of assessment that traditional ASR implementations lack.

    4. Automated Resume and Job Description Analysis

    A foundational step in modern recruitment is the initial screening of resumes against a job description. Traditional automated approaches have relied on keyword matching and basic parsing, which are often brittle and fail to capture the semantic context of a candidate's experience. More advanced techniques from Natural Language Processing (NLP), such as TF-IDF (Term Frequency-Inverse Document Frequency) and cosine similarity, have improved this by representing documents as vectors to measure similarity. However, these methods can still struggle with the nuanced language of skills and job roles. The advent of Large Language Models (LLMs) has introduced a state-of-the-art approach to this problem. TalkLens leverages the advanced reasoning capabilities of the Google Gemini API to perform a deep, contextual analysis, comparing the resume and job description semantically. This allows the system to understand skill equivalencies, gauge the relevance of project experience, and provide a more accurate

    alignment score than what is possible with keyword-based or traditional vector similarity methods.

    Despite the existence of sophisticated tools for emotion detection, ASR, and NLP, very few systems integrate these features into a single, transparent, and accessible browser- based platform designed for real-world hiring. Furthermore, major issues concerning ethical AI, data transparency, and algorithmic fairness remain unresolved in many proprietary platforms. TalkLens is designed to bridge this gap. It creates an end-to-end evaluation pipeline that assesses a candidate holisticallyfrom their resume to their communication skills and behavioral cues. Crucially, it replaces opaque "black-box" scoring with a simple, interpretable weighted model. By presenting clear, data-driven insights from each distinct module, TalkLens empowers human reviewers to make better, fairer, and more consistent hiring decisions, reducing ethical risks and increasing confidence in the automated system.

  3. DATA AND METHODOLOGY

    The TalkLens system is designed to offer a seamless and intelligent interviewing experience by integrating multiple AI technologies within a cohesive web-based application. This section describes the system's architecture, the specific roles of its technology components, the multi-stage data processing pipeline, and the real-time interaction between the frontend and backend modules. Each technology was chosen for its performance, real-time support, and compatibility with modern web development practices.

    1. System Architecture Overview

      TalkLens follows a modular client-server architecture designed for scalability and real-time performance. The frontend, built using ReactJS, serves as the primary user interface and is responsible for all client-side operations, including webcam access, in-browser face tracking via TensorFlow.js, and audio capture using the Media Recorder API. The backend, built with Node.js and the Express framework, acts as an orchestration layer. It processes incoming data from the client (e.g., audio clips, video screenshots), routes requests to the appropriate external AI services (Roboflow, AssemblyAI, Google Gemini), and manages the secure storage of all processed results in a MongoDB database. This distributed architecture ensures that computationally intensive tasks are handled by specialized backend services, keeping the client-side application lightweight and responsive, which is critical for a smooth user experience during live interviews.

    2. Technology Stack Summary

      The system is constructed using the following core components:

      • Frontend: ReactJS for building a dynamic and interactive user interface.

      • Real-time Face Detection: TensorFlow.js for efficient, in-browser facial landmark detection without server-side video processing.

      • Backend: Node.js with the Express framework to create RESTful APIs and manage communication with third- party services.

      • Emotion Classification: Roboflow API, utilizing a trained computer vision model to classify emotions from static video frames.

      • Speech-to-Text: AssemblyAI API for high-accuracy, real-time transcription of candidate audio.

      • NLP and Reasoning: Google Gemini API for the dual tasks of semantic resume analysis and contextual answer quality evaluation.

      • Database: MongoDB for flexible, scalable storage of all session data, including evaluation scores, transcripts, and generated questions.

    3. Stage 1: Resume-to-Job-Description Analysis

      Before the live interview, the system performs an initial screening by analyzing the candidate's resume against the specified job description. This process uses the Google Gemini API for an advanced semantic evaluation that goes beyond simple keyword matching.

      1. Input: The system receives the raw text from the

        candidates resume and the text of the job description.

      2. Prompt Engineering: A carefully constructed prompt is sent to the Gemini API, instructing the model to perform a deep contextual analysis of the candidate's skills, professional experience, and project history in relation to the job requirements.

      3. Score Generation: The prompt explicitly asks the model to return a numerical score from 1 to 10, representing the degree of alignment. This output becomes the resume Score.

    4. Stage 2: Real-time Interview Analysis Pipeline:

      During the live interview, TalkLens activates a multi-modal analysis pipeline to capture and evaluate the candidate's verbal and non-verbal cues.

      • D.1. Client-Side Face Tracking: TalkLens uses TensorFlow.js to perform facial landmark detection directly in the user's browser. This client-side approach significantly reduces latency and enhances privacy by avoiding the need to continuously stream raw video footage to a server. The model tracks key facial features and periodically captures static screenshots, which are then sent to the backend for deeper emotional analysis.

      • D.2. Emotion Classification: The backend receives these screenshots and forwards them to a trained Roboflow model for emotion classification. The API returns a primary emotion label (e.g., "happy," "neutral," "surprised") along with a confidence value (0-1) for that predictin. The system collects these confidence values from every analyzed frame to compute an average, which serves as the candidate's overall

        Average Confidence score.

      • D.3. Speech Transcription: Concurrently, the browser records the candidate's speech using the Media Recorder API. These audio chunks are sent to the backend, which relays them to the Assembly AI API for transcription. The API returns a highly accurate text transcript with punctuation, which is stored and then used in the subsequent analysis stage.

      • D.4. Answer Quality Analysis: After the interview concludes, the full set of transcribed question-and- answer pairs is sent to the Google Gemini API. A second, distinct prompt instructs the model to evaluate the candidate's responses based on criteria such as relevance to the question, clarity, depth of explanation, and professionalism. The model returns a single, holistic transcription Score from 1 to 10 based on the overall quality of the candidate's communication.

    5. Stage 3: Final Evaluation and Recommendation

      For its final output, TalkLens uses a simple and transparent weighted scoring model to aggregate the metrics from the preceding stages. This approach was deliberately chosen over a complex "black-box" classifier to ensure the final recommendation is interpretable and fair.

      • Formula: The candidate's overall Score is calculated using the following weighted average: Overall Score = (resumeScore × 0.5) + (transcriptionScore × 0.3) + (averageConfidence × 0.2)

      • Decision Threshold: A fixed threshold is applied to the final score to generate a hiring recommendation. An overallScore greater than 7.0 results in a "Selected" status, while a score of 7.0 or below results in a "Rejected" status.

    6. Data Storage and Privacy

    All processed dataincluding emotion labels, confidence scores, transcripts, and final evaluationsare stored securely in a MongoDB database. Each interview session is assigned a unique ID to ensure data integrity and prevent overlap. To protect user privacy, the system is designed to avoid permanently storing raw video or audio files by default, thereby complying with modern data security standards.

  4. RESULT ANALYSIS

    To evaluate the effectiveness of the multi-stage TalkLens system, we conducted a series of tests using a curated dataset of simulated interview scenarios. This dataset included sample resumes, pre-written job descriptions, and pre-recorded video/audio clips designed to represent a range of candidate profiles and response qualities. The objective was to validate the real-time interaction between the system's modules, measure the accuracy of each AI-driven component, and benchmark the performance of the final integrated evaluation model.

    1. Individual Module Performance

      Each component of the TalkLens pipeline was tested independently to assess its accuracy and reliability.

      • Facial Emotion Detection Performance In the first phase, screenshots simulating various facial expressions (e.g., happy, nervous, neutral, confident) were passed through the Roboflow-based emotion detection pipeline. Out of 100 test inputs, the classification accuracy was compared against human- annotated labels. The model achieved an overall accuracy of approximately 87%. As shown in the chart below, performance varied slightly across emotions, with "Confident" and "Neutral" expressions being the most reliably identified. Subtle or mixed emotions proved more challenging, occasionally being misclassified as "neutral". These results confirm that the module reliably classifies clear expressions, providing a strong basis for the confidence score

      • Speech-to-Text Accuracy The AssemblyAI transcription pipeline was tested using a set of 10 interview-style audio recordings that included diverse accents and technical vocabulary. The API returned transcripts with an average word accuracy rate of

        94%, with near-perfect punctuation and sentence segmentation. This high level of accuracy ensures that the subsequent answer quality analysis by the Gemini API is performed on a reliable and accurate representation of the candidate's spoken responses.

      • Resume-to-Job-Description Matching Performance To validate the resume analysis module, we created a test set of 25 fictional resumes and 5 distinct job descriptions. Two experienced HR professionals manually scored each resume-job pair on a scale of 1-10 for alignment. The Gemini- powered analysis was then performed on the same pairs. The model's scores achieved a Mean Absolute Error (MAE) of 0.72 when compared to the average human scores, indicating a very strong correlation. For comparison, a baseline model using simple keyword matching resulted in a much higher MAE of 2.48, demonstrating the superior contextual understanding of the LLM-based approach.

      • Answer Quality Scoring Performance

        The performance of the answer quality scoring module was evaluated using a set of 30 pre-scripted interview answers of varying quality (ranging from clear and insightful to vague and irrelevant). When the Gemini-generated scores were compared against human evaluations, the results showed a Pearson correlation coefficient of 0.88, signifying a strong positive relationship. This confirms the model's ability to reliably assess a candidate's communication skills and the substance of their answers.

    2. Integrated System Performance and Comparative Analysis

      After validating the individual components, we evaluated the performance of the end-to-end weighted scoring model.

      • Overall Predictive Accuracy

        We created 40 complete, simulated candidate profiles, each with a resume, video clips, and transcribed answers. These profiles were pre-labeled by our HR experts as either "High-Potential" (20 profiles) or "Low-Potential" (20 profiles). The TalkLens system then processed each profile, and the final "Selected" or "Rejected" status was compared against the human-assigned label. The TalkLens weighted model correctly classified 35 out of the 40 profiles, achieving an overall predictive accuracy of 87.5%.

      • Comparative Model Analysis

    To further validate our choice of a transparent weighted model, we compared its predictive accuracy

    against two standard machine learning classifiers: a Random Forest and a Support Vector Machine (SVM). These models were trained on the same input features (resumeScore, transcriptionScore, averageConfidence) to predict the same "High- Potential" vs. "Low-Potential" outcome. The results are summarized below.

  5. CONCLUSION

    The evolution of hiring practices demands tools that can evaluate not only a candidate's qualifications but also their behavioral and emotional cues in a structured and objective manner. This paper introduced

    TalkLens, a smart interview assistant designed to meet that demand by integrating multiple AI components into a unified, web-based platform. The primary contribution of this work is the design and validation of an end-to-end evaluation system that combines multi-modal analysisspanning resume content, facial expressions, and transcribed verbal responseswith a transparent, weighted scoring framework to produce a fair and data-driven hiring recommendation.

    Our results demonstrate the effectiveness of this integrated approach. The individual AI modules for facial emotion detection, speech transcription, resume-to-job-description matching, and answer quality analysis all performed with high accuracy in simulated tests. The final integrated model achieved a predictive accuracy of 87.5% in distinguishing between high-potential and low-potential candidate profles. While a standard machine learning classifier like Random Forest yielded a marginally higher accuracy (90.0%), our transparent weighted model was deliberately chosen. Its simple, interpretable formula directly addresses the critical "black-box" problem prevalent in many commercial AI hiring tools, ensuring the system's decisions are explainable and auditable. By simulating a virtual interviewer that operates in real time, TalkLens reduces dependency on subjective manual evaluations, minimizes bias, and supports

    HR professionals in making more informed and consistent hiring decisions.

    However, we acknowledge the limitations of the current system. The platform was validated using a curated set of simulated data and has not yet been tested in large-scale, real- world recruitment deployments. The emotion detection model, while generally accurate, can struggle to classify subtle or mixed emotions, sometimes misinterpreting them as "neutral". Furthermore, the system's overall performance is inherently dependent on the quality and potential biases of the underlying third-party AI models from Google, Roboflow, and AssemblyAI.

    In conclusion, TalkLens represents a significant step forward in intelligent hiring technology by offering an accessible, scalable, and ethically-conscious solution for comprehensive interview analysis. By prioritizing transparency and a multi- faceted evaluation strategy, it provides a responsible and effective model for the future of AI in human resources

  6. FUTURE SCOPE

    While the current implementation of TalkLens serves as a robust proof-of-concept, there is significant potential for extending the system's capabilities to create an enterprise- ready solution. Future work will focus on three key areas: deepening the AI's analytical power, enhancing enterprise integration, and improving user accessibility.

    First, we plan to significantly enhance the depth of the AI analysis. The current framework can be extended to include more sophisticated behavioral models, such as analyzing interview transcripts to identify the Big Five personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism). To improve technical assessments, we plan to integrate modules for live coding challenges and system design evaluations. The ultimate goal is to implement true predictive analytics, where machine learning models are trained on post-hire performance data to more accurately predict long-term candidate success and retention risk.

    Second, to ensure seamless adoption in corporate environments, we will focus on enterprise integration and scalability. A primary objective is to develop plugins for major Applicant Tracking Systems (ATS) like Workday and Greenhouse, allowing TalkLens to fit directly into existing HR workflows. We also plan to build an advanced analytics dashboard for administrators to track key hiring funnel

    metrics, such as time-to-hire and quality-of-hire, and to monitor diversity and inclusion goals. Collaboration features, such as support for panel interviews with multiple reviewers and team-based scoring, will also be implemented.

    Finally, we will prioritize user accessibility and experience. This includes the development of native mobile applications for both recruiters and candidates, allowing for interview management and participation on the go. To support global organizations, we will incorporate multi-language support for both the interface and the AI analysis modules. Further enhancements will ensure compliance with web accessibility standards (WCAG 2.1), making the platform usable for individuals with diverse abilities and solidifying TalkLens as a truly inclusive hiring tool.

  7. REFERENCES

  1. P. V. Esch and A. Wallace, Understanding AI-Based Video Interviewing Systems: Trends, Challenges, and Future Possibilities, Journal of Artificial Intelligence in Business, vol. 6, no. 2, pp. 2229, 2020.

  2. R. Mollahosseini and M. Hasani, Facial Emotion Recognition in Practical Scenarios: A Review of Visual Learning Approaches, International Journal of Affective Computing and Interaction, vol. 5, no. 1, pp. 4452, 2018.

  3. X. Chen and L. Rao, Real-Time Speech Transcription in Cloud-Based Applications, in Proceedings of the International Conference on Audio Intelligence and NLP, 2021, pp. 102108.

  4. A. Kale and S. Banerjee, Analyzing Interview Emotions Using Multimodal Data: A Framework for Soft Skill Assessment, Computational Human Behavior Studies, vol. 9, no. 3, pp. 133140, 2019.

  5. A. Z. Ahmed and K. Lee, Integrating Facial and Vocal Cues in Recruitment Systems: A Multimodal Evaluation Approach, IEEE Journal on Intelligent Human-Machine Systems, vol. 4, no. 1, pp. 50 58, 2020.

  6. AssemblyAI Inc., Speech Recognition API: Developer Guide, 2025.

    [Online]. Available: https://www.assemblyai.com/docs

  7. Roboflow LLC, Emotion Classification Model Deployment using Roboflow API, 2025. [Online]. Available: https://docs.roboflow.com

  8. Google AI, Gemini API for NLP and Follow-Up Questioning, 2025.

    [Online]. Available: https://ai.google.dev/gemini-api

  9. T. Ramesh and K. Joseph, Ethical Considerations in AI-Powered Hiring: Privacy and Fairness in Automated Interviews, Journal of AI Ethics and Regulation, vol. 3, no. 1, pp. 6673, 2022.

  10. A. Kumar and S. Venkatesan, Browser-Based Face Tracking Using TensorFlow.js: Performance, Usability, and Applications, Web Computing and Vision Systems Journal, vol.8, no. 2, pp. 117124, 2021c