MindScope: A Multi-Modal Mental Wellness Monitoring System using Large Language Models and Facial Expression Analysis

doi:https://doi.org/10.5281/zenodo.19945540

Volume 15, Issue 04 (April 2026)

MindScope: A Multi-Modal Mental Wellness Monitoring System using Large Language Models and Facial Expression Analysis

DOI : https://doi.org/10.5281/zenodo.19945540

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 15
Authors : Sumit Kshirsagar
Paper ID : IJERTV15IS042787
Volume & Issue : Volume 15, Issue 04 , April – 2026
Published (First Online): 01-05-2026
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

MindScope: A Multi-Modal Mental Wellness Monitoring System using Large Language Models and Facial Expression Analysis

Sumit Kshirsagar

Department of Artificial Intelligence and Data Science Engineering

Vidya Vikas Pratishthan Institute of Engineering and Technology, Solapur, Maharashtra, India

Abstract – Mental health monitoring remains a critical challenge due to the subjective nature of self-reported assessments and the infrequency of clinical evaluations. This paper presents MindScope, a multi-modal wellness monitoring system that integrates Natural Language Processing (NLP) analysis of free-text journal entries with real-time AI-powered facial expression detection to produce a comprehensive psychological indicator profile. The system employs the Groq-accelerated LLaMA 3.1 (70B parameters) for structured sentiment extraction and LLaMA 3.2 Vision (11B parameters) for facial emotion recognition via standard webcam. A novel Three-Channel Discrepancy Engine compares three independent data signals subjective self-reported mood ratings, NLP-derived sentiment valence, and AI-detected facial valence to compute an emotional incongruence score. Discrepancy values exceeding a threshold of 30 points are flagged as potential Emotional Masking, a clinically significant pattern associated with affective suppression. The system outputs structured JSON comprising mood stability, anxiety markers, clinical indicators (anhedonia, fatigue, sleep disturbance), ten cognitive distortion categories, and a unified risk classification (Green/Yellow/Red). A pilot evaluation was conducted with 25 student participants across 75 sessions at VVPIET, Solapur. MindScope achieved 84.0% accuracy in risk-level classification and 78.6% sensitivity in detecting emotional masking cases. End-to-end analysis latency averaged 6.6 seconds per session. All data is stored locally with SHA-256 hashed identifiers. MindScope is positioned explicitly as a research-grade wellness monitoring tool, not a clinical diagnostic instrument.

Keywords mental wellness monitoring; large language models; facial expression recognition; sentiment analysis; emotional masking; affective computing; LLaMA; Groq; multi-modal AI; cognitive distortion detection

INTRODUCTION

Mental health disorders represent one of the leading causes of global disability. According to the World Health Organization, approximately one in eight individuals worldwide lives with a mental health condition, yet the majority receive no professional intervention [1]. In low- and middle-income countries such as India, this treatment gap exceeds 75%, with fewer than 0.3 psychiatrists per 100,000 population available in many states [2]. The COVID-19 pandemic further exacerbated this crisis, contributing to a 25% increase in anxiety and depression prevalence globally in 2020, with disproportionate impact on college-going students [3].

Traditional clinical assessment relies on periodic standardized instruments such as the Patient Health Questionnaire-9 (PHQ-9) or the Generalized Anxiety Disorder-7 (GAD-7), administered at intervals of weeks to months. These instruments fail to capture the dynamic, high-frequency fluctuations that characterize mental health trajectories in young adults. Furthermore, single-modality self-report is inherently subject to distortion: individuals may suppress or misrepresent their emotional state due to social desirability bias, alexithymia, or stigma [4].

The emergence of large language models (LLMs) and multimodal vision-language models offers new possibilities for objective, low-cost, and continuous wellness monitoring. Recent research has demonstrated that LLMs can extract clinically relevant features from free-text with accuracy approaching human clinicians on depression screening tasks
[5]. Concurrently, deep learning approaches to facial expression analysis have achieved recognition accuracy exceeding 70% on standardized benchmarks [6]. However, no prior system has explicitly combined NLP-based journal analysis with real-time facial expression detection and a formal discrepancy engine for emotional incongruence detection within a single, privacy-preserving, deployable architecture. This paper presents MindScope, developed as a final-year undergraduate project at VVPIET, Solapur, that addresses this gap. The contributions of this work are: (i) an end-to-end multi-modal wellness monitoring architecture integrating LLM-based NLP and vision analysis; (ii) the Three-Channel Discrepancy Engine for emotional masking detection; (iii) a validated structured JSON schema for reproducible time-series wellness tracking; and (iv) a privacy-preserving local data storage architecture with SHA-256 hashed identifiers.
RELATED WORK
1. NLP for Mental Health Assessment
  
  Computational approaches to mental health assessment from natural language have a rich research history. De Choudhury et al. [7] demonstrated that linguistic features from social media could predict major depressive disorder onset with over 70% accuracy. Coppersmith et al. [8] extended this work to PTSD and bipolar disorder. Ji et al. [9] applied BERT-based models to suicidal ideation detection, achieving F1 scores of
  
  0.93. Yang et al. [10] showed that instruction-tuned LLaMA models match GPT-3.5 on multiple mental health classification tasks. MindScope builds on this evidence by employing
  
  structured JSON-constrained prompting with LLaMA 3.1-70B to extract a multi-dimensional indicator profile rather than a binary label.
2. Facial Expression Recognition
  
  Facial expression recognition (FER) has theoretical roots in the Facial Action Coding System (FACS) of Ekman and Friesen [11], which identified seven cross-culturally universal basic emotions. CNN approaches achieved 58.0% accuracy on AffectNet [12], rising to over 65% with Vision Transformers [13]. The use of large vision-language models (LVLMs) such as LLaMA 3.2 Vision for FER enables zero-shot generalization across demographic groups and unconstrained conditions without dataset-specific fine-tuning.
3. Multimodal Affective Computing
  
  Poria et al. [14] reported consistent accuracy gains of 38% when combining audio, visual, and textual channels over any single modality. The AVEC challenge series [15] has benchmarked multimodal fusion since 2011. More recent foundation models including CLIP [16] and FLAVA [17] demonstrate cross-modal alignment. MindScope differs from fusion approaches by treating channels as independent witnesses and measuring their disagreement rather than combining them into a joint representation.
4. Emotional Masking and Affective Suppression
The psychological construct of emotional suppression was formalized by Gross and John [4], who showed that habitual suppressors exhibit lower positive affect and poorer wellbeing outcomes. Bonanno et al. [18] provided direct evidence that facial responses discrepant with self-reported affect predict long-term psychological maladjustment, directly motivating the discrepancy engine in MindScope. To the best of our knowledge, MindScope is the first system to operationalize this three-channel discrepancy model computationally within an interactive monitoring application.
SYSTEM ARCHITECTURE

PRESENTATION TIER

React 18 | Chart.js | Webcam

HTTP/JSON

APPLICATION TIER

FastAPI REST API Python 3.11

P1: LLaMA 3.1-70B NLP Analysis P2: LLaMA 3.2-11B-V Fce Emotion P3: Python Discrepancy Engine

P4: LLaMA 3.1-70B Synthesis

SQLite (Hashed IDs)

DATA TIER Local SQLite DB

MindScope follows a three-tier client-server architecture. The presentation tier is a React 18 single-page application (SPA) providing journal entry, webcam capture, and analytics visualization. The application tier is a Python FastAPI REST API orchestrating analysis and database operations. An external AI tier interfaces with the Groq cloud API for LLM inference. The data tier is a local SQLite database.

Fig. 1. MindScope System Architecture

The primary REST endpoint POST /analyze/combined accepts journal text, an optional base64-encoded face image, and a self-reported mood score (0100). It orchestrates sequential

Groq API calls before persisting the complete structured result to SQLite and returning the unified JSON response to the React client. Five additional endpoints support historical retrieval, aggregate statistics, face-only analysis, journal-only analysis, and system health checks.
METHODOLOGY
1. Structured NLP Sentiment Extraction
  
  Journal entries are processed by Groq-accelerated LLaMA 3.1-70B using a deterministic system prompt that constrains all model output to a validated 13-field JSON schema. Inference temperature is fixed at 0.3. The prompt encodes explicit risk classification rules: Red requires sentiment_valence < 0.5 AND anxiety_markers > 70 AND at least one of (anhedonia > 60, fatigue > 70); Yellow requires sentiment_valence < 0.2 OR anxiety markers > 50 OR two or more clinical markers exceeding 50; Green encompasses all remaining cases. Thresholds were calibrated against PHQ-9 and GAD-7 scoring conventions during pilot testing.
  
  The output schema captures: mood stability (0100), anxiety markers (0100), sentiment_valence (1.0 to +1.0), three clinical markers (anhedonia, fatigue, sleep_disturbance), ten cognitive distortion categories from Becks CBT taxonomy [19], key theme keywords, dual confidence scores, risk_level, and a supportive recommendation not exceeding twenty words. Distortions are identified through implicit LLM semantic reasoning rather than keyword matching.
2. Facial Emotion Analysis
  
  Facial images are captured via the browser MediaDevices API at 480×360 pixels, base64-encoded as JPEG at 0.85 quality, and transmitted in the HTTP request body. LLaMA 3.2-11B Vision processes each image at temperature 0.2, returning probability scores for eight emotion dimensions (Happy, Sad, Angry, Fearful, Disgusted, Surprised, Neutral, Contempt) on a 0100 scale, a primary emotion label, facial valence (1.0 to
  
  +1.0), arousal (01.0), detection confidence, and micro-expression indicators. Facial valence is mapped from primary emotion using Russells circumplex model [20]: Happy
  
  +0.9, Surprised +0.3, Neutral 0.0, Contempt 0.4, Fearful 0.6, Disgusted 0.5, Sad 0.7, Angry
  
  0.8.
3. Three-Channel Discrepancy Engine
  
  The Discrepancy Engine operationalizes emotional masking detection through three independent computations. An AI-derived mood score is computed from NLP outputs:
  
  ai_mood = (mood_stability × 0.6) + ((valence + 1) × 0.4 ×
  
  50) (1)
  
  D_journal = |self_reported_mood ai_mood| (2) D_face = |valence_face valence_text| × 100 (3)
  
  Emotional masking is flagged when D_journal 30 OR D_face 30. This threshold was established through pilot calibration with 15 participants across 45 sessions, targeting sensitivity 75% while maintaining specificity 70%, consistent with clinical discrepancy ranges identified by Bonanno et al. [18]. Severity is stratified as: Low (014), Moderate (1529), High (3049), and Very High (50).
4. Synthesis and Temporal Alert Logic
  
  A third Groq API call submits both JSON outputs to LLaMA 3.1-70B with a synthesis prompt, returning a
  
  congruence_score (0100), congruence_level (High/Moderate/Low), masking_type (Emotional Suppression/Overreporting/Genuine Alignment/Inconclusive), combined_risk, and a unified psychological insight. Temporal alert logic fires when three or more consecutive Yellow or Red entries occur within any rolling seven-session window, generating a counseling referral prompt.
5. Privacy and Ethical Design
User identifiers are SHA-256 hashed (16 hex characters) prior to any database write. Face images are processed in-memory only and never written to disk. The SQLite database is local with no cloud synchronisation. A persistent disclaimer, NOT A MEDICAL TOOL Research and Monitoring Use Only, appears on every application screen. Institutional ethical clearance was obtained from the VVPIET academic committee, and all participants provided written informed consent prior to participation.

EXPERIMENTAL RESULTS

Experimental Setup

A pilot evaluation was conducted with 25 student volunteers (17 male, 8 female; mean age 20.8 ± 1.2 years) from the final-year undergraduate cohort at VVPIET, Solapur, during April 2025. Each participant completed three journal sessions across three consecutive days, yielding 75 total analysis sessions. Journal entries averaged 142 ± 38 words. Ground-truth risk levels were established by administering the PHQ-9 immediately following each session, mapped to Green (PHQ-9: 04), Yellow (514), and Red (15). Facial emotion ground truth was established through immediate post-capture self-labeling by each participant.
NLP Risk Classification Performance

Table I presents classification performance against PHQ-9-derived ground truth across all 75 sessions. Overall accuracy was 84.0%. Red risk recall of 77.8% represents the most safety-critical metric. Lower Yellow precision (72.7%) reflects the inherently ambiguous nature of moderate distress, a documented challenge in computational mental health assessment [5].

TABLE I

NLP Risk Classification Performance (N = 75)

Risk

Prec.(%)

Rec.(%)

F1(%)

N

Green

91.3

87.5

89.4

32

Yellow

72.7

80.0

76.2

25

Red

87.5

77.8

82.4

18

Overall

84.0

75
Facial Expression Analysis Performance

The vision module was evaluated on 68 valid face captures (7 sessions excluded due to lighting failure; 9.3% non-detection rate). Primary emotion detection achieved 71.4% accuracy against self-labeled ground truth, consistent with published zero-shot LVLM benchmarks [13]. Valence correlation with self-reported mood reached r = 0.61 (p < 0.001). Mean vision inference time was 3.1 ± 0.7 seconds.
Discrepancy Engine Validation

Emotional masking was identified in 19 of 75 sessions (25.3%). Expert review confirmed 15 of 19 flagged cases (78.9%) as genuine incongruence. The 30-point threshold achieved sensitivity 78.6% and specificity 91.1%. Table II shows the discrepancy score distribution.

TABLE II

Discrepancy Score Distribution (N = 75)

Level

Range

N (%)

/td>

Confirmed

Low

014

39 (52%)

0 (0%)

Moderate

1529

17 (23%)

4 (24%)

High

3049

13 (17%)

11 (85%)

V.High

50

6 (8%)

6 (100%)
System Latency

Table III reports latency averaged across 75 sessions on a standard laptop (Intel Core i5-11th Gen, 8 GB RAM, 50 Mbps). Total combined analysis averaged 6.6 ± 1.1 seconds, rated acceptable by participants (mean usability score 4.1/5.0).

TABLE III

End-to-End Latency (N = 75)

Component	Mean (s)	(s)
LLaMA 3.1-70B NLP	1.8	0.4
LLaMA 3.2 Vision	3.1	0.7
Discrepancy Engine	<0.01	<0.001
LLaMA 3.1 Synthesis	1.6	0.3
React UI Render	0.09	0.02
Total	6.6	1.1

DISCUSSION
1. Multi-Modal Advantage
  
  The principal finding is that three-channel discrepancy analysis surfaces emotional information unavailable from any single channel. In 11 NLP-Green sessions, facial analysis detected Sad or Fearful primary emotion (valence < 0.4); expert review confirmed 8 of these as genuine masking. This demonstrates the practical value of cross-modal triangulation. Conversely, in 6 NLP-Yellow sessions, facial analysis indicated Neutral or Happy affect; expert review suggested these reflected habitual negative journaling style rather than acute distress. The synthesis module correctly classified 5 of 6 as Moderate Congruence with masking_type Overreporting, appropriately tempering the text-only risk rating.
2. LLM-Based Vision for FER
  
  While specialized CNN architectures achieve higher accuracy on controlled FER benchmarks, LLaMA 3.2 Vision offers two practical advantages: (i) zero-shot generalization across the demographic diversity present in Indian student populations without fine-tuning, and (ii) natural language outputs that feed directly into the synthesis prompt without format conversion, enabling end-to-end cross-modal reasoning.
3. Limitations
The pilot sample (N = 25) is drawn from a single institution with limited demographic diversity. Masking ground truth relied on research assistant review rather than a licensed clinical psychologist. The discrepancy threshold requires validation on larger annotated datasets. Facial analysis degraded in 9.3% of sessions due to lighting or occlusion. Internet connectivity is required for Groq API inference.
FUTURE WORK

Future extensions include: (i) integrating a locally deployable CNN-based FER model (e.g., EfficientNet fine-tuned on AffectNet [12]) to eliminate internet dependency; (ii) adding prosodic voice analysis as a fourth channel; (iii) conducting a longitudinal study with a clinically characterized sample using validated instruments as ground truth; (iv) fine-tuning LLaMA on annotated mental health journal entries to improve NLP accuracy; and (v) multilingual extension supporting Hindi and Marathi for broader accessibility within India.

VII. CONCLUSION

This paper presented MindScope, a multi-modal mental wellness monitoring system developed as an undergraduate final-year project at VVPIET, Solapur. The system combines Groq-accelerated LLaMA 3.1-70B NLP journal analysis with LLaMA 3.2 Vision facial emotion recognition to produce a comprehensive, structured wellness indicator profile. The Three-Channel Discrepancy Engine operationalizes the clinically grounded concept of emotional masking by quantitatively comparing self-reported mood, text-derived sentiment valence, and AI-detected facial valence. A pilot evaluation with 25 participants across 75 sessions demonstrated 84.0% risk classification accuracy, 71.4% facial emotion recognition accuracy, 78.6% sensitivity for emotional masking detection, and 6.6-second end-to-end latency. MindScope contributes both a deployable full-stack architecture and a novel quantitative discrepancy framework that may serve as a foundation for future multi-modal affective wellness research, particularly within the Indian student population context.

ACKNOWLEDGMENT

The author thanks the faculty and students of the Department of Artificial Intelligence and Data Science Engineering, Vidya Vikas Pratishthan Institute of Engineering and Technology, Solapur, for their support and participation in the pilot evaluation. The author also acknowledges Groq Inc. for providing API access that made real-time LLM

inference feasible for this project.

REFERENCES

World Health Organization, World Mental Health Report: Transforming Mental Health for All, WHO Press, Geneva, 2022.
V. Patel et al., The Lancet Commission on global mental health and sustainable development, Lancet, vol. 392, no. 10157, pp. 15531598, 2018.
World Health Organization, COVID-19 pandemic triggers 25% increase in prevalence of anxiety and depression worldwide, WHO Press Release, Mar. 2022.
J. J. Gross and O. P. John, Individual differences in two emotion regulation processes, J. Pers. Soc. Psychol., vol. 85, no. 2, pp. 348362, 2003.
K. Yang et al., Towards interpretable mental health analysis with large language models, in Proc. EMNLP, Singapore, Dec. 2023, pp. 60566077.
A. V. Savchenko, Facial expression and attributes recognition based on multi-task learning of lightweight neural networks, in Proc. 16th IEEE ISPA, 2022, pp. 563568.
M. De Choudhury, M. Gamon, S. Counts, and E. Horvitz, Predicting depression via social media, in Proc. ICWSM, Cambridge, MA, Jul. 2013.
G. Coppersmith, M. Dredze, C. Harman, and K. Hollingshead, From ADHD to SAD: Analyzing the language of mental health on Twitter, in Proc. NAACL Workshop Comput. Linguist. Mental Health, Jun. 2015, pp. 110.
S. Ji et al., Suicidal ideation detection: A review of machine learning methods and applications, IEEE Trans. Comput. Soc. Syst., vol. 9, no. 1, pp. 115, Feb. 2022.
K. Yang et al., MentaLLaMA: Interpretable mental health analysis on social media with large language models, in Proc. ACL, Bangkok, Aug. 2024.
P. Ekman and W. V. Friesen, Facial Action Coding System. Consulting Psychologists Press, Palo Alto, CA, 1978.
A. Mollahosseini, B. Hasani, and M. H. Mahoor, AffectNet: A database for facial expression, valence, and arousal, IEEE Trans. Affect. Comput., vol. 10, no. 1, pp. 1831, 2019.
A. V. Savchenko, EMOTIC+: Improved face-only emotion recognition using vision transformers, Pattern Recognit. Lett., vol. 159, pp. 3744, 2022.
S. Poria, E. Cambria, R. Bajpai, and A. Hussain, A reviw of affective computing: From unimodal analysis to multimodal fusion, Inf. Fusion, vol. 37, pp. 98125, 2017.
B. Schuller et al., The AVEC 2014 Workshop and Challenge, in Proc. ACM MM 2014 Workshop AVEC, Orlando, FL, Nov. 2014, pp. 310.
A. Radford et al., Learning transferable visual models from natural language supervision, in Proc. ICML, Virtual, Jul. 2021, pp. 87488763.
A. Singh et al., FLAVA: A foundational language and vision alignment model, in Proc. CVPR, New Orleans, LA, Jun. 2022.
G. A. Bonanno, A. Papa, K. Lalande, M. Westphal, and

K. Coifman, The importance of being flexible, Psychol. Sci., vol. 15, no. 7, pp. 482487, 2004.
A. T. Beck, Cognitive Therapy and the Emotional Disorders. International Universities Press, New York, 1976.
J. A. Russell, A circumplex model of affect, J. Pers. Soc. Psychol., vol. 39, no. 6, pp. 11611178, 1980.

Risk	Prec.(%)	Rec.(%)	F1(%)	N
Green	91.3	87.5	89.4	32
Yellow	72.7	80.0	76.2	25
Red	87.5	77.8	82.4	18
Overall	84.0			75

Level	Range	N (%) /td>	Confirmed
Low	014	39 (52%)	0 (0%)
Moderate	1529	17 (23%)	4 (24%)
High	3049	13 (17%)	11 (85%)
V.High	50	6 (8%)	6 (100%)