Design and Implementation of a Hybrid Multimodal Deep Learning Model for Mental Health Detection with Adaptive Conversational Support

Prof. Jaya Nag Mathur; Prof. Chetana Shravage; Anjali Sinkar; Joya Tamboli; Shraddha Satpute; Girija Raskar

doi:10.5281/zenodo.20846605

Volume 15, Issue 06 (June 2026)

Design and Implementation of a Hybrid Multimodal Deep Learning Model for Mental Health Detection with Adaptive Conversational Support

DOI : 10.5281/zenodo.20846605

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 8
Authors : Prof. Jaya Nag Mathur, Prof. Chetana Shravage, Anjali Sinkar, Joya Tamboli, Shraddha Satpute, Girija Raskar
Paper ID : IJERTV15IS060953
Volume & Issue : Volume 15, Issue 06 , June – 2026
Published (First Online): 25-06-2026
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Design and Implementation of a Hybrid Multimodal Deep Learning Model for Mental Health Detection with Adaptive Conversational Support

Prof. Jaya Nag Mathur, Prof. Chetana Shravage, Anjali Sinkar

Dept. Articial Intelligence & Data Science, Dr. D. Y. Patil Institute of Technology Pimpri, India

Joya Tamboli, Shraddha Satpute, Girija Raskar

Dept. Articial Intelligence & Data Science, Dr. D. Y. Patil Institute of Technology Pimpri, India

Abstract – Its increasingly common worldwide to see mental health issues like stress, anxiety, and depression. A lot of people dealing with these problems dont get diagnosed because they cant easily access help or because of the social stigma involved. But now, with progress in AI and technology that can recognize emotions, we can build systems that watch how users are feeling by looking at different kinds of data. One promising idea is to use this tech to spot people who might be struggling with their mental health and offer them help sooner than we usually can. This project aims to create a special kind of system called a Hybrid Multimodal Deep Learning Architecture for Detecting Mental Health and Providing Adaptive Conversational Support. This system will bring together different technologies, like analyzing text, recognizing facial expressions, and using psychological questionnaires. It will be using various ML and deep learning ways, such as Logistic Regression, SVMs, LSTM, BERT Transformers, and CNNs. The main idea is to blend these technologies using these algorithms to create a thorough way to gure out mental health indicators from what users put into the system. The text-based parts will use NLP. Facial emotion recognition will use DL models which are trained on FER2013 dataset. For structured mental health answers, it will use a questionnaire similar to the DASS-21. The information from each part will be combined using a multimodal fusion technique to gure out the overall risk level for mental health. Once a persons emotional state is known, a smart chatbot will offer support and help them reach their desired emotional state. The system also includes a way to detect suicide risk.Also a chat bot running live, to guide and detect the face and eye movements to identify different mood shifts while you are typing. The chat-bot also give some guided replies when it detects how user feels and makes the chats feel warmer, and calmer for user. If red ags such as deep despair or talk of suicide pop up then it instantly displays hotlines, emergency numbers and expert resources without any delay

Index TermsMental Health Detection, Multimodal Deep Learning, Emotion Recognition, Conversational AI, BERT, CNN, Affective Computing.

Introduction

Mental health issues are very prevalent as of late world- wide – this includes mental illnesses as anxiety, stress, and depression that can also greatly hinder a persons emotional quality of life and daily functioning. The World Health Organization (WHO) stated that nearly 1 in 8 individuals worldwide is living with a mental health problem [2], [20]. Early detection and intervention are key in reducing both the severity and long-term effects of mental illnesses. However, individuals may be reluctant to seek help because of stigma, a lack of understanding about what mental health is and what professionals do, limited access to care [17], [27]. Due to recent advancements in AI and ML, there are now intelligent systems that can detect mental illness based on behavioral and emotional information. Through the use of natural language processing (NLP), these systems are able to analyze different types of text-based information (e.g.speech from conversa- tions) to determine if there is a linguistic pattern that would be suggestive of depression or anxiety [3],[6].Affective comput- ing techniques are able to identify the feelings of individuals (based on their facial features or actions) as demonstrated by systems with deep learning methods, such as CNNs [4], [23]. Furthermore, transformer-based architectures, like BERT, enhance an understanding of language in context, which aids in providing a more accurate determination of the emotion behind the text provided as input [11], [40]. In addition to emotional recognition abilities, AI-enabled conversational agents are well-positioned to assist individuals in managing their mental health.Today, newer therapy chatbots utilize state- of-the-art NLP methods and hybrid conversational structures to offer customized support and direction to users [12], [13], [15].

Literature Review

Articial intelligence (AI) is being increasingly employed for analyzing behavioral and emotional data by employing computer-based techniques for the detection of mental health conditions as stress, anxiety, and depression. The primary focus of the initial studies has been to use machine learning techniques for analyzing textual and structured psychological data. Researchers such as Chaturvedi et al. [1] and Kumar and Singh [3] have attempted to use supervised algorithms such as SVM and the Random Forest for the classication of depression symptoms by using datasets such as DASS-21 and PHQ-9.

in [8], [12] and Han et al. in [9] have developed a Multimodal Framework that utilizes deep learning techniques to blend different emotional data inputs and attain an accuracy rate of 92% to 95%. This implies that a deeper understanding of human emotions can be achieved by integrating different data inputs.

TABLE I

Comparative Analysis of AI Approaches Used in Mental Health

k

Detection

35%

25%

15%

25%

Approach	Data Modal- ity	Model Type	Benchmar Dataset	Accuracy / F1 (%)
SVM /	Text	Traditional	DASS-	7078
Random	(survey	Machine	21,
Forest [3], [21]	re- sponses)	Learning	PHQ-9
CNN-	Facial	Deep	FER-	8287
Based	Images	Learning	2013,
Emotion			Affect-
Recogni-			Net
<>tion [7],
[25],
[27]
RNN /	Text /	Sequential	Reddit,	8488
LSTM	Audio	Deep	IEMO-
Models [10], [30]	Conver- sations	Learning	CAP
Transform	Dialogue	NLP	Twitter,	8993
Models	/ Contex-	Trans-	Woebot
(BERT,	tual Text	former	Logs
GPT)		Architec-
[14],		ture
[17],
[18]
Multimod	Text +	Hybrid	AffectNet,	9295
Fusion	Facial +	Deep	DAIC-
Systems [8], [12], [29]	Audio	Learn- ing	WOZ

Text (35%)

Facial Image (25%)

Audio (15%)

Multimodal (25%)

Fig. 1. Contribution of Different Modalities

er

These traditional machine learning models had acceptable accuracy rates, ranging from 70% to 78%, but since they relied on human-designed features, they had trouble iden- tifying complex emotional patterns. Because social media sites provide a wealth of textual data, researchers may now use NLP techniques to identify mental health disorders. By analyzing language patterns taken from online exchanges, Benton et al. [6] demonstrated how multi-task learning models may be used to predict psychological disorders. In the area of sentiment analysis and emotion recognition in texts, the accuracy has been improved with the introduction of BERT and GPT transformer-based architectures that have greatly improved the context understanding of texts [14]. Another important area of research in the application of emotional computing algorithms is Facial Emotion Recognition using DL algorithms. The accuracy of these systems has been greatly improved compared to the earlier systems that relied on the use of visual characteristics. When it comes to the recognition of emotional states from face photos, Convolutional Neural Networks (CNN) with the help of datasets like FER-2013 and AffectNet have achieved an accuracy of 82% to 87% [7], [25], [27]. Emotional signals from speech or textual interactions have also been examined by using sequential deep learning methods like RNN and LSTM networks. When applied to data sets like Reddit or IEMOCAP, these methods have shown that temporal trends in emotional data can be recognized with an accuracy between 84% and 88% [10], [30]. However, recent studies have focused more on Multimodal Emotion Recognition, which involves the integration of different data inputs such as text, voice, and facial expressions. Tzirakis et al.

al

As mentioned earlier, recent studies have shown that there is a shift from the traditional ML techniques towards DL and multimodal techniques for mental health identication. Though traditional techniques have shown good results in mental health identication, the use of multimodal techniques for data detection has shown better results.

Methodology

The suggested system is anticipated to employ a Hybrid Multimodal Deep Learning Framework for Detecting Mental Health Disorders and Versatile Conversation Support. The suggested system employs text data evaluation, facial emo- tion evaluation, and psychological evaluation assessment for possible psychological health assessment.
1. Data Collection
  
  The system makes use of two public data sets and real-time user data.
  1. Reddit Depression Dataset: The dataset consists of textual posts made by people who are depressed as well as those who
    
    are not. The dataset is used to build a natural language pro- cessing algorithm that recognizes depressed language patterns.
  2. FER2013 Facial Emotion Dataset: This dataset includes seven emotional classes: angry, disgusted, afraid, joyous, neu- tral, sad, and startled. The dataset is then being used to build a DL model that recognizes face expressions.
  3. Questionnaire-Based Input: The DASS-21, is the basis for a structured psychiatric questionnaire. Included are questions about emotional condition and stress.
2. Data Preprocessing and Classication
  
  The users answers and chats are handled via Natural Language Processing (NLP). The pre-processing of the data includes the following steps: Cleaning and normalizing text
  
  Tokenization
  - Handling Stop Words Extraction of TF-IDF Features For text classication, a number of DL and ML approaches
  have been used. The following algorithms are among the methods used for the project: Logistic Regression , SVMs LSTM The Transformer Model, or BERT
  
  Of the algorithms used, the BERT algorithm offers the most contextual information.
3. Facial Emotion Recognition
  
  The FER2013 dataset is used to identify face expressions using deep learning algorithms. These methods are as follows:
  1. Haar Cascade Classiers for face detection.
  2. Preprocessing and normalization of images .
  3. Convolutional neural networks for the extraction of features .
  4. Convolutional Neural Networks and MobileNet for Emo- tion Classication .
4. Questionnaire-Based Psychological Screening
  
  In order to determine the symptoms associated with anxiety, stress, and depression, the questionnaire module analyzes user replies.
  
  The structured psychological assessment complements the machine learning algorithms predictions.
5. Multimodal Fusion Mechanism
  
  Weighted decision fusion is utilized in improving the reli- ability of the results from text prediction, results from face expression detection, and results from the questionnaire score. This is how the nal mental health risk score is determined: 0.5
  
  × Text Prediction Score 0.3 × Emotion Detection Scoe 0.2 × Questionnaire Score The results from mental health detection can be improved by combining multiple emotional signals at any given time.
6. Adaptive Conversational Chatbot
  
  Based on the identied level of mental health risk, an AI-powered conversational chatbot engages with the user. It offers encouraging reactions and emotional support, based on detected emotional state of a user. The chatbot uses a number of combination :
  - Rules-based conversational responses
  - Natural language processing for intent detection
  The adaptive conversational aid helps users communicate their emotions.
7. Report Generation
Following this evaluation procedure, the system creates a mental health screening report based on the following:
- An overview of the questionnaire replies
- A facial expression of emotion .
- Risk level for mental health .
- Interventions or actions to be performed .
Text Analysis

SVM / BERT

Questionnaire

Risk Scoring

Chatbot Support

+ Helpline Alert

Mental Health Risk Prediction

Multimodal Fusion

Emotion Model

CNN (FER2013)

Data Processing

Text & Image Prep

User Input Text / Face / Questionnaire

A users mental health state is summarized in the mental health screening report.

Fig. 2. Proposed Multimodal Mental Health Detection Framework

Results and Discussion

The articial intelligence techniques used for mental health detection are included in the experimental review. These techniques include machine learning, deep learning, and mul- timodal fusion. The effectiveness of these techniques is eval- uated using their capacity to identify emotional patterns and mental health indicators from text, conversations, and facial expressions.

The structured textual data and questionnaire responses were initially passed through conventional machine learning techniques as Logistic Regression, Random Forest, and SVM. These techniques are largely based on manually crafted elements like linguistic metadata and TF-IDF. Previous studies have indicated that the accuracy of the techniques for survey- based datasets like DASS-21 and PHQ-9 ranges from 70% to 78% [3], [21]. These techniques are easy to understand

TABLE II

System Components and Algorithms Used in the Proposed Mental Health Detection System

Traditional ML (15%)

CNN (20%)

LSTM/RNN (18%)

BERT (22%)

Multimodal (25%)

20%

18%

15%

22%

25%

System Component	Algorithm / Model Used	Dataset / Input Source	Purpose
Text Preprocessing	+ TF-IDF, Tokenization, Stop-word Removal	Reddit Depression Dataset	Convert textual responses into numerical features for classication
Text Classication	Support Vector Machine (SVM), BERT Transformer	Reddit Depression Dataset	Detect depressive language patterns from user text input
Facial Emotion Recognition	Convolutional Neural Network (CNN)	FER2013 Dataset	Identify emotional states from facial expressions
Psychological Screening	Rule-based scoring (DASS-21 inspired)	User Questionnaire	Evaluate user mental health through structured responses
Multimodal Fusion	Weighted Decision Fusion	Text + Facial + Questionnaire Data	Combine multiple emotional signals for nal risk prediction
Conversational Support	Rule-based Chatbot with NLP intent detection	User chat input	Provide emotional guidance and mental health support
Suicide Risk Detection	Keyword-based detection + risk scoring	User text responses	Identify critical mental health signals and recommend helpline assistance
Report Generation	Automated summary module	Model outputs	Generate mental health screening report

Fig. 3. Model Contribution in Mental Health Detection

TABLE III

Recog-

Model / Com- po- nent	Dataset Used	Accuracy (%)	Precision	Recall	F1-Score
Logistic	Reddit	74.2	0.73	0.72	0.72
Re-	De-
gres-	pres-
sion	sion
(Base- line)	Dataset
Support	Reddit	81.5	0.80	0.81	0.80
Vector	De-
Ma-	pres-
chine	sion
(SVM)	Dataset
BERT	Reddit	90.3	0.90	0.89	0.89
Trans-	De-
former	pres-
	sion
	Dataset
CNN	FER201	85.6	0.84	0.85	0.84
Emo-	Dataset
tion

nition
Propose	Text +	93.1	0.92	0.91	0.91
Multi-	Facial		0.92	0.91	0.91
Multi-	Facial
modal	Emo-
Fu-	tion +
sion	Ques-
Model	tion- naire

Performance Evaluation of Different Models for Mental Health Detection

3

d

and computationally inexpensive, but they are ineffective in identifying the deeper emotional content of the text.

Deep learning has made major strides in the recognition of emotions. CNNs are trained using the FER-2013 dataset to increase emotion recognition accuracy. CNN uses the sup- plied image to automatically identify the persons emotional state. Research has demonstrated that CNNs face expression detection accuracy ranges from 82% to 87% [7], [25], and

[27].

Furthermore, by taking use of temporal patterns of emo- tional signals from speech and text-based discussions, sequen- tial DL techniques such as RNN and LSTM networks can be employed to further improve the systems performance. Applying these methods to conversational datasets such as Reddit and IEMOCAP has demonstrated 84% to 88% accuracy [10], [30].

Fig. 4. Dashboard

Fig. 5. Chatbot

Fig. 9. Guided Breathing

Fig. 6. Music for relaxation

Fig. 7. Statistics

Fig. 8. Emotion check through face detection

Fig. 10. Report

Conclusion and Future Scope

Although there has been a lot of advancement in the use of AI to identify mental health problems, there are still a number of important issues that need to be resolved before these technologies can be widely used or trusted ethically. The current models repeatedly show high accuracy, their assessment is usually based on few data sources, such as the DAIC-WOZ dataset, PHQ-9 questionnaires, and Reddit posts.Because these data sources lack representational diver- sity, the outcomes might not be fully relevant to people from diverse cultural or geographic origins. To ensure the guarantee unbiased and reliable diagnoses for various groups, future studies should focus on building large datasets that include a variety of languages, balanced demographic samples, and di- verse populations. Even though these integrated signal systems have the potential to lead the eld, the current top models, their high computational requirements make them impractical for edge based applications on smartwatches or mobile devices .

Compact transformer architectures present a potential remedy since they can increase processing speed without sacricing efcacy when paired with knowledge distillation techniques. These systems scale effectively by combining cloud-based support with on-device computing, which makes it easier to expand the use of continuous mental health monitoring. One major issue with AI is that its decision-making process is frequently opaque. It can be challenging for therapists to track the processes that lead to conclusions because many deep learning models function as black boxes. A little number of tools can nowadays can grasp emotions perfectly, which is showing a gentle way that puts the care rst and then offers that personalized individualized advice before that issues becomes worse. These are the systems that can be converted into the digital health platforms, devices that we can wear on tness trackers, or our mobile phones which help the people maintain everyday well-being. To make that technology enhance the capability to improve life and rather than replac- ing the human clinical advices, future advancements should combine the computer scientists, mental health therapists, and specialized doctors.

References

S. S. Chaturvedi, R. K. Tiwari, and M. Singh, A Survey on AI Techniques for Mental Health Detection, IEEE Access, vol. 9, pp. 1234512358, 2021.
World Health Organization, Depression and Other Common Mental Disorders: Global Health Estimates, WHO, 2017.
P. Kumar and R. Singh, Machine Learning Approaches for Depression Detection, IEEE Transactions on Affective Computing, vol. 11, no. 3,

pp. 456467, 2020.
J. Li, M. Chen, and H. Zhang, Facial Expression Recognition Based on Deep Learning for Mental Health Analysis, IEEE Access, vol. 10,

pp. 5632156330, 2022.
X. Zhao, L. Peng, and W. Zhang, Multimodal Emotion Recognition Us- ing Attention-Based Deep Learning, in Proc. IEEE Int. Conf. Affective Computing and Intelligent Interaction, 2022.
A. Benton, M. Mitchell, and D. Hovy, Multi-task Learning for Mental Health Prediction, in Proc. EACL, 2017.
A. Mollahosseini, B. Hasani, and M. H. Mahoor, AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild, IEEE Trans. Affective Computing, vol. 10, no. 1, pp. 1831, 2019.
P. Tzirakis, J. Zhang, and B. Schuller, End-to-End Speech and Facial Expression Emotion Recognition, IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 2, pp. 282293, 2020.
J. Han et al., Multimodal Emotion Recognition with Temporal Fusion Networks, IEEE Transactions on Multimedia, vol. 24, pp. 12361248, 2022.
J. Weizenbaum, ELIZAA Computer Program for the Study of Natural Language Communication, Communications of the ACM, vol. 9, no. 1,

pp. 3645, 1966.
T. Wolf et al., Transformers: State-of-the-Art Natural Language Pro- cessing, in Proc. EMNLP, 2020.
J. Fitzpatrick, A. Darcy, and M. Vierhile, Delivering Cognitive Behav- ioral Therapy via Chatbot, JMIR Mental Health, vol. 4, no. 2, 2017.
A. Inkster, K. Stillwell, and D. Jones, Machine Learning and Chatbot- Based Therapy, Frontiers in Digital Health, vol. 3, pp. 112, 2021.
J. Park et al., Emotion-Aware Conversational Agents for Mental Health, IEEE Access, vol. 10, pp. 102321102334, 2022.
A. S. Mahmood and R. Li, Hybrid Conversational Frameworks for Emotion-Adaptive Chatbots, IEEE Transactions on Human-Machine Systems, vol. 52, no. 6, pp. 12471258, 2023.
P. Ekman and W. V. Friesen, Facial Action Coding System. Consulting Psychologists Press, 1978.
A. M. Rahman, A. as, and V. B. Mendonc¸a, Machine Learning Techniques for Stress and Anxiety Detection: A Survey, IEEE Access, vol. 10, pp. 5602456037, 2022.
S. Lovibond and P. Lovibond, Manual for the Depression Anxiety Stress Scales (DASS-21), Psychology Foundation of Australia, 1995.
N. Patel et al., Privacy-Preserving Mental Health Chatbots Using Transformer Models, IEEE Trans. Affective Computing, 2023.
R. Kessler and T. U¨ stu¨n, The WHO World Mental Health Surveys. Cambridge University Press, 2008.
A. Ghosh et al., Cloud-Based Scalable Framework for Emotion-Aware AI Systems, IEEE Cloud Computing, vol. 9, no. 2, pp. 1827, 2022.
J. Deng et al., FER2013: A Benchmark Dataset for Facial Emotion Recognition, in Proc. IEEE CVPR Workshops, 2013.
J. Zhao et al., Facial Emotion Recognition Based on CNN with Attention Mechanism, IEEE Access, vol. 8, pp. 4035740368, 2020.
T. Tzirakis et al., End-to-End Multimodal Emotion Recognition Using Deep Neural Networks, IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 13011309, 2017.
K. Krafka et al., Eye Tracking for Everyone, in Proc. IEEE CVPR,

pp. 21762184, 2016.
S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural Computation, vol. 9, no. 8, pp. 17351780, 1997.
K. R. Choudhary et al., Digital Psychiatry and AI: Challenges and Promise, Indian Journal of Psychological Medicine, 2023.
T. Ahmed et al., AI-Based Screening for Mental Disorders: Systematic Review, IEEE Journal of Biomedical and Health Informatics, vol. 27, no. 3, 2023.
UNESCO, Ethics of Articial Intelligence: Global Framework. UN- ESCO Publishing, 2021.
J. Deng et al., Cross-Cultural Emotion Recognition: A Survey, IEEE Access, vol. 9, 2021.
S. Mirsamadi et al., Automatic Speech Emotion Recognition Using Deep Recurrent Networks, in Proc. ICASSP, 2017.