DOI : https://doi.org/10.5281/zenodo.19471333
- Open Access

- Authors : M. Tejasree, Roqia Tabassum, S. Pooja, P. Srinidhi, R. Rakshiitha, S. Ram Teja
- Paper ID : IJERTV15IS031750
- Volume & Issue : Volume 15, Issue 03 , March – 2026
- Published (First Online): 08-04-2026
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License:
This work is licensed under a Creative Commons Attribution 4.0 International License
Emotion Based Music Recommendation System Using LSTM-CNN Architecture
M. Tejasree (1), Roqia Tabassum (2), S. Pooja (3), P. Srinidhi (4), R. Rakshiitha (5), S. Ram Teja (6)
(1,3,4,5,6) Computer Science Student
(2) Assistant Professor, Department of Computer Science and Engineering Department of Computer Science and Engineering Sphoorthy Engineering College, Hyderabad, India
Abstract This paper presents an Emotion-Based Music Recommendation System using a hybrid LSTM – CNN architecture to recommend music based on the users emotional state. The system detects human emotions from facial expressions and maps the detected emotions to suitable music playlists. Convolutional Neural Networks (CNN) are used for facial feature extraction, while Long Short-Term Memory (LSTM) networks are used to improve emotion classication by capturing temporal dependencies and emotional patterns. Unlike traditional music recommendation systems that rely on user listening history or ratings, the proposed system dynamically adapts recommendations based on real-time emotional analysis. The system is implemented as an ofine web-based application using the Flask framework, with features such as user authentication, language-based song ltering, and local music streaming. A local database is used to store song metadata and user information, enabling efcient data retrieval and ofine functionality. Experimental results demonstrate that the proposed system improves emotion detection accuracy and provides more relevant music recommendations, thereby enhancing user experience and personalization. The system highlights the integration of deep learning-based emotion recognition with intelligent music recommendation to create adaptive and user-centric entertainment systems.
Keywords – Emotion Detection, Music Recommendation, CNN, LSTM, Deep Learning, Personalization
-
Introduction
Emotion recognition has emerged as a signicant research area in human-computer interaction, as understanding human emotions can greatly enhance system intelligence and user experience. Human emotions are commonly expressed through facial expressions, and psychological studies indicate that these expressions can be categorized into universal emotional states, forming the foundation for emotion-aware systems [1]. With the advancement of deep learning techniques, Convolutional Neural Networks (CNN) have proven highly effective in analyzing visual data, particularly for facial
expression recognition. CNN models are capable of extracting meaningful spatial features from images, making them suitable for accurate emotion detection [2]. In addition, Long Short-Term Memory (LSTM) networks are widely used for capturing temporal dependencies in data, which further enhances the performance of emotion classication systems [3].
In the domain of music recommendation, traditional systems primarily rely on user preferences and listening history, which may not always reect the users current emotional state. Studies in music emotion recognition highlight the importance of aligning music recommendations with human emotions to improve personalization and user satisfaction [4].
Therefore, this work presents the development of an Emotion-Based Music Recommendation System that integrates CNN and LSTM techniques to detect user emotions and provide personalized music recommendations. The proposed system aims to improve user experience by offering adaptive and emotion-aware suggestions in an efcient and intelligent manner.
-
Literature Review
Music emotion recognition has been widely studied to understand the relationship between human emotions and musical preferences. Research shows that identifying emotional characteristics in music can signicantly enhance user satisfaction by providing more relevant and personalized recommendations [4]. These studies form the foundation for emotion-aware music recommendation systems.
Recommender systems have evolved over time from traditional approaches such as collaborative ltering and content-based ltering to more advanced intelligent systems. The Recommender Systems Handbook highlights various techniques used to improve recommendation accuracy and personalization, but most systems still rely on historical user data rather than real-time emotional input [5].
With the advancement of deep learning, Convolutional Neural Networks (CNN) have demonstrated excellent performance in image processing tasks, including facial expression recognition. CNN models can automatically extract important features from images, making them highly suitable
for emotion detection applications [2]. In addition, Long Short-Term Memory (LSTM) networks are effective in handling sequential data and capturing temporal dependencies, which helps improve the accuracy of emotion classication [3].
Furthermore, deep learning frameworks have enabled the integration of multiple techniques to build more robust and adaptive systems. Modern deep learning approaches provide the capability to combine visual and sequential data for better performance in real-time applications, including emotion-based recommendation systems [8].
-
Methodology
The methodology of the proposed Emotion-Based Music Recommendation System focuses on integrating emotion detection with intelligent music recommendation to provide a personalized user experience. The system is designed to operate in an ofine environment using a web-based architecture built with the Flask framework.
Initially, the system performs user authentication through a login and registration module. User credentials are stored and validated using a local JSON-based database, ensuring secure and ofine access. Once authenticated, the user is redirected to the main dashboard interface.
The system then requests camera access to capture real-time facial images. Upon permission, a single frame is captured and preprocessed to remove noise and enhance image quality. The processed image is forwarded to the emotion detection module. This module utilizes Convolutional Neural Networks (CNN) to extract spatial features from facial expressions for accurate emotion recognition [2]. Additionally, Long Short-Term Memory (LSTM) networks are incorporated to capture temporal dependencies and improve classication performance [3]. If deep learning models are unavailable, a heuristic-based fallback method is used to ensure continuous system functionality.
The detected emotion is evaluated based on condence scores. If the condence level is below a predened threshold, the system assigns a neutral emotion to maintain consistency in recommendations. This approach ensures robustness and avoids incorrect predictions.
Following emotion detection, the system retrieves song data from a local database such as SQLite or CSV les. The dataset contains attributes such as song title, artist, mood, language, and le path. Songs are ltered based on the detected emotion and user-selected language preferences. Music emotion recognition plays a crucial role in aligning songs with user mood to enhance satisfaction [4].
The recommendation module processes the ltered data and ranks songs using content-based ltering techniques. These techniques ensure that the recommended songs closely match the detected emotional state of the user [5]. The nal playlist is displayed on the user interface, allowing users to select and play songs directly from local storage.
The system also supports dynamic interaction, enabling users to change language preferences orrequest new
recommendations. This iterative process ensures continuous adaptation and improved user experience.
Overall, the methodology integrates emotion detection, data processing, and recommendation techniques into a unied system. The use of deep learning models enhances accuracy and adaptability, making the system efcient and scalable for real-time applications [8].
-
System Architecture
The system architecture of the proposed Emotion-Based Music Recommendation System is designed to ensure efcient integration of emotion detection, data processing, and music recommendation modules. The architecture follows a modular and scalable design, enabling smooth interaction between components while maintaining ofine functionality.
-
Overall Architecture Overview
The system consists of ve major layers responsible for processing user input, detecting emotions, and generating personalized music recommendations.
-
User Interface Layer: This layer provides interaction between the user and the system, including camera activation, image capture, language selection, and music playback.
-
Data Acquisition and Preprocessing Layer: This layer captures facial images using a webcam and performs preprocessing such as resizing to 48×48 pixels, normalization, reshaping, and label encoding. The preprocessing is based on the FER2013 dataset, and the Disgust class is merged with Angry to handle class imbalance.
-
Emotion Detection Layer: This layer uses a hybrid LSTM-CNN model to detect emotions from facial images. The model extracts spatial and temporal features and classies emotions into four categories: Happy, Sad, Neutral, and Energetic.
-
Recommendation Engine: Generates song recommendations based on detected emotion and language preference using emotion-based ltering and ranking.
-
Database Layer: Stores user credentials, song metadata, and le paths for music playback using local storage such as SQLite, CSV, or JSON.
-
-
Architecture Diagram
Figure 1 presents the overall activity diagram of the Emotion-Based Music Recommendation System. It shows how user input is captured via the web interface, analyzed by the LSTM-CNN emotion detection module, and processed by the recommendation engine along with language preferences. The personalized playlist is then displayed for playback, highlighting modularity, real-time processing, and ofine functionality.
Fig. 1. System Architecture
-
User Interface Layer
The User Interface (UI) is developed as a web-based platform using the Flask framework. It provides functionalities such as camera selection, image capture, language selection, and music playback controls.
-
Data Acquisition Layer
This layer captures real-time user input through a webcam. The system requests camera access and captures facial images. The acquired images are preprocessed using techniques such as:
-
Noise reduction
-
Image resizing
-
Normalization
This ensures that the input data is suitable for deep learning models.
-
-
Emotion Detection Layer
The Emotion Detection Layer forms the core of the system. It integrates deep learning models to classify user emotions:
-
CNN (Convolutional Neural Network): Extracts spatial features from facial expressions [2].
-
LSTM (Long Short-Term Memory): Captures temporal dependencies to improve classication accuracy [3].
Low-condence predictions are assigned a neutral emotion to maintain reliability.
-
-
Recommendation Engine
The recommendation module processes the detected emotion and generates a personalized playlist by performing:
-
Emotion-based ltering
-
Language preference ltering
-
Content-based ranking
This ensures that the recommended songs closely align with the users current emotional state, enhancing personalization and satisfaction [4], [5].
-
-
Database Layer
The system uses a local database (SQLite/CSV/JSON) to store:
-
User credentials
-
Song metadata (title, artist, mood, language)
-
File paths for music playback
Ofine storage ensures fast data retrieval and uninterrupted system performance without internet dependency.
-
-
System Workow
The overall workow of the system is as follows:
-
User logs into the system
-
Webcam captures facial image
-
Image is preprocessed
-
CNN + LSTM models detect emotion
-
Emotion is validated using condence score
-
Songs are ltered based on emotion and language
-
Personalized playlist is displayed
-
User interacts with recommendations
-
-
Key Features of the Architecture
-
Modular and scalable design
-
Real-time emotion detection
-
Ofine functionality
-
Robust fallback mechanism
-
Personalized recommendation engine
-
-
LSTM-CNN Architecture
The system uses a hybrid LSTM-CNN architecture for emotion detection from facial expressions. This architecture combines the strengths of Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks to improve emotion classication accuracy.
CNN is responsible for extracting spatial features from facial images, such as edges, textures, and facial landmarks. These features are then passed to the LSTM network, which captures temporal dependencies and sequential patterns in facial expressions. The combination of CNN and LSTM enables the system to learn both spatial and temporal characteristics, resulting in improved performance.
-
Working of LSTM-CNN Model
The working of the model is described as follows:
-
Input Image: The system captures a facial image or video frame as input.
-
Convolution Layer: Applies lters to extract important features from the image.
-
ReLU Activation: Introduces non-linearity using the function:
f (x) = max(0, x)
-
Max Pooling: Reduces feature map size while retaining important information.
-
Feature Extraction: Generates feature vectors representing facial expressions.
-
LSTM Layer: Processes sequential information and captures temporal patterns.
-
Fully Connected Layer: Combines extracted features for classication.
-
Softmax Layer: Converts output into probability values.
-
Emotion Output: Final emotion is predicted (Happy, Sad, Neutral, Energetic).
-
-
Model Flow
Input Image CNN Feature Extraction LSTM
Fully Connected Layer Softmax Emotion Output
-
-
Implementation
The Emotion-Based Music Recommendation System is impemented as an ofine web-based application using the Flask framework. The system integrates facial emotion detection, music recommendation, and local music playback into a unied platform.
-
Tools and Technologies
The system is developed using Python, Flask, OpenCV, and TensorFlow/Keras, with SQLite/CSV/JSON for local data storage and HTML, CSS, and JavaScript for the user interface.
-
System Modules
The implementation consists of the following key modules:
-
User Authentication: Manages login and registration using a local JSON database.
-
Image Capture and Preprocessing: Captures facial images via webcam and performs preprocessing such as resizing, noise reduction, normalization, and grayscale conversion.
-
Emotion Detection: Uses a LSTM-CNN model to classify emotions into Happy, Sad, Neutral, and Energetic. The emotion with the highest probability is selected, with low-condence predictions assigned as Neutral.
-
Recommendation Engine: Filters songs based on detected emotion and language preference using content-based ranking.
-
Music Playback: Enables users to play recommended songs directly from local storage.
-
Database: Stores user data and song metadata for efcient ofine access.
-
-
System Workow
-
User logs into the system
-
Facial image is captured and preprocessed
-
Emotion is detected using LSTM-CNN model
-
Songs are ltered based on emotion and language
-
Recommended playlist is displayed
-
User plays selected songs
The implementation demonstrates efcient real-time emotion detection and personalized music recommendation in an ofine environment.
-
-
Results and Discussion
-
Experimental Results
The Ofine system was tested using a webcam-based emotion detection module integrated with a local music recommendation database. The system detects the users facial emotion and recommends songs based on the detected mood and selected language. The application runs completely ofine, ensuring user privacy.
-
Initial Interface: Figure 2 shows the Initial Interface of the Emotion-Based Music Recommendation System. The interface acts as the main control panel, containing options for camera selection, starting the camera, capturing an image, selecting language, and displaying recommended songs. Initially, the detected mood is shown as Unknown since no image has yet been captured.
-
Language Selection Interface: Figure 3 shows the language selection interface. Users can select their preferred language, such as English, Hindi, or others, before generating song recommendations. This allows the recommendation engine to lter songs by both mood and language, enhancing personalization.
is presented with audio playback controls, allowing immediate playback within the system.
Fig. 2. Initial Interface of Mood-Based Song Recommender System
Fig. 3. Language Selection Interface
-
Emotion Detection Result: Figure 4 illustrates the emotion detection process. The webcam captures the users facial image, which is analyzed by the LSTM-CNN model to determine the emotional state. In this example, the detected emotion is Happy, conrming correct functioning of the ofine emotion detection module.
Fig. 4. Emotion Detection Result
-
Song Recommendation Output: Figure 5 displays the song recommendation results. After detecting the users emotion, the system recommends songs that match the detected mood and chosen language. Each recommended song
Fig. 5. Song Recommendation Output
-
Experimental Summary: The experiments show that:
-
Webcam captures images successfully.
-
Emotion detection predicts mood accurately.
-
Songs are ltered correctly by mood and language.
-
Recommended songs are loaded and played.
-
Entire system operates ofine without internet.
-
-
-
Performance Evaluation
The performance of the Emotion-Based Music Recommendation System was evaluated based on emotion detection accuracy, system response time, and recommendation functionality. The emotion detection model was trained using the FER2013 dataset and tested on facial images captured through a webcam.
TABLE I
System Performance Evaluation
Parameter
Value
Emotion Detection Accuracy
86.5%
Number of Emotion Classes
4
Dataset Used
FER2013
Total Songs in Database
524
Average Response Time
1.8 sec
System Mode
Ofine
-
Discussion
The Mood-Based Song Recommender successfully integrates emotion detection with a music recommendation engine in an ofine environment. Key observations include:
-
Ofine operation improves privacy and reduces dependency on internet connectivity.
-
Emotion detection accuracy depends on lighting, camera quality, and facial visibility.
-
Recommendation quality relies on proper categorization of the song database by mood and language.
-
Fast response time is achieved as computations are performed locally.
Overall, the system demonstrates that emotion-based recommendations signicantly enhance user experience by automatically aligning music suggestions with the users mood.
-
-
Conclusion
This paper presented an ofine Emotion-Based Music Recommendation System using a LSTM-CNN architecture for facial emotion recognition. The system detects user emotions and recommends songs based on mood and language from a local music database. The model was trained using the FER2013 dataset and achieved an emotion detection accuracy of 86.5% with an average response time of
1.8 seconds. The system successfully integrates emotion detection, recommendation, and music playback in a single ofine application. The results show that emotion-based music recommendation enhances user experience by providing personalized music suggestions. Future work can include improving emotion recognition accuracy and developing a cloud-based recommendation system with larger datasets.
-
Future Work
In future work, the emotion detection accuracy can be improved by using advanced deep learning models. More emotion categories can be added to improve music recommendation quality. The system can also be integrated with online music streaming services such as Spotify or YouTube Music. Additionally, the system can be developed as a mobile application for real-time emotion-based music recommendation.
-
S. Zhang, L. Yao, A. Sun, and Y. Tay, Deep Learning Based Recommender System: A Survey and New Perspectives, ACM Computing Surveys, 2019.
-
R. Panda, R. Malheiro, and R. Paiva, Novel Audio Features for Music Emotion Recognition, IEEE Transactions on Affective Computing, 2020.
-
B Schuller, S. Steidl, and A. Batliner, The INTERSPEECH 2009 Emotion Challenge, INTERSPEECH, 2009.
Acknowledgment
The authors sincerely thank Sphoorthy Engineering College, Hyderabad, India, for providing the necessary facilities and support to carry out this research. Special thanks to Assistant Professor Roqia Tabassum and Department of Computer Science and Engineering for their guidance and mentorship throughout the development of this project.
References
-
P. Ekman, An Argument for Basic Emotions, Cognition and Emotion, vol. 6, no. 34, pp. 169200, 1992.
-
A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classication with Deep Convolutional Neural Networks, Advances in Neural Information Processing Systems (NIPS), 2012.
-
S. Hochreiter and J. Schmidhuber, Long Short-Term Memory, Neural Computation, vol. 9, no. 8, pp. 17351780, 1997.
-
Y. Kim et al., Music Emotion Recognition: A State of the Art Review, Proceedings of the International Society for Music Information Retrieval (ISMIR), 2010.
-
F. Ricci, L. Rokach, and B. Shapira, Recommender Systems Handbook, Springer, 2015.
-
I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016.
-
A. Mollahosseini, D. Chan, and M. H. Mahoor, Going Deeper in Facial Expression Recognition Using Deep Neural Networks, IEEE Winter Conference on Applications of Computer Vision, 2016.
-
S. Li and W. Deng, Deep Facial Expression Recognition: A Survey, IEEE Transactions on Affective Computing, 2022.
-
H. Lee and K. Lee, Music Recommendation System Using Emotion Recognition Based on Deep Learning, IEEE Access, 2021.
