DOI : 10.17577/IJERTCONV14IS010044- Open Access

- Authors : Prathiksha K, Ms Jayashree M
- Paper ID : IJERTCONV14IS010044
- Volume & Issue : Volume 14, Issue 01, Techprints 9.0
- Published (First Online) : 01-03-2026
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License:
This work is licensed under a Creative Commons Attribution 4.0 International License
Explainable AI-Based Feedback Rating Prediction for Healthcare Applications
Prathiksha K
Department of Computer Applications St. Joseph Engineering College, Mangalore
Ms Jayashree M Assistant Professor Department of Computer Applications
St. Joseph Engineering College, Mangalore
Abstract- Accurately and honestly evaluating patient feedback is important in guaranteeing high-quality care in the expanding field of digital healthcare. In order to predict and understand star ratings (15) from free-text patient reviews, this paper illustrates an Explainable AI-based framework that combines a refined DistilBERT model with LIME. More than 61,000 anonymised NHS GP reviews were used to train the algorithm, which has an accuracy rate of 81.24%. By providing token-level visualisations, LIME improves model transparency and enables stakeholders to understand how particular words in a review affect the expected rating.The entire framework is used by AyushCare, a cross-platform application for doctor consultations created with Flutter, Node.js, and MongoDB.By allowing end-user rating display and doctor-level interpretability, it improves clinical and administrative workflow decision-making.
The suggested approach bridges the gap between black-box AI models and interpretable decision- making in healthcare review analysis.
Index Terms- BERT, Explainable AI, Healthcare Feedback, LIME, Rating Prediction, Sentiment Classification
T
-
INTRODUCTION
he patient experience has been recognized as a key factor of evaluation in terms of quality care and service delivery. As the digitisation
of healthcare systems has rapidly extended access to information for patients, new ways for patients to express their satisfaction or otherwise have quickly replaced the current ancient stage for venting – online reviews. These reviews are useful repositories for information about clinical outcomes, healthcare professional behaviours, service delays, and overall care experiences. Also, the unstructured format and high volume of text give healthcare professionals a major challenge when trying to identify problems, assess trends, and implement data-driven improvements; the analysis of thousands of reviews is not only subjective, it can also take too long leading to decisions without consistency. The challenge of automating the analysis of patient reviews is further complicated by the glaring mismatch between the sentiment within
the text and the star rating. Commonly used sentiment analysis techniques will typically return binary measures (positive versus negative) that do not provide enough information for a fuller assessment of performance. Additionally, although many effective machine learning models can be used, they are frequently black boxes yielding no or very little transparency in their model, helping them to make, often biased, predictions.
This paper presents an explainable AI-based rating prediction model that that takes into account these issues by predicting star ratings (15) directly from patient feedback using BERT and explaining predictions using LIME (Local Interpretable Model- Agnostic Explanations). The tool is designed to integrate into the AyushCare platform to assist healthcare administrators and clinicians in understanding patient sentiment at scale. Unlike conventional models, this approach combines state of the art deep contextual understanding (via BERT) with post-hoc interpretability (via LIME). This dual approach builds trust in A. generated decisions by providing easy to understand, well-defined explanations spelling out the reasoning. This clarity is especially important for real world application in the sensitive domains like healthcare.
-
LITERATURE REVIEW
Although review rating prediction is a relatively new subfield within natural language processing (NLP), notable developments have been done in recent years to enhance both prediction accuracy and model interpretability. Early approaches often relied on classical machine learning algorithms along with basic text features, but modern solutions now leverage deep learning and transformer-based architectures.
In [1], the authors benchmarked traditional models such as Naïve Bayes, SVM, Random Forest, and neural networks on a huge dataset of vegan and vegetarian restaurant reviews. Using TF-IDF and GloVe embeddings, they found that BERT and SVM both obtained higher accuracy rating prediction of 74%. However, one significant limitation of their
work was that there was a lack of interpretability, as BERT's predictions had been treated as a black box. Also, the dataset was veritably unstable, with over 70% of reviews being 5 star. This led to low standing classes, similar as 1 star reviews, having poor recall( 0.35) and F1- score( 0.46). The model still struggled to generalise across all standing situations, despite the authors' use of class weighting to mitigate this, showing a glaring disparity in performance for sentiments that were under represented.
To improve both structural and sentiment granularity in rating prediction, the authors of [2] put forth the Review Text Granularity (RTG) model. For numerical rating prediction, the model uses classifiers like Logistic Regression (LG) to calculate continuous sentiment scores (SSs) from review texts. After that, these scores are submitted into several algorithms, such as Linear Regression. (LRR), Edge Regression (RER), and Lasso Regression (LAR). The RTG framework only used TF-IDF-based text representations, even though these conventional models generated dependable results (e.g., 87% accuracy with LG and RMSE of 0.2677 with LRR). It lacked support for explainability tools like LIME or SHAP and did not make use of contextual word embeddings like BERT or GloVe. The primary drawback found is that, in spite of its efficiency and interpretability, the RTG model is unable to provide transparency in prediction reasoning and capture deep semantic context, which prevents its use in sensitive fields where interpretability and context are important such as customer experience systems or healthcare.
For product review analysis, a hybrid approach that combines a sentiment classifier and a star rating predictor was presented in [3]. In addition to testing deep learning techniques like Bi-LSTM and Bi- GRU, the study investigated rule-based and machine learning models (such as Naïve Bayes, Logistic Regression, and SVM). However, the deep learning models did not outperform traditional methods due to a shortage of data and a mismatch in domains. Because deep learning was not fully utilised in their final implementation, they were unable to handle contextual details in reviews that were complicated or unclear.. This presents a gap that can be addressed by integrating fine-tuned transformer-based models like BERT or DistilBERT
In [4], the authors emphasized automating sentiment classification for feedback analysis using a combination of rule-based, machine learning, and deep learning techniques, and found that BERT- based models significantly outperformed traditional methods, achieving over 90% accuracy. However, the model was trained on an dataset that was unbalanced and lacked a balancing technique, which may have skewed the performance toward majority classes. Additionally, BERT was treated as a black
box, and the work lacked real-time explanation techniques such as LIME or SHAP, limiting its transparency in critical applications.
In [5], the authors evaluated Naïve Bayes and LSTM models for feedback rating prediction using Word2Vec embeddings on Amazon product reviews. The findings revealed that Naïve Bayes outperformed LSTM in accuracy, which can be related to the models simplicity andthe limitations posed by the dataset. The key gap identified lies in the models inability to capture deep semantic context and provide interpretable predictions. As a future enhancement, the authors proposed using larger, more diverse datasets and fine-tuning transformer based models like BERT to enhance accuracy, generalizability, and transparency across application areas.
Despite the advancements in sentiment classification and rating prediction, gaps remain in the integration of explainable AI, context-aware modeling, and domain adaptation particularly in healthcare feedback systems where decision transparency is crucial. This work coveys these drawbacks by applying a BERT-based model enhanced with LIME for interpretability, and evaluating its performance on real-world GP reviews.
-
METHODOLOGY
This section outlines the complete workflow used to build and evaluate the proposed explainable AI model for predicting star ratings from patient feedback. The methodology covers data preprocessing, model architecture, training setup, and interpretability techniques. A fine-tuned DistilBERT transformer model is used for classification, while LIME is integrated to enhance model transparency.
-
System Design
The architecture of the proposed system is divided into three major components: text encoding, rating classification, and prediction interpretability.
Initially the DistilBERT tokeniser was applied to tokenise the cleaned review text and maintain context in subword format. Pretrained distilbert- base-uncased model was used to create rich contextual embeddings using the tokenised input which encapsulate the structure of the reviews.
In the second step, output embeddings from DistilBERT was considered as the input to a classification which conatin a fully connected layer. The training was done without class weightings using a standard cross-entropy loss. The model generalised well, despite the dataset being imbalanced, due to the volume of the dataset and contextual information available from DistilBERT. Predictive bias towards the majority examples (5-
star reviews) in this imbalanced dataset was reduced.
To ensure interpretability, the model applies LIME (Local Interpretable Model-Agnostic Explanations) at the last step. This tool determines the significance of each word in regard to the predicted class and perturbs the input text to create local approximations around a prediction.
This allows for transparent and understandable insights to be generated that are important for healthcare applications by focusing on the parts of the review that most influenced the model's decision.
Fig. 1. Architecture of the proposed explainable rating prediction system.
-
Data Source
The England NHS GP Reviews (20222024) dataset, which includes over 61,000 anonymised patient reviews uploaded to the NHS public portal is used in this study. Every record included a star rating from 1 to 5 and a free-text review that summarized the patient's experience. Two columns were considered for the study and the column names are: Review Comment Text, which represents the textual feedback, and Star Rating, which represents the ground truth label. Preprocessing of the data involved several methods to prepare the raw text for model input. The preprocessing steps included removing the reviews with missing or null values as a first step. In order to standardize the text, the remaining texts were converted to lower case and removed special characters through regular expressions. Also, removed any cleaned reviews with zero characters. And then converted the star scores to number labels through an integer range of
0 to 4, which is equivalent to the number of classification categories that the model expected, using scikit-learn's LabelEncoder.
Fig. 2. Distribution of ratings in dataset
-
Technologies and Algorithms
The primary model utilized in this research is DistilBERT, a lighter and faster variant of the original BERT model, known for preserving approximately 97% of its language understanding capabilities while offering improved efficiency. The tokeniser and weights of model are loaded using the Hugging Face Transformers library.
To ensure uniform input dimensions, tokenisation is carried out with padding and truncation enabled, with a maximum token length of 128. This study did not use any particular class imbalance handling strategies, such as weighted loss or oversampling. Instead, a conventional cross-entropy loss was utilised, and the model depended on DistilBERT's generalisation ability to handle skewed class distributions. The model is trained for four epochs with a batch size of 16 and optimised using the Adam optimiser with a learning rate of 3e- 5.Standard classification metrics like accuracy and the macro-averaged F1-score are used to evaluate the model.
To explain individual predictions, LimeTextExplainer is applied. This technique uses perturbation sampling to build interpretable surrogate models around specific predictions, highlighting the most influential tokens for each predicted class.
-
Implementation Details
The complete process was done using Python 3.10 in Google Colab, which used Tesla T4's GPU acceleration to speed up training. Important libraries are Hugging Face Transformers and Datasets for maintaining models and data, PyTorch for training models, Scikit-learn for preprocessing and evaluation, and LIME for interpretability.
Using stratified sampling, the processed dataset was split into 80% training and 20% testing subsets and transformed into a Hugging Face Dataset object. The optimal-performing model was automatically saved as a component of the training setup, along with an assessment method that was scheduled to execute after every epoch. After training, the tokenizer and model were exported as a zip file for deployment and saved on disk.
To demonstrate the models explainability, five representative reviews, one from each rating class (1 to 5) were selected for generating LIME- based visualizations. These highlight the specific words that contributed most to the predicted star rating. As shown in Fig. 3, for a 1-star review, words like rude, terrible, and unhelpful had the highest contribution toward the predicted class. Such explanations serve as qualitative proof of the models ability to justify its predictions, addressing the need for transparency in clinical decision support systems.
Fig. 3. LIME explanation for a 1-star review showing word importance scores used by the model.
-
-
RESULTS AND EVALUATION
The standard classification metrics, such as accuracy, confusion matrix, and macro-averaged F1-score, were used to measure the model on a 20% held-out test set. These indicators were chosen to show how well the model performed across all five rating classes (15 stars), including minority classes like 1-star and 2-star ratings.
A significant correlation between the predicted and actual patient ratings was demonstrated by the model's overall accuracy of 81.24% after four training epochs using standard cross-entropy loss. This performance is comparable to, and sometimes better than, findings from related recent studies. For example, Hani et al. [1] used BERT to predict ratings with 74% accuracy when there was no explainability component.
The classification report (Fig. 5) and confusion matrix (Fig. 4) display the full evaluation results.
The model performs well in forecasting 1-star and 5-star reviews, which predominate the dataset, as seen in Fig. 4.
However, it shows reduced performance in distinguishing between mid-range ratings (24 stars), particularly where sentiment is more ambiguous or neutral. This behavior is typical in ordinal multi-class classification, where adjacent class confusion is more likely.
The preciion, recall, and F1-score are shown for each class in Fig. 5. The model performed good even on underrepresented classes like 2 and 3, achieving a weighted average F1-score of 0.79 and a macro-averaged F1-score of 0.62.
Fig. 4. Confusion Matrix
Fig. 5. Classification Report
For one review from each rating category, LIME explanations were produced in order to verify the model's interpretability. An example of a 5-star review is shown in Fig. 7, which highlights the key terms that impacted the model's choice. These justifications provides light on the model's logic and validate that it is consistent with human intuition, which is particularly important in clinical feedback systems.
Fig. 6. Feedback Submission
Fig. 7. Explanation using LIME
To analyze the training behavior and monitor potential overfitting, a line plot of training and
validation loss was generated, as displayed in Fig. 8. The validation loss increases slightly after epoch 3, suggesting minor overfitting, but the model maintains stable performance overall.
Fig. 8. Training and validation loss across epochs.
-
FUTURE WORK
There are still a number of areas that could use improvement, even if the current approach shows encouraging results in terms of accurately and interpretably forecasting patient satisfaction ratings. A more detailed assessment of the patient experience can be obtained by breaking down reviews into more manageable groups, such as clinical care, staff behavior, or appointment availability, with the aid of sentiment analysis based on aspects.
Performance gains in healthcare-specific language comprehension can be evaluated from a model perspective by looking at more complex structures such as BERT-BiLSTM hybrids or domain-specific transformers like BioBERT. In terms of interpretability, combining LIME with other explanatory frameworks like SHAP could enhance clinical trust through providing multi-perspective insights into the logic behind predictions.
For deployment, the model can be encapsulated within a RESTful API or deployed via a Streamlit or FastAPI. This would allow healthcare providers to upload reviews in real-time and receive predicted ratings along with interpretability visualizations, enabling actionable feedback analytics at scale.
-
CONCLUSION
This research presents an explainable AI-based approach for automated patient rating prediction using free-text reviews from NHS GP feedback. By fine-tuning a pretrained DistilBERT model and integrating LIME for interpretability, the system achieves a high prediction accuracy of 81.24% while
also providing transparency in decision-making an essential requirement for real-world healthcare applications.
The methodology preserves data integrity, and outputs interpretable insights through local token- level explanations. The results demonstrate that transformer-based models, when paired with explainable AI tools, can effectively support clinical feedback systems by reducing manual workload and enabling data-driven evaluation of healthcare services.
This work, implemented within the AyushCare platform, sets a foundation for the deployment of intelligent, interpretable NLP systems in the healthcare sector, facilitating real-time review analytics and supporting continuous improvement in patient care delivery.
-
REFERENCES
-
A. Hani, I. Cviti, and A. Kirbi, "Comparing machine learning models for sentiment analysis and rating prediction of vegan and vegetarian restaurant reviews," Computers, vol. 13, no. 10, p. 248, Oct.
2024. doi: 10.3390/computers13100248
-
R. Garapati and M. Chakraborty, Enhancing Sentiment Analysis and Rating Prediction Using the Review Text Granularity (RTG) Model, IEEE Access, vol. 13, pp. 2007120084, Jan. 2025, doi: 10.1109/ACCESS.2025.3534261.
-
A. Jain, A. Vijay, and S. Bhargava, Products Review Rating Prediction from Users Text Reviews, International Journal of Engineering Research & Technology (IJERT), vol. 11, no. 6, 2022.
-
A. N. M. Anmigha and J. Devis, Text Classification for Feedback Analysis: Automating Rating System with NLP, in Proceedings of the National Conference on Emerging Computer Applications (NCECA), vol. 1.6, issue 1, Amal Jyothi College of Engineering, Kottayam, Kerala, India, 2025, doi: 10.5281/zenodo.15429681.
-
B. Priyanka, Y. Amaraiah, and D. D. D. S. Suribabu, Rating and Analysis of Customer Reviews Using Natural Language Processing, International Journal of Creative Research Thoughts (IJCRT), vol. 10, no. 12, pp. 199204, Dec. 2022. [Online]. Available: https://www.ijcrt.org
