🌏
International Engineering Publisher
Serving Researchers Since 2012

Employee Attrition Prediction

DOI : 10.17577/IJERTCONV14IS010080
Download Full-Text PDF Cite this Publication

Text Only Version

Employee Attrition Prediction

C Aswathi, Mr Gururaja S Department of Computer Applications

St Joseph Engineering College, Mangalore, Karnataka, India

Abstract – Employee turnover is a significant challenge for modern organizations. It impacts productivity, team cohesion, employee morale, and long-term workforce strategies. Leveraging machine learning to forecast attrition enables HR to implement proactive measures to retain key employees and reduce replacement expenses. This study utilizes a Random Forest classifier on the IBM HR dataset, attaining an impressive accuracy of 99%. We compare Logistic Regression, Extra Trees Classifier, Naive Bayes, and Gradient Boosting models and reference recent research done in India and around the world. The findings indicate that ensemble methods excel in capturing intricate feature interactions and non-linear relationships. The results indicate that implementing predictive analytics within HR functions can significantly enhance talent management, bolster organizational stability, and facilitate data-driven decision-making for retaining employees. This research highlights the practical importance of HR analytics in accurately forecasting attrition and informing strategic planning efforts.

Keywords: Employee turnover, Machine Learning, Random Forest Classifier, HR Analytics, ,Predictive Analytics, IBM HR Dataset.

satisfaction. This shift takes HR from a reactive approach to a more proactive way of managing the workforce.

This research presents a Random Forest-based model for predicting attrition, evaluating its effectiveness against well- established algorithms from five academic studies, specifically: Logistic Regression (IIMA)[1], Extra Trees Classifier (MDPI)[2], Naïve Bayes (Computers journal)[3], and Gradient Boosting (IJDM)[4]. The goal is to assess the model's performance, identify key factors like job role, satisfaction, salary advancement, and workplace environment, and produce useful insights for HR. Additionally, the study investigates the practical use of ML in HR analytics, tackling issues related to model implementation, data quality, and ethical considerations in data usage. The outcomes are intended to strengthen employee retention strategies within organizations, promoting sustainable growth and fostering a positive workplace culture in the global economy.

  1. INTRODUCTION

    In the current competitive business landscape, it is essential to keep talented employees. Employee turnover includes resignations, terminations, and retirements. It results in high costs due to increased recruitment and training expenses. It also causes decreased productivity, loss of knowledge within the organization, and lower team morale. Furthermore, high turnover can tarnish an organization's reputation, making it harder to attract new talent.

    By using data analytics and machine learning (ML), companies can now look at historical employee data to find patterns of attrition. ML algorithms can predict which employees are likely to leave. This allows HR to step in early with customized retention strategies that tackle issues and improve job

    Fig 1: Architecture Diagram

  2. LITERATURE SURVEY

    Khare et al. [1] used Logistic Regression to predict employee turnover in a large IT firm in India. The model focused on demographic factors like age, gender, marital status, and job satisfaction ratings. Although it attained a reasonable accuracy of around 87%, its linear assumptions made it less effective at capturing non-linear relationships among features. Furthermore, the study did not incorporate sophisticated preprocessing

    techniques or ensemble learning models, which limited its relevance in dynamic workplace datasets.

    Raza et al. [2] employed an Extra Trees Classifier on the IBM HR dataset to investigate employee turnover through an ensemble methodology. This approach resulted in a 93% accuracy and highlighted the necessity of data imbalance by using a sampling technique that addresses the Synthetic Minority Over-sampling Technique (SMOTE).The authors also conducted correlation analysis and assessed feature importance to pinpoint key predictors, such as monthly income, job role, and overtime. The research showed that ensemble learning techniques can identify complex patterns better than linear models.

    Fallucchi et al. [3] evaluated different classification models and found that the Naïve Bayes classifier performed best in recall at 54%. Its overall accuracy was about 81%. The simple design of Naïve Bayes made it useful for smaller datasets and for quick testing. However, its independence assumptions limited its effectiveness with datasets that have interdependent features. Their research highlighted the importance of choosing models that fit the structure of HR data.

    Antony and Haritha [4] performed a comparative analysis of Logistic Regression and Gradient Boosting classifiers utilizing the IBM HR dataset. Logistic Regression attained an accuracy of 87%, whereas Gradient Boosting slightly surpassed this with an accuracy of 88%. Their research also identified significant factors such as job satisfaction, monthly income, and performance rating as key features. They concluded that Gradient Boosting offers superior precision and interpretability for HR predictions.

    Alduayj and Rajpoot [5] looked into several machine learning algorithms, such as K-Nearest Neighbors (KNN) and Random Forest. They used a synthetic dataset based on IBM. Their experimentation with data balancing methods like Adaptive Synthetic Sampling (ADASYN) enhanced the performance of the classifications. KNN recorded an impressive F1-score of 0.93, while Random Forest achieved a score of 0.91. Despite KNN's strong performance, the authors pointed out its computational inefficiencies and challenges with large-scale datasets, thereby favoring Random Forest for practical applications.

  3. METHODOLOGY

    1. Dataset

      The dataset used in this research is the IBM HR Analytics Employee Attritiin dataset, imported in CSV format for analysis and modeling. It includes data that represents HR records for 1,470 employees. This dataset is often used in academic and industry research to examine employee behavior, identify attrition trends, and create predictive models. It contains a variety of employee attributes such as satisfaction level, last evaluatin, number of projects, average monthly hours, tenure at the company, work accident history, promotion status over the past five years, department, and salary level. The main target variable is left, which is a binary indicator where 1 means an employee has left, and 0 means an employee has stayed.

    2. Data Inspection

      An initial analysis was performed to grasp feature distributions, data types, and to identify any anomalies or missing values. The overview of the dataset indicated a well-organized format with no significant missing values, encompassing both the numerical and categorical data. Examining categorical variables like 'department' and 'salary', as well as the distribution of the 'left' variable, revealed possible class imbalance. We gathered descriptive statistics for both groups (attrited and non- attrited) to compare their means and feature distributions. This analysis informed the next steps in preprocessing.

    3. Class Balance

      Class imbalance was examined by looking into the distribution of the target variable left. To alleviate potential bias stemming from imbalanced classes, strategies such as oversampling the minority class using techniques like SMOTE (Synthetic Minority Over- sampling Technique) or ADASYN Adaptive Synthetic Sampling) were contemplated to ensure robust model training.

    4. Data Preprocessing

      Cleaning the dataset and ensuring it is correctly formatted are essential prerequisites for dependable model performance.

    5. Categorical Encoding

      Categorical features such as Department and Salary were converted to numerical values, enabling algorithms to process these attributes. Techniques like label encoding and mapping (low, medium, high salaries as 1, 2, 3) transformed string values into integers. One-hot encoding was also applied to nominal categorical data to prevent the introduction of misleading ordinal relationships.

    6. Feature Engineering and Selection

      The dataset was examined for duplicate or strongly related features that could create noise or cause multicollinearity. Features exhibiting high collinearity or minimal variance were eliminated when necessary. Strategies such as correlation heatmaps and expert knowledge guided the decision about which variables to keep..

    7. Model Building

      • Algorithm:

        The Random Forest Classifier was chosen for its strength,

        Additional graphs and tables were employed to interpret the confusion matrix and feature importance, yielding actionable insights for HR decision-makers.

  4. RESULTS

    In this research, we compared the effectiveness of the Random Forest model with several other models mentioned in the literature. The model achieved an accuracy of 99.52% and an F1- score of 0.9866. This clearly shows its outstanding performance. For comparison:

    ability to handle different data types, and resistance to The Logistic Regression model used in the IIMA study [1]

    overfitting. The model was set up with parameters including the number of estimators (trees), with ten trees employed for demonstration.

    • Cross-Validation:

    To ensure the models applicability and to prevent

    achieved an approximate accuracy of 87% and an F1-score of 0.78.

    • The Extra Trees Classifier, employed in the MDPI study [2], reached an accuracy of 93% and an F1- score of 0.86.

      overfitting, k-fold cross-validation (usually with k=5 or 10) The Naïve Bayes model discussed in the Computers Journal [3]

      was conducted during training. This method divides the training data into multiple segments, training and validating the model iteratively to produce a more reliable evaluation of its performance.

      • Hyperparameter Tuning:

        Hyperparameters like the number of estimators, maximum tree depth, and minimum samples per leaf were improved

        yielded a lower accuracy of 81% and an F1-score of 0.65, due to its simplifying assumptions.

    • The IJDM study [4] reported Gradient Boosting with an accuracy of 88% and an F1-score of 0.82, showing improved performance over basic linear models.

      using grid search or randomized search techniques. This In the IEEE study [5], the K-Nearest Neighbors (KNN) classifier

      approach aimed to find the best setup that maximizes model accuracy and robustness.

      • Implementation:

        Python libraries like scikit-learn for modeling, pandas for data manipulation, numpy for numerical processing, and seaborn or matplotlib for visualization were used.

        1. Model Evaluation

      • Metrics:

    The performance of the trained model was evaluated on the reserved test set. We used the accuracy score, a classification report that includes precision, recall, and F1- score, and the confusion matrix. These metrics give a detailed look at predictive performance, especially for imbalanced classes.

    1. Result Visualization:

    achieved a commendable accuracy

    of 92% and an F1-score of 0.93 on a synthetic dataset.

    When contrasted with the other models, the Random- Forest classifier used in this study showed a significant advantage, positioning it as a reliable and robust choice for predicting employee attrition.

    Fig 2: Feature Importance Chart

    Fig 3: Confusion Matrix

    Among 1450 total cases, just 7 were misclassified, leading for an overall accuracy rate of 99.52%. The model showcased an outstanding performance in distinguishing between retention and attrition outcomes, with very low error rates. The low incidence of false positives and negatives emphasizes the model's trustworthiness in predicting employee behavior, making it a crucial asset for workforce planning and proactive retention efforts.

    Fig 4: Model Comparison Chart

    Fig 5: Evaluation Table

  5. DISCUSSION

    The Random Forest model showed better performance than other algorithms, achieving higher accuracy and F1- score, which underscores its effectiveness in predicting employee attrition. Its ensemble method, which combines several decision trees, along with its ability to capture non-linear relationships and complex feature interactions, gave it an edge over linear models like Logistic Regression and simpler probabilistic approaches such as Naïve Bayes.

    Using suitable feature encoding techniques, including One-Hot Encoding for categorical variables and StandardScaler for numerical features, ensured that the data was properly prepared for model training. Moreover, optimizing parameters (like the number of estimators and maximum depth) and maintaining a balanced train-test division contributed to the model's impressive performance. Key factors identified includes Job Satisfaction, Overtime, and Monthly Income. These results align with the previous research conducted both in India and internationally, thereby reinforcing the validity of this analysis.

    Even so, in spite of its outstanding ability to predict outcomes, the Random Forest model is less interpretable when it is compared with linear models. HR professionals

    typically prefer models that offer straightforward insights into the decision factors influencing employee turnover. To mitigate this issue, future efforts could incorporate Explainable AI tools such as SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model- agnostic Explanations) to deliver: visualizations of the feature contributions, explanations of

    individual employee-level risk, and actionable insights for targeted retention strategies.

    Additionally, investigating the advanced algorithms like XGBoost or combining deep learning models with suitable explainability techniques could further boost predictive precision and practical usefulness. Expanding the dataset to include real- time organizational data and external influences such as industry trends, market conditions, and employee feedback can also enhance the model's generalizability for application within enterprise HR analytics platforms.

  6. CONCLUSION

    This study confirms the effectiveness of Random Forest in predicting employee turnover, achieving an impressive accuracy rate of 99% with the IBM HR Analytics dataset. A comparison of Random Forest with other algorithms, such as Logistic Regression, Extra Trees Classifier, Naïve Bayes, and Gradient Boosting, derived from five academic studies, highlights the superior performance of ensemble models in managing complex, real-world HR datasets. The ability of Random Forest to uncover non- linear relationships and interactions among features enhances its capability to accurately predict attrition outcomes. Identifying high-risk employees at an early stage allows the organizations to implement targeted HR initiatives, to improve job satisfaction, and reduce to employee turnover rates. This study highlights important elements affecting employee turnover, including job satisfaction, overtime hours, and monthlyearnings, offering useful insights for creating targeted retention strategies. By leveraging machine learning-based predictive analytics, HR departments can make data-driven decisions, enhance workforce stability, and also lower costs associated with frequent hiring and training.

    However, in spite of its outstanding performance, this model has a number of notable practical limitations. First, the interpretability of Random Forest is comparatively limited when set against the simpler models, making it challenging for HR professionals to grasp the precise reasoning behind each prediction. This lack of transparency could hinder its adoption for sensitive decision-making situations. Secondly, the model's performance might deteriorate when utilized with real- time or changing datasets, particularly if there are shifts in organizational dynamics over time. Furthermore, the crucial qualitative factors such as employee motivation, interpersonal relations, and external opportunities are not included in the dataset, which might restrict the model's predictive thoroughness.

    To mitigate these issues, future efforts should focus on testing with larger, more varied, and real-time HR datasets, alongside

    employing Explainable AI (XAI) tools such as SHAP or LIME to offer clearer, human- comprehensible explanations for model predictions. This will enhance trust, ensure ethical implementation, and promote more transparent, equitable, and also efficient decision-making in employee retention strategies.

  7. REFERENCES

  1. S. Khare, A. Dey, and M. Mishra, Employee attrition prediction using logistic regression, presented at the IIMA International Conference on Advanced Data Analysis, Business Analytics and Intelligence, Ahmedabad, India, Apr. 2011.[Online] Available: https://www.academia.edu/download/56793153/Employ ee-Attrition-Risk-

    Assessment-using-Logistic- Regression-Analysis.pdf

  2. S. M. Raza, M. Z. Iqbal, R. A. Khan, and S. Khan, Comparative analysis of machine learning models for predicting employee attrition, Applied Sciences, vol. 12, no. 13, p. 6424, Jul. 2022..[Online] Available : https://www.mdpi.com/2076-3417/12/13/6424

  3. Fallucchi, Coladangelo, and Esposito, A machine learning approach for predicting employee attrition using HR analytics, Computers, vol. 9, no. 4,

    p. 86, Dec. 2020.[Online] Available: https://www.mdpi.com/2073-431X/9/4/86

  4. J. Antony and Haritha, Comparative Analysis of Logistic Regression and Gradient Boosting for Employee Attrition Prediction, Indian Journal of Data Mining, vol. 6, no. 1, pp. 2430, 2024. [Online] Available: https://www.ijdm.latticescipub.com/wp- content/uploads/papers/v4i1/A163604010524.pdf

  5. S. S. Alduayj and K. Rajpoot, Predicting Employee Attrition using Machine Learning, in *Proc. 13th Int. Conf. Innovations in Information Technology (IIT 2018)*, UAE, pp. 9398, 2018. [Online] Available: https://ieeexplore.ieee.org/document/8605976

  6. IBM HR Analytics Employee Attrition & Performance Dataset.[Online]. Available: https://www.kaggle.com/pavansubhasht/ibm-hr-

    analytics-attrition-dataset.

  7. Scikit-learn documentation: https://scikit-learn.org/stable/

  8. S. Ponnuru, G. Merugumala, S. Padigala, R. Vanga, and B. Kantapalli, Employee Attrition Prediction using Logistic Regression, Int. J. Res. Appl. Sci. Eng. Technol., vol. 8, pp. 28712875,

    May 2020. doi:10.22214/ijraset.2020.5481.Online] Available : https://www.ijraset.com/fileserve.php?FID=29109

  9. R. S. Shankar, J. Rajanikanth, V. V. Sivaramaraju, and

    K. V. S. S. R. Murthy, Prediction of employee attrition using data mining, in *Proc. 2018 IEEE Int. Conf. System, Computation, Automation and Networking (ICSCAN)*, Pondicherry, India, Jul. 67, 2018, doi: 10. https://ieeexplore.ieee.org/abstract/document/8541242

  10. T. P. Salunkhe, Improving employee retention by predicting employee attrition using machine learning techniques, M.S. thesis, Dublin Business School, Dublin, Ireland, 2018.

    https://esource.dbs.ie/server/api/core/bitstreams/97ac58d 6-acbc-4f6f-ac03- eae8d8fad8ce/content

  11. V. V. Saradhi and G. K. Palshikar, Employee churn prediction, Expert Systems with Applications, vol. 38, no. 3, pp. 19992006, Mar. 2011, doi:10.1016/j.eswa.2010.07.134.. https://www.sciencedirect.com/science/article/abs/pii/S0 957417410007621?via%3Dihub