Predicting Lung Cancer Among Smokers

Canvil Joyal Lobo; Sumangala N

doi:10.17577/IJERTCONV14IS010014

Techprints 9.0 - 2026 (Volume 14 - Issue 01)

Predicting Lung Cancer Among Smokers

DOI : 10.17577/IJERTCONV14IS010014

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 25
Authors : Canvil Joyal Lobo, Sumangala N
Paper ID : IJERTCONV14IS010014
Volume & Issue : Volume 14, Issue 01, Techprints 9.0
Published (First Online) : 01-03-2026
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Predicting Lung Cancer Among Smokers

Canvil Joyal Lobo

Department of MCA

St. Joseph Engineering College Mangalore, India

Sumangala N

Assistant Professor

St Joseph Engineering College Mangalore, India

Abstract – Lung cancer disproportionately affects smokers, particularly for individuals with a smoking history. Timely diagnosis significantly drastically improves recovery outcomes, but conventional diagnostic tools are often invasive and expensive. This research presents a machine learning-based system designed risk among smokers using structured health survey data. The dataset comprises attributes related to symptoms, lifestyle habits, and medical background. After data preprocessing and initial dataset investigation, the study employed logistic regression applied due to its efficiency in binary health classifications. The calculation of model using accuracy, F1-score, and ROC-AUC, showing strong predictive performance. This study emphasizes how lightweight, transparent algorithmic systems are positioned to support proactive screening and guide clinical decision-making in resource-limited healthcare environments.

Index Terms – Resume Evaluation, Skill Gap Analysis, Semantic Similarity, AI-based Career Guidance

INTRODUCTION

Lung cancer remains one of the deadliest forms of cancer globally, with a particularly high incidence among long-term smokers. The disease often goes undetected in its early stages owing to the lack of noticeable symptoms, leading to delayed diagnosis and reduced chances of survival. Traditional diagnostic approaches such as CT scans and biopsies, while effective, are often costly, time- consuming, and inaccessible in many regions.

Contemporary advances in computational analytics have gained significant traction in the healthcare domain given its capacity for recognize patterns in data and assist in predictive diagnostics. These algorithms can process structured patient information, including lifestyle factors, medical history, and symptoms, for probabilistic evaluation of various conditions, including cancer. By integrating such approaches into preliminary screening systems, healthcare providers can identify high-risk individuals early and recommend timely interventions.

This study focuses on developing a predictive system to estimate respiratory cancer susceptibility among those with nicotine dependence using machine learning techniques. Unlike imaging- based systems that require large-scale infrastructure and medical personnel, the proposed model utilizes survey-based attributes such as coughing, shortness of breath, fatigue, and anxiety. The aim is to showcase how curated clinical data can pre-processing and algorithm selection, even basic input data can yield meaningful predictions that

support early detection efforts.

Employing algorithmic prediction model not only enhances the speed and efficiency of screening but also has the potential to extend medical insights into underserved populations. Through this research, we aim to bridge the gap between clinical diagnostics and

technology-driven prevention strategies, highlighting the importance of accessible and interpretable AI tools in healthcare.

.

RELATED WORK

Scholarly work has extensively examined algorithmic approaches techniques into the early detection and diagnosis of lung cancer. These studies emphasize the growing importance of computational models in supporting clinical decision-making, especially for high- risk groups such as smokers. In previous work, classification algorithms such as Decision Tree-based classifiers alongside maximum maximum-margin classifiers in Support Vector Machines (SVM), and Ensemble methods like Random Forests demonstrate utility to evaluate structured patient data, demonstrating promising levels of accuracy in identifying cancer presence. Seminal work in these researchers includes employed supervised learning on datasets comprising attributes like age, coughing, chest pain, and fatigue to train binary classifiers for cancer detection. These approaches showed these computational approaches, when properly properly trained and validated, can provide meaningful insights based on symptom-level data alone. Another notable direction has involved the use of logistic regression, showing consistent reliability for binary classification tasks in healthcare interpretability. Compared to complex black-box models, logistic regression offers the advantage of clearly associating input variables with the output class, a trait especially valuable in medical applications where transparency is essential.

Furthermore, studies using ensemble methods and hybrid models have reported improvements in prediction accuracy by combining multiple classifiers. Multilayer perceptron models explored, particularly with image-based data particularly cross-sectional imaging, constrained by intensive resources vast amounts annotated datasetsmaking them less practical in low-resource or survey- based settings. Most models tend to specialize in either imaging data or high-end clinical attributes, which may not be accessible to all populations. The approach presented in this paper proposes resolving this limitation sing simple, structured survey inputs to build an accessible and lightweight prediction tool. This aligns with the broader goal of democratizing AI in healthcare and extending early diagnostic support to underserved communities.

METHODOLOGY

The proposed system employs a structured machine learning pipeline to predict lung cancer risk among individuals, particularly smokers, using survey-based health data. The methodology is organized into several key phases: data acquisition, preprocessing, exploratory analysis, feature selection, model training, and evaluation.

Data Collection and Preprocessing

The dataset in this work incorporated a publicly available health survey titled survey lung cancer.csv, which contains responses from 309 individuals. The dataset includes a mix of behavioural, physical, and psychological indicators related to lung health,

particularly in relation to smoking habits. Each record corresponds to a unique individual and captures whether or not that person is suspected of having lung cancer based on their symptoms and lifestyle.

The dataset comprises 16 features, including:

Demographic data such as GENDER and AGE
Lifestyle factors such as SMOKING and ALCOHOL CONSUMING
Medical symptoms including FATIGUE, COUGHING, SHORTNESS OF BREATH, WHEEZING, CHEST PAIN, and SWALLOWING DIFFICULTY
Psychological or environmental triggers such as PEER PRESSURE and ANXIETY

The target variable is LUNG_CANCER, which indicates whether the individual is likely to have lung cancer (YES) or not (NO). All the features are categorical or ordinal in nature, with values typically coded as 1, 2, or binary (e.g., YES/NO, M/F). A few columns contain numerical data, such as AGE.

During preliminary computational analysis techniques, systematic dataset screening was undertaken to flag consistency. Column names were standardized, missing or duplicate records were handled (if found), and categorical fields were label-encoded to prepare the data for modelling. The dataset was found to be balanced, with a reasonable distribution of both YES and NO classes in the LUNG_CANCER label, making it suitable for binary classification.

The advantage of this dataset lies in its survey-based nature, which eliminates the requirement for expensive diagnostic imaging procedures or lab results. Developed using this dataset potential to be deployed in community health screening, especially in remote or resource-limited setings were early detection accessible screening solutions.

Data Preprocessing and Cleaning

Before feeding the dataset into any machine learning algorithm, a thorough preprocessing and cleaning phase was conducted to enhance data quality and ensure compatibility with the model. This step is essential for removing inconsistencies and preparing the features in algorithm to learn effectively.

Initially the missing values, duplicate entries, and inconsistent formatting. All 309 records were intact, and there were no null or duplicate values detected, which eliminated for imputation or record removal.

Next, categorical data as GENDER and LUNG_CANCER using label encoding. For example, the values M and F in the GENDER column were changed into binary values 1 and 0, respectively. Similarly, the LUNG_CANCER column, which initially contained YES and NO responses, was encoded into 1 and 0 to align with binary classification requirements.

Column names that included spaces or inconsistent formatting, such as ALCOHOL CONSUMING or CHRONIC DISEASE, were

renamed using underscores (e.g., ALCOHOL_CONSUMING, CHRONIC_DISEASE) to ensure clean and readable syntax during code execution.

All uniform value distribution. Since some columns consisted of categorical or ordinal values ranging from 1 to 2, no standard normalization or scaling was required. However, continuous variables such as AGE were kept in their original form, given the model's tolerance for small-range numerical features.

Through these careful preprocessing and cleaning steps, the dataset was transformed into a well-structured, consistent format suitable for training and evaluating classification models. This process plays crucial role improving model accuracy and reducing noise that might otherwise lead to biased or unreliable predictions.

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) was carried out to gain meaningful insights from the dataset and to understand how each feature behaves in relation to the target variable, LUNG_CANCER. This step helps in identifying patterns, correlations, and potential anomalies capable of altering algorithmic of the predictive model.

The process began with analyzing statistical spread of essential attributes particularly the AGE, SMOKING, and YELLOW_FINGERS. Visualizations like histograms and bar plots were used to analyze the dispersion patterns of these spread across the dataset. For instance, most individuals in the dataset were above 50 years of age, reinforcing the observation that susceptibility to lung cancer tends to increase with age.

Categorical variables such as COUGHING, WHEEZING, and SHORTNESS_OF_BREATH was explored using count plots to evaluate their frequency in relation to the cancer status. Symptoms such as like coughing and wheezing appeared more frequently among individuals labelled as positive for lung cancer, suggesting a strong association. A correlation heatmap was also generated to examine the linear relationships between features. Variables such as ALCOHOL_CONSUMING,ALLERGY,SWALLOWING_DIFFIC

ULTY showed notable correlation with the target, which provided further guidance during feature selection.

Pie charts were show the proportion of lung cancer cases by gender and smoking behaviour. These visual tools revealed that a higher percentage of males and active smokers were categorized as having lung cancer, aligning with known clinical findings.

Through this exploratory phase, the most informative features were identified and potential multicollinearity issues were checked. "The resultant understanding became pivotal to developing a reliable and interpretable classification model in subsequent stages.

Fig 1: Pie chart of Chronic Disease vs yellow Fingers Combination

Model Evaluation

After the training phase, evaluation techniques to determine identifying in lung cancer risk. Our evaluation framework incorporated included accuracy, F1-score, and ROC-AUC, each of which offers a different perspective on the algorithm's predictive competence performs in a binary classification setting.

Our logistic classifier exhibited robust predictive capability, achieving a prediction correctness rate reaching of approximately 86%. The F1-score, which provides a harmonious trade-off between sensitivity, also indicated a solid classification performance. Additionally, the ROC-AUC score approached 0.89, showing that the model was effective at distinguishing between positive and negative classes across different threshold values.

To gain further insight, a confusion matrix was generated to predictive results in terms of verifiable positives, confirmed

negatives, and their erroneous counterparts. This matrix revealed a high rate of correct predictions, with very few instances misclassified. This aspect held particular significance in minimizing false negatives, which in medical diagnostics could lead to missed cancer detections.

Furthermore, the Receiver Operating Characteristic (ROC) curve plotted to visualize the between sensitivity and specificity. A curve closer to the top-left corner of the plot confirmed that the logistic regression model performed reliably across various classification thresholds.

To ensure fairness in comparison, two additional modelsK- Nearest Neighbors (KNN) and Decision Treewere also evaluated using the same metrics. While both alternatives produced acceptable results, logistic regression consistently outperformed when considering positive predictive value versus generalization. This made it the reliable choice candidate for the task of likelihood of lung cancer in its initial stages in this study.

The evaluation results validated the proposed model's ability to provide quick, non-invasive risk assessments based on survey data, with potential applications in preventive health screening and decision support systems.

RESULTS AND DISCUSSIONS

Model Performance Results

This predictive model developed study were evaluated based on performance on their effectiveness in identifying individuals at risk for lung cancer. The primary model, Logistic Regression, demonstrated performance compared to the other alternative classifiers used .

On the testing dataset, Logistic Regression achieved an classification correctness accuracy 86%, indicating that it was able to correctly categories the majority of the samples. The F1-score, which reflects the combined metric precision recall, also reached 86%, showing that the model exhibited well across multiple positive and negative classes. Additionally, the model recorded a ROC-AUC score of 89%, which signifies a strong ability to distinguish between the two classes regardless of the decision threshold.

To provide a benchmark, the same evaluation was applied to the K-Nearest Neighbours (KNN) and Decision Tree models. The KNN classifier reached an accuracy of 82%, while the Decision Tree achieved 79%. Though these models performed reasonably results, they remained less consistent in respect of recall and AUC against the regression model. Empirical evidence confirms logistic regression's n offers a more balanced and generalizable solution for the classification problem in this dataset.

The overall results indicate that a well-tuned, interpretable model like Logistic Regression is capable of delivering reliable outcomes using a compact feature set derived from survey responses. These findings validate the feasibility of using basic symptom and lifestyle data to develop screening tools that support early diagnosis of lung cancer cases detection, particularly in settings where access to medical imaging is limited.

The performance metrics suggest that the model is suitable for real-world deployment in preliminary health checkups or decision support systems. Its consistent performance across key indicators highlights its practical utility in medical risk prediction applications.

Fig 2: Model performance results

Feature Importance and Correlation Insights

Understanding significantly to the outcome helps improve both model transparency and clinical relevance. In this study, feature relevance was assessed through two approaches: statistical correlation analysis and model-driven importance scores.

A correlation heatmap was generated to explore the relationships between independent variables and the target label. Several features such as COUGHING, WHEEZING,SWALLOWING_DIFFICULTY, and

ALCOHOL_CONSUMING showed strong associations with the presence of lung cancer. These findings align with medical observations, where symptoms related to respiratory distress or lifestyle risks are known to be linked to cancer.

To complement this, the Logistic Regression model provided interpretable coefficients for each feature. These coefficients indicate the degree to which each input influences of a positive prediction. Higher values suggest a higher positive with lung cancer cases, while lower or negative values indicate minimal or inverse impact.

Notably, features such as YELLOW_FINGERS, CHEST_PAIN, and SHORTNESS_OF_BREATH carried considerable weight in the model, implying that they contribute heavily to the decision- making process. Conversely, variables like PEER_PRESSURE or FATIGUE, though included in the dataset, had relatively lower impact.

By combining statistical correlation and model-based interpretation, the analysis provided a clear view of which features are most influential in predicting lung cancer risk. This insight is valuable not only for improving model performance but also for guiding future screening strategies that rely on minimal yet meaningful inputs.

Model Comparison and Discussion

To assess the suitability of different classifiers for predicting lung cancer risk, three models were examined: Logistic Regression, K- Nearest Neighbours (KNN), and Decision Tree. These models were selected for their contrasting characteristics in terms of simplicity, interpretability, and adaptability to small medical datasets.

Logistic Regression consistently delivered superior performance, making it a strong candidate for applications that require clarity and reliability. With an accuracy of 86%, F1-score of 86%, and ROC- AUC of 89%, it demonstrated the ability to produce balanced results across both classes. Its linear nature and coefficient interpretability further reinforce its practicality in health-related decision-making.

The KNN algorithm achieved a slightly lower accuracy of 82%. While KNN is useful in many classification tasks, its performance in this study was slightly hindered, possibly due to its sensitivity to feature scaling and the curse of dimensionality. It also requires more computational resources during the prediction phase, making it less efficient for deployment.

The Decision Tree model provided moderate results, with an accuracy of 79%. Although it is valued for its transparency and logic-based structure, the model showed signs of overfitting, particularly when handling smaller subsets of the data. This impacted its generalization ability, especially when applied to unseen records.

Overall, the analysis highlighted Logistic Regression as the most balanced and dependable model for the task at hand. It maintained high scores across all evaluation metrics while offering transparency in feature influence, a critical aspect in medical applications. The findings suggest that simpler, well-optimized models can perform competitively without the complexity of deeper algorithms, especially when working with structured, symptom-based datasets.

Real-World Implications

The use of predictive machine learning models in medical prediction tasks offers practical benefits beyond theoretical accuracy. Context of this research, the predictive model developed for assessing lung cancers cases high-risk among smokers presents meaningful applications in primary care and early screening programs.

Since the model relies on structured survey inputssuch as symptoms, behavioral habits, and demographic datait can be deployed in non-clinical environments without the need for specialized equipment or costly diagnostic tools. This makes useful in underserved communities where access to radiological imaging or advanced lab facilities is limited.

Healthcare workers or field practitioners can use such models as part of preliminary health assessments. For example, a basic digital interface powered by the model could guide practitioners in identifying individuals who should be referred for further testing, such as imaging or biopsy. This approach allows for more efficient allocation of medical resources by focusing attention on high-risk cases.

Furthermore, the model's interpretability supports informed decision-making. Because features contributing to risk are clearly identifiable, healthcare providers can explain predictions to patients in understandable terms, building trust in technology-assisted recommendations.

This approach aligns with broader public health goals by enabling low-cost, scalable solutions for early detection. Integrating this model into mobile apps, community health platforms, or electronic health record systems can extend its impact, helping reduce diagnostic delays and improving the chances of early intervention.

The current approach also focuses solely on structured data and does not incorporate advanced techniques such as ensemble learning or deep learning, which could offer additional performance improvements. However, this trade-off was intentional to keep the model lightweight and interpretable.

Finally, the system has not yet been validated in a real-world clinical setting. Until it is tested in collaboration with medical professionals or integrated into healthcare platforms, its practical effectiveness and trustworthiness remain theoretical.

Despite these limitations, the project serves as a meaningful foundation for developing accessible and affordable screening tools in public health.

CONCLUSION

This systematic study research presents a predictive framework designed to quantify the likelihood of lung cancer in individuals with a history of smoked -exposed individuals through computational analytics in machine learning techniques. By focusing on accessible survey data rather than expensive or invasive medical procedures, the study demonstrates how computational models can support early-stage health screening in resource-constrained settings.

Logistic Regression was identified as the most optimal efficient classifier in cases, offering a good trade-off balance between performance and interpretability. It outperformed other approaches such as K-Nearest Neighbours and Decision Tree in key evaluation quantitative measures, notably diagnostic precision accuracy, F1- score, and ROC-AUC. Leveraging interpretable features like coughing, yellow-stained fingers, shortness of breath, and wheezing added clinical relevance to the model's predictions.

Through careful data preparation, exploratory analysis, and validation, the model showed its potential capacity supportive to serve as a practical decision-support tool for preliminary lung cancer screening. The framework is lightweight, easy to implement, and adaptable to digital health platforms, making it suitable for community outreach programs, mobile screening units, or integration into electronic medical systems.

While the study highlights encouraging results, it also acknowledges the need for broader clinical validation and more comprehensive datasets. Nevertheless, this work lays the foundation for building scalable, low-cost diagnostic tools that could improve health outcomes by facilitating earlier detection and intervention.

REFERENCES

Limitations

Fig 3: Real-World Implications

S. Kumar and R. Rani, "Detection of lung cancer using machine learning methods," Materials Today: Proceedings, vol. 47, pp. 4564567, 2021.
S. Sharma and A. Sahu, "Evaluation of various machine learning models for predicting lung cancer," International Journal of Computer Sciences and Engineering, vol. 7, no. 3, pp. 202 206, 2019.

Although the proposed model achieved promising results in

predicting lung cancer among smokers, there are several areas that present constraints and opportunities for further enhancement.

One major limitation is extent and scale of the dataset used. With just over 300 records, the data may not adequately represent the full range of demographic or clinical variability present diverse in groups of individuals. As a result, the model's ability to generalize to broader or unseen data remains limited. A more diverse and larger dataset would be beneficial for improving robustness.

Another concern is the nature of the features, many of which are self-reported and based on subjective experiences, such as fatigue, anxiety, or perceived symptoms. These kinds of inputs can vary significantly from person to person and may introduce noise or bias into the predictions. Including medically verified or sensor-based data could strengthen the model's reliability.
Y. Hossain et al., "Predictive analysis of lung cancer using logistic regression and decision tree," in 2020 IEEE Int. Conf. on Smart Technologies (ICST), pp. 15.
J. Alaskan et al., "Using deep learning to identify early lung cancer via CT imaging," IEEE Access, vol. 9, pp. 78998 79010, 2021.
L. Duong and M. Nguyen, "Applying data preprocessing in machine learning-based lung cancer detection," in Proc. 2021 7th Int. Conf. on Computer and Technology Applications, pp. 8589.
P. Raj and M. Pasha, "An integrated model for lung cancer prediction through machine learning," International Journal of Engineering and Advanced Technology, vol. 8, no. 6, pp. 16591663, 2019.
World Health Organization, "Cancer Fact Sheet," [Online]. Available:https://www.who.int/news-room/fact- sheets/detail/cancer
S. Patel and H. Prajapati, "Predicting lung cancer outcomes using classification algorithms," International Journal of Computer Applications, vol. 180, no. 38, pp. 2328, 2018.
M. Goyal and A. Ghosh, "Utilizing AI for lung cancer forecasting with ensemble techniques," Journal of Artificial Intelligence and Soft Computing Research, vol. 11, no. 1, pp. 4557, 2021.
M. Tahir et al., "Overview of machine learning strategies for lung cancer forecasting," IEEE Reviews in Biomedical Engineering, vol. 14, pp. 7793, 2021.
T. Cover and P. Hart, "Pattern classification using nearest neighbours techniques," IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 2127, 1967.
M. Rahman and M. Rahman, "Classification methods for cancer data using machine learning: A survey," International Journal of Advanced Computer Science and Applications, vol. 10, no. 5, pp. 173181, 2019.
U. Nahid, S. Qureshi, and S. Ghani, "Machine learning applications for predicting lung cancer outcomes," Computational Intelligence and Neuroscience, vol. 2020, Article ID 4932915, 2020.
S. Gupta and A. Kumar, "Data mining techniques applied to lung cancer identification," International Journal of Engineering Research and Technology, vol. 4, no. 5, pp. 157 162, 2015.
A. Maheshwari and M. Kumar, "Use of predictive analytics in cancer diagnostics using ML," Biomedical and Pharmacology Journal, vol. 13, no. 1, pp. 195202, 2020.
B. Esteva et al., "Deep learning approaches in the field of healthcare diagnostics," Nature Medicine, vol. 25, pp. 2429, 2019.