Gene Expression Based Cancer Detection using Biological Attention Hybrid Framework

doi:https://doi.org/10.5281/zenodo.19950166

Volume 15, Issue 04 (April 2026)

Gene Expression Based Cancer Detection using Biological Attention Hybrid Framework

DOI : https://doi.org/10.5281/zenodo.19950166

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 12
Authors : Dr. Kakoli Banerjee, Sakshi Srivastava, Anushka Kandwal, Khushi Sharma
Paper ID : IJERTV15IS043060
Volume & Issue : Volume 15, Issue 04 , April – 2026
Published (First Online): 01-05-2026
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Gene Expression Based Cancer Detection using Biological Attention Hybrid Framework

Dr. Kakoli Banerjee

Computer Science and Engineering JSS Academy Of Technical Education Noida, India

Sakshi Srivastava

Computer Science and Engineering JSS Academy Of Technical Education Noida, India

Anushka Kandwal

Computer Science and Engineering JSS Academy Of Technical Education Noida, India

Khushi Sharma

Computer Science and Engineering JSS Academy Of Technical Education Noida, India

Abstract – The early and correct detection of various types of cancer are based on gene expression data which is currently considered as an essential issue with great impact in bioinformatics research and development. The gene expression data is considered to be of high dimensionality and noisy. However, there exist various challenges to already existing ML based classification methods. In this paper, we propose a Biological Attention Hybrid (BAH) model, a dual-path framework that combines attention-based feature combining both classical and deep learning methodology. This model initially uses an attention mechanism to focus on the most important genes and then processes the data in two different paths. One path uses PCA and SVM, while the other uses a CNN-LSTM model to capture both short and long term patterns in the dataset. At the end, the outputs from both paths are combined using a fusion layer to improve overall performance. We tested the model on the TcgaTargetGtex dataset, and it achieved high accuracy with consistent results with the help of cross-validation. The model gives better performance than traditional methods and also highlights important genes. So, this approach can be used for building more accurate and reliable cancer detection systems in future.

Keywords gene expression, cancer classification, attention-based feature selection, PCA, SVM, CNN-LSTM, ensemble learning, dimensionality reduction, machine learning, bioinformatics.

Introduction

According to the World Health Organization, cancer is still one of the leading causes of death globally, responsible for almost one out of every six fatalities (WHO, 2024). Enhancing survival rates, targeted therapies, and lowering healthcare costs all depend on early and accurate detection of cancer. The fast growth in genomic technologies in recent years has made it possible for researchers to examine gene expression profiles. These profiles offer important insights into the molecular processes behind the development of cancer by continuously capturing the activity of thousands of genes. The gene expression data are high dimensional and comprise only a small number of samples but tens of thousands of features. Earlier classification algorithms suffer from overfitting, noise and redundancy. Hence dimensionality reduction is crucial for getting the most important features while removing irrelevant genes.

Various dimensionality reduction techniques like PCA have proven to be one of the most effective unsupervised

methods. PCA converts high dimensional data into a reduced components in order to maximize the variance of original dataset. Support Vector Machines is the most important supervised learning algorithms for bioinformatics applications in the of classification. SVM manage non linear and high dimensional datasets by making optimal decision boundaries between classes. The combination of PCA with the SVM classifier increases noise reduction and generalization which leads to increased accuracy and computational efficiency.

Earlier methods to cancer classification based on gene expression have mostly followed two methodologies: (i) applying PCA and SVM (ii) deep learning models. While the PCASVM method are efficient but deep learning models are capable of finding complex non-linear gene patterns. But no approach alone can fully study the biological structure of genes data.

An important but not yet explored challenge in this domain is finding biologically important genes from many genes in the dataset. All genes do not contribute equally to cancer classification some have more impact than others. Finding important genes can increase the performance and reliability of model. Attention mechanism is used find important genes by learning gene importance weights during training.

PCASVM based frameworks showed strong performance, however various essential limitations remained. Gene relevance for the classification task relied on unsupervised dimensionality reduction, which does not specifically take account of relevance in the gene dataset. As a result, biologically important genes may not be given desired importance.

PCA converts the data into linear combinations of genes, that helps in limiting interpretability at the gene level. Already known algorithms do not catch extensive non linear gene expressions and long term dependencies that are present in gene expression data.

The deep learning methodologies points out few of the challenges by understanding non linear expressions but they mostly show overfitting due to the small dataset size and high dimensional of genes data. Also, most of the already existing methods consider machine learning and deep learning as separate models and do not combine them.

So, there is a need to introduce a new framework that combines both of them: (i) find biologically important genes

(ii) reduce dimensionality of data (iii) capture patterns in genes data (iv) combine multiple learning methods into one.

So, for gene expression based cancer detection this paper introduces a novel framework called Biological Attention Hybrid (BAH) model. This model combines attention mechanism with both machine learning and deep learning approaches.

The model consists of three layers:
1. Attention layer which assigns importance weights to genes using attention mechanism.
2. A dual path layer where one path is PCA followed by an SVM and the second is CNN-LSTM to capture both short and long range patterns in gene expression.
3. A fusion layer which combines the output from both layers.
This study tries to improve older methods (like PCA and SVM) by adding new techniques such as attention mechanism and deep learning models. Due of this the model becomes more accurate and easier to understand. It can show which genes are important for cancer detection and make the results with real medical meaning.
Related Work/ Literature Review
1. Foundational and survey studies
  
  Prior research showed that high dimensional gene expression data can identify disease categories and develop predictive models, hereby creating the experimental framework of feature selection combined with classification for molecular diagnostics. Detailed reviews on computational learning for gene expression data and support vector machines in medicine gives solution of methodologies and practical considerations, including feature selection, normalization, and cross validation. These survey articles highlight that SVM continues to be a reliable option for small sample, high dimensional issues, particularly when combined with meticulous preprocessing and dimensionality reduction.
2. PCA as a fundamental tool for dimensionality reduction
  
  PCA is extensively utilized in gene expression research for noise reduction, visualization, and as a preprocessing measure prior to classification. Several recent papers and reviews reinforce that PCA is computationally efficient, scles well to genome scale datasets, and often outperforms or complements more complex hidden variable methods for large transcriptomic datasets. Notably, Zhou et al. (2022) showed PCAs practical advantage in large genomics analyses, arguing it is fast and effective for removing major confounders and summarizing variance
3. VM: strengths and modern variants
  
  Because of their margin based generalization and capacity to function in extremely high dimensions, Support Vector Machines (SVMs) have been routinely validated for microarray and RNA seq. classification tasks. Numerous
  
  SVM variations and kernel strategies (kernel selection, scalable solvers, ensemble SVMs) have been modified for biomedical applications, according to recent methodological reviews. For instance, current summaries of SVM applications in medicine cover useful advice on handling class imbalance and hyperparameter tuning, which are crucial in tasks involving tumors versus normal tasks.
4. Empirical PCA+SVM papers and case studies
  
  Several applied studies from 2018 to 2025 have implemented PCA in combination with SVM for cancer classification using microarray and RNA-seq datasets. These works evaluate different preprocessing strategies, dimensionality reduction pipelines, and classifier optimizations. A concise summary of notable recent studies is presented below.
  - 2024 S. Al Azani et al.
    
    Dataset: RNA-seq. (TCGA)
    
    Method: Data filtering – PCA – machine-learning models including SVM
    
    Performance: Addresses the curse of dimensionality and demonstrates that PCA combined with SVM achieves strong accuracy on benchmark datasets.
  - 2024 R. Van et al.
    
    Dataset: Multiple RNA-seq. datasets
    
    Method: Comparison of preprocessing pipelines followed by machine-learning classification
    
    Performance: Shows that normalization and preprocessing choices significantly influence PCA outputs and SVM performance.
  - 2024 A. Razzaque et al.
    
    Dataset: Leukemia, colon, and prostate microarray datasets
    
    Method: PCA – Particle Swarm Optimization (PSO) for component selection – SVM
    
    Performance: PCA reduces computational time, and PSO helps optimize principal component selection. The combined PCAPSOSVM framework achieves strong classification accuracy.
  - 2023 F. Alharbi et al.
    
    Dataset: Survey of multiple gene-expression studies
    
    Method: Machine learning approaches for gene-expression classification
    
    Performance: SVM combined with dimensionality reduction or feature selection techniques as a consistently strong baseline for cancer classification.
  - 2022 H. J. Zhou et al.
    
    Dataset: Genomics datasets
    
    Method: PCA versus hidden variable inference methods evaluation.
    
    Performance: Shows that PCA is efficient and often outperforms more complex variable.
5. Comparative insights and methodological lessons
  
  Many conclusions can be drawn from the recent literature:
  1. Batch correction, low variance gene filtering, and RNA seq. normalization (TPM/RPKM/log transforms) have a significant impact on PCA components and the performance of the resulting SVM. This sensitivity is brought to light by studies comparing RNA seq. preprocessing pipelines, which advise clear disclosure of preprocessing decisions.
  2. Using PCA to compress the feature space before SVM reduces computational cost and mitigates overfitting. Several empirical studies report faster training and improved or at least competitive accuracy. PCA frequently eliminates technical variation and noise from very high dimensional data that would otherwise confound classifiers.
  3. The Trade-off between interpretability and variance capture. Since PCA components are linear combinations of numerous genes, biological interpretation of them may be challenging because gene level contributions can be ambiguous. Therefore, in order to maintain interpretability, a number of papers combine PCA with supervised feature selection or post hoc analysis of loadings.
  4. Hybrid and optimized pipelines outperformed naïve baselines. Generally, works that optimize component selection via heuristic search (PSO, genetic algorithms), combine PCA with wrapper methods (e.g., SVM RFE), or adjust the number of principal components report better classification than PCASVM with an arbitrary K.
  5. Emerging trend: deep methods and multi omics, but SVM is still a solid foundation. Although stacked/ensemble models and deep learning and multi omics integration are becoming more popular, numerous reviews warn that traditional pipelines like PCA+SVM are still competitive, simpler to understand, and require fewer resources for small sample sizes
6. Limitations and gaps in the literature
  - Heterogeneous reporting: A direct comparison of reported accuracies is unreliable due to the fact that many studies do not report the same preprocessing procedures, cross validation techniques, or sample counts.
  - Limited biological interpretability: Without further analysis (loadings, enrichment), it is difficult to link PCA components to particular biological pathways. After PCA, few studies systematically incorporate pathway analysis.
  - Lack of standardized benchmarks: In contrast to certain machine learning domains, genomics does not have widely accepted benchmark splits; numerous studies employ disparate cancer types
    
    and datasets, which makes meta-analysis challenging.
  - Underexplored supervised dimensionality reduction: although unsupervised PCA is widely used, supervised techniques (PLS, LDA, supervised PCA variants) and kernel/PCA variants should be more carefully compared to PCA+SVM for RNA seq data.
7. akeaways for this review
We draw the conclusion from the reviewed literature that PCA + SVM remains a viable, competitive and interpretable baseline for gene expression-based cancer classification, particularly when sample sizes and computational resources are constrained. Standardized reporting for reproducibility, thorough preprocessing, methodical hyperparameter tuning, and measures to enhance biological interpretability (gene loading analysis, pathway enrichment of components) should be the focus of future research.
Methodological Framework

Problem Formulation

Let X E IRNXG be a gene expression matrix where:
- N = number of samples
- G = number of genes
  
  xi E IRG shows the gene expression of the patient and yi E {O,1} denotes the class label (0 = Normal, 1 = Cancer).
  
  The aim of this is to learn a classifier
  
  f: IRG {O,1}
  
  that reduces the error while finding biologically important genes.
Workflow Overview

The general workflow for gene expressionbased cancer detection using PCA and SVM involves several stages:
1. Data Acquisition:
  
  Gene expression datasets are collected from public repositories such as the Gene Expression Omnibus (GEO) or The Cancer Genome Atlas (TCGA).
  
  These datasets typically contain samples from both
  
  cancerous and non cancerous tissues.
2. Data Preprocessing:
  
  Before applying PCA, raw gene expression data undergo:
  - Normalization
  - Handling missing values
  - Noise reuction
  - Feature scaling (important for PCA)
  - Low-expression genes are removed
  - Top high-variance genes are selected
3. Stage 1: Attention-Based Feature Weighting
  
  Other than traditional PCASVM pipelines, the BAH model introduces an attention mechanism to identify important genes.
  
  Each gene is assigned a weight:
  
  Aj = softmax(zj/r) X ngenes
  
  The attention weighted gene vector is:
  
  xi = xi A
  
  This step:
  - Give importance to biologically important genes
  - reduce noise and irrelevant features
4. Stage 2: Dual-Path Feature Learning
  
  After attention weighting, the data goes through two parallel paths.
  
  Path 1: PCA + SVM
  
  Dimensionality reduction using PCA:
  
  Z = XW
  
  where:
  - X= attention-weighted gene matrix
  - W= eigenvectors Classification using SVM:
    
    Outputs from both paths are combined using a stacking approach:
    
    F = [pSVM, pCNN]
    Final prediction:
    
    y = a(wlpSVM + w2pCNN + b)
    
    This step:
    - integrates complementary information
    - improves overall accuracy
    - reduces model bias
    1. Model Evaluation:
      
      Performance is assessed using metrics such as:
      - Accuracy: Overall correctness of the classifier. Accuracy = TP+TN
        
        TP+TN+FP+FN
      - Precision: Proportion of positive (cancer) predictions that are correct,
        
        Precision= TP
        
        TP+FP
      - Recall (Sensitivity): Proportion of actual positives (cancer samples) correctly identified,
        
        Recall= TP TP+FN
      - F1 Score: Harmonic mean of precision and recall
        
        f(x) = sign L ai yiK(xi, x) + b
        
        i
        
        This path:
        
        reduces dimensionality
        
        provides uniformness
        
        Path 2: CNN-LSTM (New Addition)
        
        The attention weighted gene sequence is a 1D signal:
        
        CNN layers are used for finding local gene interactions
        
        LSTM is used for finding long range dependencies
        
        This allows the model to learn:
        
        non-linear relationships
        
        sequential genomic patterns
5. Stage 3: Fusion via Meta-Learning
F = 2 · Precision·Recall

l

Precision+Recall
1. Interpretation & Biological Validation:
  
  Unlike traditional PCA based approaches:
  - Attention weights directly highlight important genes
  - Top weighted genes can be analysed for biological relevance
    
    This improves:
  - interpretability
  - clinical usefulness
IV.IMPLEMENTATION
1. Data Preparation
  1. The original column containing gene IDs was renamed to gene-id.
  2. The dataset was transposed, resulting in:
    - Samples – rows
    - Genes – columns
  3. The transposed index was reset and labeled as
    
    Sample.
  4. Class labels were assigned using sample naming conventions:
    - Samples starting with GTEX Normal
    - All others – Cancer
  5. The final dataset included:
    - Sample (ID),
    - gene expression features,
    - Label (Cancer/Normal).
2. Label Encoding
  1. Class labels (Cancer, Normal) were converted into numeric form using label encoder
  2. This enabled training with machine learning and deep learning models which need numeric input.
3. Feature Engineering and Preprocessing
  1. Gene expression values were standardised using standard scalar: Mean = 0 Standard deviation = 1
  2. Low expression genes were removed (genes with negligible signal in >90% samples).
  3. The top 3,000 genes were selected based on variance from the filtered genes,
  4. Data split: 80% training 20% testing
4. Attention-Based Feature Weighting
  1. Attention mechanism was used to learn importance weights for each gene.
  2. The attention network includes:
    - D nse layer (512 units)
    - D opout (0.20.3)
    - Outpu layer (3,000 weights)
  3. Softmax normalization with temperature scaling was used in order to predict attention scores.
  4. The weighted gene matrix was formulated using element multiplication:
    
    x = x A
  5. Hyperparameter tuning was performed on 27 configurations and the best model was selected based on accuracy of validation.
5. Dimensionality Reduction with PCA (Path 1)
  1. PCA was used on the attention-weighted gene matrix.
  2. Components were selected in order to preserve 90% of total variance.
  3. 8 principal parameters were required. This reduced the dimensionality from 3,000 to 8.
  4. PCA resulted in a small and noiseless representation for dataset.
6. odel Training Using SVM (Path 1)
  1. An SVM classifier was implemented with:
    - K rnel: Radial Basis Function (RBF)
      
      = 1
    - gamma = scale
  2. The model was trained using the 50-dimensional PCA features from the training set.
7. CN LSTM Model (Path 2)
  1. The shape of attention weighted gene vectors was changed to sequence format: (samples, 3000, 1)
  2. CNN layers obtained the local gene interaction patterns:
    - onv1D (64 filters) BatchNorm MaxPool
    - onv1D (32 filters) BatchNorm MaxPool
  3. A Bidirectional LSTM obtained long term dependencies in gene expressions.
  4. The final dense layer resulted in classification probabilities.
  5. Training used:
- Opti izer: Adam
- oss: Binary cross entropy
- arly stopping in order to prevent overfitting
I. Fusion via Meta-Learning
1. Results from SVM and CNN-LSTM were combined with the help of stacking.
2. Out of fold cross validation was used to avoid accidental data leakage.
3. A Logistic Regression model was used as the meta learner.
4. Final result was obtained using the following:
y = a(wlpSVM + w2pCNN + b)

J. sualization
1. PCA Variance Plot (Displays cumulative explained variance)
2. Confusion Matrix (Displays classification performance)
3. ROC Curve (Analyses model discrimination ability)
4. Attention Weight Distribution (Chooses important genes)
K. odel Evaluation

Performance was evaluated using:
- Accuracy
  
  Non
  
  Highlig
  
  Linear
  
  Capture
  
  Improved
  
  linear relation s
  
  hts gene importa nce
  
  compressi on
  
  s non linearity
  
  accuracy
ypical Workflow Diagram

A flow of the process:

Accuracy =

TP + TN

TP + TN + FP + FN

Precision

Precision =

TP

TP + FP
Recall

Recall =

TP

TP + FN
F1 Score

Precision · Recall F1 = 2 ·

Precision + Recall
AUC-ROC

Challen ge	Attenti on Role	PCA + SVM Role	CNN-LSTM Role	Comb ined Effect
High dimensi onal data	Selects importa nt genes	Reduces dimensio nality	Learns patterns	Efficient learning
Noisy features	Suppre sses noise	Removes redundan cy	Learns robust features	Better generaliza tion
Small sample size	Focuse s on key genes	Works well with low data	Regular ized learning	Reduced overfittin g

L. Why BAH Works Better

Challenges

Biological Attention Hybrid (BAH) frameworks have achieved great results in detection of cancer using this dataset. But there are several technical, practical challenges that limit their full potential.
1. Hi Dimensional and Noisy Data:
  - The datasets contain thousands of genes but only hundreds of samples. This will cause overfitting and poor generalization.
  - PCA reduces dimensionality but it may ignore important biologically variations if they are not tuned accordingly.
2. M Crash on Full Dataset
  
  30,000 genes and 11,000 samples crashes Google Colab.
3. Hi sensitivity to preprocessing:
  
  Minimal changes in normalization, batch creation or filtering can affect PCA components and SVM performance. This makes results unstable and inaccurate.
4. mbalanced Datasets:
- n many TCGA datasets, the number of cancerous Samples are way higher than non cancerous ones.
- This imbalance creates a biasness in the classifier, this results in biased predictions toward the more frequently occurring classes.
Conclusion

The analysis in this paper displays the importance of machine learning in gene expression based analysis for early cancer detection. Conventional methodologies like PCA combined with SVM have already shown good performance for binary as well as multiclass classification. But they still have certain limitations, when it comes to finding complex biological patterns and understanding which genes are actually important.

In this work, we proposed the Biological Attention Hybrid (BAH) model, that enhances the conventional PCASVM approach with the help of addition of an attention mechanism and a dual path structure. One path uses PCA and SVM, while the other uses a CNN-LSTM model. This collaboration helps the model in handling high dimensional and noisy gene expression data more effectively and efficiently.

We tested our model on the TcgaTargetGtex dataset, which contains 5,758 samples and 11,000 genes. The model was able to achieve high accuracy with consistent performance with the help of cross validation. This shows that combining attention with hybrid learning models can perform better than conventional PCASVM methodologies.

With the help of attention mechanism, we were able to determine 217 important genes, which can act as promising contributors.

However, a few challenges exist, such as class imbalance, computational cost, and validating the important genes. In the future, this work can be extended to multi-class cancer classification and addition of explainable AI techniques for more promising clinical use.

The BAH model displays that combining attention, machine learning, and deep learning together in a single framework can result in both high accuracy and interpretability. This makes it a strong step toward building better and more reliable cancer diagnosis systems in medicine world.
Refrences

T. R. Golub, D. K. Slonim, P. Tamayo, et al., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, *Science*, vol. 286, no. 5439, pp. 531 537, 1999.
F. Alharbi and A. Vakanski, Machine learning methods for cancer classification using gene expression data: A review, *Bioengineering (Basel)*, vol. 10, no. 2, Art. 173, Jan. 2023. doi: 10.3390/bioengineering10020173
S. Al Azani, O. S. Alkhnbashi, E. Ramadan and M. Alfarraj, Gene expression-based cancer classification for handling the class imbalance problem and curse of dimensionality, *International Journal of Molecular Sciences*, vol. 25, no. 4, Art. 2102, Feb. 2024. doi: 10.3390/ijms25042102
. Mani and H. Rajaguru, A framework for performance enhancement of classifiers in detection of prostate cancer from microarray gene expression tasks,

*Heliyon*, vol. 10, no. 9, e29630, 2024. doi: 10.1016/j.heliyon.2024.e29630
N. Tabassum, S. Islam, S. Rizwan, M. Sobhan, T. A. Chowdhury and S. Ahmed, Cancer classification from gene expression using ensemble learning and dimensionality reduction, *Genes*, vol. 15, no. 2, Art. 405,

2024. doi: 10.3390/genes15020405
E. Elhaik, Principal component analyses (PCA) based findings in genetics & genomics, *Scientific Reports*, vol. 12, Art. 2369, 2022. doi: 10.1038/s41598 022

14395 4
M. Greenacre, P. J. F. Groenen and T. Hastie, Principal component analysis, *Nature Reviews Methods Primers*, vol. 2, Art. 184, 2022. doi: 10.1038/s43586 022

00184 w
E. H. Houssein, Z. Abohashima, M. Elhoseny and

W. M. Mohamed, An efficient binary Harris Hawks Optimization based on quantum SVM for cancer classification tasks, arXiv:2202.11899 [cs.LG], Feb. 2022.
J. Brown, et l., Support vector machine classification and validation of cancer tissue microarray gene expression data, *Bioinformatics*, vol. 16, no. 10,

pp. 906 914, 2000.
A. Vaswani et al., “Attention is all you need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 59986008.
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 17351780, 1997.

Non	Highlig	Linear	Capture	Improved
linear relation s	hts gene importa nce	compressi on	s non linearity	accuracy