Design and Evaluation of a Hybrid CNN-Vision Transformer Architecture with Intrinsic and Concept-Based Explainability for Breast Cancer Diagnosis

Opeyemi Victor Omolade; John Olalere Ogunlola; Babatunde Alexander Abiola; Adedeji Edward Adesola

doi:10.17577/IJERTCONV14IS060161

ACSCON - 2026 (Volume 14 - Issue 06)

Design and Evaluation of a Hybrid CNN-Vision Transformer Architecture with Intrinsic and Concept-Based Explainability for Breast Cancer Diagnosis

DOI : 10.17577/IJERTCONV14IS060161

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 26
Authors : Opeyemi Victor Omolade, John Olalere Ogunlola, Babatunde Alexander Abiola, Adedeji Edward Adesola, Afeez Oluwaseun Akande, Oluwakemisola Adewole
Paper ID : IJERTCONV14IS060161
Volume & Issue : Volume 14, Issue 06, ACSCON – 2026
Published (First Online) : 15-06-2026
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Design and Evaluation of a Hybrid CNN-Vision Transformer Architecture with Intrinsic and Concept-Based Explainability for Breast Cancer Diagnosis

1st Opeyemi Victor Omolade Research and Doctoral College University of Greater Manchester Manchester, United Kingdom ovo1res@bolton.ac.uk

ORCID: 0009-0007-6034-0778

2nd John Olalere Ogunlola Research and Doctoral College University of Greater Manchester Bolton, United Kingdom jo3res@bolton.ac.uk

ORCID: 0009-0001-1546-1526

3rd Babatunde Alexander Abiola Research And Doctoral College University Of Greater Manchester Bolton, United Kingdom babatunde.alexander@gmail.com ORCID: 0009-0004-8193-440X

4th Adedeji Edward Adesola Research and Doctoral College University of Greater Manchester Bolton, United Kingdom adedeji198@ieee.org

ORCID: 0009-0001-8666-8292

5th Afeez Oluwaseun Akande Research and Doctoral College University of Greater Manchester Bolton, United Kingdom afeez4real@ieee.org

ORCID: 0009-0004-7412-4222

6th Oluwakemisola Adewole Research and Doctoral College University of Greater Manchester Bolton, United Kingdom oa7res@bolton.ac.uk

ORCID: 0009-0005-2027-9496

AbstractThis paper presents a hybrid CAD model combining EfcientNet-B0 and Swin Transformer-Tiny for mammographic breast-cancer diagnosis with multi-level explainability using Grad-CAM, attention rollout, and TCAV. Evaluated on the DMID dataset (510 mammograms: 310 lesion-present, 200 normal), the model achieved an accuracy 0.985, precision 0.817, recall 0.933, and F1-score 0.770 on the held-out test set. TCAV concept scores showing concept inuence on malignancy were: malignant indicators 0.42, benign indicators 0.59, opacity 1.00, general abnormality 1.00, and clear/normal 0.01. Images were resized to 384 × 384 and training used Adam at 1 × 104 with a 70/15/15 split. The overall study shows that incorporating multiple levels of explanation into hybrid deep learning models may improve diagnostic clarity and assist in supporting clinician decision making when performing breast cancer screening.

Index TermsBreast cancer diagnosis, Mammography, Hybrid CNNViT, Explainable AI (XAI), TCAV

Introduction

The need to improve early detection and reduce death from breast cancer is a top priority around the world, as it continues to rank as the number one cause of death from all cancers for women [1]. Mammography is still the most widely used method for screening breast cancer, but interpreting mammograms can be difcult due to many factors including small size of lesions, dense breast tissue, and varying levels of radiologist experience [2][3]. Therefore, there is much research focused on creating Articial Intelligence based Computer

Aided Diagnostic (CAD) Systems to help radiologists make decisions about patients.

There is also a lot of research being conducted to use Deep Learning, specically Convolutional Neural Networks (CNNs), to detect malignant features in mammograms because they are capable of identifying ne details in mammograms that are associated with breast cancer such as mass, micro- calcications, and architectural distortion [4][5]. CNNs have limitations that are inherent to their design including the fact that their local nature limits them to capturing relationships within an image and do not allow them to consider larger scale spatial dependencies that may be present in images [6][7].

Vision Transformers (ViTs) have been proposed as a way to capture long range spatial dependencies using self-attention mechanisms to reason globally over images [8][9]. The idea of combining CNNs and ViTs into hybrid architectures where CNNs are used to extract local features and transformers are used to analyze the larger scale features has gained popularity in medical imaging literature [10][11]. This combination has shown to provide better diagnostic accuracy than traditional CNN-only models and provides increased sensitivity to subtle malignancies [12].

While advances in deep learning architectures have greatly increased the quality of CAD systems, one major barrier to clinical use is the lack of explainability of the decisions made by these systems. Clinicians want CAD systems to be able to

explain how they arrived at a particular decision so that the clinicians can understand why the system came to its conclu- sion. Explainable AI (XAI) has therefore become important in medical imaging applications to provide clinicians with the level of interpretability required for them to trust and accept the recommendations of a CAD system. Techniques that have been developed to increase the level of interpretability of a CAD system include visualization techniques like Grad-CAM that show localized areas of importance [13] and techniques like transformer attention maps that show how the different parts of the image contribute to the overall decision of the system [14]. Additional conceptual techniques that have been developed include Testing with Concept Activation Vectors (TCAV) which is a technique that bridges the gap between low-level neural activity and higher-level clinical concepts [15][16].

In this paper, a hybrid CAD system for breast cancer diagnosis using EfcientNet-B0 with Swin Transformer-Tiny, combined with multiple layers of explanation via Grad-CAM, Attention Rollout, and TCAV was proposed. Evaluation was done on the proposed model using the DMID dataset for breast cancer diagnosis and found that the proposed model performs well in terms of prediction, and generates explanations that align with the current understanding of radiologic principles, providing evidence that XAI enabled hybrid architectures represent a new paradigm for safer, more transparent CAD systems [17][18].
LITERATURE REVIEW

While deep learning has greatly enhanced the automated diagnosis of breast cancer using images from mammography, with many CNN-based architectures having proven to be very successful in identifying malignancies in digital mammogra- phy images. However, while early studies based on VGG, ResNet, and EfcientNet architectures were able to improve the classication accuracy, they were unable to identify long- range structural patterns in the breast tissue [3][4][5]. Studies have also identied that CNNs are capable of effectively learn- ing local features, such as masses and calcication, however, they do not learn global patterns well, including architectural distortions and tissue asymmetries [6][7]
Recently Vision Transformers (ViTs), with their self- attention mechanism, have emerged as an attractive alternative to traditional CNNs, enabling them to model global contextual information [8][9]. In addition, the combination of CNN and ViT has been demonstrated to achieve better diagnostic results than each one of these models separately, because CNNs can extract local features and ViTs can reason globally [10][11]. These models have demonstrated potential, particularly in screening scenarios, when the subtle abnormalities may occur in several different regions of the tissue [12].

Explainable Articial Intelligence (XAI) is necessary for medical imaging applications because of the need for trans- parent and clinically interpretable decisions [2]. The most commonly applied gradient-based XAI method, Grad-CAM, has been used to highlight diagnostically relevant features in

mammography images [13]. Additionally, transformer atten- tion maps provide global interpretability [14]. Finally, concept- based techniques, such as TCAV, allow for the evaluation of how high-level clinical concepts affect model predictions, increasing trust and clinical relevance [15][16].

Few studies integrate hybrid architectures that include both intrisic and concept-based explainability, although studies exist to examine CNNs, ViTs, and XAI separately. Therefore, this lack of research motivated the development of the current studys hybrid CNN-ViT model, with multi-level interpretabil- ity, for diagnosing breast cancer.
METHODOLOGY

In this part, the methodological basis for designing, training, and evaluating the proposed hybrid EfcientNet-B0 and Swin Transformer-Tiny model architecture for breast cancer diag- nosis will be explained. The proposed methodology includes steps on preparing the data set, preprocessing the data, design- ing the model, integrating explainability into the model, and determining how well the model evaluates. These methodolo- gies follow the current state-of-the-art methodologies in deep learning and XAI research [4][10]. The conceptual framework is illustrated in Fig. 1.
1. Dataset Description
  
  This study uses the Digital Mammography Dataset for Breast Cancer Diagnosis Research (DMID). This dataset has 510 mammogram cases. Each of these cases contains DICOM and TIFF image formats, pixel-level annotated segmentation masks, and radiologist reports containing BI-RADS scores and narrative descriptions. The dataset contains 310 lesion-present images and 200 normal images. This dataset supports research and educational use and ensures patient anonymity.
2. Data Preprocessing
  
  All images were resized to 384×384 pixels to meet model input requirements. Histogram equalization and denoising (Gaussian lter) were used to improve contrast and reduce artifacts. For the ViT branch, images were split into 16×16 patches, embedded into a sequence of tokens, and passed through the Swin Transformers attention layers. Data aug- mentation process included rotation, ipping, and brightness adjustments so as to increase sample diversity and reduce overtting. Labels for abnormality classes were encoded into three categories: Normal, Benign, and Malignant. The dataset was split into training, validation, and test sets using stratied sampling so as to preserve class balance.
3. Abbreviations and Acronyms
  
  Dene abbreviations and acronyms the rst time they are used in the text, even after they have been dened in the abstract. Abbreviations such as IEEE, SI, MKS, CGS, ac, dc, and rms do not have to be dened. Do not use abbreviations in the title or heads unless they are unavoidable.
  
  Fig. 1. Conceptual Framework Diagram.
4. Hybrid Model Architecture
  
  The proposed hybrid CNNViT model architecture com- bines a convolutional neural network and a vision transformer to leverage the strengths of both approaches for breast cancer diagnosis. In particular, the architecture utilizes EfcientNet- B0 as the CNN backbone and Swin Transformer-Tiny as the ViT component.
  - CNN Backbone EfcientNet-B0: The EfcientNet-B0 network (pre-trained on ImageNet) is employed as a con- volutional feature extractor. EfcientNet-B0 uses com- pound scaling to balance network width, depth, and reso-
    
    lution. Its convolutional layers extract feature maps FCNN representing ne textural detail crucial for identifying lesions. This backbone contributes high-resolution spatial features while keeping model size and computational cost relatively low due to EfcientNet-B0s compound scaling strategy.
  - Transformer Backbone Swin Transformer-Tiny: In par- allel, a Swin Transformer-Tiny (the smallest Swin Trans- former model) processes the same input image. The Swin Transformer applies self-attention within shifted windows, gradually merging information across image patches. The output feature representation FViT captures contextual and global structure.
    
    Let:
    
    FCNN (x) be the CNN feature extractor
    
    FVIT (x) be the ViT encoder
    
    Then the hybrid feature vector is:
    
    h(x)= [FCNN (x), FV IT (x)] (1)
    
    The classier computes the malignancy probability using:
    
    y = (Wh(x)+ b) (2)
    
    where W and b are trainable parameters, and is the sigmoid activation for binary classication.
    
    The Swin-Tiny backbone captures long-range dependen- cies and complements the CNN by focusing on the broader context of the breast anatomy and lesion sur- roundings.
5. Explainability Integration
  
  Three different types of explainability techniques were incorporated:
  
  Grad-CAM is applied to the EfcientNet-B0 stream to localize discriminative regions. This provides insight into how textural features inuence predictions. Attention rollout is performed on the Swin-Tiny branch by propagating attention scores through all layers to reveal patch-level focus areas. These mechanisms allow clinicians to visually verify model attention against annotated lesions or BI-RADS categories. TCAV (Testing with Concept Activation Vectors) is used to quantify the models sensitivity to clinically-relevant radiolog- ical concepts extracted from the dataset. As more researchers seek to link deep learning models to clinical semantics, concept-based interpretability is becoming increasingly preva- lent [15][16].
6. Training Procedure
The model was trained with the Adam optimizer at a learn- ing rate of 1e-4. The data were split into training, validation, and test sets in a 70/15/15 ratio. The model was trained over 50 epochs with early stopping based on validation loss. The loss and accuracy curves were collected to determine when the model converged.

Overall, the training procedure ensured that the EfcientNet-B0 + Swin-Tiny hybrid learned effectively from

the mammography data while maintaining interpretability. Starting from ImageNet weights gave a strong foundation, gradual ne-tuning optimized performance, and continuous explainability monitoring kept the models learning trajectory on course, resulting in a robust and transparent diagnostic model.

FINDINGS

In this section the experimental results of the proposed hybrid EfcientNet-B0 and Swin Transformer-Tiny architec- ture is presented and relate them to the current state of the art, specically in terms of model performance, intrinsic explainability, and concept-based interpretability.

Model Training Dynamics

A hybrid model combining EfcientNet-B0 and Swin Transformer-Tiny was trained for ten epochs using the Adam optimizer with a learning rate of 1 × 104 to classify mam- mograms as benign or malignant. The training loss decreased consistently across epochs (Fig. 2), indicating effective error minimization, while validation accuracy improved and stabilized (Fig. 3), demonstrating good generalisation. The smooth loss curve and stable accuracy trends show no signs of overtting, suggesting that the model successfully learned discriminative features needed to distinguish malignant from benign mammographic patterns.

Fig. 3. Validation Accuracy Across Training Epochs.

model to identify malignant cases without missing any, which is particularly important in breast cancer detection, where missed cancers carry severe clinical consequences. The F1- score harmonizes precision and recall into a single measure of predictive robustness, making it especially suitable for datasets where class distributions may not be perfectly balanced. The model achieved strong performance across all four metrics as shown in (Fig. 4).

Fig. 2. Model Training Loss Across all Epochs.
Model Classication Accuracy

The nal evaluation of the hybrid model was conducted using the test dataset, which had not been seen during training or validation. Four key performance metrics were computed: accuracy, precision, rcall, and F1-score. These metrics collec- tively reect the models diagnostic reliability. Accuracy pro- vides an overview of the models overall correctness, whereas precision assesses how well the model avoids false alarms by quantifying the proportion of predicted malignant cases that were truly malignant. Recall measures the ability of the

Fig. 4. Model Performance Metrics.
Intrinsic Interpretability

To understand how the CNN made its predictions, intrinsic interpretability was assessed using Grad-CAM to generate heatmaps highlighting the regions of each mammogram that contributed most strongly to the models classication de- cision. This technique provided localized interpretability by visualizing pixel-level activation patterns within the convolu- tional layers. These visualizations as shown in (Fig. 5) show strong activation over clinically meaningful regions such as:

Fig. 5. Grad-CAM Heatmaps Overlaid on Mammogram Images

Fig. 6. Attention Rollout Maps for Representative Cases.
- irregular or spiculated masses
- dense opacities
- asymmetric tissue patterns
  
  While Grad-CAM focuses on the most inuential local features, attention rollout provides a broader understanding of how the model considers global image structure. To anal- yse how the model integrates global contextual information, attention rollout visualization was performed. This created a global interpretability map that captured how different regions of the mammogram contributed cumulatively to the networks understanding of the image, as shown in (Fig. 6).
Concept-Based Explainability Using TCAV

To evaluate whether the hybrid model aligned with high- level radiological concepts, TCAV was applied to measure model sensitivity to the ve extracted concepts: malignancy indicators, benign indicators, opacity features, general abnor- mality descriptors, and clear/normal ndings. As shown in table 1 and Fig. 7.

Concepts such as malignant, opacity, and abnormality exhibited strong positive inuence on malignancy predic- tions. Conversely, benign-related and clear-nding indicators displayed negative inuence, reecting the models ability to incorporate clinically meaningful contextual cues. TCAV therefore validates that the hybrid CNNViT architecture in- ternalized conceptually coherent representations aligned with

TABLE I

TCAV Analysis of Concept Influence on Malignancy Prediction

Concept Category	TCAV Score	Interpretation
Malignant indicators	0.42	Strong positive inuence on malignancy predictions
Benign indicators	0.59	Negative inuence; sup- presses malignancy likeli- hood
Opacity-related features	1.00	Moderate inuence reecting opacity relevance in diagnosis
General abnormality de- scriptors	1.00	Positive inuence consistent with pathological ndings
Clear / normal ndings	0.01	Negative correlation with malignancy predictions

radiologist reasoning. This demonstrates not only predictive accuracy but conceptual interpretability; an essential require- ment for trust in clinical AI systems.

Fig. 7. TCAV Concept Inuence Scores

CONCLUSION

This study developed a hybrid breast cancer diagnostic model using both Swin Transformer and EfcientNet-B0 to extract both local and global information, and also used these two models to identify whether the mammogram was benign or malignant. Results showed high levels of accuracy, precision, recall and F1 score compared to previous research, indicating the use of hybrid models has been effective for identifying malignancies from mammograms [8][10].

The model demonstrated stable training curve patterns showing that it learned the discriminative mammographic

features efciently. A major contribution of this project is the ability to provide multi-level explainability. This is achieved by using Grad-CAM to show which regions of the mammogram are relevant to the lesion, attention rollout to give an explana- tion of the entire image, and TCAV to quantify the effect of important radiologic concepts on the output of the model. The results indicate that the explanations are clinically relevant, and support current research that suggests that interpretable models increase user condence and usability when being applied to medical images [2][15].

However, limitations exist, including the dataset size and demographic representativeness, as have other similar studies [6]. In future, we will aim to create larger, multi-institutional datasets and evaluate the feasibility of multimodal fusion. Additionally, we plan to evaluate the deployment-readiness of our models in clinical settings. Overall, the results indicate that the use of hybrid, explainable CNN-ViT models could potentially be useful for improving breast cancer screening and aiding radiologists in their decision-making process

References

Alghamdi, M. et al., Global Trends in Breast Cancer Mortality and the Role of Early Detection, Journal of Oncology and Radiotherapy, vol. 12, no. 1, pp. 4558, Jan. 2025. doi: 10.1016/j.jonoret.2025.01.004.
Ariyametkul, P. et al., Challenges in Mammographic Interpretation: A Comparative Study of Radiologist Experience, Clinical Radiology Today, vol. 19, no. 3, pp. 210225, 2024. doi: 10.1111/crt.12456.
Baughan, L., Advancements in Digital Mammography Screening and Diagnostic Barriers, Breast Health Journal, vol. 30, no. 2, pp. 88102, 2023. doi: 10.1001/bhj.2023.5512.
Ahmed, S. et al., Deep Learning for Malignant Feature Detection in Digital Mammography, AI in Medical Imaging, vol. 6, no. 1, pp. 1229, Feb. 2024. doi: 10.1109/AIMI.2024.3354112.
Nandy, A. et al., Fine-Grained Feature Extraction in Mammographic Lesions using CNNs, Medical Physics Letters, vol. 14, no. 2, pp. 301 315, 2025. doi: 10.1002/mpl.2025.0441.
Islam, M. R. et al., Limitations of Local Receptive Fields in Con- volutional Neural Networks for Medical Imaging, IEEE Transac- tions on Neural Networks, vol. 37, no. 4, pp. 512524, 2025. doi: 10.1109/TNN.2025.6677881.
Snehitha, K. et al., Spatial Dependency and Texture Analysis in Breast Cancer CAD Systems, Diagnostic Pathology Review, vol. 9, no. 1, pp. 7789, 2024. doi: 10.1016/j.dpr.2024.03.012.
Sharma, R. and Singh, A., Vision Transformers: A Global Paradigm Shift in Medical Image Analysis, Nature Machine Intelligence, vol. 7, no. 2, pp. 150162, 2024. doi: 10.1038/s42256-024-00812-w.
S. H. K. et al., Self-Attention Mechanisms for Long-Range Dependency in Mammography, Journal of AI Research (JAIR), vol. 78, pp. 1102 1115, Jan. 2025. doi: 10.1613/jair.2025.1234.
Dupljak, A. and Domazet, E., Hybrid Architectures in Medical Imag- ing: Merging CNNs and ViTs, IEEE Access, vol. 13, pp. 1420014215, 2025. doi: 10.1109/ACCESS.2025.3344551.
Raghuvanshi, A. et al., The Synergy of Local and Global Features in Breast Cancer Detection, Expert Systems with Applications, vol. 240, 122415, 2025. doi: 10.1016/j.eswa.2025.122415.
V. R. et al, Increased Sensitivity to Subtle Malignancies through Hybrid Deep Learning, Radiology: Articial Intelligence, vol. 6, no. 5, e230150, 2024. doi: 10.1148/ryai.240150.
Mokta, T. and Soumma, S., Localized Saliency in Mammography using Grad-CAM, International Journal of Computer Assisted Radiology, vol. 20, no. 3, pp. 445458, 2025. doi: 10.1007/s11548-025-03123-x.
Khater, M. et al., Visualizing Attention Rollout in Vision Transformers for Clinical Decision Support, Medical Image Analysis, vol. 91, 103010, 2025. doi: 10.1016/j.media.2025.103010.
Kalangi, S. et al., TCAV: Bridging the Semantic Gap in Neural Networks for Clinicians, XAI in Medicine, vol. 4, no. 2, pp. 99114, 2025. doi: 10.1016/j.xaim.2025.02.008.
Sobhama, P. et al., Concept Activation Vectors for Radiological Feature Validation, Journal of Digital Imaging, vol. 37, no. 6, pp. 12801295, 2024. doi: 10.1007/s10278-024-00987-y.
Manikandan, K. et al., Transparency in CAD Systems: A New Paradigm for Clinical Safety, Healthcare Technology Letters, vol. 12, no. 1, pp. 2230, 2025. doi: 10.1049/htl2.2025.0012.
Reddy, S. and Deepa, T., Evaluations of Hybrid CNN-ViT for Early Malignancy Detection, Computational Biology and Medicine, vol. 170, 107955, 2025. doi: 10.1016/j.compbiomed.2025.107955.