Visual Segmentation with Explanations for Medical Decision Making: A CNN Architecture with a Transformer-Attention Extension
| dc.contributor.author | Khalaf, Mahmoud | |
| dc.date.accessioned | 2026-05-19T17:46:05Z | |
| dc.date.available | 2026-05-19T17:46:05Z | |
| dc.date.issued | 2026-05-19 | |
| dc.date.submitted | 2026-05-14 | |
| dc.description.abstract | Medical image segmentation is essential for assisting medical professionals in locating anomalies in images. However, the lack of explainability in current segmentation frameworks limits clinical trust and adoption. This thesis presents two complementary frameworks for explainable medical image segmentation, demonstrating that accuracy and interpretability can be achieved simultaneously through innovative architectural design. The first framework introduces an attention-based CNN architecture that generates model-specific visual explanations through spatial attention gates integrated directly into the network. The proposed model achieves a Dice score of 0.8621 on the Kvasir-SEG polyp segmentation dataset, outperforming all evaluated explainable models, while producing attention heatmaps that faithfully reflect the model's decision-making process. The second framework advances this by introducing a dual encoder architecture combining a pretrained ResNet-34 CNN encoder with a pretrained Swin Transformer encoder, fused through a learned directional attention-gated mechanism at multiple scales. The dual encoder achieves a Dice score of 0.907 on Kvasir-SEG, outperforming all single encoder baselines, while generating richer multi-scale visual explanations that reflect the complementary contributions of both encoders. Finally, this thesis outlines a pathway in detail towards a fully multimodal explainability system, integrating textual explanations through SigLIP, Retrieval Augmented Generation, and a Large Language Model alongside the visual heatmaps. We explore future directions regarding breast cancer and echocardiogram segmentation, more specifically applications for ejection fraction computation,and insights gained from trials and errors that led to the innovative designs of the thesis. | |
| dc.identifier.uri | https://hdl.handle.net/10012/23338 | |
| dc.language.iso | en | |
| dc.pending | false | |
| dc.publisher | University of Waterloo | en |
| dc.relation.uri | https://github.com/MaudDK/MedSeg-XAI-AGFusion | |
| dc.relation.uri | https://datasets.simula.no/kvasir-seg/ | |
| dc.subject | Medical image segmentation | |
| dc.subject | Explainable AI | |
| dc.subject | Convolution Neural Networks | |
| dc.subject | Vision Transformers | |
| dc.subject | Deep Learning | |
| dc.title | Visual Segmentation with Explanations for Medical Decision Making: A CNN Architecture with a Transformer-Attention Extension | |
| dc.type | Master Thesis | |
| uws-etd.degree | Master of Mathematics | |
| uws-etd.degree.department | David R. Cheriton School of Computer Science | |
| uws-etd.degree.discipline | Computer Science | |
| uws-etd.degree.grantor | University of Waterloo | en |
| uws-etd.embargo.terms | 0 | |
| uws.contributor.advisor | Cohen, Robin | |
| uws.contributor.advisor | Bentahar, Jamal | |
| uws.contributor.affiliation1 | Faculty of Mathematics | |
| uws.peerReviewStatus | Unreviewed | en |
| uws.published.city | Waterloo | en |
| uws.published.country | Canada | en |
| uws.published.province | Ontario | en |
| uws.scholarLevel | Graduate | en |
| uws.typeOfResource | Text | en |