HATN: hierarchical adaptive transformer network for real-time medical image segmentation using hybrid CNN-ViT architecture with multi-scale attention and uncertainty-aware loss functions

Dr. Veerpratap Meena

doi:10.55529/jaimlnn.61.75.87

Authors

Dr. Veerpratap Meena Assistant Professor in the Department of Electrical Engineering at the National Institute of Technology (NIT) Jamshedpur, India.

Keywords:

Medical Image Segmentation, Hybrid CNN-ViT, Swin Transformer, Multi Scale Attention, Uncertainty Quantification, Deep Learning.

Abstract

Medical image segmentation is kind of a cornerstone in modern clinical medicine, it helps with accurate volumetric tracing of anatomical structures using CT, MRI, and endoscopic imagery so clinicians can do diagnosis and treatment planning. Even with all the big improvements brought by U-Net and later ideas, three issues still show up as bottlenecks for real-world adoption: (i) the global context modeling is still not enough for long-range anatomical relationships; (ii) the skip connection feature selection is weak, so some unhelpful low-level signals can mess up decoder representations. And (iii) there is no good uncertainty quantification, which is basically a requirement before clinical teams accept AI-driven diagnostic systems. In this work, we put forward HATN (Hierarchical Adaptive Transformer Network), a hybrid CNN-ViT segmentation design. It uses a Swin-Transformer style hierarchical backbone, then applies Multi-Scale Deformable Attention (MSDA) in the bottleneck area. For skip connections, we add Multi-Scale Channel Attention (MSCA), so the network keeps more relevant details while suppressing the rest. Training uses a compound uncertainty-aware objective, L_HATN = 0.50*L_CE + 0.35*L_Dice + 0.15*L_UC.We test HATN on five well-known benchmark datasets : Synapse Multi-Organ CT, ACDC Cardiac MRI, Polyp Segmentation, ISIC Skin Lesion, and NIH Pancreas-CT. Bayesian hyper parameter selection is done with Optuna, running 120 trials total and using 5-fold cross-validation to cover variation properly. For epistemic uncertainty, we use Monte Carlo Dropout with T = 20 forward passes, giving uncertainty estimates that can be checked downstream. Results show HATN reaches 92.38% Dice and 4.9 mm HD95 on Synapse. It beats the closest competitor, which is 89.16% Dice, by 3.22 Dice points and also reduces HD95 by 3.7 mm. For cross-dataset generalization, we obtain 91.74%, 88.62%, and 90.44% Dice on ACDC, Polyp, and ISIC benchmarks respectively, and notably , all of this is without fine-tuning. During inference, the method runs at 48 FPS on an NVIDIA RTX 3090 with TensorRT FP16 optimization, hitting real time clinical thresholds. All nine baseline comparisons end up statistically significant (p < 0.001, Bonferroni corrected), no question there. Ablation studies back up each HATN piece, SHAP and Grad-CAM also show attention maps that are anatomically sensible and consistent. The full codebase plus pre-trained weights are openly released so people can more quickly do community research.

HATN: hierarchical adaptive transformer network for real-time medical image segmentation using hybrid CNN-ViT architecture with multi-scale attention and uncertainty-aware loss functions

Authors

Keywords:

Abstract

Published

How to Cite

Issue

Section

Similar Articles

SidebarMenu

Downloads

Current Issue

Information

Make a Submission