Medvlmoe: sparse mixture of experts vision language model for multi-pathology medical image diagnosis

Dr. Inam Ullah Khan

Authors

Dr. Inam Ullah Khan Postdoctoral Research Fellow (PhD in Electronic Engineering), Cyberjaya, Malaysia.

Keywords:

Vision Language Model, Mixture of Experts, Medical Image Diagnosis, Cross-Modal Fusion, Radiological AI, Multi Pathology Classification.

Abstract

Existing Vision Language Models use a single shared representation space where the visually complex signatures of various disease classes cannot be captured. Existing Vision Language Models have a single shared representation space, and are unable to represent the visually complex signatures of different disease classes. To address the above challenges, this paper introduces MedVLMoE, a sparse Mixture of Experts (MoE) Vision Language Model to route fused image-text representations to disease experts' networks. It is a dual-stream encoder consisting of ViT-L/14 for radiological image feature and BioGPT for clinical text, with modality-specific position encodings to encode the position, and learn a cross-modal fusion attention module that connects the two streams. The sparse MoE module (K=8 experts, Top-2 routing) allows to specialize on diseases without the need of explicit supervision by experts, which is enforced by a load-balancing auxiliary loss. Experiments are shown to outperform the best baseline (CLIP-Med) by 3.7–6.3% in terms of the AUC-ROC metric on five medical imaging benchmarks (ChestX-ray14, CheXpert, MIMIC-CXR, PathMNIST, RetinaMNIST). The activation analysis was used by experts to interpretively explain the behaviour of the MoE modules in clinical AI systems, and emergent disease-specific routing specialization was identified.

Medvlmoe: sparse mixture of experts vision language model for multi-pathology medical image diagnosis

Authors

Keywords:

Abstract

Published

How to Cite

Issue

Section

Similar Articles

SidebarMenu

Downloads

Current Issue

Information

Make a Submission

Keywords