Cross-modal Adaptation of medical vision-language model for few-shot classification

Authors: Jingyi Wu and S Kevin Zhou
Conference: ICIC 2025 Posters, Ningbo, China, July 26-29, 2025
Pages: 624-634
Keywords: Vision-Language Model ยท Cross-Modal Adaptation ยท Few-shot Learning

Abstract

Medical Vision-Language Models VLMs show significant potential for aux-iliary diagnosis, especially given the continuous growth of medical image da-ta. However, adapting these models effectively with limited labeled data re-mains a challenge. This paper proposes a cross-modal adaptation method for few-shot medical image classification based on pre-trained VLMs. Our ap-proach leverages both image features and corresponding text features ex-tracted from the pre-trained models to train a classifier head. Furthermore, we employ the SHAP interpretability analysis method to select the most in-formative text features, thereby enhancing classification performance. We evaluated our method on the CheXpert5x200 dataset using MedCLIP and KAD as foundation models, comparing it against zero-shot classification and uni-modal adaptation using only image features . Results demonstrate that our approach significantly improves few-shot classification performance over the baselines. The SHAP-based feature selection provides additional gains. Ultimately, we present a general, simple, and efficient cross-modal ad-aptation strategy that enhances medical VLM performance using only a small number of image samples, contributing to more reliable AI-powered diagnos-tic tools.
๐Ÿ“„ View Full Paper (PDF) ๐Ÿ“‹ Show Citation