Cross-modal Adaptation of medical vision-language model for few-shot classification
Authors:
Jingyi Wu and S Kevin Zhou
Conference:
ICIC 2025 Posters, Ningbo, China, July 26-29, 2025
Pages:
624-634
Keywords:
Vision-Language Model ยท Cross-Modal Adaptation ยท Few-shot Learning
Abstract
Medical Vision-Language Models VLMs show significant potential for aux-iliary diagnosis, especially given the continuous growth of medical image da-ta. However, adapting these models effectively with limited labeled data re-mains a challenge. This paper proposes a cross-modal adaptation method for few-shot medical image classification based on pre-trained VLMs. Our ap-proach leverages both image features and corresponding text features ex-tracted from the pre-trained models to train a classifier head. Furthermore, we employ the SHAP interpretability analysis method to select the most in-formative text features, thereby enhancing classification performance. We evaluated our method on the CheXpert5x200 dataset using MedCLIP and KAD as foundation models, comparing it against zero-shot classification and uni-modal adaptation using only image features . Results demonstrate that our approach significantly improves few-shot classification performance over the baselines. The SHAP-based feature selection provides additional gains. Ultimately, we present a general, simple, and efficient cross-modal ad-aptation strategy that enhances medical VLM performance using only a small number of image samples, contributing to more reliable AI-powered diagnos-tic tools.
BibTeX Citation:
@inproceedings{ICIC2025,
author = {Jingyi Wu and S Kevin Zhou},
title = {Cross-modal Adaptation of medical vision-language model for few-shot classification},
booktitle = {Proceedings of the 21st International Conference on Intelligent Computing (ICIC 2025)},
month = {July},
date = {26-29},
year = {2025},
address = {Ningbo, China},
pages = {624-634},
}