Multi-Scale Contrastive Adapter for Vision-Language Model Group Robustness

Authors: Yue Cai, Wenqiong Zhang, and Yikai Wang
Conference: ICIC 2025 Posters, Ningbo, China, July 26-29, 2025
Pages: 104-118
Keywords: Group Robustness, Vision-Language Models, Multi-Scale

Abstract

While vision-language models VLMs like CLIP demonstrate strong zero-shot classification capabilities, their robustness to group shifts remains a critical challenge, as classification accuracy degrades significantly for minority groups. Existing methods to improve group robustness often require costly full-model retraining or rely on single-scale feature representations, which may inadequately capture diverse group characteristics. We propose the Multi-Scale Contrastive Adapter MSCA and related modules, a novel framework designed to improve group robustness in VLMs with less computational cost. MSCA employs a multi-scale feature representation strategy, leveraging contrastive learning across multiple dimensions to alleviate the group shift of the model on the dataset in multiple different dimensional spaces. A feature voting mechanism is introduced to dynamically select the most relevant feature dimensions during inference, further improving group robustness. Experiments across benchmarks Waterbirds, CelebA, CIFAR-10.02 show that MSCA significantly improves the worst-group accuracy to 86.1 and reduces GAP from 55.2 to 4.1 , outperforming recent advanced methods like FairerCLIP. Our findings highlight that MSCA offers a practical pathway toward more robust vision-language models.
📄 View Full Paper (PDF) 📋 Show Citation