Piculet: Specialized Model-Guided Hallucination Alleviation for MultiModal Large Language Models

Authors: Kohou Wang\inst{1}\orcidID{0009-0007-5863-2288} \and Xiang Liu\inst{1}\orcidID{0009-0003-2492-403X } \and ZeZhou Chen\inst{1}\orcidID{0009-0000-6796-6043 } \and Zhaoxiang Liu*\inst{1,2}\orcidID{0000-0002-1267-0277}\and Shiguo Lian*\inst{1,2}\orcidID{0000-0003-4308-7049}\and Kai Wang\inst{1,2}\orcidID{0000-0002-1171-0281}
Conference: ICIC 2024 Posters, Tianjin, China, August 5-8, 2024
Pages: 1-15
Keywords: {Multimodal Large Language Models \and hallucinations \and training-free}

Abstract

Multimodal Large Language Models (MLLMs) have made significant progress in bridging
the gap between visual and language modalities. However, hallucinations in MLLMs,
where generated text does not align with image content,
continue to be a major challenge. Existing methods for addressing hallucinations often rely on instruction-tuning,
which requires retraining the model with specific data, which increases the cost of utilizing MLLMs further. In this paper,
we introduce a novel training-free method, named Warbler, for enhancing the input representation of MLLMs. Warbler leverage multiple auxiliary models
to extract description of visual information from the input image, and combine these description together with the original image as an input to the MLLM.
We evaluate our method both quantitively and qualitively, and the results demostrating that Warbler greatly decrease hallucinations of MLLMs.
Our method can be easily extended to different MLLMs while being universal.
📄 View Full Paper (PDF) 📋 Show Citation