An Enhanced Method of Multimodal Information Retrieval Based on Document Segmentation

Authors: Jintao Liu, Chen Feng, Guang Jin, and Jun Fan
Conference: ICIC 2025 Posters, Ningbo, China, July 26-29, 2025
Pages: 3648-3659
Keywords: Large language model Multimodality Vector matching Retrieval-augmented generation

Abstract

With the rapid development of information technology, emerging technologies, typically represented by large language models LLMs , are powerfully driving profound transformations in multiple industries. However, the LLMs may exhibit the hallucination phenomenon, making it difficult for them to accurately understand and effectively apply relevant industry knowledge in some specific fields. To address this issue, a method called DocColQwen, a multimodal in-formation retrieval enhancement method based on document segmentation in this paper is proposed. First, the large model analyses the user task and then the contextualized late interaction over paligemma 's Colpalli idea is used to segment the multimodal experimental document into images. Subsequently, the images and user questions are transformed into vectors through encoding for matching, and the retrieved documents are passed to the Qwen2-VL model for response output. Finally, the method is verified in multimodal experimental documents to validate its effectiveness, providing a solution idea for the analysis and processing of multimodal test documents.
📄 View Full Paper (PDF) 📋 Show Citation