DocHQ: Towards Multi-modal Document Understanding via Hybrid Feature Queries

Authors: Jin Wang, Yingying Liu, and Yahong Han
Conference: ICIC 2025 Posters, Ningbo, China, July 26-29, 2025
Pages: 327-339
Keywords: Document Image Understanding, Multi-modal Feature Alignment, Docu-ment Pretrain model.

Abstract

Significant progress has been made in general multi-modal tasks leveraging pre-trained visual and language models. However, in visual document under-standing tasks, enhancing performance by utilizing existing models encoun-ters difficulties due to the fundamental differences between natural and doc-ument images. In this paper, we introduce DocHQ, a multi-modal document image understanding model with pre-trained visual and language models, employing a hybrid feature query for feature alignment between document visual information and language text. Our approach combines learnable and fixed task-oriented queries within a cross-attention visual-language align-ment module to extract more fine-grained information from document im-ages. Moreover, we utilize large-scale document images for alignment train-ing between the pre-trained image encoder and the language model. Experi-mental results demonstrate that our method achieves outstanding perfor-mance across three different types of document image understanding tasks compared to existing approaches.
📄 View Full Paper (PDF) 📋 Show Citation