DocHQ: Towards Multi-modal Document Understanding via Hybrid Feature Queries

Authors: Jin Wang, Yingying Liu, and Yahong Han

Conference: ICIC 2025 Posters, Ningbo, China, July 26-29, 2025

Pages: 327-339

Keywords: Document Image Understanding, Multi-modal Feature Alignment, Docu-ment Pretrain model.

DOI: 10.65286/icic.v21i1.32638

Abstract

Significant progress has been made in general multi-modal tasks leveraging pre-trained visual and language models. However, in visual document under-standing tasks, enhancing performance by utilizing existing models encoun-ters difficulties due to the fundamental differences between natural and doc-ument images. In this paper, we introduce DocHQ, a multi-modal document image understanding model with pre-trained visual and language models, employing a hybrid feature query for feature alignment between document visual information and language text. Our approach combines learnable and fixed task-oriented queries within a cross-attention visual-language align-ment module to extract more fine-grained information from document im-ages. Moreover, we utilize large-scale document images for alignment train-ing between the pre-trained image encoder and the language model. Experi-mental results demonstrate that our method achieves outstanding perfor-mance across three different types of document image understanding tasks compared to existing approaches.

BibTeX Citation:

@inproceedings{ICIC2025,
    author = {Jin Wang, Yingying Liu, and Yahong Han},
    title = {DocHQ: Towards Multi-modal Document Understanding via Hybrid Feature Queries},
    booktitle = {Proceedings of the 21st International Conference on Intelligent Computing (ICIC 2025)},
    month = {July},
    date = {26-29},
    year = {2025},
    address = {Ningbo, China},
    pages = {327-339},
    note = {Poster Volume Ⅰ}
    doi = {10.65286/icic.v21i1.32638}
}