A dual cross-modal interactive guided common representation method for fine-grained cross-modal retrieval

Authors: Hongchun Lu, Min Han, Xue Li, Le An
Conference: ICIC 2024 Posters, Tianjin, China, August 5-8, 2024
Pages: 686-700
Keywords: Fine-grained Cross-media Retrieval,Cross-Modal Spatial Interaction,Cross-Modal Channel interaction

Abstract

In fine-grained cross-modal retrieval tasks, the huge heterogeneity gap between different modalities is a key factor leading to low retrieval performance. Therefore, addressing the media divide i.e., inconsistent representation of different media types is an important way to improve retrieval performance. Although previous research has yielded some results, the standard model still has some shortcomings. First, the information interaction between different modalities is ignored when learning common representations of different media data. Second, discriminative fine-grained features are not fully exploited. To address this challenge, we propose a dual cross-modal interaction-guided common representation network DCINet to enhance the information interaction between different modalities while mining discriminative features in media data. Specifically, we construct a common representation network and use pre-interaction and post-interaction multimodal feature inputs into the network for training, respectively. The two training strategies guide the learning of the common representation network through a maximal-minimal game, effectively enhancing cross-media semantic consistency and improving retrieval accuracy. Finally, extensive experiments and ablation studies conducted on public datasets demonstrate the effectiveness of our proposed method.
📄 View Full Paper (PDF) 📋 Show Citation