Efficient Multimodal Sentiment Recognition with Dual Cross-Attention for Multi-Scale Features

Authors: Xinyu Ye, Tingsong Ma, Yi Feng, and Yiming Zhai
Conference: ICIC 2025 Posters, Ningbo, China, July 26-29, 2025
Pages: 1990-2002
Keywords: Multimodal Sentiment Analysis, Cross-Attention Mechanism, Multi-Scale Features.

Abstract

Multimodal sentiment analysis has emerged as a key research field, particularly for decoding emotions conveyed through text and images on social media platforms. However, many approaches encounter difficulties when integrating textual and visual features across diverse dimensions, often leading to suboptimal performance. To address this, we propose a novel approach for multimodal sentiment recognition that designs a simple yet efficient network, inspired by feature pyramids. In this model, feature vectors are split into high- dimensional and low-dimensional representations, which are then processed through distinct cross-attention mechanisms tailored to their scales, followed by a fusion step to capture comprehensive cross-modal interactions. This strategy enhances the network’s ability to model relationships between modalities effectively. We evaluated our approach on the well-established MVSA-Single and MVSA-Multiple datasets, where it consistently surpasses existing techniques. Specifically, it achieves an accuracy of 78.27 and an F1 score of 77.95 on MVSA-Single, and an accuracy of 71.18 and an F1 score of 68.92 on MVSA-Multiple. These results demonstrate the potential of combining high- and low-dimensional features with dual cross-attention for social media sentiment analysis.
📄 View Full Paper (PDF) 📋 Show Citation