MAVSA-DI: Mongolian Audio-Visual Sentiment Analysis Based on Deep Residual Shrinkage Network and Improved 3D-DenseNet

Authors: Ren Qing-Dao-Er-Ji, Qian Bo, Ying Lu, Yatu Ji, and Nier Wu
Conference: ICIC 2025 Posters, Ningbo, China, July 26-29, 2025
Pages: 1155-1167
Keywords: Deep Residual Shrinkage Network, Improved 3D-DenseNet, SPD-Conv, Feature Fusion, Mongolian Audio-Visual Sentiment Analysis.

Abstract

To address the issue of inaccurate extraction of key emotional features in Mongolian audio and video data, which leads to suboptimal sentiment classi-fication performance, this paper proposes a Mongolian Audio-Visual Senti-ment Analysis model based on Deep Residual Shrinkage Network and Im-proved 3D-DenseNet MAVSA-DI . Specifically, the audio branch adopts a Deep Residual Shrinkage Network DRSN to suppress noise interference through a soft-thresholding mechanism and enhance the extraction of emo-tion-relevant acoustic features. The video branch employs an Improved 3D-DenseNet I3DD by integrating the SPD-Conv module, which combines the deep feature extraction capability of SPD-Conv with the dense connectivity of 3D-DenseNet to improve spatiotemporal feature learning from low-resolution facial expressions. Furthermore, Intra-Modal Attention IMA mechanisms are applied to both branches to highlight intra-modal key infor-mation, followed by Cross-Modal Attention CMA to facilitate effective feature fusion. Experimental results demonstrate that the proposed model significantly outperforms several advanced baselines in terms of classifica-tion accuracy for Mongolian Audio-Visual Sentiment Analysis MAVSA .
📄 View Full Paper (PDF) 📋 Show Citation