Gated Cross-modal Attention and Multimodal Homogeneous Feature Discrepancy Learning for Speech Emotion Recognition

Authors: Feng Li Jiusong Luo
Conference: ICIC 2024 Posters, Tianjin, China, August 5-8, 2024
Pages: 329-338
Keywords: Speech emotion recognition Multimodal Wav2vec 2.0 Cross-modal attention mechanism

Abstract

Understanding human emotions from speech is crucial for computers to comprehend human intentions. Human emotions are expressed through a wide variety of forms, including speech, text, and facial expressions. However, most speech emotion recognition fails to consider the interactions between different information sources. Therefore, we propose a multimodal speech emotion recognition framework that integrates information from different modalities via a gated cross-modal attention mechanism and multimodal homogeneous feature discrepancy learning. Specifically, we firstly extract acoustic, visual and textual features using different pre-train model, respectively. Then, A-GRU-LVC Auxiliary Gated Recurrent Unit with learnable Vision Center and A-GRU Auxiliary Gated Recurrent Unit is used to further extract emotion-related information for visual and text features. Additionally, we design a gated cross-modal attention mechanism to dynamically fusion multimodal fusion features. Finally, we introduce multimodal homogeneous feature discrepancy learning to better capture differences among different emotion samples. Evaluation results show that our proposed model can achieve better recognition performance than the previous methods on the IEMOCAP dataset.
📄 View Full Paper (PDF) 📋 Show Citation