HMoE-SiMBA: Heterogeneous Mixture-of-Experts with SiMBA Attention for Robust Chinese Speech Emotion Recognition

Authors: Lu Wang and Xinyue Duan
Conference: ICIC 2025 Posters, Ningbo, China, July 26-29, 2025
Pages: 1323-1339
Keywords: Chinese speech emotion recognition, State Space Models, Heterogeneous Mixture-of-Experts

Abstract

Speech Emotion Recognition SER for Mandarin Chinese is crucial for human-computer interaction, yet faces challenges in real-world applications due to unique tonal and prosodic features. Existing methods suffer from limitations in feature extraction, model generalization, and computational efficiency. To address these issues, we propose HMoE-SiMBA, a novel framework based on HMoE Heterogeneous Moxture-of-Experts and SiMBA Simplified Mamba-Based Architecture attention for addressing stability and generalization issues in Chinese SER. Our approach employs a multi-modal feature representation layer to comprehensively capture emotional cues, utilizes heterogeneous feature extractors with dynamic routing to enhance feature adaptability, and combines EinFFT and Mamba for efficient sequence modeling. Experiments on the CASIA dataset demonstrate that HMoE-SiMBA achieves 92.2 accuracy, significantly outperforming existing methods with robust performance in complex acoustic environments.
📄 View Full Paper (PDF) 📋 Show Citation