MMCFusionNet: A Multimodal Mixture of Experts and Collaborative Attention Fusion Network for Abnormal Emotion Recognition

Authors: Haitao Xiong, Xin Zhou, Wei Jiao, and Yuanyuan Cai
Conference: ICIC 2025 Posters, Ningbo, China, July 26-29, 2025
Pages: 2982-2994
Keywords: Emotion recognition Multi-modal learning Multi-modal fusion Mixture of Experts Collaborative Attention

Abstract

The growing popularity of short videos on social media has introduced new challenges for content moderation, particularly in detecting abnormal emotions like hate and sarcasm. These emotions usually exhibit higher concealment and multimodal inconsistency compared to conventional ones. While prior studies have primarily focused on conventional emotion recognition, research on abnormal emotions remains limited. Moreover, existing models often fail to leverage the complementary nature of multimodal data fully and lack robust intermodal interactions. This study proposes MMCFusionNet, a novel multimodal fusion framework designed for abnormal emotion recognition in short videos. The model extracts and aligns features from four modalities text, visual, audio, and facial through a dedicated feature encoder and alignment module to improve the ability of hate and sarcasm emotion recognition. At its core, the model integrates two key mechanisms: 1 Mix-ture of Experts MoE modules to enhance intramodal representations across temporal frames for identifying concealed emotional cues 2 Dual-channel collaborative attention Co-Attention modules to facilitate intermodal complementarity for resolving multimodal contradictions. Experimental results on the HateMM and MUStARD datasets show that MMCFusionNet outperforms baseline models across various evaluation metrics, with ablation studies confirming the effectiveness and robustness of each module.
📄 View Full Paper (PDF) 📋 Show Citation