CDSS: Innovating Cross Differential Attention for Robust Monaural Multi-Speaker Audio-Visual Speech Separation
Authors:
Yinlong Zhang, Jinjiang Liu, Jiawei Jin, Jiuxin Lin, and Zhiyong Wu
Conference:
ICIC 2025 Posters, Ningbo, China, July 26-29, 2025
Pages:
1866-1883
Keywords:
Speech Separation, Audio-Visual, Cross Differential Attention, Multi-Speaker Scenarios
Abstract
Target Speaker Extraction TSE based on visual cues has been widely adopted and further extended to Audio-Visual Multi-Speaker Speech Separation AV-MSS through either simultaneous multi-speaker processing or recursive approaches. However, in real-world scenarios, obtaining complete visual information for all speakers is often impractical due to data collection constraints. Existing methods mostly use basic self-attention mechanisms to model correlations between separated speech streams to mitigate missing visual cues. Nevertheless, these approaches overlook the critical distinction between speech signals with auxiliary visual information and those without, resulting in performance degradation when modalities are incomplete. To address this, we propose a novel Cross Differential Attention CDA mechanism that performs cross-modal differentiation, effectively highlighting the salient disparities between modalities. This design enables the model to adaptively emphasize informative, modality-specific features, thereby significantly improving robustness and effectiveness in both complete and missing-visual scenarios. Extensive experiments validate our method’s superiority, demonstrating state-of-the-art performance on both two-speaker and three-speaker mixture tasks.
BibTeX Citation:
@inproceedings{ICIC2025,
author = {Yinlong Zhang, Jinjiang Liu, Jiawei Jin, Jiuxin Lin, and Zhiyong Wu},
title = {CDSS: Innovating Cross Differential Attention for Robust Monaural Multi-Speaker Audio-Visual Speech Separation},
booktitle = {Proceedings of the 21st International Conference on Intelligent Computing (ICIC 2025)},
month = {July},
date = {26-29},
year = {2025},
address = {Ningbo, China},
pages = {1866-1883},
note = {Poster Volume Ⅱ}
}