TM-SPEECH: END-TO-END TEXT TO SPEECH BASED ON INTEGRATING TRANSFORMER AND MAMBA

Authors: Long Wang, Zichao Deng, Haoke Hou, Song Shen, Wancheng He, and Haohan Ding
Conference: ICIC 2025 Posters, Ningbo, China, July 26-29, 2025
Pages: 1835-1846
Keywords: text-to-speech,speech synthesis,Transformer,Mamba

Abstract

Text-to-speech synthesis is the process of converting natural language text into speech. In recent years, deep learning has made significant strides in this field. Although the Transformer model is effective at capturing dependencies, its attention mechanism’s quadratic complexity results in longer training times and increased costs. Recent advancements in state-space models SSMs have demonstrated impressive performance in modeling long-range dependencies due to their sub-quadratic complexity. Mamba, a notable example of SSMs, exhibits linear time complexity and excels in tasks involving long sequences, similar to those in natural language. In this paper, we propose TM-Speech, which integrates Mamba for modeling long-range dependencies and Transformer for capturing short-range dependencies, thereby reducing model training costs. Comparative experiments show that TM-Speech is almost 2× smaller and 3× faster than FastSpeech2 during training, while also achieving superior inferred audio quality. The code is available at https: github.com Apolarity886 TMSpeech.
📄 View Full Paper (PDF) 📋 Show Citation