DMP-TTS: Disentangled Multi-modal Prompting for Controllable Text-to-Speech with Chained Guidance

Kang Yin 1, Chunyu Qiang2, Sirui Zhao 1, Xiaopeng Wang2, Yuzhe Liang2, Pengfei Cai1, Tong Xu1, Chen Zhang2, Enhong Chen1

1 University of Science and Technology of China, Hefei, China

2 Kling AI, Kuaishou Technology, Beijing, China

Abstract

Controllable text-to-speech (TTS) systems face significant challenges in achieving independent manipulation of speaker timbre and speaking style, often suffering from entanglement between these attributes. We present DMP-TTS, a latent Diffusion Transformer (DiT) framework with explicit disentanglement and multi-modal prompting. A CLAP-based style encoder (Style-CLAP) aligns cues from reference audio and descriptive text in a shared space and is trained with contrastive learning plus multi-task supervision on style attributes. For fine-grained control during inference, we introduce chained classifier-free guidance (cCFG) trained with hierarchical condition dropout, enabling independent adjustment of content, timbre, and style guidance strengths. Additionally, we employ Representation Alignment (REPA) to distill acoustic-semantic features from a pretrained Whisper model into intermediate DiT representations, stabilizing training and accelerating convergence. Experiments show that DMP-TTS delivers stronger style controllability than open-source baselines while maintaining competitive intelligibility and naturalness.

Model Architecture

Architecture of the DMP-TTS model
Fig. 1: Overview of the DMP-TTS framework. It employs a unified multi-modal style encoder (Style-CLAP) to disentangle style from speaker timbre. The model is trained with a combination of contrastive loss and multi-task supervision to learn discriminative style representations.

The architecture of DMP-TTS, shown in the figure below, is built upon stacked Diffusion Transformer (DiT) blocks. The core idea is to progressively transform random Gaussian noise into the target mel-spectrogram representation. This generation process is guided by three distinct inputs:

  • Content: Provided by a text encoder based on the input text.
  • Timbre: Extracted from a reference audio clip by a speaker encoder.
  • Style: Generated by our proposed Style-CLAP encoder, which can take either reference audio or descriptive text as a style prompt.
A duration predictor, conditioned on both text and style, is also used to control the rhythm and phoneme-level alignment of the synthesized speech.

Experiments

The audio samples on this page demonstrate the performance of our model. The model was trained on a high-quality, internal Chinese speech dataset of approximately 300 hours, featuring around 1,000 speakers.

All examples presented below are from our carefully constructed test set. This means the speakers and sentences were completely unseen by the model during training, ensuring a fair and robust evaluation of its generalization capabilities. The test set was specifically designed for cross-speaker style transfer tasks.

For a comprehensive comparison, we evaluate DMP-TTS against recent state-of-the-art models: CosyVoice and CosyVoice2, which leverage text-based style prompting, and IndexTTS2, which employs an audio-based style prompt.

Comparison with Text-based Style Prompting

In this section, we compare DMP-TTS with text-prompted models like CosyVoice and CosyVoice2 using identical inputs. As the samples show, DMP-TTS provides superior control over stylistic attributes while maintaining a competitive level of speech naturalness.

Content Text Style Prompt Timbre Prompt CosyVoice CosyVoice2 DMP-TTS (Ours)
我在现场已经完全兴奋起来了呢! This speech is delivered in a happy tone, at a fast pace, and with moderate energy.
要是被他发现我也有理由的喽。 This speech is delivered in a sad tone, at a moderate pace, and with low energy.
你没时间陪我! This speech is delivered in a angry tone, at a moderate pace, and with high energy.
他保证入会期间,不抽烟,不嚼烟,不渎神。 This speech is delivered in a neutral tone, at a low pace, and with moderate energy.

Comparison with Audio-based Style Prompting

Here, we compare DMP-TTS with IndexTTS2, a model that utilizes a reference audio clip as the style prompt. Both models are tasked with synthesizing speech using a timbre prompt and a style prompt. Our approach appears to offer improved performance in timbre similarity and the preservation of fine-grained acoustic details.

Content Text Timbre Prompt Style Prompt IndexTTS2 DMP-TTS (Ours)
购票款两百。
手机也可以拍照啊。
这是多么古老的一张照片啊!
应该是不能和你一起当然伤心了哈。

Effect of Representation Alignment (REPA)

Our model incorporates a Representation Alignment (REPA) strategy to guide the training process. This table visualizes the impact of REPA by showing synthesis quality for models trained under the same training steps. The results demonstrate that models trained with REPA achieve significantly lower Word Error Rate and higher perceived naturalness earlier in the training phase, highlighting REPA's contribution to faster convergence and stability.

Training Step Ground Truth DMP-TTS (With REPA) Without REPA
Step 10000
Step 12000
Step 15000
Step 20000
Step 30000

Effect of Guidance Scale: Text Style Prompt

This table demonstrates the effect of varying the style guidance scale while keeping content and timbre guidance fixed (Content=9, Timbre=12). A higher style scale value enhances the stylistic attributes described in the text prompt, ranging from subtle (Scale=6) to highly expressive (Scale=18).

Content Text Style Text Timbre Prompt Style Scale=6 Style Scale=12 Style Scale=18
不要和我说这么多,就是不行。 This speech is delivered in a angry tone, at a fast pace, and with high energy.
赢的机会是多少。 This speech is delivered in a sad tone, at a moderate pace, and with high energy.

Effect of Guidance Scale: Timbre Guidance

This table illustrates the impact of the timbre guidance scale. A higher value leads to a synthesized voice that is closer to the timbre prompt, with the effect becoming more pronounced from Scale=6 to Scale=18. Content and style guidance are fixed (Content=9, Style=12).

Content Text Style Text Timbre Prompt Timbre Scale=6 Timbre Scale=12 Timbre Scale=18
我每天晚上都睡得很香嘞! This speech is delivered in a happy tone, at a moderate pace, and with high energy.
好来,我能帮你的就这么多了。 This speech is delivered in a sad tone, at a moderate pace, and with moderate energy.