VibeVoice Logo IndexTTS2

IndexTTS2: Breakthrough Emotional Expression & Duration Control TTS

First autoregressive TTS model with precise duration control, supporting emotion-speaker decoupling

Bilibili's open-source industrial-level zero-shot TTS system, supporting precise duration control, emotional synthesis and Chinese & English voice cloning, designed for professional dubbing, game development, podcast creation and more.

🎬 Video Demo Showcase

Demonstrating IndexTTS2's core technical capabilities through real-world examples

🚀 Core Features

IndexTTS2's four major innovative technological breakthroughs

Precise Duration Control

Precisely control speech duration by specifying token count, achieving perfect synchronization with video

Normal Duration:

Compressed Duration:

Emotion Decoupling

Separate emotional features from speaker characteristics, independently control timbre and emotion

Stern Style:

Gentle Style:

Zero-shot Synthesis

Clone any speaker's voice without training, supporting few-shot learning

Reference Audio:

Cloning Result:

Cross-language Capability

Support Chinese & English voice cloning, maintaining speaker characteristics in cross-language synthesis

Chinese Reference:

English Synthesis:

🎮 Try IndexTTS2

Experience the powerful capabilities of zero-shot speech synthesis

Loading IndexTTS2 Demo

Initializing Gradio interface... Loading AI models... Preparing voice synthesis... Almost ready...

Retry attempt /

Demo Temporarily Unavailable

The interactive demo is currently experiencing issues. You can try refreshing or visit the demo directly.

If the problem persists, please try again later or contact support.

Usage Instructions

📝 Text Input

Enter the text content you want to synthesize, supporting mixed Chinese and English input

🎤 Reference Audio

Upload reference audio files, the model will clone their timbre and style

⏱️ Duration Control

Adjust speech duration parameters to achieve precise duration control

😊 Emotion Control

Select emotion types to generate expressive speech

🔬 Technical Deep Dive

In-depth understanding of IndexTTS2's core technical architecture and innovative breakthroughs

Model Architecture

IndexTTS2 adopts a Transformer-based autoregressive architecture, modeling the TTS task as a language modeling problem. The model achieves more natural speech synthesis by generating discrete audio codec tokens rather than traditional spectrograms.

  • Encoder-Decoder Structure: Supports conditional generation and context learning
  • Multimodal Fusion: Unified modeling of text, audio, and emotional information
  • Attention Mechanism: Precise duration control and emotional alignment

Training Data

The model is trained on large-scale multilingual and multi-emotional datasets, ensuring excellent generalization and emotional expressiveness.

  • Data Scale: Over 100,000 hours of high-quality speech data
  • Language Coverage: Chinese, English and other multilingual support
  • Emotion Annotation: 7 basic emotions + fine-grained emotion control
  • Speaker Diversity: 1000+ different speaker characteristics

📊 Performance Metrics

IndexTTS2 achieves industry-leading performance across multiple key metrics

Word Error Rate (WER)

2.1%

40% reduction compared to baseline models

Speaker Similarity

0.92

COSINE similarity score

Emotion Fidelity

89.3%

Emotion classification accuracy

⚡ Model Comparison Analysis

Model Duration Control Emotion Decoupling Zero-shot Capability Naturalness Score
IndexTTS2 Precise Supported Excellent 4.6/5.0
MaskGCT Not Supported Limited Good 4.2/5.0
F5-TTS Basic Not Supported Good 4.0/5.0
CosyVoice2 Not Supported Limited Average 3.8/5.0

* Scores based on public benchmarks and user evaluations, out of 5.0

🎯 Application Scenarios

Empowering your projects

Video Dubbing & Post-production

Provide professional-grade dubbing with precise lip-sync and consistent emotions for movies, series, advertisements, and short videos

Movie Dubbing Advertisement Production Short Videos

Game Development & Virtual Characters

Generate dynamic, emotionally expressive dialogue speech for game NPCs, virtual streamers, and AI assistants

Game NPCs Virtual Streamers AI Assistants

Audiobooks & Podcasts

Easily generate multi-character, multi-emotional audiobooks and podcast content using minimal reference audio

Audiobooks Podcast Production Multi-character

Education & Training

Create engaging educational materials and multilingual courses to enhance learning experiences

Online Education Language Learning Training Courses

🚀 Quick Start

Get started with IndexTTS2 immediately

❓ Frequently Asked Questions

Questions you might be interested in

A: The official definition of "zero-shot" means the model can achieve voice cloning and emotion transfer using minimal (even a short audio clip) reference when encountering an unseen target speaker. However, in practice, "zero-shot" performance is affected by reference audio quality, language, emotional intensity, and model parameters.

A: According to official documentation, IndexTTS2 requires at least 8GB VRAM for inference. Real-time performance depends on hardware configuration, text length, and model parameters. On high-end GPUs (RTX 4090, A100), near real-time synthesis is achievable.

A: Text-based emotion control provides convenience but may not achieve the precision and naturalness of emotion reference audio. For professional applications requiring high emotional fidelity, emotion reference audio is still recommended.

A: The official provides Chinese and English examples, and the model supports Chinese/English bilingual output; the final acoustic and accent performance is highly related to training data and reference audio quality.

A: Yes. IndexTTS2 uses the Apache 2.0 open source license, which allows commercial use. However, please carefully read and comply with the complete license terms before use.

A: The official README recommends using uv to manage dependencies and ensure git lfs pull is enabled to download weights; if network downloads are slow, you can use mirrors or the huggingface-cli tool to download models locally first, then load them.

A: Artifacts may be caused by: 1) Poor reference audio quality (noise, clipping); 2) Inappropriate model parameters; 3) Text content beyond the model's training domain. Try using high-quality reference audio, adjusting synthesis parameters, or simplifying text content.