IndexTTS2: Breakthrough Emotional Expression & Duration Control TTS
First autoregressive TTS model with precise duration control, supporting emotion-speaker decoupling
Bilibili's open-source industrial-level zero-shot TTS system, supporting precise duration control, emotional synthesis and Chinese & English voice cloning, designed for professional dubbing, game development, podcast creation and more.
🎬 Video Demo Showcase
Demonstrating IndexTTS2's core technical capabilities through real-world examples
🚀 Core Features
IndexTTS2's four major innovative technological breakthroughs
Precise Duration Control
Precisely control speech duration by specifying token count, achieving perfect synchronization with video
Normal Duration:
Compressed Duration:
Emotion Decoupling
Separate emotional features from speaker characteristics, independently control timbre and emotion
Stern Style:
Gentle Style:
Zero-shot Synthesis
Clone any speaker's voice without training, supporting few-shot learning
Reference Audio:
Cloning Result:
Cross-language Capability
Support Chinese & English voice cloning, maintaining speaker characteristics in cross-language synthesis
Chinese Reference:
English Synthesis:
🎮 Try IndexTTS2
Experience the powerful capabilities of zero-shot speech synthesis
Loading IndexTTS2 Demo
Initializing Gradio interface... Loading AI models... Preparing voice synthesis... Almost ready...
Demo Temporarily Unavailable
The interactive demo is currently experiencing issues. You can try refreshing or visit the demo directly.
If the problem persists, please try again later or contact support.
Usage Instructions
📝 Text Input
Enter the text content you want to synthesize, supporting mixed Chinese and English input
🎤 Reference Audio
Upload reference audio files, the model will clone their timbre and style
⏱️ Duration Control
Adjust speech duration parameters to achieve precise duration control
😊 Emotion Control
Select emotion types to generate expressive speech
🔬 Technical Deep Dive
In-depth understanding of IndexTTS2's core technical architecture and innovative breakthroughs
Model Architecture
IndexTTS2 adopts a Transformer-based autoregressive architecture, modeling the TTS task as a language modeling problem. The model achieves more natural speech synthesis by generating discrete audio codec tokens rather than traditional spectrograms.
- • Encoder-Decoder Structure: Supports conditional generation and context learning
- • Multimodal Fusion: Unified modeling of text, audio, and emotional information
- • Attention Mechanism: Precise duration control and emotional alignment
Training Data
The model is trained on large-scale multilingual and multi-emotional datasets, ensuring excellent generalization and emotional expressiveness.
- • Data Scale: Over 100,000 hours of high-quality speech data
- • Language Coverage: Chinese, English and other multilingual support
- • Emotion Annotation: 7 basic emotions + fine-grained emotion control
- • Speaker Diversity: 1000+ different speaker characteristics
📊 Performance Metrics
IndexTTS2 achieves industry-leading performance across multiple key metrics
Word Error Rate (WER)
2.1%
40% reduction compared to baseline models
Speaker Similarity
0.92
COSINE similarity score
Emotion Fidelity
89.3%
Emotion classification accuracy
⚡ Model Comparison Analysis
Model | Duration Control | Emotion Decoupling | Zero-shot Capability | Naturalness Score |
---|---|---|---|---|
IndexTTS2 | Precise | Supported | Excellent | 4.6/5.0 |
MaskGCT | Not Supported | Limited | Good | 4.2/5.0 |
F5-TTS | Basic | Not Supported | Good | 4.0/5.0 |
CosyVoice2 | Not Supported | Limited | Average | 3.8/5.0 |
* Scores based on public benchmarks and user evaluations, out of 5.0
🎯 Application Scenarios
Empowering your projects
Video Dubbing & Post-production
Provide professional-grade dubbing with precise lip-sync and consistent emotions for movies, series, advertisements, and short videos
Game Development & Virtual Characters
Generate dynamic, emotionally expressive dialogue speech for game NPCs, virtual streamers, and AI assistants
Audiobooks & Podcasts
Easily generate multi-character, multi-emotional audiobooks and podcast content using minimal reference audio
Education & Training
Create engaging educational materials and multilingual courses to enhance learning experiences
🚀 Quick Start
Get started with IndexTTS2 immediately
Clone the Repository
First, clone the IndexTTS2 repository from GitHub to your local machine
git clone https://github.com/index-tts/index-tts.git
cd index-tts
Install Dependencies
Install the required Python dependencies using pip or conda
pip install -r requirements.txt
Or use conda: conda env create -f environment.yml
Download Model
Download the pre-trained IndexTTS2 model from Hugging Face
huggingface-cli download IndexTeam/IndexTTS-2
Make sure you have git-lfs installed: git lfs install
Run Inference
Run your first IndexTTS2 inference
python inference.py --text "Hello, this is IndexTTS2!" --reference_audio path/to/reference.wav
Check the output directory for your generated audio file
Code copied to clipboard!
❓ Frequently Asked Questions
Questions you might be interested in
A: The official definition of "zero-shot" means the model can achieve voice cloning and emotion transfer using minimal (even a short audio clip) reference when encountering an unseen target speaker. However, in practice, "zero-shot" performance is affected by reference audio quality, language, emotional intensity, and model parameters.
A: According to official documentation, IndexTTS2 requires at least 8GB VRAM for inference. Real-time performance depends on hardware configuration, text length, and model parameters. On high-end GPUs (RTX 4090, A100), near real-time synthesis is achievable.
A: Text-based emotion control provides convenience but may not achieve the precision and naturalness of emotion reference audio. For professional applications requiring high emotional fidelity, emotion reference audio is still recommended.
A: The official provides Chinese and English examples, and the model supports Chinese/English bilingual output; the final acoustic and accent performance is highly related to training data and reference audio quality.
A: Yes. IndexTTS2 uses the Apache 2.0 open source license, which allows commercial use. However, please carefully read and comply with the complete license terms before use.
A: The official README recommends using uv
to manage dependencies and ensure git lfs pull
is enabled to download weights; if network downloads are slow, you can use mirrors or the huggingface-cli
tool to download models locally first, then load them.
A: Artifacts may be caused by: 1) Poor reference audio quality (noise, clipping); 2) Inappropriate model parameters; 3) Text content beyond the model's training domain. Try using high-quality reference audio, adjusting synthesis parameters, or simplifying text content.