Advanced audio language models with few-shot learning capabilities
Featuring MiMo-Audio-7B-Instruct with thinking mechanisms and SOTA performance in audio understanding, dialogue generation, and instruction-based TTS.
Experience MiMo-Audio's capabilities with our live interactive demo
Discover the advanced capabilities of MiMo-Audio's audio language models
Instruction-tuned model with integrated thinking mechanisms. Achieves open-source SOTA performance on audio understanding, spoken dialogue, and instruct-TTS evaluations.
Emergent few-shot learning capabilities across diverse audio tasks. Trained on over 100 million hours of audio data for superior generalization.
1.2B-parameter Transformer operating at 25 Hz with 8-layer RVQ stack. Optimized for superior reconstruction quality and downstream modeling.
Advanced multimodal architecture combining patch encoder, LLM, and patch decoder
Aggregates consecutive RVQ tokens into patches, downsampling to 6.25 Hz for efficient LLM processing
7B parameter model with instruction tuning and thinking mechanisms for superior performance
Autoregressively generates full 25 Hz RVQ token sequences via delayed-generation scheme
Versatile capabilities for diverse audio processing tasks
Deep comprehension of complex audio content and context
High-quality conversational audio synthesis
Text-to-speech with instruction-based control
Advanced voice transformation and style transfer
Common questions about MiMo-Audio's capabilities and usage
MiMo-Audio is a series of audio language models developed by Xiaomi's MiMo team. It features few-shot learning capabilities and includes MiMo-Audio-7B-Instruct, which integrates thinking mechanisms for superior performance in audio understanding, dialogue generation, and instruction-based text-to-speech.
MiMo-Audio-7B-Instruct excels in multiple audio tasks including audio understanding (comprehending complex audio content), dialogue generation (creating natural conversational audio), instruction TTS (text-to-speech with instruction-based control), and voice conversion (advanced voice transformation and style transfer).
The MiMo-Audio-Tokenizer is a 1.2B-parameter Transformer that operates at 25 Hz with an 8-layer RVQ (Residual Vector Quantization) stack. It's optimized for superior reconstruction quality and serves as the foundation for downstream audio modeling tasks.
MiMo-Audio demonstrates emergent few-shot learning capabilities across diverse audio tasks. Trained on over 100 million hours of audio data, the model can generalize to new tasks with minimal examples, making it highly adaptable for various audio processing applications.
MiMo-Audio uses a multimodal architecture combining three key components: a patch encoder that aggregates RVQ tokens into patches (downsampling to 6.25 Hz), a 7B parameter large language model with instruction tuning and thinking mechanisms, and a patch decoder that autoregressively generates full 25 Hz RVQ token sequences.
MiMo-Audio models are available on Hugging Face, including MiMo-Audio-7B-Instruct, MiMo-Audio-7B-Base, and MiMo-Audio-Tokenizer. You can also find the source code, documentation, and evaluation toolkit on the official GitHub repository. Try the interactive demo at VibeVoice.info.
Explore the capabilities of advanced audio language models with few-shot learning