MiMo-V2-TTS | Xiaomi

Beyond the Monotone

The agentic era is here. AI agents can see, hear, act, and call tools to solve real problems. But a truly intelligent partner shouldn't just execute — it should express.

MiMo-V2-TTS is built for this moment: full-modal interaction in the age of agents. It gives agents more than understanding — it gives them a voice with warmth, emotion, and soul. Not a passive text-to-speech engine, but a natural extension of how an agent communicates and connects:

Contextual emotion awareness — picks up emotional cues from text and automatically matches the most natural tone and delivery
Universal style adaptability — from formal announcements to casual conversation, the output stays natural and on-register
Real-time, seamless interaction — keeps pace with the agent's reasoning, making conversation feel fluid and effortless

Introduction

Xiaomi MiMo-V2-TTS is Xiaomi's self-developed large-scale speech synthesis model. Built on a proprietary Audio Tokenizer and a multi-codebook joint speech-text modeling architecture, MiMo-V2-TTS is pretrained on over 100 million hours of speech data and further refined through multi-dimensional reinforcement learning.

What it delivers:

Highly controllable, multi-granularity style control — from setting the overall tone of an utterance to fine-tuning local emotional nuances, including mid-sentence shifts in mood and gradual emotional transitions within a single phrase
Natural prosody reproduction — faithfully reproduces the natural rhythm and cadence of human speech
Singing capability — accurately captures pitch and rhythm when singing, without sounding artificial

How it's built:

Large-scale speech-text joint pretraining — pretrained on over 100 million hours of speech data, establishing strong cross-modal alignment and unified understanding-generation capabilities
High-quality supervised fine-tuning — fine-tuned with a small amount of high-quality supervised data, giving the model generalizable, arbitrary-granularity style control that responds to free-form instructions
Multi-dimensional reinforcement learning — to further unlock the expressive potential accumulated during massive pretraining, we introduce multi-dimensional RL that balances stability with expressiveness. Beyond conventional supervised learning, MiMo-V2-TTS leverages RL during speech generation training to continuously optimize across multiple dimensions:
- More natural prosody
- More stable audio quality
- More accurate word-level articulation
- Higher-fidelity voice cloning
- Contextually appropriate tone and delivery across diverse scenarios
Multi-layer codebook architecture — thanks to the MiMo-Audio multi-layer codebook architecture, the model operates in a high-fidelity discrete token space that preserves the rich information present in the original speech signal. This allows speech-specific reward signals to directly guide optimization during the reinforcement learning phase, making multi-dimensional evaluation metrics more effective in shaping the generation process.

Natural Language Style Instruction and Voice Control

Flexible Text-Driven Customization of Delivery Style

Most TTS systems give you a dropdown menu of emotions: happy, sad, angry, neutral. Pick one. MiMo-V2-TTS gives you a text box instead.

Describe the voice you want in plain language — any language, any level of detail — and the model generates speech that matches your description. No predefined tags. No fixed vocabulary. Just natural language.

"sleepy, just woke up, slightly hoarse"
"cutesy baby voice, a bit whiny"
"deeply affectionate, speaking slowly, almost whispering"
"impassioned and emphatic, like giving a speech to a crowd"

生气：跟你说多少次了！换下来的鞋子不要乱踢，吃完的外卖盒马上扔掉！你当这是住酒店是不是？你看这满屋子乱的，你要是再不收拾，我直接全给你扔出去！

温柔：乖，先把这杯温水喝了。今天在外面累坏了吧？去泡个热水澡，睡衣都在床上放好了。有什么心烦的事儿，等你睡醒了，明天我们慢慢再说，好不好？

悄悄话：嘘，轻点儿关门，宝宝刚才好不容易才哄睡着。你饿不饿？我去厨房给你下碗面……动作小点儿啊。

fast: Oh my gosh, I literally just missed the bus by three seconds, and now I'm gonna be super late for the meeting, I really need to run, bye!

slow: It's... so late. My whole body aches. I'm just gonna lay here on the couch and stare at the ceiling for a while.

happy: Are you kidding me? We got the front row tickets! This is gonna be the best weekend ever, let's go celebrate right now!

angry: Are you serious right now? Stop leaving your dirty dishes on my desk! I am so sick of cleaning up after you every single day!

sad: I just don't understand why he would say that. It really hurts, you know? I thought we were actually good friends.

The model also handles dialects and character personas with the same interface:

Northeastern Mandarin, Sichuan dialect, Cantonese, Taiwanese Mandarin

哎呀妈呀，这外头风刮得，跟小刀刮脸似的！你赶紧进去把你那件长款羽绒服套上，别搁那儿穷得瑟了，冻感冒了我可不管你啊。

哎哟喂，你还在磨蹭个啥子嘛！锅里头的红油都烧开了，毛肚再不烫就要老得咬不动咯。搞快点搞快点，吃完我们还要去搓两把麻将哦。

乖乖咧，你这究竟是弄啥咧？昨天跟你说得好好的，今儿一扭脸你又给忘了。赶紧把那碗胡辣汤趁热怼了去，别搁那儿一直看手机了，中不中？

Character voices: Sun Wukong, Lin Daiyu

师父莫怕！俺老孙刚才翻到那山头看过了，前面树林子里透着股妖气。你们先在这石头上歇着，且容俺去打个头阵，探探什么来路！

我就知道，别人不挑剩下的也不给我。早知他今日来，我就不来了，倒显得我在这儿多余。罢了罢了，横竖是我不懂事，净惹人嫌弃。

This isn't keyword matching — the model parses the semantic content of your style description and maps it to the corresponding acoustic features during generation. Compositional descriptions work naturally: "angry but trying to stay calm" produces a measurably different output than either "angry" or "calm" alone.

Fine-Grained Control of Acoustic and Non-Verbal Sound Events

Real human speech is not a clean sequence of words. It's full of coughs, sighs, hesitations, sharp intakes of breath, and nervous laughter — the paralinguistic events that carry as much meaning as the words themselves.

MiMo-V2-TTS generates these events as naturally integrated components of the speech output, not as audio clips spliced in after the fact. The model understands where these events belong in context and how they should sound given the surrounding text.

Supported events include:

Coughing — natural throat-clearing and cough sounds
Pauses — meaningful silences for deliberation or rhetorical emphasis
Hesitation fillers — "um...", "uh...", reflecting real-time cognitive processing
Sighing — exhalations and deep breaths conveying emotional state
Laughter — from subtle chuckles to sudden bursts

Hear it in action:

Achoo! Ahem. I—I really [cough] think I am coming down with a terrible [cough] terrible cold.

[heavy breathing] Just... give me... a second. I ran... all the way... from the station.

I just feel... long sigh... like I'm constantly treading water, you know?

It's just so stupid! (sobbing) We spent all that money on the cake and the dog just... (sudden laugh) he just ate the whole thing in one bite!

Hold on… [heavy breathing] I… I need a minute… [soft cough] I ran all the way here and I can barely catch my breath… [nervous sigh] did I make it in time?

（虚弱，气若游丝）水……给我点水……（剧烈咳嗽）咳咳咳！嗓子里像是有火在烧……（喘息）我是不是……熬不过去这个冬天了？

（紧张，深呼吸）呼……冷静，冷静。不就是一个面试吗……（语速加快，碎碎念）自我介绍已经背了五十遍了，应该没问题的。加油，你可以的……（小声）哎呀，领带歪没歪？

（极其疲惫，有气无力）师傅……到地方了叫我一声……（长叹一口气）我先眯一会儿，这班加得我魂儿都要散了。

如果我当时……（沉默片刻）哪怕再坚持一秒钟，结果是不是就不一样了？（苦笑）呵，没如果了。

（寒冷导致的急促呼吸）呼——呼——这、这大兴安岭的雪……（咳嗽）简直能把人骨头冻透了……别、别停下，走，快走。

Advanced Text Comprehension

Rich Text Understanding

When a human reads "You are UN-BE-LIEVABLE!" out loud, they don't pronounce each syllable the same way. The capitalization, the hyphens, the exclamation mark — all of these are instructions for how to speak. Most TTS systems ignore them.

MiMo-V2-TTS interprets typographic and formatting cues as prosodic signals:

ALL CAPS → automatic stress and emphasis ("THIS IS IMPORTANT" sounds emphasized, not just louder)
Character repetition → mapped to speech rhythm and emotional intensity ("不不不不不" becomes a rapid, emphatic refusal)
Punctuation → shapes intonation contours (questions rise, exclamations punch, ellipses trail off)

Example inputs:

Ugh... pffft... oh, pleeeease! You actually think I care? I am SO. TOTALLY. OVER. THIS. ENTIRE. THING.

Wait, w-what do you mean? The final exam is scheduled for TODAY?

You are UN-BE-LIEVABLE! I am sooooo done with your constant lies. GET. OUT!

Ahem... uh, ex-CUSE me! HEY! Could I... could I PLEASE have ev-ery-one's at-TEN-TION?

I—I've prepared a lot... sorry, my hands are shaking... my experience really aligns here. Ugh, tch, l-look... I. SAID. NO! Gah, d-do. Not. Ask about that previous project... EVER... again!

你也太离谱了！！我真的是受——够——了你那些谎话。给我，出去！

我我我不是那个意思，我就是、就是有点没想好……

这个……嘶我想想啊……我好像在哪见过。

不是？？？你在逗我呢兄弟！！！你说你搞了这么久，结果就这？？？？？？

Inferring Speaking Style from Text Context

Perhaps the most consequential capability: MiMo-V2-TTS can infer the appropriate speaking style directly from the text content itself, without any explicit style prompt.

A question is spoken with rising intonation — automatically.
An angry outburst shifts to a sharp, clipped delivery — automatically.
A tender confession softens in pace and volume — automatically.

This emerges from the language model backbone's deep semantic understanding. The model doesn't just convert text to speech — it reads the text, understands the emotional arc, and adapts its delivery sentence by sentence.

Guys, watch this— I'm gonna sneak around the back, take him by surprise, three two one go, oh my god I got him! That was such a clean shot, we're totally dominating this game right now, let's keep this momentum and win this match together!

She pulled me back quickly and breathed urgently into my ear, "We have to leave right now! They're coming closer, and if we stay here, we'll never get away!"

She grabbed my hand and urged me excitedly, "Hurry hurry hurry! The surprise is just around the corner—you're gonna love it more than anything!"

She buried her face in my shoulder and cried quietly, "I tried so hard… why can't anyone see how much I'm hurting inside?"

Her tone turned sharp and serious, but her voice stayed low, "Don't trust anyone here. Not a single person. They're all lying to you."

我真的已经拼尽全力去做了，可是不管怎么努力都没用，我真的好难过，为什么没有人能看见我的委屈，能体谅我一下。

你怎么可以这样对我！我把你当成最信任的人，什么心里话都跟你说，你却在背后背叛我，你真的太让我伤心和失望了！

请你慢慢闭上眼睛，把注意力带到你的呼吸上，轻轻吸气，再缓缓呼气。让身体一点点放松下来，肩膀放下，眉头舒展，心里的紧张也跟着一起离开。就在这里，安安静静地和自己待一会儿，不用着急，不用追赶，只需要感受此刻的安稳与平静。

他挺直身体，语气坚定、掷地有声："对方辩友，这不是妥协，是逃避！我们要的是真相，是公平，是绝不退让的原则！我坚决反对这种不负责任的观点！"

朋友在旁边兴奋地小声喊："快看快看！他过来了！真的是本人！比照片还好看！"

Singing Capability

MiMo-V2-TTS supports singing voice synthesis — within the same unified model that handles speech.

To our knowledge, this makes it the only commercially available TTS API that natively supports both speaking and singing generation. No separate model. No mode switching. The same architecture that delivers a whispered confession can belt out a pop chorus.

我怎么变这样，变得这样倔强？每一步的地方，每一站都不会忘。舞台上远远的光，落在我的肩膀，想起第一次那个模样。我怎么变这样，变得这样疯狂？用这灿烂时光，绽放不一样的光。就算黑夜太漫长，风景全被遮挡，抬头就有一片星光。

Oh，听说天不够黑，别着急入睡，那些想见的人会慢慢出现。星星那么美，我不想浪费，借个时间和自己聊一聊天。听说天不够黑，别着急入睡，那些恨过的人会变得完美。我还不会累，我还不想睡，只要轰轰烈烈度过这迷人的夜。

当你的眼睛眯着笑，当你喝可乐，当你吵，我想对你好，你从来不知道，想你想你，也能成为嗜好。当你说今天的烦恼，当你说夜深你睡不着，我想对你说，却害怕都说错，好喜欢你，知不知道？

我从前相信，这世上有一个温暖的人，只为我悲喜，为我阻挡着人间的锋利。为了找到你，从未放过任何蛛丝马迹，而事到如今，终于明白我命里没你。

What's Next

MiMo-V2-TTS is a milestone in our voice technology roadmap, but it is not the destination.

On our roadmap: Expanded language coverage beyond Chinese and English. And tighter integration with MiMo-V2-Omni's multimodal understanding capabilities — enabling agents that not only see and understand the world, but speak about it with the full expressiveness of a human voice.

The voice agent era needs voices that are more than intelligible. It needs voices that are alive.

We're building them.