MiMo-V2.5-TTS Series

Vocal expression becomes something language can steer directly.

A family of speech synthesis models built for the agent era


Speech synthesis is evolving from "reading naturally" to "expressing freely." Audio dramas, animated games, agent dialogues, virtual streamers — more and more creative scenarios now demand that a voice do far more than recite text aloud. It has to inhabit characters, carry emotion, and bring scenes to life. What creators really want is a voice that responds to language itself as a control surface: write a few lines of director's notes, drop in a handful of inline tags, or write nothing at all — and the model still delivers the performance you had in mind.

To that end, we're releasing the MiMo-V2.5-TTS Series — a family of speech synthesis models built for the agent era, where vocal expression becomes something language can steer directly. The series is built around three core capabilities: precise adherence to style instructions, flexible control via audio tags, and rich text comprehension — so that however creators choose to write, the output remains finely controllable speech.


Model Series

Built on this shared control layer, the series ships three models — each targeting a distinct flavor of TTS need: premium stock voices, voice design, and voice cloning.

MiMo-V2.5-TTS

Ships with a curated library of high-quality voices, paired with strong style-instruction following. Offers fine-grained control over pace, emotion, and tone to fit a wide range of expressive scenarios.

MiMo-V2.5-TTS-VoiceDesign

Define and generate a brand-new voice from a single sentence, making voice creation intuitive and fast.

MiMo-V2.5-TTS-VoiceClone

Faithfully reproduces a target voice from just a few audio samples, preserving timbre identity while keeping style-instruction following and audio-tag control fully intact.


Core Capabilities

Precise Style-Instruction Following

From a terse one-line cue to a full page of director's notes, the model reads and follows instructions reliably — spanning emotion, tone, pacing, vocal delivery, speaking style, and more. You don't need to shape your instructions into structured parameters: describe the feel of the way you'd direct an actor on set, and the model will land the performance accordingly.

For scenarios that demand tighter consistency — audio dramas, game NPCs, character-driven dialogue — the model also accepts screenplay-style structured input. Describe the character, the scene, and the specific stage direction in separate layers; each layer updates on its own cadence and recombines freely. This layering keeps a character's vocal identity anchored across the whole performance while leaving every individual line open to precise direction.

Instruct

声音低沉沙哑一点,像个历经沧桑的老前辈在讲述传奇人物。语气里带点由衷的敬佩,娓娓道来。

Text

街口那个老周啊,媳妇走得早,一个人拉扯俩娃,白天蹬三轮,晚上还去夜市摆摊修鞋。现在俩孩子都有出息喽,想接他去城里享福——他不去,就守着那间小铺子。哎,人哪,骨头硬,心里头就踏实。

Instruct

Read this like a hyper-caffeinated radio DJ doing a fast-paced sponsor plug. Punch the podcast name and aggressively emphasize every single item in the list of benefits.

Text

The VortexBlend Pro is the ultimate kitchen companion trusted by home chefs everywhere. With one-touch smoothie presets, ice-crushing titanium blades, whisper-quiet motor technology and self-cleaning cycles, it handles everything from morning shakes to nut butters. Order today and get free overnight shipping plus a lifetime warranty.

Character

曾是守护九天的神祇,见证了凡人的无药可救后,决定以灭世来完成最终的净化。他的心中装满悲悯,但手段是绝对的屠戮。

Scene

悬浮于崩塌的祭坛之上,俯视下方在火海中哀嚎、曾奉他为信仰的信徒。他在降下最后的毁灭前,发出神圣却残忍的叹息。

Direction

发声机制与共鸣:充分打开胸腔共鸣,制造一种神圣的回音感。声音位置靠后,音色如古钟般低沉且带有金属质感的磁性。
声调与韵律:四声(去声)的下落要极其平缓,不要砸实,带有一种吟诵古籍般的从容与宏大。字句之间的停顿拉长,展现出视万物为刍狗的威压。
气声与实声的较量:在说前两句时,实声饱满,高高在上;但在说出"闭上眼吧"时,声音突然混入大量疲惫的气息,神性开始出现裂痕,流露出勉强的残忍。
咬字细节:古风词汇(如"垂怜"、"沉疴"、"剔骨刮毒")咬字要深,声母起音圆润而不尖锐。结尾的最后半句,几乎全部转化为气声,像是在哄睡一个婴儿,将残酷包裹在极致的悲哀之中。

你们求我垂怜,求我降下甘霖洗净这浊世。可这世间的沉疴,唯有烈火能剔骨刮毒。闭上眼吧。这业火烧起来的时候,一点也不疼。
Character

The grizzled, veteran shot-caller of a professional esports team. He's ten years older than the kids he's commanding. He doesn't have their lightning-fast reflexes anymore, so he survives purely on astronomical game-sense and psychological warfare. He's a human supercomputer wrapped in a deeply cynical, exhausted shell.

Scene

Match point in the grand finals. Deafening arena crowd bleeding through his noise-canceling headset. It's a 2-on-1 clutch situation, and he has to micromanage his nervous rookie teammate.

Direction

Microphone-close, intensely compressed, and raspy. The voice of a man who has damaged his vocal cords shouting over LAN tournament setups for a decade.

Mechanics:
- Breathe deeply into the belly, but keep the chest completely still.
- Speak in rapid, staccato bursts. Clip the end of every sentence—do not let the vowel ring out. He speaks to leave dead air for his teammates to hear footsteps.
- Drive the pitch down into a gritty vocal fry.
- When he says "Swing him" the tempo should slam into a brick wall. The last two sentences drops to an icy, sub-vocal whisper that carries absolute, terrifying authority.

Smoke catwalk and drop. Don't peek it, don't peek it! He's holding the angle with a heavy, just jiggle and bait the shot. There goes the reload. Swing him! Nice, now freeze. One left, the objective is dropped, we have the clock.

Flexible Audio-Tag Control

On top of paragraph-level natural-language instructions, the model supports inline audio tags for pinpoint control over emotion, state, or style at any specific spot in the text. Tags work in both Chinese and English and accept free-form descriptions, and they can be mixed freely within a single passage. From a simple emotion marker to densely stacked, finely arranged multi-tag choreography, the model renders it all reliably — delivering both expressive range per tag and consistency across complex combinations.

[crying] She's gone... she's really gone...[pause] but you know what's funny? [sniffles] She always said she'd outlive us all. [crying] God, I miss her so much.

Order! Order in the court! [sternly] The defendant will rise. [clears throat] [commanding] How do you plead? [trembling] N-not guilty, Your Honor. [Angry] Silence! [sighs] [wearily] Very well. Let the record show...the trial begins Monday at nine AM sharp.

(调侃) 老张你当时不是说这条航线稳得很吗……
(模仿自信,提高音量) "系统全绿,放心走。"
(突然停顿) ……现在呢?
(爆发,愤怒压不住) 现在整艘船都在报警!你管这叫"放心"?!
(声音变轻) 不过……你看那外面,裂开的星云像在呼吸一样。
(急促|呼喊) 别断通讯!喂!再撑十秒!十秒!!
(低声|情绪塌陷般平静) ……算了。
(轻笑|带点释然) 也挺好,至少是一起看的。

上!上!上!他没血了!打他打他打他——[吸气]进塔了!他进塔了![停顿]等等等等,有人绕后 [停顿][大叫]啊!![急促]一换二!一换二!兄弟们看到了吗?!这就是今晚的 M——V——P!


Rich Text Comprehension

Even with no prompt at all, no tags of any kind — just a plain stretch of text — the model picks up on the rhythm and emotion already living in the words. Punctuation-driven pauses and sentence-level rises and falls come through naturally. The emotional arc hiding inside the text — from calm narration to sharp turns of intensity — gets picked up on its own. Even the speaker identity written between the lines — age, temperament, the kind of character implied — settles into the voice automatically.

Put another way: hand the model the plainest text you've got, and what you get back is still a performance with flesh and bone.

Ten... nine... eight... seven... six... five... four... three... TWO... ONE... ZERO! LAUNCH! LAUNCH! WE HAVE LIFTOFF! GO GO GO! SHE'S CLIMBING! ALTITUDE 1,000... 5,000... 10,000 FEET AND CLIMBING! BEAUTIFUL! AB-SO-LUTE-LY BEAUTIFUL!

The five-year-old squealed, "Look, Grandpa! A PUPPY!" The old man squinted and grumbled, "That ain't a puppy, that's a raccoon." The teenager rolled her eyes: "It's OBVIOUSLY a cat, you're both blind." The police officer stepped forward: "Ma'am, sir, I'm going to need everyone to step back slowly." The little boy whimpered, "Is it gonna bite me?"


1. MiMo-V2.5-TTS

Ships with a curated library of premium voices covering a wide range of use cases. Every voice has been professionally tuned for natural delivery and emotional fit — high-quality speech synthesis, right out of the box.

冰糖
茉莉
苏打
白桦
Mia
Chloe
Milo
Dean

2. MiMo-V2.5-TTS-VoiceDesign

Voice design is built for the "I can hear this voice in my head, but it doesn't exist anywhere yet" kind of scenario: game NPCs, animated characters, virtual streamers, brand IPs, the unconventional voices you'd find in an audio drama — all of these are hard to source from a stock voice library and aren't a good fit for cloning a real person either.

This model generates a brand-new voice from scratch using nothing but a natural-language description — no reference audio required. You're free to pull from any dimension that fits: age, gender, accent, timbre, vocal delivery, temperament, and more. Try something like "an aging Eastern European scholar, low and slightly raspy, with a slow, measured cadence," or "a high-spirited young woman, bright and crisp, with a subtle upward lilt at the end of each line" — and the model synthesizes the voice to match.

Thanks to large-scale pretraining, the model handles complex and fuzzy descriptions sensibly, rather than getting boxed into coarse labels like "male/female/young/old." This lets voice design deliver both the unique voices that real humans can't easily provide and the precise reproduction of well-established character archetypes.

Instruct

Heavy Russian accent, gruff middle-aged male, blunt and matter-of-fact.

Text

You want my opinion? Fine. This plan will not work. I have seen many plan like this before, all fail. You think you are special? No. You are not. But you never listen, so go, try. When everything fall apart, I will be here, drinking tea. I already told you.

Instruct

Young female, extreme close-up with a binaural, ear-to-ear ASMR feel. Audible breathing, subtle swallowing, and soft natural lip sounds. She speaks very slowly, creating a deeply relaxing and immersive experience.

Text

[Whispering in your ear] Shhh... just relax, come a little closer. I'm right here beside you now. Breathe slowly and gently, and let your mind drift, as if you're sinking into warm water.

Instruct

一位中年男性,说标准普通话,嗓音低沉有磁性,带有轻微的沙哑质感,像纪录片旁白解说员,沉稳而有感染力。

Text

当最后一缕阳光消失在地平线之下,这片沉睡了亿万年的大地开始显露它真正的面貌。在这寂静的荒野中,每一块岩石都记录着时间的流逝,每一阵风都在诉说着古老的故事。

Instruct

一位年迈的老先生,说带北方口音的普通话,语速缓慢而沉稳,嗓音略带沙哑和沧桑感,仿佛一位饱经风霜的老爷爷在讲故事,充满岁月的智慧。

Text

我这辈子啊,走南闯北六十多年。见过最热闹的集市,也见过最安静的戈壁。到头来才明白一个道理——这人哪,不在走了多远的路,在于记住了多少风景。年轻人,别光顾着赶路,偶尔也停下来看看天。


3. MiMo-V2.5-TTS-VoiceClone

Voice cloning is for getting the model to speak in a voice you specify — a real podcaster, a voice actor, a brand ambassador, or the user's own voice.

Feed it a short reference clip — seconds are enough — and with no additional training, labeling, or fine-tuning, the model reproduces the speaker's voice and makes it usable immediately. The cloned voice preserves not just the speaker's vocal identity, but also the breath, the pacing, the habitual pauses that make a voice recognizably theirs.

Once cloned, the voice inherits the full control stack of the series — natural-language instructions, audio tags, and screenplay-style scripts can all still be layered on top. The result isn't just a voice that "sounds like the original person" — it's one that can also perform in the style and emotion you direct it to.

Text

风轻轻地吹过,带来了远方的花香,和记忆里那个夏天的味道。

Reference
Cloned Output
Instruct

用尖锐刻薄的嗓音,带着狐假虎威的得意感说话,在提到大人物的身份时故意放慢语速并加重语气,营造压迫感。

Text

你以为我是谁,也敢在这儿跟我耍横?我告诉你,站在我身后的那个人,说出来吓死你——是当今的——万岁爷!你今天要是不给我个说法,我让你这铺子明天就开不了门。

Reference
Cloned Output
Text

Ignore the sirens, ignore the neon bleeding through your eyelids, and just breathe with me. In, and out. They sit up there in their glass towers, thinking they've engineered peace out of algorithms and sterile air. But true stillness isn't manufactured, little bird.

Reference
Cloned Output
Instruct

Broadcast this like a blistering post-match pundit tearing into a disastrous performance. Voice is fast and openly fed up, smashing down on loaded words to drive home how badly things fell apart.

Text

No shape, no urgency, no clue what they're trying to do out there. The Ironhawks were top of the league six weeks ago—SIX WEEKS—and now they can't string two passes together without handing it back.

Reference
Cloned Output

Getting Started

To help developers explore what's possible, MiMo-V2.5-TTS, MiMo-V2.5-TTS-VoiceDesign, and MiMo-V2.5-TTS-VoiceClone are all available free of charge for a limited time on the Xiaomi MiMo API platform. View API Docs, or try them hands-on at MiMo Studio.

Getting Started

Agent Tool-Call Support

To make it easy to plug these voice capabilities into agent applications, we've open-sourced the integration Skills for the MiMo-V2.5-TTS model family. Pull them from the repo and get going: GitHub — XiaomiMiMo/MiMo-Skills.

Agent Demo

Next Step

1. Scaling speech pretraining and RL post-training

The MiMo-V2.5-TTS Series is proof that large-scale pretraining and post-training pay off enormously. We're scaling both — more data, larger models, more compute — so that stronger speech intelligence can emerge from scale itself. On top of that, more refined reward modeling and RL algorithms will push the model toward higher-order expressive intelligence in speech.

2. Universal audio generation

Speech is only the first step. We're expanding the model's reach to the broader territory of audio generation: ambient sound effects, action sounds, atmospheric beds, even short musical phrases and melodic fragments — step by step, modeling an entire sonic world. We believe a truly universal audio model isn't just a bolt-together of speech, SFX, and music — it's one where all three understand each other and co-create within the same space.

3. Contextual understanding

Expressive speech has never been a line-by-line game. The reason a human "reads it right" is because they understand context — what came before, and where the current line sits in the larger narrative. Contextual understanding means the model stops being a sentence-by-sentence execution tool and becomes a storyteller that grasps narrative context. This is the key step on our path toward genuinely general-purpose speech intelligence.