MiMo-V2.5-ASR | Xiaomi

MiMo-V2.5-ASR is an open-source speech recognition model developed by Xiaomi MiMo. It supports bilingual Chinese–English recognition, a wide range of Chinese dialects, code-switching, lyrics transcription, knowledge-intensive content, noisy environments, multi-speaker scenarios, and other challenging acoustic conditions.

Through large-scale mid-training, high-quality supervised fine-tuning (SFT), and novel reinforcement learning (RL) algorithms, MiMo-V2.5-ASR achieves state-of-the-art performance across multiple authoritative benchmarks.

Key Features

🗣️Chinese Dialects: Native support for Wu, Cantonese, Hokkien, Sichuanese, and more
🔀Code-Switching: Seamless Chinese–English code-switching transcription with no language tags required
🎵Song Recognition: High-precision lyrics transcription for Chinese and English songs, even with mixed accompaniment and vocals
🔊Noisy Environments: Robust recognition under heavy noise, far-field capture, and other adverse acoustic conditions
👥Multi-Speaker: Accurate transcription of overlapping, multi-party conversations such as meetings
🇬🇧Complex English Scenarios: Leading performance on the Open ASR Leaderboard for challenging English benchmarks such as AMI
📚Knowledge-Intensive Content: Precise recognition of classical poetry, technical terminology, personal names, place names, and other knowledge-dense material
📝Native Punctuation: Punctuation generated natively from prosody and semantics, delivering ready-to-use transcripts with no post-processing needed

Performance

MiMo-V2.5-ASR achieves leading performance across public and internal benchmarks spanning general Chinese and English recognition, Chinese dialects, code-switching, and lyrics transcription — demonstrating consistent advantages across scenarios and languages.

Representative evaluation results are shown below.

Detailed benchmark results are shown as follows. (Lower WER is better.)

General Chinese Recognition

	AiShell-2	Fleurs-Zh	Wenet Meeting	Wenet Net	CommonVoice-Zh
Qwen3-ASR-1.7B	2.67	3.21	5.9	4.94	5.6
Seed-ASR 2.0	2.63	3.31	7.22	4.89	5.08
FunASR-1.5	2.57	2.75	5.95	5.28	4.57
Gemini-3.1-Pro	4.52	3.3	12.09	9.69	7.74
MiMo-V2.5-ASR	2.52	2.41	5.92	5.26	4.90

General English Recognition (Open ASR Leaderboard)

	Average WER	AMI	Earnings22	Gigaspeech	LS Clean	LS Other	SPGISpeech	Tedlium	Voxpopuli
Qwen3-ASR-1.7B	5.76	10.56	10.25	8.74	1.63	3.4	2.84	2.28	6.35
FunASR-1.5	5.88	10.65	11.64	8.71	1.5	3.16	1.87	3.48	5.99
Qwen3-ASR-0.6B	6.42	11.66	11.06	9.14	2.13	4.45	3.03	2.85	7.07
Whisper-large-v3	7.44	15.95	11.29	10.02	2.01	3.91	2.94	3.86	9.54
VibeVoice-ASR-HF	7.77	17.2	13.17	9.67	2.2	5.51	3.8	2.57	8.01
Seed-ASR 2.0	8.09	15.78	13.37	9.63	2.81	5.83	4.65	3.13	9.53
MiMo-V2.5-ASR	5.73	10.63	11.17	8.87	1.45	3.49	1.85	2.4	6.01

Chinese Dialects

	CommonVoice-Taiwan	WeNet-Yue	WeNet-Wu	WeNet-Chuan	Fleurs-Yue
Qwen3-ASR-1.7B	3.7	6.54	24.29	11.99	3.53
Seed-ASR 2.0	4.18	11.33	32.59	11.61	5.63
FunASR-1.5	6.78	6.6	29.08	13.21	31.16
Gemini-3.1-Pro	6.19	22.11	64.87	30.11	4.78
MiMo-V2.5-ASR	3.65	7.21	19.55	11.99	3.28

Lyrics Recognition

	m4singer	opencpop	MIR-1K-Vocals	Sing-Chinese
Qwen3-ASR-1.7B	4.82	3.68	5.56	12.05
Seed-ASR 2.0	9.46	N/A	17.92	N/A
FunASR-1.5	5.58	17.36	N/A	12.8
Gemini-3.1-Pro	4.25	4.42	8.24	17.67
MiMo-V2.5-ASR	3.95	2.93	4.91	9.06

Internal Business Scenarios

	Code-Switch	In-house-1	In-house-2
Qwen3-ASR-1.7B	18.3	7.55	16.29
Seed-ASR 2.0	16.55	9.08	14.43
FunASR-1.5	16.99	9.78	16.9
Gemini-3.1-Pro	18.45	16.47	30.9
MiMo-V2.5-ASR	14.07	6.91	13.46

Scenario Showcases

The following examples illustrate MiMo-V2.5-ASR's recognition capabilities across diverse scenarios, offering a direct look at the model's performance along multiple dimensions.

Part 1: Chinese Dialect Recognition

Hundreds of millions of people in China speak regional dialects, yet most speech recognition systems offer limited support beyond Mandarin. MiMo-V2.5-ASR natively supports Wu, Cantonese, Hokkien, Sichuanese, Henan dialect, Northeastern Mandarin, and more.

Dialect	Speech	Recognition Result
Wu		再讲呢，爷娘又勿辣身边，勿懂个呀，有吃就拼命吃，就像个狼一样个。
Cantonese		听佢讲，无论自己出咩难题，都能够尽力办到。唉，咁真系天外飞嚟嘅大横财。
Sichuanese		幺二八筒都成对，有的是机会。这把我看你们哪个跑得脱。
Hokkien		十点，哇，迄日头搁真炎，热啊要死诶。
Taiwanese Mandarin		一山还有一山高，萝卜还有萝卜糕。
Shandong		盘中人民币汇率却逆势上涨。
Northeastern		干啥呢？你搁那嘎达杵着干啥呢？你瞅啥？
Henan		就是童年记忆中的鲜美味道。
Shaanxi		所以，中原文化的底蕴，自然不言而喻。

Part 2: Robust Recognition in Complex Acoustic Scenarios

Real-world speech recognition frequently contends with environmental noise and overlapping speakers. Through large-scale data augmentation and targeted training, MiMo-V2.5-ASR delivers robust performance in live-streaming, noisy, and multi-speaker overlap scenarios.

Scenario	Speech	Recognition Result
Live-stream sales		这一千块钱啊，只有两百个，只有两百个，只有两百个人可以抢，快快快快快，手慢了就没有。
Live-stream commentary		比从上海飞我们青海还便宜呢，啊，吉祥航空，人民币两千二，人民币两千二，牛逼吧？这就是中国航空公司把价格都打下来了。不是，现在我没否认，两千二，真两千二，啊，真两千二。我跟你说一下什么情况，现在中国的航空公司抢生意呢，你像像吉祥啊、的小航航空公司都飞欧洲，快把那，为什么？一是疫情加上俄呃呃俄乌冲突开始以后，欧洲的航空公司遇到了两个问题。
Esports commentary		漂亮，漂亮，漂亮，就是这么打。因为我们确实也，RNG赢需要三十、三十五分钟，T1赢就要需要二十五分钟。狐狸给你，然后我用丽桑卓去打你的一个狐狸。是的。然后我看谁还敢质疑我！对，塔姆这个点把我们比较依赖的小明的这个开团克制得太死了。
English multi-speaker		Okay, that's what I'll do for you. You live in the Chicago area? Yes, I do. I'll bring you back on the show, and I'm gonna find two guys perfect for you that you can choose.

Part 3: Code-Switching and Complex English Scenarios

In an increasingly globalized world, Chinese–English code-switching is a common feature of everyday communication. MiMo-V2.5-ASR naturally supports mixed-language speech within a single model, with no need for pre-set language tags.

Scenario	Speech	Recognition Result
Cantonese–English		大家有兴趣睇试食放题片嘅，可以支持下我哋。钟意睇嘅你，希望可以畀个 comment，like，share 畀朋友。订阅埋 Channel，㩒埋个钟仔，下次有新片就会通知，下次再同大家去食其他嘢。
Chinese–English		Stack Overflow是一个科技Q&A的平台，GitHub则是全球最大的源代码托管服务商。
Accented English		There should be a signal, uh, something like a radio wave or a infrared light or a LED, which can be used to change the different functionalities in the television. If the user wants to, uh, change the channels or increase the volume, he can change it.

Part 4: Knowledge-Intensive Scenarios

In specialized domains, the accurate recognition of personal names, place names, and technical terms is critical to transcript usability. Through knowledge-enhanced training strategies, MiMo-V2.5-ASR substantially improves recognition accuracy for easily confused homophones and domain-specific expressions, producing transcripts that are genuinely actionable.

Scenario	Speech	Recognition Result
Wordplay & idioms		你在我身边就像红袖添香，小鸟依人。好，我晚上去你家睡觉。趁虚而入。第1个字是重的反义词，轻，然后开的，轻车熟路。好。这晚在街中偶遇心中的她。我我就问1问，你什么时候来过我家了？走了走了走了。我妈看这节目，我怎么跟她交代？
Historical drama dialogue		你既说熹贵妃私通，那奸夫是谁呀？太医温实初。温实初是熹贵妃的心腹，日日都要把脉的，若说日久生情也是难怪。更何况我听说熹贵妃初入宫时卧病许久，当时就是温太医诊治。康常在好记性，原来孽情深重始于当日。两位妹妹怎么能如此揣测？熹贵妃入宫病重，由温太医诊治，乃是情理之中的事。温太医医术高明不说，与姐姐母家素日也有交情，入宫之后互相照应也是应当的，怎么会有私通一说？如此说来，竟是青梅竹马了。
English sports commentary		Mane, Thiago, Luis Diaz. He's going to have a goal. It's deflected and it's in! Redemption, a route back for Liverpool from Luis Diaz with the help of a touch on the way through. And now it's time to believe again for those inside Anfield. And here comes another threat for Liverpool. Mohamed Salah's trying to get back. Round the corner goes Davis, and he's gone for goal and he's gone over the top. Simicass, one more burst. Naby Keita, Mohamed Salah. Still had a lot to do and the defender Sanchez got a touch.
Classical Chinese poetry		丈夫生居天地之间，岂能郁郁久居人下。天子呼来不上船，自称臣是酒中仙。君不见黄河之水天上来，奔流到海不复回。山外青山楼外楼，西湖歌舞几时休。草衣家住断桥东，好句轻如湖上风。杨柳春风一杯酒，江湖夜雨十年灯。莫等闲，白了少年头，空悲切。暖风吹得游人醉，直把杭州作汴州。对影闻声已可怜，玉池荷叶正田田。本是后山人，偶作前堂客。

Part 5: Lyrics Recognition

From polished studio tracks to free-tempo live performances in noisy venues, singing voice recognition has long been one of the hardest problems in ASR. MiMo-V2.5-ASR can accurately convert the sung word into text even under heavy accompaniment, extreme pitch variation, and artistic rhythmic phrasing.

Scenario	Speech	Recognition Result
Chinese pop		兰亭临帖，行书如行云流水。月下门推，心细如你脚步碎。忙不迭，千年碑易拓，却难拓你的美。真迹绝，真心能给谁？牧笛横吹，黄酒小菜又几碟。
Chinese ballad		烟雨入江南，山水如墨染，宛若丹青未干，提笔然点欲穿，行舟临秀川，画鹢推清澜，缱绻怡然。
English song		It's never over. It's never over. The ribbon was crimson, the color of the night.

Conclusion and Outlook

Through staged training and targeted optimization for complex scenarios, MiMo-V2.5-ASR achieves state-of-the-art performance across multiple dimensions — general bilingual Chinese–English recognition, Chinese dialects, code-switching, lyrics transcription, noisy and multi-speaker environments, and knowledge-intensive content.

Going forward, we will continue to expand dialect coverage and deepen contextual awareness capabilities.