Xiaomi MiMo-V2.5

A leap in agency and multimodality.

Introducing MiMo-V2.5

Today, we are releasing MiMo-V2.5, a major step forward in agentic capability and multimodal understanding. With native visual and audio understanding, MiMo-V2.5 reasons seamlessly across modalities, surpasses MiMo-V2-Pro in agentic performance, and supports up to 1 million tokens of context.

With our optimized training pipelines, MiMo-V2.5 is trained from the start to see, hear, and act on what it perceives, leading to a single model that understands everything and gets things done.


Best-in-Class Agency

MiMo-V2.5 builds on our strong LLM backbone, extended with dedicated visual and audio encoders and an optimized post-training pipeline that jointly aligns perception, reasoning, and tool use.

On the agentic benchmarks that matter most for real-world deployment, MiMo-V2.5 delivers best-in-class performance:

In our internal MiMo Coding Bench, MiMo-V2.5 delivers strong results on everyday coding tasks, closing the gap with frontier models and matching MiMo-V2.5-Pro at half the cost.

On Claw-Eval, a benchmark for daily agentic tasks, MiMo-V2.5 achieves a 62.3 on the general subset, placing it at the Pareto frontier of performance and efficiency.

These results highlight what makes MiMo-V2.5 unique: frontier-level agentic capability with high token efficiency. You no longer need to choose between a model that understands everything and one that gets things done.

Sharper Perception, Longer Horizon

MiMo-V2.5 delivers sharper perception for precise visual reasoning, complex chart analysis, and deep multimodal understanding, with native support for up to 1 million tokens of context.

On multimodal agentic tasks, MiMo-V2.5 reaches 23.8 on Claw-Eval Multimodal, matching Claude Sonnet 4.6, leading MiMo-V2-Omni by eight points, and trailing Claude Opus 4.6 by a single point.

On video understanding, MiMo-V2.5 scores 87.7 on Video-MME, effectively tied with Gemini 3 Pro (88.4) and well ahead of Gemini 3 Flash. Long-horizon video comprehension — scene tracking, temporal reasoning, visual grounding over minutes of footage — is now in frontier territory.

On image understanding, MiMo-V2.5 lands at 81.0 on CharXiv RQ and 77.9 on MMMU-Pro, closing in on Gemini 3 Pro.

Together, these results show MiMo-V2.5 is a single model that perceives, reasons, and acts across every modality at the frontier.

Token Plan Update

Alongside stronger models, your Token Plan gets better too. Rates are now simpler and lower:

From today onward, Token Plans no longer charge a multiplier for the 1M-token context window. Order your Token Plan now.

What's Next

MiMo-V2.5 brings frontier agency and native multimodality into the same model, at a price point that makes both practical for production. We are already training the next generation with deeper reasoning, tighter tool integration, and richer real-world grounding. In the meantime, try it in AI Studio or access the API — we cannot wait to see what you build.