Introducing MiMo-V2-Flash
Today, we are releasing and open-sourcing MiMo-V2-Flash. It is a powerful, efficient, and ultra-fast foundation language model that particularly excels in reasoning, coding, and agentic scenarios, while also serving as an excellent general-purpose assistant for everyday tasks.
Benchmark Comparison
Reasoning, Coding, and Agentic
MiMo-V2-Flash is a Mixture-of-Experts model with 309B total parameters and 15B active parameters, adopting a hybrid attention architecture that interleaves sliding-window and full attention, using an aggressive 128-token sliding window and a 5:1 hybrid ratio. With such a lightweight model architecture, we deliver superior intelligence.
In the math competition AIME 2025 and the scientific knowledge benchmark GPQA-Diamond, MiMo-V2-Flash ranks among the top 2 open-source models, demonstrating strong reasoning ability. On the SWE-bench Verified and Multilingual benchmarks for software engineering capability, MiMo-V2-Flash achieved the #1 spot among all open-source models and is on par with the world’s top closed-source models. This model is built for reasoning, coding, and agentic scenarios. It supports a hybrid thinking mode, allowing users to toggle whether the model “thinks” or answers instantly; it can generate functional HTML webpages with one click, working seamlessly with vibe-coding scaffolds such as Claude Code, Cursor, and Cline; and it offers an ultra-long 256k context window, enabling it to complete tasks across hundreds of rounds of agent interactions and tool calls.
Let MiMo-V2-Flash integrate into your workflow and build things for you.
Web Dev Showcases
Meet MiMo
MiMo-V2-Flash is not just a specialist that can only write code and do math—it can become your assistant for everyday tasks, and a friend you can exchange ideas with, sparking your inspiration.
Push efficiency to the limit
MiMo-V2-Flash is engineered for maximum efficiency. It delivers blazing-fast inference at 150 tokens per second while maintaining an ultra-low cost of $0.1 per million input tokens and $0.3 per million output tokens—making it one of the most cost-effective high-performance models available.
Price vs Speed
Price is calculated as a blend of input & output in 3:1 ratio. Data from Artificial Analysis.
Efficiency comes from innovative architectural advances designed for high-throughput inference. MiMo-V2-Flash adopts a 1:5 hybrid of Global Attention (GA) and Sliding Window Attention (SWA). Our extensive empirical results show that SWA is simple, efficient, and easy to use, delivering better overall performance than Linear Attention across general tasks, long-context payload, and reasoning. It also provides a fixed-size KV cache, making it easy to integrate with existing training and inference infrastructure. We redefined parallel decoding to achieve extremely high output token throughput: by introducing Multi-Token Prediction (MTP) training, we boost the base model's capabilities, and during inference we validate MTP tokens in parallel.
Multi-Token Prediction
MiMo-V2-Flash leverages MTP as a native draft model for self-speculative decoding, delivering real deployment speedups. LLM decoding is inherently memory-bound due to low arithmetic intensity. Batch-level parallelism is commonly used to increase FFN arithmetic intensity but does not benefit attention computation, as each request maintains its own KV cache. In contrast, MTP lifts the arithmetic intensity of both FFN and attention by generating multiple draft tokens, which the main model then verifies in parallel. This approach enables token-level parallelism without increasing KV cache I/O. In MiMo-V2-Flash, the MTP block is deliberately kept lightweight to prevent it from becoming a new inference bottleneck. It uses a dense FFN (not MoE) to limit parameter count and SWA (instead of GA) to reduce KV cache and attention computation costs. Despite this lean design, the MTP module achieves a high acceptance rate. In our measurements with a 3-layer MTP, it attains an accepted length of 2.8–3.6 tokens and an effective speedup of 2.0–2.6×.
MOPD: A New Post-Training Paradigm
During the post-training phase, to efficiently scale reinforcement learning (RL) computation and enhance the model's reasoning and agentic capabilities, we propose the Multi-Teacher Online Policy Distillation (MOPD) paradigm. At its core lies an efficient on-policy learning mechanism: after obtaining domain-specific expert teachers via SFT/RL, the student model samples (rollouts) from its own policy distribution and optimizes using dense, token-level rewards provided by multiple teachers.
MOPD training is stable and remarkably efficient—requiring less than 1/50 of the computational resources of traditional SFT+RL pipelines to match the peak performance of teacher models.
Furthermore, MOPD employs a decoupled design that supports flexible integration of new teachers and ORMs, and naturally enables a "teach and learn" closed-loop iteration: distilled student models can evolve into stronger teachers, achieving continuous self-improvement of capabilities.
Evaluation
We evaluate MiMo-V2-Flash across a broad suite of benchmarks:
| Benchmark | MiMo-V2 | Kimi-K2 | DeepSeek-V3.2 | Gemini-3.0 | Claude | GPT-5 |
|---|---|---|---|---|---|---|
| Flash | Thinking | Thinking | Pro | Sonnet 4.5 | High |
MiMo-V2-Flash achieves performance comparable to K2 Thinking and DeepSeek V3.2 Thinking on most reasoning benchmarks, while maintaining competitive general writing capabilities for high-quality open-ended responses. In long-context evaluations, our model surpasses K2 Thinking, a significantly larger full global attention LLM, highlighting the strong long-context capabilities of our hybrid SWA architecture.
On agentic tasks, MiMo-V2-Flash scores 73.4% on SWE-Bench Verified, outperforming all open-source competitors and approaching the performance of GPT-5-High. On SWE-Bench Multilingual, the model resolves 71.7% of issues, establishing it as the most capable open-source LLM for software engineering. For general tool-use on τ²-Bench, it achieves category scores of 95.3 (Telecom), 79.5 (Retail), 66.0 (Airline). Terminal Bench results also show a competitive score. For search agent evaluation, MiMo-V2-Flash scores 45.4 on BrowseComp, further boosted to 58.3 with context management.
Open-source
As usual, we are open-sourcing everything. Read our technical report for full model details.
Model weights, including MiMo-V2-Flash-Base, are available on Hugging Face under the MIT license.
On Day 0, we contributed all inference code to SGLang. We work closely with their team and shared insights on MiMo-V2-Flash inference on the LMSYS blog.
Let's build with MiMo. Start from here: API Platform (Limited time free!)