MiMo-V2-Omni | Xiaomi

Beyond the Text Box

Language models have mastered text, but the real world is multimodal and driven by action.

Whether it's a robotic arm following voice commands, a digital agent executing multi-step software workflows, or an autonomous vehicle fusing real-time sensor data, true automation requires an intelligence that does more than just "talk."

To step beyond text, the next generation of agents will operate across screens, software, and the physical world. They require an intelligence capable of processing reality in all its richness—images, video, audio, and text in perfect concert—and translating that perception into reliable, real-time action within complex systems. MiMo-V2-Omni was built for that future: an omni foundation model uniting frontier multimodal understanding with strong agentic capability.

Perception and Agency in One Model

MiMo-V2-Omni fuses dedicated image, video, and audio encoders into a single shared backbone — not as separate capabilities bolted together, but as a unified perceptual stream. The model sees, hears, and reads simultaneously, the way an agent operating in the real world must.

But perception alone is not enough. An agent must connect what it perceives to what it does next. Our approach is direct: we train the model to anticipate the future, not just describe the present. What is in the scene, what will happen next, what should be done now — the model learns all three from the very first training step. Perception and action are never separate stages; they emerge as one continuous reasoning process.

At the output end, this means MiMo-V2-Omni natively supports structured tool calling, function execution, and UI grounding — ready to plug into real agent frameworks and orchestration systems without additional adaptation layers.

Perception: Image, Video, Audio, and Beyond

An agent that can't perceive well can't act well. We evaluated MiMo-V2-Omni against leading models across every perceptual modality — not to chase benchmarks, but to verify that the foundation is solid before building agency on top of it.

On audio understanding (MMAU-Pro, BigBench-Audio), MiMo-V2-Omni goes well beyond transcription into environmental sound classification, multi-speaker disentanglement, audio-visual joint reasoning, and deep comprehension of continuous audio exceeding 10 hours in length. Overall, it surpasses Gemini 3 Pro, making it one of the strongest audio understanding foundation models available today.

On image understanding (MMMU-Pro, CharXiv RQ), MiMo-V2-Omni demonstrates powerful multi-discipline visual reasoning and complex chart analysis, exceeding Claude Opus 4.6 and approaching the level of top closed-source models such as Gemini 3.

On video understanding (VideoMME, FutureOmni), MiMo-V2-Omni supports native audio-video joint input, enabling true multimodal video comprehension. Through innovative video pretraining, it develops strong situational awareness and future reasoning capability — not just perceiving what is happening, but anticipating what comes next, grounded in the full sensory context.

Case 1 Crisis Prediction: Becoming the "Visual Core" of Autonomous Driving ▾

True autonomous driving does not need more static bounding boxes. It requires a model that understands the physical world well enough to anticipate danger before it materializes.

We gave MiMo-V2-Omni raw dashcam footage from a drive through a coastal town and a single instruction: act as the visual brain of an autonomous driving system and identify potential safety risks in real time. The unedited output is below.

As the visual processing unit for this autonomous vehicle, I have analyzed the video feed to identify potential hazards, traffic dynamics, and safety risks. Here is a real-time risk assessment of the drive through the coastal town.

00:00 – 01:00: Roundabout & Coastal Road Entry

00:02 [Yield Hazard]: A dark green sedan is entering the roundabout from the left. Our system must verify if it will enter our lane or continue circulating.
00:07 – 00:13 [Vulnerable Road User]: A cyclist is crossing the road from right to left in the distance. The system tracks their vector to ensure they do not swerve into our path.
00:22 [Critical Intersection Risk]: A silver SUV enters the roadway from the left side. This is a high-risk merge point; the vehicle has poor visibility of our lane until it pulls into traffic. Immediate deceleration and tracking are required to prevent a collision.
00:31 [Pedestrian Hazard]: A pedestrian is walking on the very edge of the road on the right side, near the cliff wall path.
00:47 [Lane Narrowing]: A large white van is approaching in the oncoming lane while our lane is lined with parked cars. The drivable space is significantly constricted, requiring precise lateral positioning to avoid side swipes.

01:00 – 02:15: Seafront Boulevard (Construction Zone)

01:00 – 01:30 [Occlusion & Boundaries]: The right side of the road is blocked by high stone walls and hoarding; the left by parked cars. This creates "tunnel vision." Any pedestrian or vehicle emerging from the parked cars on the left would have zero warning time.
01:09 [Pedestrian on Wall]: A pedestrian is walking along the top of the wall on the right. While not immediately in the road, erratic movement could lead to a fall into traffic.
02:12 [Intersection/Right Turn]: The vehicle approaches a junction. Cross-traffic has right-of-way. We must scan deeply left and right past the building corners which create severe blind spots.

02:15 – 04:00: Town Center & High Street (High Density)

02:26 [Slow Moving Obstacle]: A street cleaning vehicle (yellow/orange) is moving slowly ahead. This obstructs forward visibility and may make sudden, unpredictable lateral movements to clean the gutter.
02:59 [Jaywalking Hazard]: A pedestrian in a pink shirt crosses the street mid-block against traffic flow. This is a classic anomalous behavior detection required for safety.
03:00 – 03:30 [Urban Canyon/Sidewalk Incursion]: We are in a narrow high street with shops. Pedestrian density is extremely high.
- Risk: Pedestrians stepping off the curb (curb jaywalking) to avoid crowds on the narrow sidewalk.
- Risk: Delivery vans (03:35) are double-parked, forcing their doors to open into traffic or requiring oncoming traffic to cross our center line.
03:59 [Jaywalker at Intersection]: A pedestrian crosses directly in front of the ego vehicle at a junction where the white van is stationary. The pedestrian assumes the van being stopped means the lane is clear.

04:00 – 05:30: Navigating Residential Turns

04:13 [Blind Crossing]: Two pedestrians are crossing the road ahead at a designated or informal crossing point. System braking is active.
04:51 [High Vulnerability Pedestrian]: A woman is standing on the street corner pushing a stroller (pram). She is positioned right at the apex of the corner, partially in the road.
- Risk: If she moves just a few feet, she is in our direct collision path. The stroller increases the stopping distance required should she slip or drop an item.
05:10 [Blind Corner]: The road curves sharply past old buildings. Sight distance to the next junction is low. Speed must be modulated to the road curvature.

05:30 – 06:37: Bus Station & Roundabouts

05:45 [Roundabout Merge]: Approaching a complex roundabout. We must yield to traffic currently circulating. A blue Ford Fusion is ahead, dictating our pace.
05:56 [Occlusion by Heavy Vehicle]: Large buses are parked at the station on the right.
- Risk: Pedestrians crossing from the bus station might step out from behind a bus (where our sensors cannot see them) directly into our lane.
06:18 [Traffic Island Crossing]: Pedestrians are standing on a refuge island in the middle of the road. They may step off the island unexpectedly to beat a light change or cross quickly.
06:34 [Closing Intersection]: Navigating a busier multi-lane intersection near a gas station. Multiple entry/exit points (cars entering/exiting the Shell station) create chaotic conflict points on the right side.

Case 2 Cross-Modal Resonance: Understanding the Montage of Life ▾

Mapping sounds to images mechanically is labeling. True cross-modal understanding means grasping what the sounds mean — not just what produced them.

We gave MiMo-V2-Omni a clip from the film The Good Thing in which everyday household sounds are paired with unexpected imagery, and asked it to analyze what it sees and hears. The unedited output (in Chinese) is below.

这是一个非常细腻且具有高度蒙太奇思维的片段。这段视频展示了一位母亲（或女性照料者）与女儿在录音室进行的声音游戏，通过"听觉线索"作为转场，将原本琐碎、沉重甚至有些残酷的现实生活家务场景，重新解构成了富有诗意和想象力的艺术瞬间。

1. 核心叙事结构：声音与画面的辩证法

影片采用了"听觉-视觉"的对应蒙太奇结构。

听觉层（女儿的视角）：女儿戴着降噪耳机，在录音室里听到母亲播放的各种环境音（Foley/拟音）。这些声音经过她的想象，被描述为"台风"、"河流"、"熊猫吃竹子"、"大象喝水"等宏大或可爱的意象。
视觉层（母亲的视角）：画面随即切入这些声音在现实中的源头。这通常是母亲在家庭琐事中的动作——煮饭、拖地、洗碗、修东西、甚至意外摔倒。

这种结构建立了一种"现实主义的诗意化"：母亲劳作的艰辛（现实）通过女儿的耳朵被转化为了某种神话般的存在（艺术）。这不仅是母女间的默契游戏，也是影片对日常家务劳动的一种审美重塑。

2. 关键场景的意象解码

00:58 - 01:02：熊猫吃竹子 (A panda eating bamboo)
- 声音/意象：清脆的咀嚼声。
- 画面：母亲在用力折断芹菜/蔬菜。
- 分析：这是一个幽默的对比。母亲重复性的备菜动作，被女儿戏谑地比作吃竹子的熊猫。这暗示了母职劳动中的重复性和动物本能般的劳累，同时也展现了女儿眼中的可爱一面。
01:13 - 01:21：河马在喝水 (A hippo drinking water)
- 画面：母亲拿着巨大的搅拌杯（可能是榨汁机）仰头大口喝水。
- 分析：这是一个非常生动且带有自嘲意味的画面。它极度写实地描绘了母亲的干渴和疲惫——她甚至没有时间优雅地喝水，只能像动物一样大口吞咽生存所需的水分。但在女儿的描述中，这变成了动物世界的一幕，消解了母亲的狼狈，赋予了其生命力。
01:43 - 01:50：泥石流 (A mudslide)
- 事件：母亲提着一袋橘子/柿子下楼梯，袋子破了，水果滚落一地，她狼狈地捡拾。
- 意象：女儿称之为"泥石流"。
- 分析：这是一个具有象征意义的高潮点。楼梯上的橘子滚落象征着生活控制权的瞬间丧失和崩溃。橙色的水果在昏暗的楼道中显得格外刺眼，正如失控的生活细节。将这种崩溃称为"泥石流"，既符合滚落的动态，也隐喻了母亲生活中随时可能发生的、不可控的灾难性压力。
01:51 - 01:55：海豚跳到大海里 (A dolphin jumping in the ocean)
- 画面：水龙头的水流冲击着西兰花和番茄。
- 分析：极具反差的蒙太奇。清洗蔬菜的浑浊水流，被想象成自由自在的海豚跳跃。这表达了在琐碎家务中对自由的渴望，或者仅仅是孩子将一切水声都视为海洋的纯真投射。

3. 视觉风格与光影分析

录音室（避风港）：光线温暖、昏暗，充满包裹感。这是母女连接的私人空间，也是安全感的来源。墙上的海报和杂物显示出这是一间属于她们的"作坊"。
厨房与家务场景（战场）：多为自然光或冷色调，强调生活的质感和粗糙感。例如01:56母亲搅拌热汤时，光线透过窗户照在蒸汽上，显得神圣又劳累。
街道与楼梯（孤独感）：02:18处的雨夜街头，霓虹灯反射在湿滑的路面上，母亲独自撑伞行走，色调冷峻（蓝绿色）。这强化了都市生活中成年人的孤独感和沉重感。

4. 角色动态与主题总结

母亲（The Observer/The Creator）：她是这些声音的制造者，也是生活的承受者。她在厨房、楼道、深夜的灯光下处理着维持家庭运转的琐事。最后她在雨夜独自撑伞、整理房间，展现了极强的韧性。
女儿（The Listener/The Interpreter）：她是母亲辛劳的见证者，也是唯一的"翻译官"。通过她天马行空的描述，母亲的苦劳（Hard Labor）被转化为了功劳（Masterpiece）。

总结：这一片段通过声音的联想，将家务劳动（Materiality）与艺术想象（Imaginary）完美结合。它讲述的不仅是做饭和打扫，而是一个关于"看与听"的故事：生活充满了噪音和混乱，但如果你愿意戴上"想象的耳机"（或者有一个爱你的孩子为你解读），那些原本令人窒息的噪音，或许能变成一首关于生存的交响曲。

Case 3 Understanding Ultra-Long Podcast ▾

Long-form audio is a blind spot for most multimodal foundation models today. MiMo-V2-Omni changes that.

We fed the model a single, uncut 7-hour podcast episode. No chunking. No intermediate summaries. One pass. The model produced a structured, faithful summary that captured not just isolated topics, but the logical threads connecting arguments across hours.

MiMo-V2-Omni natively supports over 10 hours of continuous audio understanding — to our knowledge, the first omni model to operate at this scale without degradation. If it matters enough to record, it matters enough to understand in full.

Listen on Apple Podcasts 7-hour marathon interview with Xie Saining — World Model, escaping Silicon Valley, AMI Labs →

Based on the interview transcript, here is a summary of the key highlights regarding Xie Saining's career, his research philosophy, and his new venture with Yann LeCun:

Transition from Academia to Entrepreneurship
- New Venture: Xie Saining has co-founded a startup with Turing Award winner Yann LeCun. The company aims to build a "World Model" (or a "Predictive Brain").
- Mission: The goal is to create a foundational model that understands the physical world, spatial intelligence, and physics, rather than just processing text. He describes this as a "reverse OpenAI" approach—building the foundational intelligence first before focusing on consumer applications.
- Scale: The startup aims to build a team of about 25 people initially, with offices in Paris, New York, Montreal, and Singapore.
The "World Model" vs. "Word Model" Thesis
- Critique of LLMs: Xie argues that Large Language Models (LLMs) are essentially "Word Models." While they are powerful, they function by tokenizing continuous signals (like video or audio) into discrete text, losing vital information about the physical world in the process.
- The Vision: He believes true intelligence requires a "World Model" that can predict the consequences of actions in a physical environment (spatial intelligence). He supports Yann LeCun's JEPA (Joint Embedding Predictive Architecture) as a path forward, which focuses on learning representations rather than just predicting pixels or tokens.
Research Philosophy and "Research Taste"
- Representation Learning: Throughout his career (at UCSD, FAIR, and NYU), Xie's core focus has been "Representation Learning"—teaching machines how to see and understand data.
- "Research Taste": He emphasizes the importance of "Research Taste"—the ability to distinguish between fundamental, impactful problems and incremental work. He believes the best research is non-linear and often involves pivoting based on new findings.
- The "Infinite Game": He views research as an "infinite game" where the goal is not just to win (publish papers) but to keep playing and exploring.
Career Highlights at FAIR (Meta)
- Key Works: During his four years at Meta's FAIR lab, Xie contributed to seminal works in self-supervised learning, including MoCo (Momentum Contrast), MAE (Masked Autoencoders), and DINO.
- Collaboration: He spoke highly of his collaboration with Kaiming He, describing their dynamic as highly productive and influential in the field of computer vision.
Personal Outlook and Leadership
- "The Normal One": Xie identifies with the quote "I am not the special one, I am the normal one" (referencing football coach Jurgen Klopp). He sees his role as a "battery"—providing energy and empowerment to his team rather than being a solitary genius.
- Courage: He stresses that in both research and entrepreneurship, courage is essential. He believes in "following your heart" and taking risks to solve hard problems, even if the path isn't clear.
Views on the AI Industry
- Skepticism of Hype: He is critical of the current "LLM is all you need" narrative, arguing that while LLMs are useful tools, they are not the complete solution for Artificial General Intelligence (AGI), especially regarding robotics and physical interaction.
- Openness: He advocates for a more open research culture, contrasting his startup's approach with the increasingly closed nature of some major AI labs.

Agentic Capabilities: From Anticipation to Action

True intelligence requires translating perception into action. We put MiMo-V2-Omni through a broad spectrum of omni-modal agentic tasks that demand sustained and precise execution.

To test its proficiency in real-world digital environments, we evaluated MiMo-V2-Omni across benchmarks such as MM-BrowserComp, OmniGAIA, and Claw Eval. Spanning web browsers, mobile interfaces, and complex application workflows, these tasks demand precise visual grounding, adaptive planning, and end-to-end execution. As the results demonstrate, MiMo-V2-Omni achieves exceptional performance, outperforming Gemini 3 Pro and GPT 5.2 across multiple core metrics.

To assess its competence in daily tasks, software engineering and professional productivity, we evaluated MiMo-V2-Omni on a suite of rigorous benchmarks, including PinchBench, SWE-Bench Verified, and GDPVal. As the results demonstrate, MiMo-V2-Omni achieves outstanding results, performing on par with the industry's most powerful reasoning models.

These results prove that our natively unified architecture pays no tax for its omni-modal capabilities — delivering frontier-level performance on agentic tasks alongside vision and audio understanding.

Case 4 Browser Use — Cross-Platform Shopping Agent ▾

Browser use is the litmus test for agentic capability. The environment changes dynamically, interaction patterns are heterogeneous across platforms, and sites actively resist automation. The perception–decision–action loop must run continuously in an open environment until the task is complete — and when the same capability transfers to smart devices or robotics, it becomes the prototype of a general-purpose agent.

For the demonstrations that follow, MiMo-V2-Omni is integrated with OpenClaw, a widely adopted open-source agent scaffold. OpenClaw handles environment orchestration — browser control, file system access, terminal interaction — while MiMo-V2-Omni provides perception, reasoning, and decision-making. The model sees what is on screen, decides what to do, and issues actions through OpenClaw; OpenClaw executes them and returns the updated state. This division makes the results a clean test of the model's own agentic intelligence rather than hard-coded automation.

We gave the system a single instruction: I'm a student who enjoys travel photography. How should I choose among the three Xiaomi 17 models? Browse Xiaohongshu for advice, make a recommendation, then go to JD.com to place the order — and try to negotiate a better price.

The model controlled a browser end-to-end. It began by scanning over a dozen posts on Xiaohongshu to gather community opinions and form a purchase recommendation. It then switched to JD.com, compared prices across multiple sellers, and engaged with customer service through natural-language chat to negotiate discounts. Finally, it completed the add-to-cart and checkout flow. Throughout the process, the model autonomously handled non-standard DOM structures, multi-tab context management, and recovery from platform anti-automation interruptions — with no human intervention at any step.

Case 5 End-to-End Video Production and TikTok Upload ▾

The previous case showed the model operating within existing platforms. This one asks it to create something from nothing, then deliver it to a live audience — spanning content design, multimedia production, code debugging, and browser automation in a single unbroken workflow. Again, MiMo-V2-Omni drives the decisions; OpenClaw provides the hands.

We gave the system one instruction: Make a 15-second introduction video for MiMo-V2-Omni with tech-style sound effects, then upload it to TikTok.

The model began by designing four scenes: a particle-convergence opener with a glowing title, an omni-modal capability showcase with animated cards, a feature highlight sequence, and a closing with particle burst and slogan. It then programmatically synthesized all audio from scratch — low-frequency bass, electronic tones for a tech aesthetic, whoosh effects for transitions, digital glitch textures for detail. No external assets were used at any stage.

During rendering, the model encountered a Chinese font error, diagnosed the root cause, applied a fix, and continued execution. It produced a 15-second 1080p video in roughly 30 seconds.

Then came the hardest part: browser use on a live platform. The model opened TikTok's upload page, uploaded the video file, and tracked upload progress to 100%. When filling in the description, it discovered that TikTok's input field is not a standard text element — conventional input methods failed. The model analyzed the page's DOM structure, identified the correct interaction pattern, and successfully entered the caption. It clicked publish, waited for the video to enter review, then liked the post and left a self-introduction comment.

Minutes later, the model checked back. The video had passed review and was publicly live.

Ten steps, spanning content creation, multimedia production, and browser automation, with zero human intervention. This is the difference between a model that generates content and a model that completes tasks. It does not follow a script — it understands page semantics, adapts to dynamic changes, and recovers from unexpected failures, operating like a real user in a real environment. The same closed-loop capability transfers directly to e-commerce operations, content management, data collection, and any workflow that demands sustained, adaptive interaction with live digital environments.

What's Next

MiMo-V2-Omni is a milestone, not a destination.

It is the first model in our roadmap that truly unifies perception and agency at the foundation level — a model that can see the full richness of the world, hear its complexity, and act within it. But every capability we have demonstrated here also reveals the next frontier we have not yet reached.

Today, the model plans over minutes. We want it to plan over hours and days — sustaining coherent intent across tasks that unfold on human timescales. Today, it processes clips. We want it to perceive in real-time streams, the way an embodied system must to operate in the physical world. Today, it works alone. We want multiple MiMo-V2-Omni instances to collaborate, decompose complex goals, and coordinate their actions. And today, it acts through browsers and terminals. We want it to act through hands — robots that don't just understand their environment, but reshape it.

We believe the path to general intelligence runs through the real world. A model that only reads text lives in a library. A model that sees, hears, reasons, and acts lives in the world. MiMo-V2-Omni is our first serious step into that world, and every step from here goes deeper.

Xiaomi MiMo-V2-Omni