model May 03, 2026

MiMo-V2.5: Xiaomi’s Omnimodal 1M‑Token Agent Model Takes Center Stage

MiMo-V2.5, released by Xiaomi’s MiMo team, is a native omnimodal model that unifies text, image, video, and audio understanding within a single architecture. Built on the MiMo-V2‑Flash backbone, it employs a hybrid attention scheme that mixes Sliding Window Attention with Global Attention (5:1 ratio) and supports up to 1 million tokens, enabling long‑context reasoning while keeping KV‑cache memory low. The model comprises a 310 B‑parameter mixture‑of‑experts LLM (15 B activated per token), a 729 M‑parameter Vision Transformer, and a 261 M‑parameter audio transformer, all trained on roughly 48 T tokens using FP8 mixed precision.

Beyond raw perception, MiMo-V2.5 incorporates extensive post‑training for agentic behavior, including supervised fine‑tuning, large‑scale reinforcement learning, and Multi‑Teacher On‑Policy Distillation. This yields strong performance on multimodal benchmarks, coding and agent tasks, and long‑context evaluations. The model is distributed in two variants—MiMo‑V2.5‑Base (256 K context) and the full MiMo‑V2.5 (1 M context)—and can be deployed via SGLang or vLLM with detailed recipes provided in the repository.

The release includes an updated config and tokenizer configuration, and the team encourages users to pull the latest files to avoid degraded performance. Community support is available through WeChat, Discord, Telegram, Reddit, and an official API platform, making MiMo‑V2.5 a readily accessible foundation for building advanced multimodal agents and applications.

Project Ideas

Create a multimodal assistant that can answer questions by jointly analyzing text, images, and short video clips, leveraging the model's 1M‑token context for deep reasoning.
Build an audio‑driven transcription and summarization tool that ingests long recordings, extracts key points, and references any embedded visual slides using the audio and vision encoders.
Develop an interactive content‑moderation pipeline that scans livestream video with accompanying audio, detecting policy violations across visual, spoken, and textual cues.
Design a long‑document summarizer that preserves embedded figures and diagrams, using the vision transformer to interpret images while the LLM condenses the surrounding text.
Implement a home‑automation chatbot that interprets spoken commands, visual cues from security cameras, and contextual history to perform complex, multi‑step actions.

← Back to all reports