Kimi K2.5: Visual Agentic Intelligence
This addresses the need for efficient and capable AI agents in research and applications, though it appears incremental as it builds on multimodal foundations.
The paper tackles the problem of advancing general agentic intelligence by introducing Kimi K2.5, an open-source multimodal model that achieves state-of-the-art results across domains like coding and vision, with Agent Swarm reducing latency by up to 4.5 times over baselines.
We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Building on this multimodal foundation, K2.5 introduces Agent Swarm, a self-directed parallel agent orchestration framework that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently. Extensive evaluations show that Kimi K2.5 achieves state-of-the-art results across various domains including coding, vision, reasoning, and agentic tasks. Agent Swarm also reduces latency by up to $4.5\times$ over single-agent baselines. We release the post-trained Kimi K2.5 model checkpoint to facilitate future research and real-world applications of agentic intelligence.