Despite recent advances in multimodal agentic systems, most approaches treat image manipulation and web search as isolated skills, rely on expensive reinforcement learning, and lack planning grounded in real tool-execution traces. These models often fall into one of two camps: (1) reasoning-centric systems (e.g., o1 or earlier Skywork-R1V variants) that perform deep but static "brain-in-a-vat" inference without real-world interaction, or (2) agentic RL frameworks that act in the world but struggle with sustained, reflective planning.
We present Skywork-R1V4, an multimodal agentic model that unifies these dichotomies. R1V4 integrates multimodal planning, active image manipulation ("thinking with images"), deep multimodal search, and—critically—interleaved reasoning that dynamically alternates between visual operations and external knowledge retrieval. Trained solely via supervised fine-tuning on fewer than 30K high-quality, planning-execution-consistent trajectories—and validated through step-wise consistency filtering—R1V4 achieves state-of-the-art results: 66.1 on MMSearch, 38.4 on BrowseComp-VL, 88.0 on V Star and 71.4 on MME-RealWorld, outperforming Gemini 2.5 Flash. Remarkably, this was accomplished in under 12 hours on 64 A800 GPUs at low cost—without any reinforcement learning.
See how Skywork R1V4-Lite performs visual reasoning in real-world scenarios
When faced with limited perspectives or insufficient information, the model automatically performs operations such as cropping, zooming, rotating, and region localization to construct a transparent and traceable "visual action chain.
When tackling complex and challenging problems, R1V4-Lite demonstrates exceptional multi-turn reasoning and information synthesis capabilities. It proactively initiates multiple rounds of search and verification, extracts and aggregates key information from diverse sources, and ultimately arrives at reliable conclusions.
The model transitions from merely "passively viewing images" to proactively manipulating them in complex scenes. It explores, verifies, refines, and integrates information based on visual cues, achieving a true cycle of "Observing, Thinking, and Acting.
R1V4-Planner-Lite enables proactive multi-modal agentic planning. Starting from a single image, the Planner can automatically construct an executable multi-turn task chain.
Exceptional proficiency across multiple multimodal tasks
Integrated intelligence (Image operations × Task planning × Deep research); Capable of rapid image analysis, lightweight planning, and deep reasoning.
Advanced image understanding through image operations (Crop / Zoom / Rotate) integrated with deep reasoning systems.
Ultra-compact model size with ultra-fast inference; A new benchmark for efficient deployment with high performance.
Industry-leading cost advantages and inference efficiency, making advanced multimodal AI accessible to more users.
@article{skywork2025r1v,
}