Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch

Abstract

Despite recent advances in multimodal agentic systems, most approaches treat image manipulation and web search as isolated skills, rely on expensive reinforcement learning, and lack planning grounded in real tool-execution traces. These models often fall into one of two camps: (1) reasoning-centric systems (e.g., o1 or earlier Skywork-R1V variants) that perform deep but static "brain-in-a-vat" inference without real-world interaction, or (2) agentic RL frameworks that act in the world but struggle with sustained, reflective planning.

We present Skywork-R1V4, an multimodal agentic model that unifies these dichotomies. R1V4 integrates multimodal planning, active image manipulation ("thinking with images"), deep multimodal search, and—critically—interleaved reasoning that dynamically alternates between visual operations and external knowledge retrieval. Trained solely via supervised fine-tuning on fewer than 30K high-quality, planning-execution-consistent trajectories—and validated through step-wise consistency filtering—R1V4 achieves state-of-the-art results: 66.1 on MMSearch, 38.4 on BrowseComp-VL, 88.0 on V Star and 71.4 on MME-RealWorld, outperforming Gemini 2.5 Flash. Remarkably, this was accomplished in under 12 hours on 64 A800 GPUs at low cost—without any reinforcement learning.

Showcase of Examples

See how Skywork R1V4-Lite performs visual reasoning in real-world scenarios

All-in-One Reasoning

1/2

When faced with limited perspectives or insufficient information, the model automatically performs operations such as cropping, zooming, rotating, and region localization to construct a transparent and traceable "visual action chain.

Super-class Image Understanding

1/2

When tackling complex and challenging problems, R1V4-Lite demonstrates exceptional multi-turn reasoning and information synthesis capabilities. It proactively initiates multiple rounds of search and verification, extracts and aggregates key information from diverse sources, and ultimately arrives at reliable conclusions.

Small & Fast Inference

1/2

The model transitions from merely "passively viewing images" to proactively manipulating them in complex scenes. It explores, verifies, refines, and integrates information based on visual cues, achieving a true cycle of "Observing, Thinking, and Acting.

Planning

1/2

R1V4-Planner-Lite enables proactive multi-modal agentic planning. Starting from a single image, the Planner can automatically construct an executable multi-turn task chain.

Benchmark Results

Skywork-R1V4 Performance

Exceptional proficiency across multiple multimodal tasks

State of the Art Results

66.1

MMSearch

38.4

BrowseComp-VL

88.0

V*

71.4

MME-RealWorld

Key Capabilities

1

All-in-One Intelligence

Integrated intelligence (Image operations × Task planning × Deep research); Capable of rapid image analysis, lightweight planning, and deep reasoning.

2

Super-class Image Understanding

Advanced image understanding through image operations (Crop / Zoom / Rotate) integrated with deep reasoning systems.

3

Small & Fast

Ultra-compact model size with ultra-fast inference; A new benchmark for efficient deployment with high performance.

4

Cost-Effective

Industry-leading cost advantages and inference efficiency, making advanced multimodal AI accessible to more users.

Citation

BibTeX Citation

@article{skywork2025r1v,

}

Access Code Repository