AI ResearchKR

InternVL-U: Understanding + Generation + Editing in One 4B Model -- A New Standard for Unified Multimodal AI

Shanghai AI Lab's InternVL-U. A single 4B parameter model handles image understanding, generation, editing, and reasoning-based generation. Decoupled visual representations outperform 14B BAGEL on GenEval and DPG-Bench.

InternVL-U: Understanding + Generation + Editing in One 4B Model -- A New Standard for Unified Multimodal AI

InternVL-U: Understanding + Generation + Editing in One 4B Model -- A New Standard for Unified Multimodal AI

There's been a long-standing goal in multimodal AI: a single model that can understand, generate, and edit images. Previously, each task required a separate model. Image understanding used InternVL, generation used Stable Diffusion, editing used InstructPix2Pix -- pipelines became complex, and knowledge sharing between models was impossible.

InternVL-U, released by Shanghai AI Lab in March 2026, tackles this problem head-on. With just 4B parameters in a single model, it handles multimodal understanding, text-to-image generation, image editing, and reasoning-based generation. It outperforms the 14B-parameter BAGEL on GenEval (0.85 vs 0.82) and DPG-Bench (85.18 vs 85.07).

The secret lies in an architectural design called Decoupled Visual Representation.

The Unified Multimodal Dilemma: Limits of a Single Representation

🔒

Sign in to continue reading

Create a free account to access the full content.

Related Posts