CLAICVAug 13, 2025

VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models

arXiv:2508.09945v15 citationsh-index: 41Has Code
Originality Highly original
AI Analysis

This addresses the problem of multimodal code generation for developers and AI researchers, representing a novel method for a known bottleneck rather than an incremental improvement.

The paper tackles the limited ability of multimodal large language models to generate code from multimodal inputs by introducing VisCodex, a unified framework that merges vision and coding models, achieving state-of-the-art performance among open-source models and approaching proprietary models like GPT-4o.

Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding. However, their ability to generate code from multimodal inputs remains limited. In this work, we introduce VisCodex, a unified framework that seamlessly merges vision and coding language models to empower MLLMs with strong multimodal code generation abilities. Leveraging a task vector-based model merging technique, we integrate a state-of-the-art coding LLM into a strong vision-language backbone, while preserving both visual comprehension and advanced coding skills. To support training and evaluation, we introduce the Multimodal Coding Dataset (MCD), a large-scale and diverse collection of 598k samples, including high-quality HTML code, chart image-code pairs, image-augmented StackOverflow QA, and algorithmic problems. Furthermore, we propose InfiBench-V, a novel and challenging benchmark specifically designed to assess models on visually-rich, real-world programming questions that demand a nuanced understanding of both textual and visual contexts. Extensive experiments show that VisCodex achieves state-of-the-art performance among open-source MLLMs and approaches proprietary models like GPT-4o, highlighting the effectiveness of our model merging strategy and new datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes