CVMar 15

Unlocking the Latent Canvas: Eliciting and Benchmarking Symbolic Visual Expression in LLMs

arXiv:2603.1450569.5h-index: 2
AI Analysis

This work addresses the problem of enabling native visual expression in LLMs for researchers and developers in multimodal AI, though it is incremental as it builds on existing LLM capabilities with a specific format.

The authors tackled the problem of unlocking latent visual representation capabilities in Large Language Models (LLMs) by using ASCII art as a text-native visual format, introducing the SVE-ASCII framework and ASCIIArt-7K dataset, and demonstrating that generative training enhances visual comprehension, confirming a mutually reinforcing cycle in symbolic visual processing.

Current multimodal approaches predominantly treat visual generation as an external process, relying on pixel rendering or code execution, thereby overlooking the native visual representation capabilities latent within Large Language Models (LLMs). In this work, we unlock this potential through ASCII art, a compact, efficient, and text-native visual format. We introduce SVE-ASCII, a unified framework designed to elicit and benchmark Symbolic Visual Expression directly within the pure text space. To address the scarcity of systematic resources, we construct ASCIIArt-7K, a high-quality dataset synthesized via a novel "Seed-and-Evolve" pipeline that augments human-curated anchors through in-context stylistic editing. We further implement a unified instruction-tuning strategy that jointly optimizes for both Generation (Text-to-ASCII) and Understanding (ASCII-to-Text). Crucially, our experiments reveal a critical phenomenon regarding task duality: while it is established that perception aids generation, we provide compelling evidence that generative training significantly enhances visual comprehension. This confirms a mutually reinforcing cycle in symbolic visual processing, a relationship previously hypothesized but rarely empirically demonstrated in the visual domain. We release our dataset, the ASCIIArt-Bench benchmark, and the SVE-ASCII model, establishing a robust baseline for native text-based visual intelligence.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes