PixelWorld: How Far Are We from Perceiving Everything as Pixels?
This work addresses the problem of cross-modal misalignment for researchers and developers in multimodal AI, though it is incremental as it builds on existing vision-language models.
The paper tackles the need for a unified perception paradigm for agentic language models interacting with real-world environments by exploring Perceive Everything as Pixels (PEAP) and introducing the PixelWorld benchmark, which renders diverse inputs into a shared pixel space. Experiments show PEAP achieves comparable performance to token-based methods on semantic tasks but degrades on reasoning tasks like mathematics and code, with Chain-of-Thought prompting helping to mitigate this gap.
Recent agentic language models increasingly need to interact with real-world environments that contain tightly intertwined visual and textual information, often through raw camera pixels rather than separately processed images and tokenized text. This shift highlights the need for a unified perception paradigm. To investigate this idea, we explore Perceive Everything as Pixels (PEAP) and introduce PixelWorld, a benchmark that renders natural-language, tabular, mathematical, and diagrammatic inputs into a shared pixel space. Experiments across multiple benchmarks show that PEAP achieves comparable performance to token-based approaches on semantic understanding tasks, suggesting that vision transformers can partially capture global textual semantics without explicit tokenization. In contrast, reasoning-intensive tasks such as mathematics and code show notable performance degradation, although Chain-of-Thought prompting helps mitigate this gap by compensating for missing symbolic structure. We further find that when visual and textual information are closely integrated, representing everything as pixels simplifies preprocessing and avoids cross-modal misalignment. PixelWorld thus provides a systematic and practical framework for evaluating unified vision--language models and facilitates further exploration of pixel-based multimodal learning.