CVNov 24, 2024

PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs

arXiv:2411.15867v30.125 citationsh-index: 2Has Code
AI Analysis85

This work addresses panoramic image generation for applications like layout control and multi-scale synthesis, offering a novel paradigm shift from diffusion-based methods, though it is incremental in adapting autoregressive models to this domain.

The paper tackles the problem of generating coherent panoramic images of arbitrary lengths by proposing PanoLlama, an autoregressive framework that achieves state-of-the-art performance with coherence at 47.50%, fidelity at 28.16%, and aesthetics at 15%. It introduces a training-free token redirection strategy to enable next-crop prediction, overcoming size limitations in existing models.

Panoramic Image Generation (PIG) aims to create coherent images of arbitrary lengths. Most existing methods fall in the joint diffusion paradigm, but their complex and heuristic crop connection designs often limit their ability to achieve multilevel coherence. By deconstructing this challenge into its core components, we find it naturally aligns with next-token prediction, leading us to adopt an autoregressive (AR) paradigm for PIG modeling. However, existing visual AR (VAR) models are limited to fixed-size generation, lacking the capability to produce panoramic images. In this paper, we propose PanoLlama, a novel framework that achieves endless and coherent panorama generation with the autoregressive paradigm. Our approach develops a training-free strategy that utilizes token redirection to overcome the size limitations of existing VAR models, enabling next-crop prediction in both horizontal and vertical directions. This refreshes the PIG pipeline while achieving SOTA performance in coherence (47.50%), fidelity(28.16%), and aesthetics (15%). Additionally, PanoLlama supports applications other PIG methods cannot achieve, including mask-free layout control, multi-scale and multi-guidance synthesis. To facilitate standardized evaluation, we also establish a dataset with 1,000 prompts spanning 100+ themes, providing a new testing benchmark for PIG research. The code is available at https://github.com/0606zt/PanoLlama.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes