PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset
For researchers in text-to-image generation, this work addresses the lack of high-quality UHR datasets and training methods, enabling native generation at 100MP resolution.
The paper introduces PixVerve-95K, a high-quality dataset of 95K ultra-high-resolution (100MP) images with annotations, and extends T2I models to native 100MP generation using three training schemes, establishing a comprehensive evaluation benchmark. Results show improved visual quality and semantic alignment for UHR generation.
Text-to-Image (T2I) models have recently seen notable progress around 1K and 2K resolution. With the extreme desire for better visual experience and the rapid development of imaging technology, the demand for Ultra-High-Resolution (UHR) image generation has grown significantly. However, UHR image generation poses great challenges due to the scarcity and complexity of high-resolution content. In this paper, we first introduce PixVerve-95K, a high-quality, open-source UHR T2I dataset curated with a carefully designed data pipeline, which contains 95K images across diverse scenarios (each image has a minimum pixel-count of 100M) and seven-dimensional annotations. Based on our large-scale image-text dataset, we take a pioneering step to extend various T2I foundation models to native 100MP generation with three training schemes. Finally, leveraging both conventional metrics and multimodal large language model-based assessments, our proposed PixVerve-Bench benchmark establishes a comprehensive evaluation protocol for UHR images encompassing visual quality and semantic alignment. Extensive experimental results on our benchmark and the constructive exploration of training strategies collaboratively provide valuable insights for future breakthroughs.