CVMar 27

TaxaAdapter: Vision Taxonomy Models are Key to Fine-grained Image Generation over the Tree of Life

arXiv:2603.2612888.01 citationsh-index: 31
AI Analysis

This addresses the challenge of scalable and accurate species image generation for biodiversity research and applications, representing an incremental improvement by adapting existing models with taxonomy guidance.

The paper tackles the problem of generating fine-grained images across the Tree of Life, where existing models often fail to capture subtle species-specific traits, by proposing TaxaAdapter, which incorporates Vision Taxonomy Models to guide generation, resulting in improved species-level fidelity and accuracy with strong generalization to few-shot and unseen species.

Accurately generating images across the Tree of Life is difficult: there are over 10M distinct species on Earth, many of which differ only by subtle visual traits. Despite the remarkable progress in text-to-image synthesis, existing models often fail to capture the fine-grained visual cues that define species identity, even when their outputs appear photo-realistic. To this end, we propose TaxaAdapter, a simple and lightweight approach that incorporates Vision Taxonomy Models (VTMs) such as BioCLIP to guide fine-grained species generation. Our method injects VTM embeddings into a frozen text-to-image diffusion model, improving species-level fidelity while preserving flexible text control over attributes such as pose, style, and background. Extensive experiments demonstrate that TaxaAdapter consistently improves morphology fidelity and species-identity accuracy over strong baselines, with a cleaner architecture and training recipe. To better evaluate these improvements, we also introduce a multimodal Large Language Model-based metric that summarizes trait-level descriptions from generated and real images, providing a more interpretable measure of morphological consistency. Beyond this, we observe that TaxaAdapter exhibits strong generalization capabilities, enabling species synthesis in challenging regimes such as few-shot species with only a handful of training images and even species unseen during training. Overall, our results highlight that VTMs are a key ingredient for scalable, fine-grained species generation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes