CVAIDec 21, 2025

PTTA: A Pure Text-to-Animation Framework for High-Quality Creation

arXiv:2512.18614v1h-index: 12
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of automating animation creation for content creators, but it is incremental as it builds on existing text-to-video models.

The authors tackled the problem of generating high-quality animations from text, which traditionally requires complex pipelines and manual labor, by presenting PTTA, a pure text-to-animation framework that outperforms comparable baselines in visual evaluations.

Traditional animation production involves complex pipelines and significant manual labor cost. While recent video generation models such as Sora, Kling, and CogVideoX achieve impressive results on natural video synthesis, they exhibit notable limitations when applied to animation generation. Recent efforts, such as AniSora, demonstrate promising performance by fine-tuning image-to-video models for animation styles, yet analogous exploration in the text-to-video setting remains limited. In this work, we present PTTA, a pure text-to-animation framework for high-quality animation creation. We first construct a small-scale but high-quality paired dataset of animation videos and textual descriptions. Building upon the pretrained text-to-video model HunyuanVideo, we perform fine-tuning to adapt it to animation-style generation. Extensive visual evaluations across multiple dimensions show that the proposed approach consistently outperforms comparable baselines in animation video synthesis.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes