TimeColor: Flexible Reference Colorization via Temporal Concatenation
This work addresses the need for more adaptable and consistent colorization in animation or video production, though it is incremental as it builds on existing diffusion-based methods with novel mechanisms.
The paper tackled the problem of sketch-based video colorization by enabling flexible use of multiple heterogeneous references, such as character sheets or arbitrary colorized frames, and achieved improvements in color fidelity, identity consistency, and temporal stability over prior baselines.
Most colorization models condition only on a single reference, typically the first frame of the scene. However, this approach ignores other sources of conditional data, such as character sheets, background images, or arbitrary colorized frames. We propose TimeColor, a sketch-based video colorization model that supports heterogeneous, variable-count references with the use of explicit per-reference region assignment. TimeColor encodes references as additional latent frames which are concatenated temporally, permitting them to be processed concurrently in each diffusion step while keeping the model's parameter count fixed. TimeColor also uses spatiotemporal correspondence-masked attention to enforce subject-reference binding in addition to modality-disjoint RoPE indexing. These mechanisms mitigate shortcutting and cross-identity palette leakage. Experiments on SAKUGA-42M under both single- and multi-reference protocols show that TimeColor improves color fidelity, identity consistency, and temporal stability over prior baselines.