CVOct 1, 2025

JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation

Siheng Wan, Zhengtao Yao, Zhengdao Li, Junhao Dong, Yanshu Li, Yikai Li, Linshan Li, Haoyan Xu, Yijiang Li, Zhikang Dong, Huacan Wang, Jifeng Shen

arXiv:2510.00974v1h-index: 4Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of improving text-to-image generation for AI applications by offering a more effective fusion method, though it appears incremental as it builds on existing token-based architectures.

The paper tackles the challenge of effectively fusing text with visual tokens in text-to-image generation by proposing JEPA-T, a unified multimodal framework that encodes images and captions into discrete tokens and uses a joint-embedding predictive Transformer with cross-attention and text embedding injection. The result is that JEPA-T achieves strong data efficiency, open-vocabulary generalization, and consistently outperforms non-fusion and late-fusion baselines on ImageNet-1K.

Modern Text-to-Image (T2I) generation increasingly relies on token-centric architectures that are trained with self-supervision, yet effectively fusing text with visual tokens remains a challenge. We propose \textbf{JEPA-T}, a unified multimodal framework that encodes images and captions into discrete visual and textual tokens, processed by a joint-embedding predictive Transformer. To enhance fusion, we incorporate cross-attention after the feature predictor for conditional denoising while maintaining a task-agnostic backbone. Additionally, raw texts embeddings are injected prior to the flow matching loss to improve alignment during training. During inference, the same network performs both class-conditional and free-text image generation by iteratively denoising visual tokens conditioned on text. Evaluations on ImageNet-1K demonstrate that JEPA-T achieves strong data efficiency, open-vocabulary generalization, and consistently outperforms non-fusion and late-fusion baselines. Our approach shows that late architectural fusion combined with objective-level alignment offers an effective balance between conditioning strength and backbone generality in token-based T2I.The code is now available: https://github.com/justin-herry/JEPA-T.git

View on arXiv PDF Code

Similar