CL AIMay 27, 2023

Exploring Better Text Image Translation with Multimodal Codebook

Zhibin Lan, Jiawei Yu, Xiang Li, Wen Zhang, Jian Luan, Bin Wang, Degen Huang, Jinsong Su

arXiv:2305.17415v226.9231 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses the lack of public datasets and error propagation in cascaded models for text image translation, which is important for applications like document translation, but it is incremental as it builds on existing multimodal and translation techniques.

The authors tackled text image translation by creating a new Chinese-English dataset (OCRMT30K) and proposing a multimodal codebook model with a multi-stage training framework, achieving strong experimental results that demonstrate its effectiveness.

Text image translation (TIT) aims to translate the source texts embedded in the image to target translations, which has a wide range of applications and thus has important research value. However, current studies on TIT are confronted with two main bottlenecks: 1) this task lacks a publicly available TIT dataset, 2) dominant models are constructed in a cascaded manner, which tends to suffer from the error propagation of optical character recognition (OCR). In this work, we first annotate a Chinese-English TIT dataset named OCRMT30K, providing convenience for subsequent studies. Then, we propose a TIT model with a multimodal codebook, which is able to associate the image with relevant texts, providing useful supplementary information for translation. Moreover, we present a multi-stage training framework involving text machine translation, image-text alignment, and TIT tasks, which fully exploits additional bilingual texts, OCR dataset and our OCRMT30K dataset to train our model. Extensive experiments and in-depth analyses strongly demonstrate the effectiveness of our proposed model and training framework.

View on arXiv PDF Code

Similar