Classifying the Unknown: In-Context Learning for Open-Vocabulary Text and Symbol Recognition
This addresses the need for flexible text and symbol recognition across new languages and scripts, enabling applications like recognizing new alphabets, but it is incremental as it builds on in-context learning and tokenizer enhancements.
The paper tackles the problem of classifying novel script patterns in documents without retraining by introducing Rosetta, a multimodal model that uses Multimodal In-Context Learning and a Context-Aware Tokenizer for open-vocabulary classification, achieving successful classification of out-of-distribution visual patterns and diverse alphabets like Chinese, Greek, and Japanese in experiments.
We introduce Rosetta, a multimodal model that leverages Multimodal In-Context Learning (MICL) to classify sequences of novel script patterns in documents by leveraging minimal examples, thus eliminating the need for explicit retraining. To enhance contextual learning, we designed a dataset generation process that ensures varying degrees of contextual informativeness, improving the model's adaptability in leveraging context across different scenarios. A key strength of our method is the use of a Context-Aware Tokenizer (CAT), which enables open-vocabulary classification. This allows the model to classify text and symbol patterns across an unlimited range of classes, extending its classification capabilities beyond the scope of its training alphabet of patterns. As a result, it unlocks applications such as the recognition of new alphabets and languages. Experiments on synthetic datasets demonstrate the potential of Rosetta to successfully classify Out-Of-Distribution visual patterns and diverse sets of alphabets and scripts, including but not limited to Chinese, Greek, Russian, French, Spanish, and Japanese.