RuCLIP -- new models and experiments: a technical report
This work provides improved models for Russian-language vision-language tasks, but it is incremental as it builds on existing CLIP architectures.
The authors introduced six new implementations of the ruCLIP model trained on 240M Russian-English pairs, which outperformed the baseline CLIP + OPUS-MT translation on most of 16 datasets in few-shot and zero-shot tasks.
In the report we propose six new implementations of ruCLIP model trained on our 240M pairs. The accuracy results are compared with original CLIP model with Ru-En translation (OPUS-MT) on 16 datasets from different domains. Our best implementations outperform CLIP + OPUS-MT solution on most of the datasets in few-show and zero-shot tasks. In the report we briefly describe the implementations and concentrate on the conducted experiments. Inference execution time comparison is also presented in the report.