CLDec 13, 2022

TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities

Zhe Zhao, Yudong Li, Cheng Hou, Jing Zhao, Rong Tian, Weijie Liu, Yiren Chen, Ningyuan Sun, Haoyan Liu, Weiquan Mao, Han Guo, Weigang Guo

arXiv:2212.06385v222.1232 citationsh-index: 36Has Code

Originality Synthesis-oriented

AI Analysis

This provides a flexible toolkit for researchers and practitioners to efficiently reproduce or build pre-training models across modalities, but it is incremental as it builds on existing modular design trends.

The authors tackled the problem of implementing diverse pre-training models across text, vision, and audio modalities by developing TencentPretrain, a modular toolkit that unifies model components, and demonstrated it matches original implementations on benchmarks.

Recently, the success of pre-training in text domain has been fully extended to vision, audio, and cross-modal scenarios. The proposed pre-training models of different modalities are showing a rising trend of homogeneity in their model structures, which brings the opportunity to implement different pre-training models within a uniform framework. In this paper, we present TencentPretrain, a toolkit supporting pre-training models of different modalities. The core feature of TencentPretrain is the modular design. The toolkit uniformly divides pre-training models into 5 components: embedding, encoder, target embedding, decoder, and target. As almost all of common modules are provided in each component, users can choose the desired modules from different components to build a complete pre-training model. The modular design enables users to efficiently reproduce existing pre-training models or build brand-new one. We test the toolkit on text, vision, and audio benchmarks and show that it can match the performance of the original implementations.

View on arXiv PDF Code

Similar