Semiparametric Token-Sequence Co-Supervision
This work addresses improving language model training for natural language processing tasks, but it appears incremental as it builds on existing supervision methods by combining them.
The paper tackles the problem of training language models by introducing a semiparametric token-sequence co-supervision method that combines next token prediction loss in a parametric token embedding space with next sequence prediction loss in a nonparametric sequence embedding space, resulting in models that consistently outperform those trained with each supervision independently and show broader generalization capability.
In this work, we introduce a semiparametric token-sequence co-supervision training method. It trains a language model by simultaneously leveraging supervision from the traditional next token prediction loss which is calculated over the parametric token embedding space and the next sequence prediction loss which is calculated over the nonparametric sequence embedding space. The nonparametric sequence embedding space is constructed by a separate language model tasked to condense an input text into a single representative embedding. Our experiments demonstrate that a model trained via both supervisions consistently surpasses models trained via each supervision independently. Analysis suggests that this co-supervision encourages a broader generalization capability across the model. Especially, the robustness of parametric token space which is established during the pretraining step tends to effectively enhance the stability of nonparametric sequence embedding space, a new space established by another language model.