BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages
This provides a resource-efficient, tokenization-free solution for multilingual NLP tasks, though it is incremental as it builds on existing BPE methods.
The authors tackled the problem of creating multilingual subword embeddings without tokenization by introducing BPEmb, a collection of pre-trained embeddings in 275 languages based on Byte-Pair Encoding, which performed competitively in fine-grained entity typing evaluations, sometimes outperforming alternatives while using fewer resources.
We present BPEmb, a collection of pre-trained subword unit embeddings in 275 languages, based on Byte-Pair Encoding (BPE). In an evaluation using fine-grained entity typing as testbed, BPEmb performs competitively, and for some languages bet- ter than alternative subword approaches, while requiring vastly fewer resources and no tokenization. BPEmb is available at https://github.com/bheinzerling/bpemb