Learning Word Embeddings from the Portuguese Twitter Stream: A Study of some Practical Aspects
This incremental study addresses practical aspects for researchers working with Portuguese social media data, focusing on scalability and evaluation issues.
The paper tackled the challenge of generating word embeddings from the Portuguese Twitter stream by scaling vocabulary size and training data, achieving stable validation loss with 32,768 words over 10 million training examples using a single GPU.
This paper describes a preliminary study for producing and distributing a large-scale database of embeddings from the Portuguese Twitter stream. We start by experimenting with a relatively small sample and focusing on three challenges: volume of training data, vocabulary size and intrinsic evaluation metrics. Using a single GPU, we were able to scale up vocabulary size from 2048 words embedded and 500K training examples to 32768 words over 10M training examples while keeping a stable validation loss and approximately linear trend on training time per epoch. We also observed that using less than 50\% of the available training examples for each vocabulary size might result in overfitting. Results on intrinsic evaluation show promising performance for a vocabulary size of 32768 words. Nevertheless, intrinsic evaluation metrics suffer from over-sensitivity to their corresponding cosine similarity thresholds, indicating that a wider range of metrics need to be developed to track progress.