Distilling BERT for low complexity network training
This work addresses the need for efficient NLP models on resource-constrained devices like mobiles and Raspberry Pi, though it is incremental as it applies existing distillation techniques to specific models.
The paper tackled the problem of transferring BERT's knowledge to simpler models like BiLSTM and CNNs for sentiment analysis on SST-2, showing that these distilled models achieve competitive performance while reducing inference complexity for edge devices.
This paper studies the efficiency of transferring BERT learnings to low complexity models like BiLSTM, BiLSTM with attention and shallow CNNs using sentiment analysis on SST-2 dataset. It also compares the complexity of inference of the BERT model with these lower complexity models and underlines the importance of these techniques in enabling high performance NLP models on edge devices like mobiles, tablets and MCU development boards like Raspberry Pi etc. and enabling exciting new applications.