Supplementary Material: Implementation and Experiments for GAU-based Model
This work provides an incremental improvement for Chinese NLP tasks by optimizing a recent Transformer variant for better efficiency and performance.
The paper tackles the implementation and adaptation of the GAU (Gated Attention Unit) layer from the FLASH Transformer variant, proposing a novel GAU-based model pre-trained on a Chinese corpus, which achieves a 75.02 dev average score on the CLUE benchmark, 1% higher than RoFormerV1 and 45% faster.
In February this year Google proposed a new Transformer variant called FLASH, which has a faster speed, lower VRAM footprint and better performance. This is achieved by designing a performant layer named GAU (Gated Attention Unit), which combines the Attention layer and FFN. In this paper, some implementation details are re-analyzed both theoretically and practically. We then propose a novel GAU-based model and pre-train it on a Chinese corpus. Results of the CLUE benchmark show that our model achieves a dev average score of 75.02, 1% higher than RoFormerV1 and being 45% faster, which is also competitive with RoFormerV2.