Raghu Kiran Ganti

LGJul 12, 2024

Enhancing Training Efficiency Using Packing with Flash Attention

Achintya Kundu, Rhui Dih Lee, Laura Wynter et al.

Padding is often used in tuning LLM models by adding special tokens to shorter training examples to match the length of the longest sequence in each batch. While this ensures uniformity for batch processing, it introduces inefficiencies by including irrelevant padding tokens in the computation and wastes GPU resources. Hugging Face SFT trainer has always offered the option to use packing to combine multiple training examples, allowing for maximal utilization of GPU resources. However, up till now, it did not offer proper masking of each packed training example. This capability has been added to Hugging Face Transformers 4.44. We analyse this new feature and show the benefits across different variations of packing.

AIAug 30, 2024

Flexible and Effective Mixing of Large Language Models into a Mixture of Domain Experts

Rhui Dih Lee, Laura Wynter, Raghu Kiran Ganti

We present a toolkit for creating low-cost Mixture-of-Domain-Experts (MOE) from trained models. The toolkit can be used for creating a mixture from models or from adapters. We perform extensive tests and offer guidance on defining the architecture of the resulting MOE using the toolkit. A public repository is available.

Raghu Kiran Ganti

2 Papers