CL AI LGFeb 14, 2024

API Pack: A Massive Multi-Programming Language Dataset for API Call Generation

Zhen Guo, Adriana Meza Soria, Wei Sun, Yikang Shen, Rameswar Panda

arXiv:2402.09615v65.55 citationsh-index: 40Has CodeICLR

Originality Incremental advance

AI Analysis

This work addresses the challenge of API call generation for developers and researchers, offering a dataset and method that enhance model performance across multiple programming languages, though it is incremental as it builds on existing fine-tuning techniques.

The authors tackled the problem of improving API call generation for large language models by introducing API Pack, a massive multi-language dataset, and demonstrated that fine-tuning on it enables open-source models to outperform GPT-3.5 and GPT-4 in generating code for new API calls, with specific gains such as fine-tuning CodeLlama-13B on 20,000 Python instances.

We introduce API Pack, a massive multi-programming language dataset containing over one million instruction-API calls for improving the API call generation capabilities of large language models. Our evaluation highlights three key findings: First, fine-tuning on API Pack enables open-source models to outperform GPT-3.5 and GPT-4 in generating code for entirely new API calls. We show this by fine-tuning CodeLlama-13B on 20,000 Python instances from API Pack. Second, fine-tuning on a large dataset in one language, combined with smaller datasets from others, improves API generation accuracy across multiple languages. Third, we confirm the benefits of larger datasets for API generalization, as increasing fine-tuning data to one million instances enhances generalization to new APIs. To support further research, we open-source the API Pack dataset, trained model, and code at https://github.com/zguo0525/API-Pack.

View on arXiv PDF Code

Similar