SELGJan 8, 2024

An Exploratory Study on Automatic Identification of Assumptions in the Development of Deep Learning Frameworks

arXiv:2401.03653v6h-index: 11Sci Comput Program
Originality Synthesis-oriented
AI Analysis

This addresses the high-cost manual identification of assumptions for developers and users in deep learning framework projects, but it is incremental as it applies existing models to a new dataset.

This study tackled the problem of automatically identifying assumptions in deep learning framework development by evaluating classification models on a new dataset from TensorFlow and Keras repositories, finding that the ALBERT model achieved the best performance with an f1-score of 0.9584, significantly outperforming other models like Claude 3.5 Sonnet at 0.8858.

Stakeholders constantly make assumptions in the development of deep learning (DL) frameworks. These assumptions are related to various types of software artifacts (e.g., requirements, design decisions, and technical debt) and can turn out to be invalid, leading to system failures. Existing approaches and tools for assumption management usually depend on manual identification of assumptions. However, assumptions are scattered in various sources (e.g., code comments, commits, pull requests, and issues) of DL framework development, and manually identifying assumptions has high costs. This study intends to evaluate different classification models for the purpose of identification with respect to assumptions from the point of view of developers and users in the context of DL framework projects (i.e., issues, pull requests, and commits) on GitHub. First, we constructed a new and largest dataset (i.e., the AssuEval dataset) of assumptions collected from the TensorFlow and Keras repositories on GitHub. Then we explored the performance of seven non-transformers based models (e.g., Support Vector Machine, Classification and Regression Trees), the ALBERT model, and three decoder-only models (i.e., ChatGPT, Claude, and Gemini) for identifying assumptions on the AssuEval dataset. The study results show that ALBERT achieves the best performance (f1-score: 0.9584) for identifying assumptions on the AssuEval dataset, which is much better than the other models (the 2nd best f1-score is 0.8858, achieved by the Claude 3.5 Sonnet model). Though ChatGPT, Claude, and Gemini are popular models, we do not recommend using them to identify assumptions in DL framework development because of their low performance. Fine-tuning ChatGPT, Claude, Gemini, or other language models (e.g., Llama3, Falcon, and BLOOM) specifically for assumptions might improve their performance for assumption identification.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes