LLM Benchmarking with LLaMA2: Evaluating Code Development Performance Across Multiple Programming Languages
It addresses the problem of automating scientific computing workflows for developers and researchers, but is incremental as it benchmarks an existing model on new data without introducing novel methods.
This paper evaluated the Llama 2-70B model's ability to automate software development tasks like code generation and translation across programming languages, finding it often produces functional code for simple numerical tasks but struggles with complex parallelized computations, requiring significant manual corrections.
The rapid evolution of large language models (LLMs) has opened new possibilities for automating various tasks in software development. This paper evaluates the capabilities of the Llama 2-70B model in automating these tasks for scientific applications written in commonly used programming languages. Using representative test problems, we assess the model's capacity to generate code, documentation, and unit tests, as well as its ability to translate existing code between commonly used programming languages. Our comprehensive analysis evaluates the compilation, runtime behavior, and correctness of the generated and translated code. Additionally, we assess the quality of automatically generated code, documentation and unit tests. Our results indicate that while Llama 2-70B frequently generates syntactically correct and functional code for simpler numerical tasks, it encounters substantial difficulties with more complex, parallelized, or distributed computations, requiring considerable manual corrections. We identify key limitations and suggest areas for future improvements to better leverage AI-driven automation in scientific computing workflows.