Evaluating Large Language Models in Code Generation: INFINITE Methodology for Defining the Inference Index
This work addresses the need for better evaluation metrics for LLMs in code generation, though it is incremental as it builds on existing benchmarking efforts.
This study introduced the INFINITE methodology to evaluate Large Language Models (LLMs) in code generation, focusing on efficiency, consistency, and accuracy through an Inference Index (InI). When applied to GPT-4o, OpenAI-o1 pro, and OpenAI-o3 mini-high for generating Python LSTM code for meteorological forecasting, GPT outperformed OAI1 and performed comparably to OAI3 in accuracy and workflow efficiency.
This study introduces a new methodology for an Inference Index (InI), called INFerence INdex In Testing model Effectiveness methodology (INFINITE), aiming to evaluate the performance of Large Language Models (LLMs) in code generation tasks. The InI index provides a comprehensive assessment focusing on three key components: efficiency, consistency, and accuracy. This approach encapsulates time-based efficiency, response quality, and the stability of model outputs, offering a thorough understanding of LLM performance beyond traditional accuracy metrics. We applied this methodology to compare OpenAI's GPT-4o (GPT), OpenAI-o1 pro (OAI1), and OpenAI-o3 mini-high (OAI3) in generating Python code for the Long-Short-Term-Memory (LSTM) model to forecast meteorological variables such as temperature, relative humidity and wind velocity. Our findings demonstrate that GPT outperforms OAI1 and performs comparably to OAI3 regarding accuracy and workflow efficiency. The study reveals that LLM-assisted code generation can produce results similar to expert-designed models with effective prompting and refinement. GPT's performance advantage highlights the benefits of widespread use and user feedback.