CLApr 25

Evaluating Large Language Models on Computer Science University Exams in Data Structures

Edan Gabay, Yael Maoz, Jonathan Stahl, Naama Maoz, Abdo Amer, Orr Eilat, Hanoch Levy, Michal Kleinbort, Amir Rubinstein, Adi Haviv

arXiv:2604.2334775.7

AI Analysis

For educators and researchers in CS education, this work provides a benchmark to assess LLM performance on data structure exams, but it is incremental as it applies existing models to a new dataset.

The authors evaluated LLMs (GPT-4o, Claude 3.5, Mathstral 7B, LLaMA 3 8B) on a new benchmark of CS Data Structure exam questions from Tel Aviv University, finding that GPT-4o and Claude 3.5 performed well but no specific numbers are reported.

We present a comprehensive evaluation of Large Language Models (LLMs) on Computer Science (CS) Data Structure examination questions. Our work introduces a new benchmark dataset comprising exam questions from Tel Aviv University (TAU), curated to assess LLMs' abilities in handling closed and multiple-choice questions. We evaluated the performance of OpenAI's GPT 4o and Anthropic's Claude 3.5, popular LLMs, alongside two smaller LLMs, Mathstral 7B and LLaMA 3 8B, across the TAU exams benchmark. Our findings provide insight into the current capabilities of LLMs in CS education.

View on arXiv PDF

Similar