CLApr 25

Evaluating Large Language Models on Computer Science University Exams in Data Structures

arXiv:2604.2334775.7
AI Analysis

For educators and researchers in CS education, this work provides a benchmark to assess LLM performance on data structure exams, but it is incremental as it applies existing models to a new dataset.

The authors evaluated LLMs (GPT-4o, Claude 3.5, Mathstral 7B, LLaMA 3 8B) on a new benchmark of CS Data Structure exam questions from Tel Aviv University, finding that GPT-4o and Claude 3.5 performed well but no specific numbers are reported.

We present a comprehensive evaluation of Large Language Models (LLMs) on Computer Science (CS) Data Structure examination questions. Our work introduces a new benchmark dataset comprising exam questions from Tel Aviv University (TAU), curated to assess LLMs' abilities in handling closed and multiple-choice questions. We evaluated the performance of OpenAI's GPT 4o and Anthropic's Claude 3.5, popular LLMs, alongside two smaller LLMs, Mathstral 7B and LLaMA 3 8B, across the TAU exams benchmark. Our findings provide insight into the current capabilities of LLMs in CS education.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes