SE CL PLJul 11, 2025

Semantic Source Code Segmentation using Small and Large Language Models

Abdelhalim Dahou, Ansgar Scherp, Sebastian Kurten, Brigitte Mathiak, Madhu Chauhan

arXiv:2507.08992v11 citationsh-index: 29

Originality Incremental advance

AI Analysis

This addresses the problem of inefficient code navigation and maintenance for developers, especially in low-resource languages like R used in social sciences and psychology, though it is incremental as it builds on existing language model techniques.

The paper tackled automated segmentation of research R code into functional segments, introducing two novel approaches and a dataset, and found that context-based line-by-line analysis with fine-tuned small language models outperformed range-based methods and large language models, achieving superior results despite the models not being pre-trained on R code.

Source code segmentation, dividing code into functionally coherent segments, is crucial for knowledge retrieval and maintenance in software development. While enabling efficient navigation and comprehension of large codebases, manual and syntactic analysis approaches have become impractical as repositories grow, especially for low-resource languages like R and their research domains (e.g., social sciences, psychology).This paper introduces an automated, domain-specific approach for research R code segmentation using Large and Small Language Models (LLMs/SLMs). It presents two novel approaches and a human-annotated dataset, StatCodeSeg. We explore two distinct approaches: line-by-line analysis with context and range-based segment determination. We experiment with LLMs and fine-tuned SLMs. To support the generalizability of our approaches, we also include experiments on Python code from the computer science domain.Our results show that context-based line-by-line analysis is superior over range-based segmentation.Using smaller language models like CodeBERT and an encoder-only version of CodeT5+ are better than their LLM counterparts. Most notably, these two best-performing models did not see R code during pre-training versus the LLMs but were only fine-tuned on 4,130 lines of manually annotated code.

View on arXiv PDF

Similar