Thomas Lemberger

SEJul 16, 2021Code

Towards a Benchmark Set for Program Repair Based on Partial Fixes

Dirk Beyer, Lars Grunske, Thomas Lemberger et al.

Software bugs significantly contribute to software cost and increase the risk of system malfunctioning. In recent years, many automated program-repair approaches have been proposed to automatically fix undesired program behavior. Despite of their great success, specific problems such as fixing bugs with partial fixes still remain unresolved. A partial fix to a known software issue is a programmer's failed attempt to fix the issue the first time. Even though it fails, this fix attempt still conveys important information such as the suspicious software region and the bug type. In this work we do not propose an approach for program repair with partial fixes, but instead answer a preliminary question: Do partial fixes occur often enough, in general, to be relevant for the research area of automated program repair? We crawled 1500 open-source C repositories on GitHub for partial fixes. The result is a benchmark set of 2204 benchmark tasks for automated program repair based on partial fixes. The benchmark set is available open source and open to further contributions and improvement.

CLOct 31, 2023

Integrating curation into scientific publishing to train AI models

Jorge Abreu-Vicente, Hannah Sonntag, Thomas Eidens et al.

High throughput extraction and structured labeling of data from academic articles is critical to enable downstream machine learning applications and secondary analyses. We have embedded multimodal data curation into the academic publishing process to annotate segmented figure panels and captions. Natural language processing (NLP) was combined with human-in-the-loop feedback from the original authors to increase annotation accuracy. Annotation included eight classes of bioentities (small molecules, gene products, subcellular components, cell lines, cell types, tissues, organisms, and diseases) plus additional classes delineating the entities' roles in experiment designs and methodologies. The resultant dataset, SourceData-NLP, contains more than 620,000 annotated biomedical entities, curated from 18,689 figures in 3,223 articles in molecular and cell biology. We evaluate the utility of the dataset to train AI models using named-entity recognition, segmentation of figure captions into their constituent panels, and a novel context-dependent semantic task assessing whether an entity is a controlled intervention target or a measurement object. We also illustrate the use of our dataset in performing a multi-modal task for segmenting figures into panel images and their corresponding captions.

Thomas Lemberger

2 Papers