Martin Gaedke

h-index22

4papers

3,270citations

Novelty43%

AI Score33

Ranked #119,129 of 194,257 authors (top 61%)#1,020 in HC (top 41%)

4 Papers

5.9SEApr 1

Containing the Reproducibility Gap: Automated Repository-Level Containerization for Scholarly Jupyter Notebooks

Sheeba Samuel, Daniel Mietchen, Hemanta Lo et al.

Computational reproducibility is fundamental to trustworthy science, yet remains difficult to achieve in practice across various research workflows, including Jupyter notebooks published alongside scholarly articles. Environment drift, undocumented dependencies and implicit execution assumptions frequently prevent independent re-execution of published research. Despite existing reproducibility guidelines, scalable and systematic infrastructure for automated assessment remains limited. We present an automated, web-oriented reproducibility engineering pipeline that reconstructs and evaluates repository-level execution environments for scholarly notebooks. The system performs dependency inference, automated container generation, and isolated execution to approximate the notebook's original computational context. We evaluate the approach on 443 notebooks from 116 GitHub repositories referenced by publications in PubMed Central. Execution outcomes are classified into four categories: resolved environment failures, persistent logic or data errors, reproducibility drift, and container-induced regressions. Our results show that containerization resolves 66.7% of prior dependency-related failures and substantially improves execution robustness. However, a significant reproducibility gap remains: 53.7% of notebooks exhibit low output fidelity, largely due to persistent runtime failures and stochastic non-determinism. These findings indicate that standardized containerization is essential for computational stability but insufficient for full bit-wise reproducibility. The framework offers a scalable solution for researchers, editors, and archivists seeking systematic, automated assessment of computational artifacts.

CLJun 26

Learning Complementary Action Modeling from Automotive Maintenance Instructions

Jiaqi Wu, Bai Li, Jochen Hartmann et al.

A minute lexical variation can reverse the procedural meaning of an instruction even when the rest of the sentence remains unchanged. In automotive maintenance instructions, this pattern often appears when an action phrase turns an instruction into its procedural counterpart. The entities, modifiers, and surrounding context remain largely invariant, while the action phrase determines the procedural relation. We define this task as Complementary Action Modeling (CAM). Given a maintenance instruction, the goal is to identify or generate its procedural counterpart by modifying the action phrase while preserving the remaining sentence context. This task focuses on three aspects: distinguishing complementarity from surface similarity, controlling generation at the action-phrase level, and evaluating relational correctness using retrieval, overlap-based, and human evaluation. Using a German automotive maintenance dataset, we examine these questions through candidate matching and controlled Seq2Seq generation. The results show that complementary maintenance instructions are best modeled as procedural associations grounded in subtle lexical cues. They should therefore not be treated as ordinary cases of sentence similarity or synonym-based paraphrasing.

3.3HCDec 25, 2020

Distributional Ground Truth: Non-Redundant Crowdsourcing Data Quality Control in UI Labeling Tasks

Maxim Bakaev, Sebastian Heil, Martin Gaedke

HCI increasingly employs Machine Learning and Image Recognition, in particular for visual analysis of user interfaces (UIs). A popular way for obtaining human-labeled training data is Crowdsourcing, typically using the quality control methods ground truth and majority consensus, which necessitate redundancy in the outcome. In our paper we propose a non-redundant method for prediction of crowdworkers' output quality in web UI labeling tasks, based on homogeneity of distributions assessed with two-sample Kolmogorov-Smirnov test. Using a dataset of about 500 screenshots with over 74,000 UI elements located and classified by 11 trusted labelers and 298 Amazon Mechanical Turk crowdworkers, we demonstrate the advantage of our approach over the baseline model based on mean Time-on-Task. Exploring different dataset partitions, we show that with the trusted set size of 17-27% UIs our "distributional ground truth" model can achieve R2s of over 0.8 and help to obviate the ancillary work effort and expenses.

3.5HCOct 29, 2016

REFOCUS: Current & Future Search Interface Requirements for German-speaking Users

Maximilian Speicher, Andreas Both, Martin Gaedke

While smartphones are widely used for web browsing, also other novel devices like Smart TVs become increasingly popular. Yet, current interfaces do not cater for the newly available devices beyond touch and small screens, if at all for the latter. Particularly search engines -- today's entry points of the WWW -- must ensure their interfaces are easy to use on any web-enabled device. We report on a survey that investigated (1) users' perception and usage of current search interfaces, and (2) their expectations towards current and future search interfaces. Users are mostly satisfied with desktop and mobile search, but seem to be skeptical towards web search with novel devices and input modalities. Hence, we derive REFOCUS -- a novel set of requirements for current and future search interfaces, which shall address the demand for improvement of novel web search and has been validated by 12 dedicated experts.