Krzysztof Sierszecki

h-index10

3papers

379citations

3 Papers

2.1SEJun 10

Undefined Behavior in C and C++: An Experiment With Desktop Use Cases

Jukka Ruohonen, Krzysztof Sierszecki

Undefined behavior is idiomatic to C and C++ programming; such behavior is a use of an erroneous program construct for which the languages impose no requirements, such as integer overflows. The paper presents an empirical experiment seeking to probe the extent of undefined behavior executing underneath typical desktop use of a Linux distribution. The analysis is based on an undefined behavior sanitizer implemented in a compiler. According to the results, undefined behavior is common. By completing 59 simple experimental tasks, nearly 11 thousand unique undefined behavior warnings were generated by 32 unique programs and libraries written in C or C++. Of these warnings, most were associated with the Mesa graphics library and generated by interacting with graphical user interfaces. Merely logging into the GNOME desktop environment generated over 500 unique warnings. Of all warnings, the clear majority was about virtual table pointers. The associated stack traces were also lengthy in general. With these and other results, the paper contributes to the empirical literature on C and C++.

1.2CYFeb 26

GenAI Integration into Engineering Education: A Case Study of an Introductory Undergraduate Engineering Course

Kadir Kozan, Ozgur Keles, Sihan Jian et al.

GenAI has a potential to enhance the learning and teaching processes in engineering education. For instance, GenAI feedback on students' task performance can be effective depending on when such feedback is provided. However, little is known about how engineering faculty and instructors discover such potential within the scope of their instruction when they try out the technology for the first time. To this end, this study purported to describe an engineering instructor's and seven teaching assistants' initial experiences of integrating GenAI into their undergraduate engineering course and the corresponding changes in students' formative exercise performance. An embedded descriptive single case study design was employed. The corresponding research data included four interviews conducted at the beginning, middle and end of an academic semester, and students' formative exercise performance. Overall, after GenAI integration, students' formative exercise performance increased, and a critical and reflective practice of learning about how to integrate GenAI into instruction provided informative insights. Still, technology integration stayed at the level of replacing other instructional methods or increasing the efficiency of solving coding problems. It turned out to be exciting and surprising for students to be able to use GenAI in course work even though their use of the technology weakened over time. Our findings suggest that engineering teaching staff's initial experimental experiences with GenAI integration can be informative and provide context-specific practical insights. Therefore, it is reasonable for higher education institutions to encourage such experiences especially when there is a lot of unknown regarding an emerging technology.

7.0SEApr 7

CAKE: Cloud Architecture Knowledge Evaluation of Large Language Models

Tim Lukas Adam, Phongsakon Mark Konrad, Riccardo Terrenzi et al.

In today's software architecture, large language models (LLMs) serve as software architecture co-pilots. However, no benchmark currently exists to evaluate large language models' actual understanding of cloud-native software architecture. For this reason we present a benchmark called CAKE, which consists of 188 expert-validated questions covering four cognitive levels of Bloom's revised taxonomy -- recall, analyze, design, and implement -- and five cloud-native topics. Evaluation is conducted on 22 model configurations (0.5B--70B parameters) across four LLM families, using three-run majority voting for multiple-choice questions (MCQs) and LLM-as-a-judge scoring for free-responses (FR). Based on this evaluation, four notable findings were identified. First, MCQ accuracy plateaus above 3B parameters, with the best model reaching 99.2\%. Second, free-response scores scale steadily across all cognitive levels. Third, the two formats capture different facets of knowledge, as the MCQ accuracy approaches a ceiling while free-responses continue to differentiate models. Finally, reasoning augmentation (+think) improves free-response quality, while tool augmentation (+tool) degrades performance for small models. These results suggest that the evaluation format fundamentally shapes how we measure architectural knowledge in LLMs.