AIDec 24, 2025
Shape of Thought: When Distribution Matters More than Correctness in Reasoning TasksAbhranil Chandra, Ayush Agrawal, Arian Hosseini et al.
We present the surprising finding that a language model's reasoning capabilities can be improved by training on synthetic datasets of chain-of-thought (CoT) traces from more capable models, even when all of those traces lead to an incorrect final answer. Our experiments show this approach can yield better performance on reasoning tasks than training on human-annotated datasets. We hypothesize that two key factors explain this phenomenon: first, the distribution of synthetic data is inherently closer to the language model's own distribution, making it more amenable to learning. Second, these `incorrect' traces are often only partially flawed and contain valid reasoning steps from which the model can learn. To further test the first hypothesis, we use a language model to paraphrase human-annotated traces -- shifting their distribution closer to the model's own distribution -- and show that this improves performance. For the second hypothesis, we introduce increasingly flawed CoT traces and study to what extent models are tolerant to these flaws. We demonstrate our findings across various reasoning domains like math, algorithmic reasoning and code generation using MATH, GSM8K, Countdown and MBPP datasets on various language models ranging from 1.5B to 9B across Qwen, Llama, and Gemma models. Our study shows that curating datasets that are closer to the model's distribution is a critical aspect to consider. We also show that a correct final answer is not always a reliable indicator of a faithful reasoning process.
AIOct 14, 2024
VideoAgent: Self-Improving Video GenerationAchint Soni, Sreyas Venkataraman, Abhranil Chandra et al.
Video generation has been used to generate visual plans for controlling robotic systems. Given an image observation and a language instruction, previous work has generated video plans which are then converted to robot controls to be executed. However, a major bottleneck in leveraging video generation for control lies in the quality of the generated videos, which often suffer from hallucinatory content and unrealistic physics, resulting in low task success when control actions are extracted from the generated videos. While scaling up dataset and model size provides a partial solution, integrating external feedback is both natural and essential for grounding video generation in the real world. With this observation, we propose VideoAgent for self-improving generated video plans based on external feedback. Instead of directly executing the generated video plan, VideoAgent first refines the generated video plans using a novel procedure which we call self-conditioning consistency, allowing inference-time compute to be turned into better generated video plans. As the refined video plan is being executed, VideoAgent can collect additional data from the environment to further improve video plan generation. Experiments in simulated robotic manipulation from MetaWorld and iTHOR show that VideoAgent drastically reduces hallucination, thereby boosting success rate of downstream manipulation tasks. We further illustrate that VideoAgent can effectively refine real-robot videos, providing an early indicator that robots can be an effective tool in grounding video generation in the physical world. Video demos and code can be found at https://video-as-agent.github.io.
CRSep 24, 2025
STAF: Leveraging LLMs for Automated Attack Tree-Based Security Test GenerationTanmay Khule, Stefan Marksteiner, Jose Alguindigue et al.
In modern automotive development, security testing is critical for safeguarding systems against increasingly advanced threats. Attack trees are widely used to systematically represent potential attack vectors, but generating comprehensive test cases from these trees remains a labor-intensive, error-prone task that has seen limited automation in the context of testing vehicular systems. This paper introduces STAF (Security Test Automation Framework), a novel approach to automating security test case generation. Leveraging Large Language Models (LLMs) and a four-step self-corrective Retrieval-Augmented Generation (RAG) framework, STAF automates the generation of executable security test cases from attack trees, providing an end-to-end solution that encompasses the entire attack surface. We particularly show the elements and processes needed to provide an LLM to actually produce sensible and executable automotive security test suites, along with the integration with an automated testing framework. We further compare our tailored approach with general purpose (vanilla) LLMs and the performance of different LLMs (namely GPT-4.1 and DeepSeek) using our approach. We also demonstrate the method of our operation step-by-step in a concrete case study. Our results show significant improvements in efficiency, accuracy, scalability, and easy integration in any workflow, marking a substantial advancement in automating automotive security testing methodologies. Using TARAs as an input for verfication tests, we create synergies by connecting two vital elements of a secure automotive development process.
CVJun 16, 2020
A generalizable saliency map-based interpretation of model outcomeShailja Thakur, Sebastian Fischmeister
One of the significant challenges of deep neural networks is that the complex nature of the network prevents human comprehension of the outcome of the network. Consequently, the applicability of complex machine learning models is limited in the safety-critical domains, which incurs risk to life and property. To fully exploit the capabilities of complex neural networks, we propose a non-intrusive interpretability technique that uses the input and output of the model to generate a saliency map. The method works by empirically optimizing a randomly initialized input mask by localizing and weighing individual pixels according to their sensitivity towards the target class. Our experiments show that the proposed model interpretability approach performs better than the existing saliency map-based approaches methods at localizing the relevant input pixels. Furthermore, to obtain a global perspective on the target-specific explanation, we propose a saliency map reconstruction approach to generate acceptable variations of the salient inputs from the space of input data distribution for which the model outcome remains unaltered. Experiments show that our interpretability method can reconstruct the salient part of the input with a classification accuracy of 89%.
CRJun 12, 2020
CANOA: CAN Origin Authentication Through Power Side-Channel MonitoringShailja Thakur, Carlos Moreno, Sebastian Fischmeister
The lack of any sender authentication mechanism in place makes CAN (Controller Area Network) vulnerable to security threats. For instance, an attacker can impersonate an ECU (Electronic Control Unit) on the bus and send spoofed messages unobtrusively with the identifier of the impersonated ECU. To address this problem, we propose a novel sender authentication technique that uses power consumption measurements of the ECU to authenticate the sender of a message. When an ECU is transmitting, its power requirement is affected, and a characteristic pattern appears in its power consumption. Our technique exploits the power consumption of each ECU during the transmission of a message to determine whether the message actually originated from the purported sender. We evaluate our approach in both a lab setup and a real vehicle. We also evaluate our approach against factors that can impact the power consumption measurement of the ECU. The results of the evaluation show that the proposed technique is applicable in a broad range of operating conditions with reasonable computational power requirements and attaining good accuracy.
LGApr 10, 2019
Deep Learning for System Trace RestorationIlia Sucholutsky, Apurva Narayan, Matthias Schonlau et al.
Most real-world datasets, and particularly those collected from physical systems, are full of noise, packet loss, and other imperfections. However, most specification mining, anomaly detection and other such algorithms assume, or even require, perfect data quality to function properly. Such algorithms may work in lab conditions when given clean, controlled data, but will fail in the field when given imperfect data. We propose a method for accurately reconstructing discrete temporal or sequential system traces affected by data loss, using Long Short-Term Memory Networks (LSTMs). The model works by learning to predict the next event in a sequence of events, and uses its own output as an input to continue predicting future events. As a result, this method can be used for data restoration even with streamed data. Such a method can reconstruct even long sequence of missing events, and can also help validate and improve data quality for noisy data. The output of the model will be a close reconstruction of the true data, and can be fed to algorithms that rely on clean data. We demonstrate our method by reconstructing automotive CAN traces consisting of long sequences of discrete events. We show that given even small parts of a CAN trace, our LSTM model can predict future events with an accuracy of almost 90%, and can successfully reconstruct large portions of the original trace, greatly outperforming a Markov Model benchmark. We separately feed the original, lossy, and reconstructed traces into a specification mining framework to perform downstream analysis of the effect of our method on state-of-the-art models that use these traces for understanding the behavior of complex systems.
SEApr 11, 2017
Debugging Behaviour of Embedded-Software Developers: An Exploratory StudyPansy Arafa, Daniel Solomon, Samaneh Navabpour et al.
Many researchers have studied the behaviour of successful developers while debugging desktop software. In this paper, we investigate the embedded-software debugging by intermediate programmers through an exploratory study. The bugs are semantic low-level errors, and the participants are students who completed a real-time operating systems course in addition to five other programming courses. We compare between the behaviour involved in successful debugging attempts versus unsuccessful ones. We describe some characteristics of smooth and successful debugging behaviour.
SEMar 7, 2017
Redundancy Suppression In Time-Aware Dynamic Binary InstrumentationPansy Arafa, Hany Kashif, Sebastian Fischmeister
Software tracing techniques are well-established and used by instrumentation tools to extract run-time information for program analysis and debugging. Dynamic binary instrumentation as one tool instruments program binaries to extract information. Unfortunately, instrumentation causes perturbation that is unacceptable for time-sensitive applications. Consequently we developed DIME*, a tool for dynamic binary instrumentation that considers timing constraints. DIME* uses Pin and a rate-based server approach to extract information only as long as user-specified constraints are maintained. Due to the large amount of redundancies in program traces, DIME* reduces the instrumentation overhead by one to three orders of magnitude compared to native Pin while extracting up to 99% of the information. We instrument VLC and PostgreSQL to demonstrate the usability of DIME*.
DMMar 3, 2015
DAG-width of Control Flow Graphs with Applications to Model CheckingTherese Biedl, Sebastian Fischmeister, Neeraj Kumar
The treewidth of control flow graphs arising from structured programs is known to be at most six. However, as a control flow graph is inherently directed, it makes sense to consider a measure of width for digraphs instead. We use the so-called DAG-width and show that the DAG-width of control flow graphs arising from structured (goto-free) programs is at most three. Additionally, we also give a linear time algorithm to compute the DAG decomposition of these control flow graphs. One consequence of this result is that parity games (and hence the $μ$-calculus model checking problem), which are known to be tractable on graphs of bounded DAG-width, can be solved efficiently in practice on control flow graphs.