Donghwan Shin

SE
h-index9
13papers
357citations
Novelty47%
AI Score42

13 Papers

LGOct 27, 2022
Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems

Fitash Ul Haq, Donghwan Shin, Lionel Briand

Deep Neural Networks (DNNs) have been widely used to perform real-world tasks in cyber-physical systems such as Autonomous Driving Systems (ADS). Ensuring the correct behavior of such DNN-Enabled Systems (DES) is a crucial topic. Online testing is one of the promising modes for testing such systems with their application environments (simulated or real) in a closed loop taking into account the continuous interaction between the systems and their environments. However, the environmental variables (e.g., lighting conditions) that might change during the systems' operation in the real world, causing the DES to violate requirements (safety, functional), are often kept constant during the execution of an online test scenario due to the two major challenges: (1) the space of all possible scenarios to explore would become even larger if they changed and (2) there are typically many requirements to test simultaneously. In this paper, we present MORLOT (Many-Objective Reinforcement Learning for Online Testing), a novel online testing approach to address these challenges by combining Reinforcement Learning (RL) and many-objective search. MORLOT leverages RL to incrementally generate sequences of environmental changes while relying on many-objective search to determine the changes so that they are more likely to achieve any of the uncovered objectives. We empirically evaluate MORLOT using CARLA, a high-fidelity simulator widely used for autonomous driving research, integrated with Transfuser, a DNN-enabled ADS for end-to-end driving. The evaluation results show that MORLOT is significantly more effective and efficient than alternatives with a large effect size. In other words, MORLOT is a good option to test DES with dynamically changing environments while accounting for multiple safety requirements.

SEJan 31, 2023
Identifying the Hazard Boundary of ML-enabled Autonomous Systems Using Cooperative Co-Evolutionary Search

Sepehr Sharifi, Donghwan Shin, Lionel C. Briand et al.

In Machine Learning (ML)-enabled autonomous systems (MLASs), it is essential to identify the hazard boundary of ML Components (MLCs) in the MLAS under analysis. Given that such boundary captures the conditions in terms of MLC behavior and system context that can lead to hazards, it can then be used to, for example, build a safety monitor that can take any predefined fallback mechanisms at runtime when reaching the hazard boundary. However, determining such hazard boundary for an ML component is challenging. This is due to the problem space combining system contexts (i.e., scenarios) and MLC behaviors (i.e., inputs and outputs) being far too large for exhaustive exploration and even to handle using conventional metaheuristics, such as genetic algorithms. Additionally, the high computational cost of simulations required to determine any MLAS safety violations makes the problem even more challenging. Furthermore, it is unrealistic to consider a region in the problem space deterministically safe or unsafe due to the uncontrollable parameters in simulations and the non-linear behaviors of ML models (e.g., deep neural networks) in the MLAS under analysis. To address the challenges, we propose MLCSHE (ML Component Safety Hazard Envelope), a novel method based on a Cooperative Co-Evolutionary Algorithm (CCEA), which aims to tackle a high-dimensional problem by decomposing it into two lower-dimensional search subproblems. Moreover, we take a probabilistic view of safe and unsafe regions and define a novel fitness function to measure the distance from the probabilistic hazard boundary and thus drive the search effectively. We evaluate the effectiveness and efficiency of MLCSHE on a complex Autonomous Vehicle (AV) case study. Our evaluation results show that MLCSHE is significantly more effective and efficient compared to a standard genetic algorithm and random search.

86.7CVMay 29
Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction

Mungyeom Kim, Minkyeong Jeon, Honggyu An et al.

Dynamic scene reconstruction from monocular video remains a fundamental challenge in computer vision. Existing feed-forward methods predict 3D Gaussians pixel-wise for each frame, suffering from duplicated Gaussians and view-dependent biases that hinder effective learning of scene motion. We present C4G, a feed-forward 4D reconstruction framework built upon a compact set of timestamp-conditioned learnable Gaussian query tokens. Each token aggregates corresponding features across the full temporal context and decodes a 3D Gaussian whose position is modulated by the target timestamp, enabling globally coherent motion modeling without per-scene optimization. To capture fine-grained details, we further introduce a video diffusion model-based rendering enhancement module. Since our framework effectively aggregates features into Gaussians, we extend this capability to feature lifting, producing a 4D feature field that supports point tracking and dynamic scene understanding. C4G achieves strong novel-view synthesis performance using significantly fewer Gaussians and without requiring camera poses, while exhibiting stronger motion modeling and robustness to large temporal gaps.

SEJan 11, 2024
Mutation-based Consistency Testing for Evaluating the Code Understanding Capability of LLMs

Ziyu Li, Donghwan Shin

Large Language Models (LLMs) have shown remarkable capabilities in processing both natural and programming languages, which have enabled various applications in software engineering, such as requirement engineering, code generation, and software testing. However, existing code generation benchmarks do not necessarily assess the code understanding performance of LLMs, especially for the subtle inconsistencies that may arise between code and its semantics described in natural language. In this paper, we propose a novel method to systematically assess the code understanding performance of LLMs, particularly focusing on subtle differences between code and its descriptions, by introducing code mutations to existing code generation datasets. Code mutations are small changes that alter the semantics of the original code, creating a mismatch with the natural language description. We apply different types of code mutations, such as operator replacement and statement deletion, to generate inconsistent code-description pairs. We then use these pairs to test the ability of LLMs to correctly detect the inconsistencies. We propose a new LLM testing method, called Mutation-based Consistency Testing (MCT), and conduct a case study on the two popular LLMs, GPT-3.5 and GPT-4, using the state-of-the-art code generation benchmark, HumanEval-X, which consists of six programming languages (Python, C++, Java, Go, JavaScript, and Rust). We compare the performance of the LLMs across different types of code mutations and programming languages and analyze the results. We find that the LLMs show significant variation in their code understanding performance and that they have different strengths and weaknesses depending on the mutation type and language.

SEJun 3, 2021
PRINS: Scalable Model Inference for Component-based System Logs

Donghwan Shin, Domenico Bianculli, Lionel Briand

Behavioral software models play a key role in many software engineering tasks; unfortunately, these models either are not available during software development or, if available, quickly become outdated as implementations evolve. Model inference techniques have been proposed as a viable solution to extract finite state models from execution logs. However, existing techniques do not scale well when processing very large logs that can be commonly found in practice. In this paper, we address the scalability problem of inferring the model of a component-based system from large system logs, without requiring any extra information. Our model inference technique, called PRINS, follows a divide-and-conquer approach. The idea is to first infer a model of each system component from the corresponding logs; then, the individual component models are merged together taking into account the flow of events across components, as reflected in the logs. We evaluated PRINS in terms of scalability and accuracy, using nine datasets composed of logs extracted from publicly available benchmarks and a personal computer running desktop business applications. The results show that PRINS can process large logs much faster than a publicly available and well-known state-of-the-art tool, without significantly compromising the accuracy of inferred models.

SEJan 26, 2021
Can Offline Testing of Deep Neural Networks Replace Their Online Testing?

Fitash Ul Haq, Donghwan Shin, Shiva Nejati et al.

We distinguish two general modes of testing for Deep Neural Networks (DNNs): Offline testing where DNNs are tested as individual units based on test datasets obtained without involving the DNNs under test, and online testing where DNNs are embedded into a specific application environment and tested in a closed-loop mode in interaction with the application environment. Typically, DNNs are subjected to both types of testing during their development life cycle where offline testing is applied immediately after DNN training and online testing follows after offline testing and once a DNN is deployed within a specific application environment. In this paper, we study the relationship between offline and online testing. Our goal is to determine how offline testing and online testing differ or complement one another and if offline testing results can be used to help reduce the cost of online testing? Though these questions are generally relevant to all autonomous systems, we study them in the context of automated driving systems where, as study subjects, we use DNNs automating end-to-end controls of steering functions of self-driving vehicles. Our results show that offline testing is less effective than online testing as many safety violations identified by online testing could not be identified by offline testing, while large prediction errors generated by offline testing always led to severe safety violations detectable by online testing. Further, we cannot exploit offline testing results to reduce the cost of online testing in practice since we are not able to identify specific situations where offline testing could be as accurate as online testing in identifying safety requirement violations.

SEDec 12, 2020
Digital Twins Are Not Monozygotic -- Cross-Replicating ADAS Testing in Two Industry-Grade Automotive Simulators

Markus Borg, Raja Ben Abdessalem, Shiva Nejati et al.

The increasing levels of software- and data-intensive driving automation call for an evolution of automotive software testing. As a recommended practice of the Verification and Validation (V&V) process of ISO/PAS 21448, a candidate standard for safety of the intended functionality for road vehicles, simulation-based testing has the potential to reduce both risks and costs. There is a growing body of research on devising test automation techniques using simulators for Advanced Driver-Assistance Systems (ADAS). However, how similar are the results if the same test scenarios are executed in different simulators? We conduct a replication study of applying a Search-Based Software Testing (SBST) solution to a real-world ADAS (PeVi, a pedestrian vision detection system) using two different commercial simulators, namely, TASS/Siemens PreScan and ESI Pro-SiVIC. Based on a minimalistic scene, we compare critical test scenarios generated using our SBST solution in these two simulators. We show that SBST can be used to effectively and efficiently generate critical test scenarios in both simulators, and the test results obtained from the two simulators can reveal several weaknesses of the ADAS under test. However, executing the same test scenarios in the two simulators leads to notable differences in the details of the test outputs, in particular, related to (1) safety violations revealed by tests, and (2) dynamics of cars and pedestrians. Based on our findings, we recommend future V&V plans to include multiple simulators to support robust simulation-based testing and to base test objectives on measures that are less dependant on the internals of the simulators.

CVDec 11, 2020
Automatic Test Suite Generation for Key-Points Detection DNNs using Many-Objective Search (Experience Paper)

Fitash Ul Haq, Donghwan Shin, Lionel C. Briand et al.

Automatically detecting the positions of key-points (e.g., facial key-points or finger key-points) in an image is an essential problem in many applications, such as driver's gaze detection and drowsiness detection in automated driving systems. With the recent advances of Deep Neural Networks (DNNs), Key-Points detection DNNs (KP-DNNs) have been increasingly employed for that purpose. Nevertheless, KP-DNN testing and validation have remained a challenging problem because KP-DNNs predict many independent key-points at the same time -- where each individual key-point may be critical in the targeted application -- and images can vary a great deal according to many factors. In this paper, we present an approach to automatically generate test data for KP-DNNs using many-objective search. In our experiments, focused on facial key-points detection DNNs developed for an industrial automotive application, we show that our approach can generate test suites to severely mispredict, on average, more than 93% of all key-points. In comparison, random search-based test data generation can only severely mispredict 41% of them. Many of these mispredictions, however, are not avoidable and should not therefore be considered failures. We also empirically compare state-of-the-art, many-objective search algorithms and their variants, tailored for test suite generation. Furthermore, we investigate and demonstrate how to learn specific conditions, based on image characteristics (e.g., head posture and skin color), that lead to severe mispredictions. Such conditions serve as a basis for risk analysis or DNN retraining.

SEApr 15, 2020
Effective Removal of Operational Log Messages: an Application to Model Inference

Donghwan Shin, Domenico Bianculli, Lionel Briand

Model inference aims to extract accurate models from the execution logs of software systems. However, in reality, logs may contain some "noise" that could deteriorate the performance of model inference. One form of noise can commonly be found in system logs that contain not only transactional messages---logging the functional behavior of the system---but also operational messages---recording the operational state of the system (e.g., a periodic heartbeat to keep track of the memory usage). In low-quality logs, transactional and operational messages are randomly interleaved, leading to the erroneous inclusion of operational behaviors into a system model, that ideally should only reflect the functional behavior of the system. It is therefore important to remove operational messages in the logs before inferring models. In this paper, we propose LogCleaner, a novel technique for removing operational logs messages. LogCleaner first performs a periodicity analysis to filter out periodic messages, and then it performs a dependency analysis to calculate the degree of dependency for all log messages and to remove operational messages based on their dependencies. The experimental results on two proprietary and 11 publicly available log datasets show that LogCleaner, on average, can accurately remove 98% of the operational messages and preserve 81% of the transactional messages. Furthermore, using logs pre-processed with LogCleaner decreases the execution time of model inference (with a speed-up ranging from 1.5 to 946.7 depending on the characteristics of the system) and significantly improves the accuracy of the inferred models, by increasing their ability to accept correct system behaviors (+43.8 pp on average, with pp=percentage points) and to reject incorrect system behaviors (+15.0 pp on average).

LGNov 28, 2019
Comparing Offline and Online Testing of Deep Neural Networks: An Autonomous Car Case Study

Fitash Ul Haq, Donghwan Shin, Shiva Nejati et al.

There is a growing body of research on developing testing techniques for Deep Neural Networks (DNN). We distinguish two general modes of testing for DNNs: Offline testing where DNNs are tested as individual units based on test datasets obtained independently from the DNNs under test, and online testing where DNNs are embedded into a specific application and tested in a close-loop mode in interaction with the application environment. In addition, we identify two sources for generating test datasets for DNNs: Datasets obtained from real-life and datasets generated by simulators. While offline testing can be used with datasets obtained from either sources, online testing is largely confined to using simulators since online testing within real-life applications can be time-consuming, expensive and dangerous. In this paper, we study the following two important questions aiming to compare test datasets and testing modes for DNNs: First, can we use simulator-generated data as a reliable substitute to real-world data for the purpose of DNN testing? Second, how do online and offline testing results differ and complement each other? Though these questions are generally relevant to all autonomous systems, we study them in the context of automated driving systems where, as study subjects, we use DNNs automating end-to-end control of cars' steering actuators. Our results show that simulator-generated datasets are able to yield DNN prediction errors that are similar to those obtained by testing DNNs with real-life datasets. Further, offline testing is more optimistic than online testing as many safety violations identified by online testing could not be identified by offline testing, while large prediction errors generated by offline testing always led to severe safety violations detectable by online testing.

SEAug 6, 2019
Scalable Inference of System-level Models from Component Logs

Donghwan Shin, Salma Messaoudi, Domenico Bianculli et al.

Behavioral software models play a key role in many software engineering tasks; unfortunately, these models either are not available during software development or, if available, they quickly become outdated as the implementations evolve. Model inference techniques have been proposed as a viable solution to extract finite-state models from execution logs. However, existing techniques do not scale well when processing very large logs, such as system-level logs obtained by combining component-level logs. Furthermore, in the case of component-based systems, existing techniques assume to know the definitions of communication channels between components. However, this information is usually not available in the case of systems integrating 3rd-party components with limited documentation. In this paper, we address the scalability problem of inferring the model of a component-based system from the individual component-level logs, when the only available information about the system are high-level architecture dependencies among components and a (possibly incomplete) list of log message templates denoting communication events between components. Our model inference technique, called SCALER, follows a divide and conquer approach. The idea is to first infer a model of each system component from the corresponding logs; then, the individual component models are merged together taking into account the dependencies among components, as reflected in the logs. We evaluated SCALER in terms of scalability and accuracy, using a dataset of logs from an industrial system; the results show that SCALER can process much larger logs than a state-of-the-art tool, while yielding more accurate models.

SESep 14, 2017
Empirical Evaluation of Mutation-based Test Prioritization Techniques

Donghwan Shin, Shin Yoo, Mike Papadakis et al.

We propose a new test case prioritization technique that combines both mutation-based and diversity-based approaches. Our diversity-aware mutation-based technique relies on the notion of mutant distinguishment, which aims to distinguish one mutant's behavior from another, rather than from the original program. We empirically investigate the relative cost and effectiveness of the mutation-based prioritization techniques (i.e., using both the traditional mutant kill and the proposed mutant distinguishment) with 352 real faults and 553,477 developer-written test cases. The empirical evaluation considers both the traditional and the diversity-aware mutation criteria in various settings: single-objective greedy, hybrid, and multi-objective optimization. The results show that there is no single dominant technique across all the studied faults. To this end, \rev{we we show when and the reason why each one of the mutation-based prioritization criteria performs poorly, using a graphical model called Mutant Distinguishment Graph (MDG) that demonstrates the distribution of the fault detecting test cases with respect to mutant kills and distinguishment.

SEJan 25, 2016
A Theoretical Framework for Understanding Mutation-Based Testing Methods

Donghwan Shin, Doo-Hwan Bae

In the field of mutation analysis, mutation is the systematic generation of mutated programs (i.e., mutants) from an original program. The concept of mutation has been widely applied to various testing problems, including test set selection, fault localization, and program repair. However, surprisingly little focus has been given to the theoretical foundation of mutation-based testing methods, making it difficult to understand, organize, and describe various mutation-based testing methods. This paper aims to consider a theoretical framework for understanding mutation-based testing methods. While there is a solid testing framework for general testing, this is incongruent with mutation-based testing methods, because it focuses on the correctness of a program for a test, while the essence of mutation-based testing concerns the differences between programs (including mutants) for a test. In this paper, we begin the construction of our framework by defining a novel testing factor, called a test differentiator, to transform the paradigm of testing from the notion of correctness to the notion of difference. We formally define behavioral differences of programs for a set of tests as a mathematical vector, called a d-vector. We explore the multi-dimensional space represented by d-vectors, and provide a graphical model for describing the space. Based on our framework and formalization, we interpret existing mutation-based fault localization methods and mutant set minimization as applications, and identify novel implications for future work.