Shaohua Wang

SE
h-index3
15papers
699citations
Novelty41%
AI Score42

15 Papers

SEAug 21, 2023
When Less is Enough: Positive and Unlabeled Learning Model for Vulnerability Detection

Xin-Cheng Wen, Xinchen Wang, Cuiyun Gao et al.

Automated code vulnerability detection has gained increasing attention in recent years. The deep learning (DL)-based methods, which implicitly learn vulnerable code patterns, have proven effective in vulnerability detection. The performance of DL-based methods usually relies on the quantity and quality of labeled data. However, the current labeled data are generally automatically collected, such as crawled from human-generated commits, making it hard to ensure the quality of the labels. Prior studies have demonstrated that the non-vulnerable code (i.e., negative labels) tends to be unreliable in commonly-used datasets, while vulnerable code (i.e., positive labels) is more determined. Considering the large numbers of unlabeled data in practice, it is necessary and worth exploring to leverage the positive data and large numbers of unlabeled data for more accurate vulnerability detection. In this paper, we focus on the Positive and Unlabeled (PU) learning problem for vulnerability detection and propose a novel model named PILOT, i.e., PositIve and unlabeled Learning mOdel for vulnerability deTection. PILOT only learns from positive and unlabeled data for vulnerability detection. It mainly contains two modules: (1) A distance-aware label selection module, aiming at generating pseudo-labels for selected unlabeled data, which involves the inter-class distance prototype and progressive fine-tuning; (2) A mixed-supervision representation learning module to further alleviate the influence of noise and enhance the discrimination of representations.

CVAug 28, 2024Code
A Comprehensive Review of 3D Object Detection in Autonomous Driving: Technological Advances and Future Directions

Yu Wang, Shaohua Wang, Yicheng Li et al.

In recent years, 3D object perception has become a crucial component in the development of autonomous driving systems, providing essential environmental awareness. However, as perception tasks in autonomous driving evolve, their variants have increased, leading to diverse insights from industry and academia. Currently, there is a lack of comprehensive surveys that collect and summarize these perception tasks and their developments from a broader perspective. This review extensively summarizes traditional 3D object detection methods, focusing on camera-based, LiDAR-based, and fusion detection techniques. We provide a comprehensive analysis of the strengths and limitations of each approach, highlighting advancements in accuracy and robustness. Furthermore, we discuss future directions, including methods to improve accuracy such as temporal perception, occupancy grids, and end-to-end learning frameworks. We also explore cooperative perception methods that extend the perception range through collaborative communication. By providing a holistic view of the current state and future developments in 3D object perception, we aim to offer a more comprehensive understanding of perception tasks for autonomous driving. Additionally, we have established an active repository to provide continuous updates on the latest advancements in this field, accessible at: https://github.com/Fishsoup0/Autonomous-Driving-Perception.

SEJul 10, 2024
Rectifier: Code Translation with Corrector via LLMs

Xin Yin, Chao Ni, Tien N. Nguyen et al.

Software migration is garnering increasing attention with the evolution of software and society. Early studies mainly relied on handcrafted translation rules to translate between two languages, the translation process is error-prone and time-consuming. In recent years, researchers have begun to explore the use of pre-trained large language models (LLMs) in code translation. However, code translation is a complex task that LLMs would generate mistakes during code translation, they all produce certain types of errors when performing code translation tasks, which include (1) compilation error, (2) runtime error, (3) functional error, and (4) non-terminating execution. We found that the root causes of these errors are very similar (e.g. failure to import packages, errors in loop boundaries, operator errors, and more). In this paper, we propose a general corrector, namely Rectifier, which is a micro and universal model for repairing translation errors. It learns from errors generated by existing LLMs and can be widely applied to correct errors generated by any LLM. The experimental results on translation tasks between C++, Java, and Python show that our model has effective repair ability, and cross experiments also demonstrate the robustness of our method.

SEAug 14, 2024
Learning-based Models for Vulnerability Detection: An Extensive Study

Chao Ni, Liyu Shen, Xiaodan Xu et al.

Though many deep learning-based models have made great progress in vulnerability detection, we have no good understanding of these models, which limits the further advancement of model capability, understanding of the mechanism of model detection, and efficiency and safety of practical application of models. In this paper, we extensively and comprehensively investigate two types of state-of-the-art learning-based approaches (sequence-based and graph-based) by conducting experiments on a recently built large-scale dataset. We investigate seven research questions from five dimensions, namely model capabilities, model interpretation, model stability, ease of use of model, and model economy. We experimentally demonstrate the priority of sequence-based models and the limited abilities of both LLM (ChatGPT) and graph-based models. We explore the types of vulnerability that learning-based models skilled in and reveal the instability of the models though the input is subtlely semantical-equivalently changed. We empirically explain what the models have learned. We summarize the pre-processing as well as requirements for easily using the models. Finally, we initially induce the vital information for economically and safely practical usage of these models.

CRApr 5
Triggering and Detecting Exploitable Library Vulnerability from the Client by Directed Greybox Fuzzing

Yukai Zhao, Menghan Wu, Xing Hu et al.

Developers utilize third-party libraries to improve productivity, which also introduces potential security risks. Existing approaches generate tests for public functions to trigger library vulnerabilities from client programs, yet they depend on proof-of-concepts (PoCs), which are often unavailable. In this paper, we propose a new approach, LiveFuzz, based on directed greybox fuzzing (DGF) to detect the exploitability of library vulnerabilities from client programs without PoCs. LiveFuzz exploits a target tuple to extend existing DGF techniques to cross-program scenarios. Based on the target tuple, LiveFuzz introduces a novel Abstract Path Mapping mechanism to project execution paths, mitigating the preference for shorter paths. LiveFuzz also proposes a risk-based adaptive mutation to mitigate excessive mutation. To evaluate LiveFuzz, we construct a new dataset including 61 cases of library vulnerabilities exploited from client programs. Results show that LiveFuzz increases the number of target-reachable paths compared with all baselines and improves the average speed of vulnerability exposure. Three vulnerabilities are triggered exclusively by LiveFuzz.

SEMay 20, 2025
LEANCODE: Understanding Models Better for Code Simplification of Pre-trained Large Language Models

Yan Wang, Ling Ding, Tien N Nguyen et al.

Large Language Models for code often entail significant computational complexity, which grows significantly with the length of the input code sequence. We propose LeanCode for code simplification to reduce training and prediction time, leveraging code contexts in utilizing attention scores to represent the tokens' importance. We advocate for the selective removal of tokens based on the average context-aware attention scores rather than average scores across all inputs. LeanCode uses the attention scores of `CLS' tokens within the encoder for classification tasks, such as code search. It also employs the encoder-decoder attention scores to determine token significance for sequence-to-sequence tasks like code summarization. Our evaluation shows LeanCode's superiority over the SOTAs DietCode and Slimcode, with improvements of 60% and 16% for code search, and 29% and 27% for code summarization, respectively.

SEMar 24, 2025
Toward building next-generation Geocoding systems: a systematic review

Zhengcong Yin, Daniel W. Goldberg, Binbin Lin et al.

Geocoding systems are widely used in both scientific research for spatial analysis and everyday life through location-based services. The quality of geocoded data significantly impacts subsequent processes and applications, underscoring the need for next-generation systems. In response to this demand, this review first examines the evolving requirements for geocoding inputs and outputs across various scenarios these systems must address. It then provides a detailed analysis of how to construct such systems by breaking them down into key functional components and reviewing a broad spectrum of existing approaches, from traditional rule-based methods to advanced techniques in information retrieval, natural language processing, and large language models. Finally, we identify opportunities to improve next-generation geocoding systems in light of recent technological advances.

CRJun 19, 2021
Vulnerability Detection with Fine-grained Interpretations

Yi Li, Shaohua Wang, Tien N. Nguyen

Despite the successes of machine learning (ML) and deep learning (DL) based vulnerability detectors (VD), they are limited to providing only the decision on whether a given code is vulnerable or not, without details on what part of the code is relevant to the detected vulnerability. We present IVDetect an interpretable vulnerability detector with the philosophy of using Artificial Intelligence (AI) to detect vulnerabilities, while using Intelligence Assistant (IA) via providing VD interpretations in terms of vulnerable statements. For vulnerability detection, we separately consider the vulnerable statements and their surrounding contexts via data and control dependencies. This allows our model better discriminate vulnerable statements than using the mixture of vulnerable code and~contextual code as in existing approaches. In addition to the coarse-grained vulnerability detection result, we leverage interpretable AI to provide users with fine-grained interpretations that include the sub-graph in the Program Dependency Graph (PDG) with the crucial statements that are relevant to the detected vulnerability. Our empirical evaluation on vulnerability databases shows that IVDetect outperforms the existing DL-based approaches by 43%--84% and 105%--255% in top-10 nDCG and MAP ranking scores. IVDetect correctly points out the vulnerable statements relevant to the vulnerability via its interpretation~in 67% of the cases with a top-5 ranked list. It improves over baseline interpretation models by 12.3%--400% and 9%--400% in accuracy.

CVApr 21, 2021
Machine vision detection to daily facial fatigue with a nonlocal 3D attention network

Zeyu Chen, Xinhang Zhang, Juan Li et al.

Fatigue detection is valued for people to keep mental health and prevent safety accidents. However, detecting facial fatigue, especially mild fatigue in the real world via machine vision is still a challenging issue due to lack of non-lab dataset and well-defined algorithms. In order to improve the detection capability on facial fatigue that can be used widely in daily life, this paper provided an audiovisual dataset named DLFD (daily-life fatigue dataset) which reflected people's facial fatigue state in the wild. A framework using 3D-ResNet along with non-local attention mechanism was training for extraction of local and long-range features in spatial and temporal dimensions. Then, a compacted loss function combining mean squared error and cross-entropy was designed to predict both continuous and categorical fatigue degrees. Our proposed framework has reached an average accuracy of 90.8% on validation set and 72.5% on test set for binary classification, standing a good position compared to other state-of-the-art methods. The analysis of feature map visualization revealed that our framework captured facial dynamics and attempted to build a connection with fatigue state. Our experimental results in multiple metrics proved that our framework captured some typical, micro and dynamic facial features along spatiotemporal dimensions, contributing to the mild fatigue detection in the wild.

SEFeb 27, 2021
Fault Localization with Code Coverage Representation Learning

Yi Li, Shaohua Wang, Tien N. Nguyen

In this paper, we propose DeepRL4FL, a deep learning fault localization (FL) approach that locates the buggy code at the statement and method levels by treating FL as an image pattern recognition problem. DeepRL4FL does so via novel code coverage representation learning (RL) and data dependencies RL for program statements. Those two types of RL on the dynamic information in a code coverage matrix are also combined with the code representation learning on the static information of the usual suspicious source code. This combination is inspired by crime scene investigation in which investigators analyze the crime scene (failed test cases and statements) and related persons (statements with dependencies), and at the same time, examine the usual suspects who have committed a similar crime in the past (similar buggy code in the training data). For the code coverage information, DeepRL4FL first orders the test cases and marks error-exhibiting code statements, expecting that a model can recognize the patterns discriminating between faulty and non-faulty statements/methods. For dependencies among statements, the suspiciousness of a statement is seen taking into account the data dependencies to other statements in execution and data flows, in addition to the statement by itself. Finally, the vector representations for code coverage matrix, data dependencies among statements, and source code are combined and used as the input of a classifier built from a Convolution Neural Network to detect buggy statements/methods. Our empirical evaluation shows that DeepRL4FL improves the top-1 results over the state-of-the-art statement-level FL baselines from 173.1% to 491.7%. It also improves the top-1 results over the existing method-level FL baselines from 15.0% to 206.3%.

SEFeb 27, 2021
A Context-based Automated Approach for Method Name Consistency Checking and Suggestion

Yi Li, Shaohua Wang, Tien N. Nguyen

Misleading method names in software projects can confuse developers, which may lead to software defects and affect code understandability. In this paper, we present DeepName, a context-based, deep learning approach to detect method name inconsistencies and suggest a proper name for a method. The key departure point is the philosophy of "Show Me Your Friends, I'll Tell You Who You Are". Unlike the state-of-the-art approaches, in addition to the method's body, we also consider the interactions of the current method under study with the other ones including the caller and callee methods, and the sibling methods in the same enclosing class. The sequences of sub-tokens in the program entities' names in the contexts are extracted and used as the input for an RNN-based encoder-decoder to produce the representations for the current method. We modify that RNN model to integrate the copy mechanism and our newly developed component, called the non-copy mechanism, to emphasize on the possibility of a certain sub-token not to be copied to follow the current sub-token in the currently generated method name. We conducted several experiments to evaluate DeepName on large datasets with +14M methods. For consistency checking, DeepName improves the state-of-the-art approach by 2.1%, 19.6%, and 11.9% relatively in recall, precision, and F-score, respectively. For name suggestion, DeepName improves relatively over the state-of-the-art approaches in precision (1.8%--30.5%), recall (8.8%--46.1%), and F-score (5.2%--38.2%). To assess DeepName's usefulness, we detected inconsistent methods and suggested new method names in active projects. Among 50 pull requests, 12 were merged into the main branch. In total, in 30/50 cases, the team members agree that our suggested method names are more meaningful than the current names.

CVNov 16, 2020
Zero Cost Improvements for General Object Detection Network

Shaohua Wang, Yaping Dai

Modern object detection networks pursuit higher precision on general object detection datasets, at the same time the computation burden is also increasing along with the improvement of precision. Nevertheless, the inference time and precision are both critical to object detection system which needs to be real-time. It is necessary to research precision improvement without extra computation cost. In this work, two modules are proposed to improve detection precision with zero cost, which are focus on FPN and detection head improvement for general object detection networks. We employ the scale attention mechanism to efficiently fuse multi-level feature maps with less parameters, which is called SA-FPN module. Considering the correlation of classification head and regression head, we use sequential head to take the place of widely-used parallel head, which is called Seq-HEAD module. To evaluate the effectiveness, we apply the two modules to some modern state-of-art object detection networks, including anchor-based and anchor-free. Experiment results on coco dataset show that the networks with the two modules can surpass original networks by 1.1 AP and 0.8 AP with zero cost for anchor-based and anchor-free networks, respectively. Code will be available at https://git.io/JTFGl.

SENov 18, 2019
Combining Program Analysis and Statistical Language Model for Code Statement Completion

Son Nguyen, Tien N. Nguyen, Yi Li et al.

Automatic code completion helps improve developers' productivity in their programming tasks. A program contains instructions expressed via code statements, which are considered as the basic units of program execution. In this paper, we introduce AutoSC, which combines program analysis and the principle of software naturalness to fill in a partially completed statement. AutoSC benefits from the strengths of both directions, in which the completed code statement is both frequent and valid. AutoSC is first trained on a large code corpus to derive the templates of candidate statements. Then, it uses program analysis to validate and concretize the templates into syntactically and type-valid candidate statements. Finally, these candidates are ranked by using a language model trained on the lexical form of the source code in the code corpus. Our empirical evaluation on the large datasets of real-world projects shows that AutoSC achieves 38.9-41.3% top-1 accuracy and 48.2-50.1% top-5 accuracy in statement completion. It also outperforms a state-of-the-art approach from 9X-69X in top-1 accuracy.

SESep 5, 2019
An Empirical Study on the Characteristics of Question-Answering Process on Developer Forums

Yi Li, Shaohua Wang, Tien N. Nguyen et al.

Developer forums are one of the most popular and useful Q&A websites on API usages. The analysis of API forums can be a critical resource for the automated question and answer approaches. In this paper, we empirically study three API forums including Twitter, eBay, and AdWords, to investigate the characteristics of question-answering process. We observe that +60% of the posts on all three forums were answered by providing API method names or documentation. +85% of the questions were answered by API development teams and the answers from API development teams drew fewer follow-up questions. Our results provide empirical evidences for us in a future work to build automated solutions to answer developer questions on API forums.

AIDec 10, 2018
A context-based geoprocessing framework for optimizing meetup location of multiple moving objects along road networks

Shaohua Wang, Song Gao, Xin Feng et al.

Given different types of constraints on human life, people must make decisions that satisfy social activity needs. Minimizing costs (i.e., distance, time, or money) associated with travel plays an important role in perceived and realized social quality of life. Identifying optimal interaction locations on road networks when there are multiple moving objects (MMO) with space-time constraints remains a challenge. In this research, we formalize the problem of finding dynamic ideal interaction locations for MMO as a spatial optimization model and introduce a context-based geoprocessing heuristic framework to address this problem. As a proof of concept, a case study involving identification of a meetup location for multiple people under traffic conditions is used to validate the proposed geoprocessing framework. Five heuristic methods with regard to efficient shortest-path search space have been tested. We find that the R* tree-based algorithm performs the best with high quality solutions and low computation time. This framework is implemented in a GIS environment to facilitate integration with external geographic contextual information, e.g., temporary road barriers, points of interest (POI), and real-time traffic information, when dynamically searching for ideal meetup sites. The proposed method can be applied in trip planning, carpooling services, collaborative interaction, and logistics management.