Hong Sun

CV
h-index7
22papers
249citations
Novelty49%
AI Score56

22 Papers

CLJul 13, 2023
AutoHint: Automatic Prompt Optimization with Hint Generation

Hong Sun, Xue Li, Yinchuan Xu et al. · microsoft-research

This paper presents AutoHint, a novel framework for automatic prompt engineering and optimization for Large Language Models (LLM). While LLMs have demonstrated remarkable ability in achieving high-quality annotation in various tasks, the key to applying this ability to specific tasks lies in developing high-quality prompts. Thus we propose a framework to inherit the merits of both in-context learning and zero-shot learning by incorporating enriched instructions derived from input-output demonstrations to optimize original prompt. We refer to the enrichment as the hint and propose a framework to automatically generate the hint from labeled data. More concretely, starting from an initial prompt, our method first instructs a LLM to deduce new hints for selected samples from incorrect predictions, and then summarizes from per-sample hints and adds the results back to the initial prompt to form a new, enriched instruction. The proposed method is evaluated on the BIG-Bench Instruction Induction dataset for both zero-shot and few-short prompts, where experiments demonstrate our method is able to significantly boost accuracy for multiple tasks.

CRJun 14, 2023Code
Efficient Backdoor Attacks for Deep Neural Networks in Real-world Scenarios

Ziqiang Li, Hong Sun, Pengfei Xia et al.

Recent deep neural networks (DNNs) have came to rely on vast amounts of training data, providing an opportunity for malicious attackers to exploit and contaminate the data to carry out backdoor attacks. However, existing backdoor attack methods make unrealistic assumptions, assuming that all training data comes from a single source and that attackers have full access to the training data. In this paper, we introduce a more realistic attack scenario where victims collect data from multiple sources, and attackers cannot access the complete training data. We refer to this scenario as data-constrained backdoor attacks. In such cases, previous attack methods suffer from severe efficiency degradation due to the entanglement between benign and poisoning features during the backdoor injection process. To tackle this problem, we introduce three CLIP-based technologies from two distinct streams: Clean Feature Suppression and Poisoning Feature Augmentation.effective solution for data-constrained backdoor attacks. The results demonstrate remarkable improvements, with some settings achieving over 100% improvement compared to existing attacks in data-constrained scenarios. Code is available at https://github.com/sunh1113/Efficient-backdoor-attacks-for-deep-neural-networks-in-real-world-scenarios

81.6CRMay 30
The Invitation Trap: Proactive Availability Backdoor in LLMs via Conversational Induction

He Wang, Jun Feng, Hong Sun et al.

Current backdoor attacks against LLMs are typically manipulated by the attacker and remain passive. In this paper, we introduce the \textbf{Proactive Availability Backdoor (PAB)}, a novel paradigm that shifts the attack vector from passive waiting to active social engineering. By weaponizing the inherent helpfulness of aligned LLMs, PAB proactively traps users into executing trigger-implanted queries by offering suggestions, achieving high aggressiveness, precision and stealthiness. To rigorously evaluate its threat in a real-life context, we introduce a dual-agent ecological simulation framework based on selected dimensions of the Five-Factor Model, and deploy PAB with few-shot prompts. Being validated on different models and domains, PAB performs remarkably and its effective attack success rate, which calculates the joint probability of attack incidence rate and attack success rate, goes to \textbf{73.1\%}. We also introduce \textbf{Anti-PAB}, a defense method tailored for PAB. Our findings reveal that the helpfulness of LLMs can be weaponized to compromise availability, exposing a serious hidden threat to LLMs users. We release all the scripts and datasets in the experiments at \texttt{https://anonymous.4open.science/r/PAB-ANONYMOUS/}.

CRJun 14, 2023
A Proxy Attack-Free Strategy for Practically Improving the Poisoning Efficiency in Backdoor Attacks

Ziqiang Li, Hong Sun, Pengfei Xia et al.

Poisoning efficiency is crucial in poisoning-based backdoor attacks, as attackers aim to minimize the number of poisoning samples while maximizing attack efficacy. Recent studies have sought to enhance poisoning efficiency by selecting effective samples. However, these studies typically rely on a proxy backdoor injection task to identify an efficient set of poisoning samples. This proxy attack-based approach can lead to performance degradation if the proxy attack settings differ from those of the actual victims, due to the shortcut nature of backdoor learning. Furthermore, proxy attack-based methods are extremely time-consuming, as they require numerous complete backdoor injection processes for sample selection. To address these concerns, we present a Proxy attack-Free Strategy (PFS) designed to identify efficient poisoning samples based on the similarity between clean samples and their corresponding poisoning samples, as well as the diversity of the poisoning set. The proposed PFS is motivated by the observation that selecting samples with high similarity between clean and corresponding poisoning samples results in significantly higher attack success rates compared to using samples with low similarity. Additionally, we provide theoretical foundations to explain the proposed PFS. We comprehensively evaluate the proposed strategy across various datasets, triggers, poisoning rates, architectures, and training hyperparameters. Our experimental results demonstrate that PFS enhances backdoor attack efficiency while also offering a remarkable speed advantage over previous proxy attack-based selection methodologies.

CROct 15, 2023
Explore the Effect of Data Selection on Poison Efficiency in Backdoor Attacks

Ziqiang Li, Pengfei Xia, Hong Sun et al.

As the number of parameters in Deep Neural Networks (DNNs) scales, the thirst for training data also increases. To save costs, it has become common for users and enterprises to delegate time-consuming data collection to third parties. Unfortunately, recent research has shown that this practice raises the risk of DNNs being exposed to backdoor attacks. Specifically, an attacker can maliciously control the behavior of a trained model by poisoning a small portion of the training data. In this study, we focus on improving the poisoning efficiency of backdoor attacks from the sample selection perspective. The existing attack methods construct such poisoned samples by randomly selecting some clean data from the benign set and then embedding a trigger into them. However, this random selection strategy ignores that each sample may contribute differently to the backdoor injection, thereby reducing the poisoning efficiency. To address the above problem, a new selection strategy named Improved Filtering and Updating Strategy (FUS++) is proposed. Specifically, we adopt the forgetting events of the samples to indicate the contribution of different poisoned samples and use the curvature of the loss surface to analyses the effectiveness of this phenomenon. Accordingly, we combine forgetting events and curvature of different samples to conduct a simple yet efficient sample selection strategy. The experimental results on image classification (CIFAR-10, CIFAR-100, ImageNet-10), text classification (AG News), audio classification (ESC-50), and age regression (Facial Age) consistently demonstrate the effectiveness of the proposed strategy: the attack performance using FUS++ is significantly higher than that using random selection for the same poisoning ratio.

MTRL-SCINov 13, 2023
Novel models for fatigue life prediction under wideband random loads based on machine learning

Hong Sun, Yuanying Qiu, Jing Li et al.

Machine learning as a data-driven solution has been widely applied in the field of fatigue lifetime prediction. In this paper, three models for wideband fatigue life prediction are built based on three machine learning models, i.e. support vector machine (SVM), Gaussian process regression (GPR) and artificial neural network (ANN). The generalization ability of the models is enhanced by employing numerous power spectra samples with different bandwidth parameters and a variety of material properties related to fatigue life. Sufficient Monte Carlo numerical simulations demonstrate that the newly developed machine learning models are superior to the traditional frequency-domain models in terms of life prediction accuracy and the ANN model has the best overall performance among the three developed machine learning models.

CVSep 12, 2025Code
USCTNet: A deep unfolding nuclear-norm optimization solver for physically consistent HSI reconstruction

Xiaoyang Ma, Yiyang Chai, Xinran Qu et al.

Reconstructing hyperspectral images (HSIs) from a single RGB image is ill-posed and can become physically inconsistent when the camera spectral sensitivity (CSS) and scene illumination are misspecified. We formulate RGB-to-HSI reconstruction as a physics-grounded inverse problem regularized by a nuclear norm in a learnable transform domain, and we explicitly estimate CSS and illumination to define the forward operator embedded in each iteration, ensuring colorimetric consistency. To avoid the cost and instability of full singular-value decompositions (SVDs) required by singular-value thresholding (SVT), we introduce a data-adaptive low-rank subspace SVT operator. Building on these components, we develop USCTNet, a deep unfolding solver tailored to HSI that couples a parameter estimation module with learnable proximal updates. Extensive experiments on standard benchmarks show consistent improvements over state-of-the-art RGB-based methods in reconstruction accuracy. Code: https://github.com/psykheXX/USCTNet-Code-Implementation.git

AIOct 15, 2013Code
Validation Rules for Assessing and Improving SKOS Mapping Quality

Hong Sun, Jos De Roo, Marc Twagirumukiza et al.

The Simple Knowledge Organization System (SKOS) is popular for expressing controlled vocabularies, such as taxonomies, classifications, etc., for their use in Semantic Web applications. Using SKOS, concepts can be linked to other concepts and organized into hierarchies inside a single terminology system. Meanwhile, expressing mappings between concepts in different terminology systems is also possible. This paper discusses potential quality issues in using SKOS to express these terminology mappings. Problematic patterns are defined and corresponding rules are developed to automatically detect situations where the mappings either result in 'SKOS Vocabulary Hijacking' to the source vocabularies or cause conflicts. An example of using the rules to validate sample mappings between two clinical terminologies is given. The validation rules, expressed in N3 format, are available as open source.

LGNov 12, 2025
Preference is More Than Comparisons: Rethinking Dueling Bandits with Augmented Human Feedback

Shengbo Wang, Hong Sun, Ke Li

Interactive preference elicitation (IPE) aims to substantially reduce human effort while acquiring human preferences in wide personalization systems. Dueling bandit (DB) algorithms enable optimal decision-making in IPE building on pairwise comparisons. However, they remain inefficient when human feedback is sparse. Existing methods address sparsity by heavily relying on parametric reward models, whose rigid assumptions are vulnerable to misspecification. In contrast, we explore an alternative perspective based on feedback augmentation, and introduce critical improvements to the model-free DB framework. Specifically, we introduce augmented confidence bounds to integrate augmented human feedback under generalized concentration properties, and analyze the multi-factored performance trade-off via regret analysis. Our prototype algorithm achieves competitive performance across several IPE benchmarks, including recommendation, multi-objective optimization, and response optimization for large language models, demonstrating the potential of our approach for provably efficient IPE in broader applications.

LGSep 15, 2025
Unsupervised Atomic Data Mining via Multi-Kernel Graph Autoencoders for Machine Learning Force Fields

Hong Sun, Joshua A. Vita, Amit Samanta et al.

Constructing a chemically diverse dataset while avoiding sampling bias is critical to training efficient and generalizable force fields. However, in computational chemistry and materials science, many common dataset generation techniques are prone to oversampling regions of the potential energy surface. Furthermore, these regions can be difficult to identify and isolate from each other or may not align well with human intuition, making it challenging to systematically remove bias in the dataset. While traditional clustering and pruning (down-sampling) approaches can be useful for this, they can often lead to information loss or a failure to properly identify distinct regions of the potential energy surface due to difficulties associated with the high dimensionality of atomic descriptors. In this work, we introduce the Multi-kernel Edge Attention-based Graph Autoencoder (MEAGraph) model, an unsupervised approach for analyzing atomic datasets. MEAGraph combines multiple linear kernel transformations with attention-based message passing to capture geometric sensitivity and enable effective dataset pruning without relying on labels or extensive training. Demonstrated applications on niobium, tantalum, and iron datasets show that MEAGraph efficiently groups similar atomic environments, allowing for the use of basic pruning techniques for removing sampling bias. This approach provides an effective method for representation learning and clustering that can be used for data analysis, outlier detection, and dataset optimization.

CVJul 22, 2025
Beyond Label Semantics: Language-Guided Action Anatomy for Few-shot Action Recognition

Zefeng Qian, Xincheng Yao, Yifei Huang et al.

Few-shot action recognition (FSAR) aims to classify human actions in videos with only a small number of labeled samples per category. The scarcity of training data has driven recent efforts to incorporate additional modalities, particularly text. However, the subtle variations in human posture, motion dynamics, and the object interactions that occur during different phases, are critical inherent knowledge of actions that cannot be fully exploited by action labels alone. In this work, we propose Language-Guided Action Anatomy (LGA), a novel framework that goes beyond label semantics by leveraging Large Language Models (LLMs) to dissect the essential representational characteristics hidden beneath action labels. Guided by the prior knowledge encoded in LLM, LGA effectively captures rich spatiotemporal cues in few-shot scenarios. Specifically, for text, we prompt an off-the-shelf LLM to anatomize labels into sequences of atomic action descriptions, focusing on the three core elements of action (subject, motion, object). For videos, a Visual Anatomy Module segments actions into atomic video phases to capture the sequential structure of actions. A fine-grained fusion strategy then integrates textual and visual features at the atomic level, resulting in more generalizable prototypes. Finally, we introduce a Multimodal Matching mechanism, comprising both video-video and video-text matching, to ensure robust few-shot classification. Experimental results demonstrate that LGA achieves state-of-the-art performance across multipe FSAR benchmarks.

HCFeb 25, 2021
Perspectives and solutions towards intelligent ambient assisted living systems

Hong Sun, Vincenzo De Florio

The population of the elderly people has kept increasing rapidly over the world in the past decades. Solutions that are able to effectively support the elderly people to live independently at their home are thus urgently needed. Ambient assisted living (AAL) aims to provide products and services with ambient intelligence to build a safe environment around people in need. With the high prevalence of multiple chronic diseases, the elderly people often need different levels of care management to prolong independent living at home. An effective AAL system should provide the required clinical support as an extension to the services provided in hospitals. Following the rapid growth of available data, together with the wide application of machine learning technologies, we are now able to build intelligent ambient assisted systems to fulfil such a request. This paper discusses different levels of intelligence in AAL. We also introduce our solution for building an intelligent AAL system with the discussed technologies. Taking semantic web technology as its backbone, such an AAL system is able to aggregate information from different sources, solve the semantic gap between different data sources, and perform adaptive and personalized carepath management based on the ambient environment.

LGJan 21, 2021
A scalable approach for developing clinical risk prediction applications in different hospitals

Hong Sun, Kristof Depraetere, Laurent Meesseman et al.

Objective: Machine learning algorithms are now widely used in predicting acute events for clinical applications. While most of such prediction applications are developed to predict the risk of a particular acute event at one hospital, few efforts have been made in extending the developed solutions to other events or to different hospitals. We provide a scalable solution to extend the process of clinical risk prediction model development of multiple diseases and their deployment in different Electronic Health Records (EHR) systems. Materials and Methods: We defined a generic process for clinical risk prediction model development. A calibration tool has been created to automate the model generation process. We applied the model calibration process at four hospitals, and generated risk prediction models for delirium, sepsis and acute kidney injury (AKI) respectively at each of these hospitals. Results: The delirium risk prediction models achieved area under the receiver-operating characteristic curve (AUROC) ranging from 0.82 to 0.95 over different stages of a hospital stay on the test datasets of the four hospitals. The sepsis models achieved AUROC ranging from 0.88 to 0.95, and the AKI models achieved AUROC ranging from 0.85 to 0.92. Discussion: The scalability discussed in this paper is based on building common data representations (syntactic interoperability) between EHRs stored in different hospitals. Semantic interoperability, a more challenging requirement that different EHRs share the same meaning of data, e.g. a same lab coding system, is not mandated with our approach. Conclusions: Our study describes a method to develop and deploy clinical risk prediction models in a scalable way. We demonstrate its feasibility by developing risk prediction models for three diseases across four hospitals.

CLOct 14, 2020
AutoADR: Automatic Model Design for Ad Relevance

Yiren Chen, Yaming Yang, Hong Sun et al.

Large-scale pre-trained models have attracted extensive attention in the research community and shown promising results on various tasks of natural language processing. However, these pre-trained models are memory and computation intensive, hindering their deployment into industrial online systems like Ad Relevance. Meanwhile, how to design an effective yet efficient model architecture is another challenging problem in online Ad Relevance. Recently, AutoML shed new lights on architecture design, but how to integrate it with pre-trained language models remains unsettled. In this paper, we propose AutoADR (Automatic model design for AD Relevance) -- a novel end-to-end framework to address this challenge, and share our experience to ship these cutting-edge techniques into online Ad Relevance system at Microsoft Bing. Specifically, AutoADR leverages a one-shot neural architecture search algorithm to find a tailored network architecture for Ad Relevance. The search process is simultaneously guided by knowledge distillation from a large pre-trained teacher model (e.g. BERT), while taking the online serving constraints (e.g. memory and latency) into consideration. We add the model designed by AutoADR as a sub-model into the production Ad Relevance model. This additional sub-model improves the Precision-Recall AUC (PR AUC) on top of the original Ad Relevance model by 2.65X of the normalized shipping bar. More importantly, adding this automatically designed sub-model leads to a statistically significant 4.6% Bad-Ad ratio reduction in online A/B testing. This model has been shipped into Microsoft Bing Ad Relevance Production model.

IVAug 27, 2020
Mixed Noise Removal with Pareto Prior

Zhou Liu, Lei Yu, Gui-Song Xia et al.

Denoising images contaminated by the mixture of additive white Gaussian noise (AWGN) and impulse noise (IN) is an essential but challenging problem. The presence of impulsive disturbances inevitably affects the distribution of noises and thus largely degrades the performance of traditional AWGN denoisers. Existing methods target to compensate the effects of IN by introducing a weighting matrix, which, however, is lack of proper priori and thus hard to be accurately estimated. To address this problem, we exploit the Pareto distribution as the priori of the weighting matrix, based on which an accurate and robust weight estimator is proposed for mixed noise removal. Particularly, a relatively small portion of pixels are assumed to be contaminated with IN, which should have weights with small values and then be penalized out. This phenomenon can be properly described by the Pareto distribution of type 1. Therefore, armed with the Pareto distribution, we formulate the problem of mixed noise removal in the Bayesian framework, where nonlocal self-similarity priori is further exploited by adopting nonlocal low rank approximation. Compared to existing methods, the proposed method can estimate the weighting matrix adaptively, accurately, and robust for different level of noises, thus can boost the denoising performance. Experimental results on widely used image datasets demonstrate the superiority of our proposed method to the state-of-the-arts.

CVApr 24, 2017
A Dual Sparse Decomposition Method for Image Denoising

Hong Sun, Chen-guang Liu, Cheng-wei Sang

This article addresses the image denoising problem in the situations of strong noise. We propose a dual sparse decomposition method. This method makes a sub-dictionary decomposition on the over-complete dictionary in the sparse decomposition. The sub-dictionary decomposition makes use of a novel criterion based on the occurrence frequency of atoms of the over-complete dictionary over the data set. The experimental results demonstrate that the dual-sparse-decomposition method surpasses state-of-art denoising performance in terms of both peak-signal-to-noise ratio and structural-similarity-index-metric, and also at subjective visual quality.

CVNov 22, 2016
Sar image despeckling based on nonlocal similarity sparse decomposition

Chengwei Sang, Hong Sun, Quisong Xia

This letter presents a method of synthetic aperture radar (SAR) image despeckling aimed to preserve the detail information while suppressing speckle noise. This method combines the nonlocal self-similarity partition and a proposed modified sparse decomposition. The nonlocal partition method groups a series of structure-similarity data sets. Each data set has a good sparsity for learning an over-complete dictionary in sparse representation. In the sparse decomposition, we propose a novel method to identify principal atoms from over-complete dictionary to form a principal dictionary. Despeckling is performed on each data set over the principal dictionary with principal atoms. Experimental results demonstrate that the proposed method can achieve high performances in terms of both speckle noise reduction and structure details preservation.

MLOct 27, 2016
Sparse Signal Subspace Decomposition Based on Adaptive Over-complete Dictionary

Hong Sun, Chengwei Sang, Didier Le Ruyet

This paper proposes a subspace decomposition method based on an over-complete dictionary in sparse representation, called "Sparse Signal Subspace Decomposition" (or 3SD) method. This method makes use of a novel criterion based on the occurrence frequency of atoms of the dictionary over the data set. This criterion, well adapted to subspace-decomposition over a dependent basis set, adequately re ects the intrinsic characteristic of regularity of the signal. The 3SD method combines variance, sparsity and component frequency criteria into an unified framework. It takes benefits from using an over-complete dictionary which preserves details and from subspace decomposition which rejects strong noise. The 3SD method is very simple with a linear retrieval operation. It does not require any prior knowledge on distributions or parameters. When applied to image denoising, it demonstrates high performances both at preserving fine details and suppressing strong noise.

CVNov 25, 2015
Principal Basis Analysis in Sparse Representation

Hong Sun, Cheng-Wei Sang, Chen-Guang Liu

This article introduces a new signal analysis method, which can be interpreted as a principal component analysis in sparse decomposition of the signal. The method, called principal basis analysis, is based on a novel criterion: reproducibility of component which is an intrinsic characteristic of regularity in natural signals. We show how to measure reproducibility. Then we present the principal basis analysis method, which chooses, in a sparse representation of the signal, the components optimizing the reproducibility degree to build the so-called principal basis. With this principal basis, we show that the underlying signal pattern could be effectively extracted from corrupted data. As illustration, we apply the principal basis analysis to image denoising corrupted by Gaussian and non-Gaussian noises, showing better performances than some reference methods at suppressing strong noise and at preserving signal details.

DBNov 10, 2015
Semantic processing of EHR data for clinical research

Hong Sun, Kristof Depraetere, Jos De Roo et al.

There is a growing need to semantically process and integrate clinical data from different sources for clinical research. This paper presents an approach to integrate EHRs from heterogeneous resources and generate integrated data in different data formats or semantics to support various clinical research applications. The proposed approach builds semantic data virtualization layers on top of data sources, which generate data in the requested semantics or formats on demand. This approach avoids upfront dumping to and synchronizing of the data with various representations. Data from different EHR systems are first mapped to RDF data with source semantics, and then converted to representations with harmonized domain semantics where domain ontologies and terminologies are used to improve reusability. It is also possible to further convert data to application semantics and store the converted results in clinical research databases, e.g. i2b2, OMOP, to support different clinical research settings. Semantic conversions between different representations are explicitly expressed using N3 rules and executed by an N3 Reasoner (EYE), which can also generate proofs of the conversion processes. The solution presented in this paper has been applied to real-world applications that process large scale EHR data.

SEAug 22, 2015
A framework for adaptive real-time applications: the declarative real-time OSGi component model

Ning Gui, Vincenzo De Florio, Hong Sun et al.

Nowadays, more and more applications require OSGi to have some form of real-time support, which is currently very limited. The resulting closed-system solutions lack of a standard management scheme which forbids standard, system-wide policies for real-time system's deployment, adaptation, and reconfiguration. In order to tackle this problem, this paper proposes a declarative real-time component model. In this model, the distinguishing real-time contract of each component is declaratively described, and a general component real-time management interface is designed. They are used to maintain an accurate view of existing real-time components' promised contracts. A real-time component runtime service is designed to control the whole lifecycle of the components. By using global information and general control interface, it can adjust the system continue to operate without impairing the deployed components' real-time contracts in the face of run-time changes. This system allows itself to be easily extended with other constraint resolving policies to fit different context. The prototype has been tested into a simulated control system. The result shows this framework can provide good real time performance while still provides real-time component dynamicity support as well. To the best of our knowledge, this is the first comprehensive solution providing explicit real-time support from design to execution in OSGi framework.

CYJan 12, 2014
The Missing Ones: Key Ingredients Towards Effective Ambient Assisted Living Systems

Hong Sun, Vincenzo De Florio, Ning Gui et al.

The population of elderly people keeps increasing rapidly, which becomes a predominant aspect of our societies. As such, solutions both efficacious and cost-effective need to be sought. Ambient Assisted Living (AAL) is a new approach which promises to address the needs from elderly people. In this paper, we claim that human participation is a key ingredient towards effective AAL systems, which not only saves social resources, but also has positive relapses on the psychological health of the elderly people. Challenges in increasing the human participation in ambient assisted living are discussed in this paper and solutions to meet those challenges are also proposed. We use our proposed mutual assistance community, which is built with service oriented approach, as an example to demonstrate how to integrate human tasks in AAL systems. Our preliminary simulation results are presented, which support the effectiveness of human participation.