LGJul 21, 2024
Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak PromptsYi Liu, Chengjun Cai, Xiaoli Zhang et al.
Large Vision Language Models (VLMs) extend and enhance the perceptual abilities of Large Language Models (LLMs). Despite offering new possibilities for LLM applications, these advancements raise significant security and ethical concerns, particularly regarding the generation of harmful content. While LLMs have undergone extensive security evaluations with the aid of red teaming frameworks, VLMs currently lack a well-developed one. To fill this gap, we introduce Arondight, a standardized red team framework tailored specifically for VLMs. Arondight is dedicated to resolving issues related to the absence of visual modality and inadequate diversity encountered when transitioning existing red teaming methodologies from LLMs to VLMs. Our framework features an automated multi-modal jailbreak attack, wherein visual jailbreak prompts are produced by a red team VLM, and textual prompts are generated by a red team LLM guided by a reinforcement learning agent. To enhance the comprehensiveness of VLM security evaluation, we integrate entropy bonuses and novelty reward metrics. These elements incentivize the RL agent to guide the red team LLM in creating a wider array of diverse and previously unseen test cases. Our evaluation of ten cutting-edge VLMs exposes significant security vulnerabilities, particularly in generating toxic images and aligning multi-modal prompts. In particular, our Arondight achieves an average attack success rate of 84.5\% on GPT-4 in all fourteen prohibited scenarios defined by OpenAI in terms of generating toxic text. For a clearer comparison, we also categorize existing VLMs based on their safety levels and provide corresponding reinforcement recommendations. Our multimodal prompt dataset and red team code will be released after ethics committee approval. CONTENT WARNING: THIS PAPER CONTAINS HARMFUL MODEL RESPONSES.
ROMay 26, 2022
Multi-Phase Multi-Objective Dexterous Manipulation with Adaptive Hierarchical CurriculumLingfeng Tao, Jiucai Zhang, Xiaoli Zhang
Dexterous manipulation tasks usually have multiple objectives, and the priorities of these objectives may vary at different phases of a manipulation task. Varying priority makes a robot hardly or even failed to learn an optimal policy with a deep reinforcement learning (DRL) method. To solve this problem, we develop a novel Adaptive Hierarchical Reward Mechanism (AHRM) to guide the DRL agent to learn manipulation tasks with multiple prioritized objectives. The AHRM can determine the objective priorities during the learning process and update the reward hierarchy to adapt to the changing objective priorities at different phases. The proposed method is validated in a multi-objective manipulation task with a JACO robot arm in which the robot needs to manipulate a target with obstacles surrounded. The simulation and physical experiment results show that the proposed method improved robot learning in task performance and learning efficiency.
STMay 23, 2024Code
FinRobot: An Open-Source AI Agent Platform for Financial Applications using Large Language ModelsHongyang Yang, Boyu Zhang, Neng Wang et al.
As financial institutions and professionals increasingly incorporate Large Language Models (LLMs) into their workflows, substantial barriers, including proprietary data and specialized knowledge, persist between the finance sector and the AI community. These challenges impede the AI community's ability to enhance financial tasks effectively. Acknowledging financial analysis's critical role, we aim to devise financial-specialized LLM-based toolchains and democratize access to them through open-source initiatives, promoting wider AI adoption in financial decision-making. In this paper, we introduce FinRobot, a novel open-source AI agent platform supporting multiple financially specialized AI agents, each powered by LLM. Specifically, the platform consists of four major layers: 1) the Financial AI Agents layer that formulates Financial Chain-of-Thought (CoT) by breaking sophisticated financial problems down into logical sequences; 2) the Financial LLM Algorithms layer dynamically configures appropriate model application strategies for specific tasks; 3) the LLMOps and DataOps layer produces accurate models by applying training/fine-tuning techniques and using task-relevant data; 4) the Multi-source LLM Foundation Models layer that integrates various LLMs and enables the above layers to access them directly. Finally, FinRobot provides hands-on for both professional-grade analysts and laypersons to utilize powerful AI techniques for advanced financial analysis. We open-source FinRobot at \url{https://github.com/AI4Finance-Foundation/FinRobot}.
TRMar 22Code
FinRL-X: An AI-Native Modular Infrastructure for Quantitative TradingHongyang Yang, Boyu Zhang, Yang She et al.
We present FinRL-X, a modular and deployment-consistent trading architecture that unifies data processing, strategy construction, backtesting, and broker execution under a weight-centric interface. While existing open-source platforms are often backtesting- or model-centric, they rarely provide system-level consistency between research evaluation and live deployment. FinRL-X addresses this gap through a composable strategy pipeline that integrates stock selection, portfolio allocation, timing, and portfolio-level risk overlays within a unified protocol. The framework supports both rule-based and AI-driven components, including reinforcement learning allocators and LLM-based sentiment signals, without altering downstream execution semantics. FinRL-X provides an extensible foundation for reproducible, end-to-end quantitative trading research and deployment. The official FinRL-X implementation is available at https://github.com/AI4Finance-Foundation/FinRL-Trading.
ROMay 26, 2022
Physics-Guided Hierarchical Reward Mechanism for Learning-Based Robotic GraspingYunsik Jung, Lingfeng Tao, Michael Bowman et al.
Learning-based grasping can afford real-time grasp motion planning of multi-fingered robotics hands thanks to its high computational efficiency. However, learning-based methods are required to explore large search spaces during the learning process. The search space causes low learning efficiency, which has been the main barrier to its practical adoption. In addition, the trained policy lacks a generalizable outcome unless objects are identical to the trained objects. In this work, we develop a novel Physics-Guided Deep Reinforcement Learning with a Hierarchical Reward Mechanism to improve learning efficiency and generalizability for learning-based autonomous grasping. Unlike conventional observation-based grasp learning, physics-informed metrics are utilized to convey correlations between features associated with hand structures and objects to improve learning efficiency and outcomes. Further, the hierarchical reward mechanism enables the robot to learn prioritized components of the grasping tasks. Our method is validated in robotic grasping tasks with a 3-finger MICO robot arm. The results show that our method outperformed the standard Deep Reinforcement Learning methods in various robotic grasping tasks.
CVNov 2, 2025
In-Context-Learning-Assisted Quality Assessment Vision-Language Models for Metal Additive ManufacturingQiaojie Zheng, Jiucai Zhang, Xiaoli Zhang
Vision-based quality assessment in additive manufacturing often requires dedicated machine learning models and application-specific datasets. However, data collection and model training can be expensive and time-consuming. In this paper, we leverage vision-language models' (VLMs') reasoning capabilities to assess the quality of printed parts and introduce in-context learning (ICL) to provide VLMs with necessary application-specific knowledge and demonstration samples. This method eliminates the requirement for large application-specific datasets for training models. We explored different sampling strategies for ICL to search for the optimal configuration that makes use of limited samples. We evaluated these strategies on two VLMs, Gemini-2.5-flash and Gemma3:27b, with quality assessment tasks in wire-laser direct energy deposition processes. The results show that ICL-assisted VLMs can reach quality classification accuracies similar to those of traditional machine learning models while requiring only a minimal number of samples. In addition, unlike traditional classification models that lack transparency, VLMs can generate human-interpretable rationales to enhance trust. Since there are no metrics to evaluate their interpretability in manufacturing applications, we propose two metrics, knowledge relevance and rationale validity, to evaluate the quality of VLMs' supporting rationales. Our results show that ICL-assisted VLMs can address application-specific tasks with limited data, achieving relatively high accuracy while also providing valid supporting rationales for improved decision transparency.
CVMar 17, 2023
Confidence-aware 3D Gaze Estimation and Evaluation MetricQiaojie Zheng, Xiaoli Zhang
Deep learning appearance-based 3D gaze estimation is gaining popularity due to its minimal hardware requirements and being free of constraint. Unreliable and overconfident inferences, however, still limit the adoption of this gaze estimation method. To address the unreliable and overconfident issues, we introduce a confidence-aware model that predicts uncertainties together with gaze angle estimations. We also introduce a novel effectiveness evaluation method based on the causality between eye feature degradation and the rise in inference uncertainty to assess the uncertainty estimation. Our confidence-aware model demonstrates reliable uncertainty estimations while providing angular estimation accuracies on par with the state-of-the-art. Compared with the existing statistical uncertainty-angular-error evaluation metric, the proposed effectiveness evaluation approach can more effectively judge inferred uncertainties' performance at each prediction.
CVJul 31, 2025Code
ST-SAM: SAM-Driven Self-Training Framework for Semi-Supervised Camouflaged Object DetectionXihang Hu, Fuming Sun, Jiazhe Liu et al.
Semi-supervised Camouflaged Object Detection (SSCOD) aims to reduce reliance on costly pixel-level annotations by leveraging limited annotated data and abundant unlabeled data. However, existing SSCOD methods based on Teacher-Student frameworks suffer from severe prediction bias and error propagation under scarce supervision, while their multi-network architectures incur high computational overhead and limited scalability. To overcome these limitations, we propose ST-SAM, a highly annotation-efficient yet concise framework that breaks away from conventional SSCOD constraints. Specifically, ST-SAM employs Self-Training strategy that dynamically filters and expands high-confidence pseudo-labels to enhance a single-model architecture, thereby fundamentally circumventing inter-model prediction bias. Furthermore, by transforming pseudo-labels into hybrid prompts containing domain-specific knowledge, ST-SAM effectively harnesses the Segment Anything Model's potential for specialized tasks to mitigate error accumulation in self-training. Experiments on COD benchmark datasets demonstrate that ST-SAM achieves state-of-the-art performance with only 1\% labeled data, outperforming existing SSCOD methods and even matching fully supervised methods. Remarkably, ST-SAM requires training only a single network, without relying on specific models or loss functions. This work establishes a new paradigm for annotation-efficient SSCOD. Codes will be available at https://github.com/hu-xh/ST-SAM.
ROMar 12
Decision-Aware Uncertainty Evaluation of Vision-Language Model-Based Early Action Anticipation for Human-Robot InteractionZhaoda Du, Michael Bowman, Qiaojie Zheng et al.
Robots in shared workspaces must interpret human actions from partial, ambiguous observations, where overconfident early predictions can lead to unsafe or disruptive interaction. This challenge is amplified in egocentric views, where viewpoint changes and occlusions increase perceptual noise and ambiguity. As a result, downstream human-robot interaction modules require not only an action hypothesis but also a trustworthy estimate of confidence under partial observation. Recent vision-language model-based approaches have been proposed for short-term action recognition due to their open-vocabulary and context-aware reasoning, but their uncertainty reliability in the temporal-prefix regime is largely uncharacterized. We present the first systematic evaluation of uncertainty in vision-language model-based short-term action recognition for human-robot interaction. We introduce a temporal-prefix evaluation protocol and metrics for calibration and selective prediction. We also characterize miscalibration patterns and failure modes under partial observations. Our study provides the missing reliability evidence needed to use vision-language model predictions in confidence-gated human-robot interaction modules.
CVSep 15, 2025Code
MAFS: Masked Autoencoder for Infrared-Visible Image Fusion and Semantic SegmentationLiying Wang, Xiaoli Zhang, Chuanmin Jia et al.
Infrared-visible image fusion methods aim at generating fused images with good visual quality and also facilitate the performance of high-level tasks. Indeed, existing semantic-driven methods have considered semantic information injection for downstream applications. However, none of them investigates the potential for reciprocal promotion between pixel-wise image fusion and cross-modal feature fusion perception tasks from a macroscopic task-level perspective. To address this limitation, we propose a unified network for image fusion and semantic segmentation. MAFS is a parallel structure, containing a fusion sub-network and a segmentation sub-network. On the one hand, We devise a heterogeneous feature fusion strategy to enhance semantic-aware capabilities for image fusion. On the other hand, by cascading the fusion sub-network and a segmentation backbone, segmentation-related knowledge is transferred to promote feature-level fusion-based segmentation. Within the framework, we design a novel multi-stage Transformer decoder to aggregate fine-grained multi-scale fused features efficiently. Additionally, a dynamic factor based on the max-min fairness allocation principle is introduced to generate adaptive weights of two tasks and guarantee smooth training in a multi-task manner. Extensive experiments demonstrate that our approach achieves competitive results compared with state-of-the-art methods. The code is available at https://github.com/Abraham-Einstein/MAFS/.
CVJun 11, 2024Code
A Semantic-Aware and Multi-Guided Network for Infrared-Visible Image FusionXiaoli Zhang, Liying Wang, Libo Zhao et al.
Multi-modality image fusion aims at fusing modality-specific (complementarity) and modality-shared (correlation) information from multiple source images. To tackle the problem of the neglect of inter-feature relationships, high-frequency information loss, and the limited attention to downstream tasks, this paper focuses on how to model correlation-driven decomposing features and reason high-level graph representation by efficiently extracting complementary information and aggregating multi-guided features. We propose a three-branch encoder-decoder architecture along with corresponding fusion layers as the fusion strategy. Firstly, shallow features from individual modalities are extracted by a depthwise convolution layer combined with the transformer block. In the three parallel branches of the encoder, Cross Attention and Invertible Block (CAI) extracts local features and preserves high-frequency texture details. Base Feature Extraction Module (BFE) captures long-range dependencies and enhances modality-shared information. Graph Reasoning Module (GR) is introduced to reason high-level cross-modality relations and simultaneously extract low-level detail features as CAI's modality-specific complementary information. Experiments demonstrate the competitive results compared with state-of-the-art methods in visible/infrared image fusion and medical image fusion tasks. Moreover, the proposed algorithm surpasses the state-of-the-art methods in terms of subsequent tasks, averagely scoring 8.27% mAP@0.5 higher in object detection and 5.85% mIoU higher in semantic segmentation. The code is avaliable at https://github.com/Abraham-Einstein/SMFNet/.
CVNov 7, 2025
Deep learning models are vulnerable, but adversarial examples are even more vulnerableJun Li, Yanwei Xu, Keran Li et al.
Understanding intrinsic differences between adversarial examples and clean samples is key to enhancing DNN robustness and detection against adversarial attacks. This study first empirically finds that image-based adversarial examples are notably sensitive to occlusion. Controlled experiments on CIFAR-10 used nine canonical attacks (e.g., FGSM, PGD) to generate adversarial examples, paired with original samples for evaluation. We introduce Sliding Mask Confidence Entropy (SMCE) to quantify model confidence fluctuation under occlusion. Using 1800+ test images, SMCE calculations supported by Mask Entropy Field Maps and statistical distributions show adversarial examples have significantly higher confidence volatility under occlusion than originals. Based on this, we propose Sliding Window Mask-based Adversarial Example Detection (SWM-AED), which avoids catastrophic overfitting of conventional adversarial training. Evaluations across classifiers and attacks on CIFAR-10 demonstrate robust performance, with accuracy over 62% in most cases and up to 96.5%.
CVJul 25, 2025
Cross Spatial Temporal Fusion Attention for Remote Sensing Object Detection via Image Feature MatchingAbu Sadat Mohammad Salehin Amit, Xiaoli Zhang, Md Masum Billa Shagar et al.
Effectively describing features for cross-modal remote sensing image matching remains a challenging task due to the significant geometric and radiometric differences between multimodal images. Existing methods primarily extract features at the fully connected layer but often fail to capture cross-modal similarities effectively. We propose a Cross Spatial Temporal Fusion (CSTF) mechanism that enhances feature representation by integrating scale-invariant keypoints detected independently in both reference and query images. Our approach improves feature matching in two ways: First, by creating correspondence maps that leverage information from multiple image regions simultaneously, and second, by reformulating the similarity matching process as a classification task using SoftMax and Fully Convolutional Network (FCN) layers. This dual approach enables CSTF to maintain sensitivity to distinctive local features while incorporating broader contextual information, resulting in robust matching across diverse remote sensing modalities. To demonstrate the practical utility of improved feature matching, we evaluate CSTF on object detection tasks using the HRSC2016 and DOTA benchmark datasets. Our method achieves state-of-theart performance with an average mAP of 90.99% on HRSC2016 and 90.86% on DOTA, outperforming existing models. The CSTF model maintains computational efficiency with an inference speed of 12.5 FPS. These results validate that our approach to crossmodal feature matching directly enhances downstream remote sensing applications such as object detection.
CVNov 11, 2025
KPLM-STA: Physically-Accurate Shadow Synthesis for Human Relighting via Keypoint-Based Light ModelingXinhui Yin, Qifei Li, Yilin Guo et al.
Image composition aims to seamlessly integrate a foreground object into a background, where generating realistic and geometrically accurate shadows remains a persistent challenge. While recent diffusion-based methods have outperformed GAN-based approaches, existing techniques, such as the diffusion-based relighting framework IC-Light, still fall short in producing shadows with both high appearance realism and geometric precision, especially in composite images. To address these limitations, we propose a novel shadow generation framework based on a Keypoints Linear Model (KPLM) and a Shadow Triangle Algorithm (STA). KPLM models articulated human bodies using nine keypoints and one bounding block, enabling physically plausible shadow projection and dynamic shading across joints, thereby enhancing visual realism. STA further improves geometric accuracy by computing shadow angles, lengths, and spatial positions through explicit geometric formulations. Extensive experiments demonstrate that our method achieves state-of-the-art performance on shadow realism benchmarks, particularly under complex human poses, and generalizes effectively to multi-directional relighting scenarios such as those supported by IC-Light.
CVAug 20, 2025
QA-VLM: Providing human-interpretable quality assessment for wire-feed laser additive manufacturing parts with Vision Language ModelsQiaojie Zheng, Jiucai Zhang, Joy Gockel et al.
Image-based quality assessment (QA) in additive manufacturing (AM) often relies heavily on the expertise and constant attention of skilled human operators. While machine learning and deep learning methods have been introduced to assist in this task, they typically provide black-box outputs without interpretable justifications, limiting their trust and adoption in real-world settings. In this work, we introduce a novel QA-VLM framework that leverages the attention mechanisms and reasoning capabilities of vision-language models (VLMs), enriched with application-specific knowledge distilled from peer-reviewed journal articles, to generate human-interpretable quality assessments. Evaluated on 24 single-bead samples produced by laser wire direct energy deposition (DED-LW), our framework demonstrates higher validity and consistency in explanation quality than off-the-shelf VLMs. These results highlight the potential of our approach to enable trustworthy, interpretable quality assessment in AM applications.
CVJan 24, 2025
Enhancing accuracy of uncertainty estimation in appearance-based gaze tracking with probabilistic evaluation and calibrationQiaojie Zheng, Jiucai Zhang, Xiaoli Zhang
Accurately knowing uncertainties in appearance-based gaze tracking is critical for ensuring reliable downstream applications. Due to the lack of individual uncertainty labels, current uncertainty-aware approaches adopt probabilistic models to acquire uncertainties by following distributions in the training dataset. Without regulations, this approach lets the uncertainty model build biases and overfits the training data, leading to poor performance when deployed. We first presented a strict proper evaluation metric from the probabilistic perspective based on comparing the coverage probability between prediction and observation to provide quantitative evaluation for better assessment on the inferred uncertainties. We then proposed a correction strategy based on probability calibration to mitigate biases in the estimated uncertainties of the trained models. Finally, we demonstrated the effectiveness of the correction strategy with experiments performed on two popular gaze estimation datasets with distinctive image characteristics caused by data collection settings.
CVOct 8, 2021
Bounding-box deep calibration for high performance face detectionShi Luo, Xiongfei Li, Xiaoli Zhang
Modern convolutional neural networks (CNNs)-based face detectors have achieved tremendous strides due to large annotated datasets. However, misaligned results with high detection confidence but low localization accuracy restrict the further improvement of detection performance. In this paper, the authors first predict high confidence detection results on the training set itself. Surprisingly, a considerable part of them exist in the same misalignment problem. Then, the authors carefully examine these cases and point out that annotation misalignment is the main reason. Later, a comprehensive discussion is given for the replacement rationality between predicted and annotated bounding-boxes. Finally, the authors propose a novel Bounding-Box Deep Calibration (BDC) method to reasonably replace misaligned annotations with model predicted bounding-boxes and offer calibrated annotations for the training set. Extensive experiments on multiple detectors and two popular benchmark datasets show the effectiveness of BDC on improving models' precision and recall rate, without adding extra inference time and memory consumption. Our simple and effective method provides a general strategy for improving face detection, especially for light-weight detectors in real-time situations.
MTRL-SCIMar 22, 2021
Comprehensive process-molten pool relations modeling using CNN for wire-feed laser additive manufacturingNoopur Jamnikar, Sen Liu, Craig Brice et al.
Wire-feed laser additive manufacturing (WLAM) is gaining wide interest due to its high level of automation, high deposition rates, and good quality of printed parts. In-process monitoring and feedback controls that would reduce the uncertainty in the quality of the material are in the early stages of development. Machine learning promises the ability to accelerate the adoption of new processes and property design in additive manufacturing by making process-structure-property connections between process setting inputs and material quality outcomes. The molten pool dimensional information and temperature are the indicators for achieving the high quality of the build, which can be directly controlled by processing parameters. For the purpose of in situ quality control, the process parameters should be controlled in real-time based on sensed information from the process, in particular the molten pool. Thus, the molten pool-process relations are of preliminary importance. This paper analyzes experimentally collected in situ sensing data from the molten pool under a set of controlled process parameters in a WLAM system. The variations in the steady-state and transient state of the molten pool are presented with respect to the change of independent process parameters. A multi-modality convolutional neural network (CNN) architecture is proposed for predicting the control parameter directly from the measurable molten pool sensor data for achieving desired geometric and microstructural properties. Dropout and regularization are applied to the CNN architecture to avoid the problem of overfitting. The results highlighted that the multi-modal CNN, which receives temperature profile as an external feature to the features extracted from the image data, has improved prediction performance compared to the image-based uni-modality CNN approach.
MTRL-SCIMar 21, 2021
Machine learning based in situ quality estimation by molten pool condition-quality relations modeling using experimental dataNoopur Jamnikar, Sen Liu, Craig Brice et al.
The advancement of machine learning promises the ability to accelerate the adoption of new processes and property designs for metal additive manufacturing. The molten pool geometry and molten pool temperature are the significant indicators for the final part's geometric shape and microstructural properties for the Wire-feed laser direct energy deposition process. Thus, the molten pool condition-property relations are of preliminary importance for in situ quality assurance. To enable in situ quality monitoring of bead geometry and characterization properties, we need to continuously monitor the sensor's data for molten pool dimensions and temperature for the Wire-feed laser additive manufacturing (WLAM) system. We first develop a machine learning convolutional neural network (CNN) model for establishing the correlations from the measurable molten pool image and temperature data directly to the geometric shape and microstructural properties. The multi-modality network receives both the camera image and temperature measurement as inputs, yielding the corresponding characterization properties of the final build part (e.g., fusion zone depth, alpha lath thickness). The performance of the CNN model is compared with the regression model as a baseline. The developed models enable molten pool condition-quality relations mapping for building quantitative and collaborative in situ quality estimation and assurance framework.
CVMar 10, 2021
Wide Aspect Ratio Matching for Robust Face DetectionShi Luo, Xiongfei Li, Xiaoli Zhang
Recently, anchor-based methods have achieved great progress in face detection. Once anchor design and anchor matching strategy determined, plenty of positive anchors will be sampled. However, faces with extreme aspect ratio always fail to be sampled according to standard anchor matching strategy. In fact, the max IoUs between anchors and extreme aspect ratio faces are still lower than fixed sampling threshold. In this paper, we firstly explore the factors that affect the max IoU of each face in theory. Then, anchor matching simulation is performed to evaluate the sampling range of face aspect ratio. Besides, we propose a Wide Aspect Ratio Matching (WARM) strategy to collect more representative positive anchors from ground-truth faces across a wide range of aspect ratio. Finally, we present a novel feature enhancement module, named Receptive Field Diversity (RFD) module, to provide diverse receptive field corresponding to different aspect ratios. Extensive experiments show that our method can help detectors better capture extreme aspect ratio faces and achieve promising detection performance on challenging face detection benchmarks, including WIDER FACE and FDDB datasets.
CVFeb 10, 2021
Detecting Localized Adversarial Examples: A Generic Approach using Critical Region AnalysisFengting Li, Xuankai Liu, Xiaoli Zhang et al.
Deep neural networks (DNNs) have been applied in a wide range of applications,e.g.,face recognition and image classification; however,they are vulnerable to adversarial examples. By adding a small amount of imperceptible perturbations,an attacker can easily manipulate the outputs of a DNN. Particularly,the localized adversarial examples only perturb a small and contiguous region of the target object,so that they are robust and effective in both digital and physical worlds. Although the localized adversarial examples have more severe real-world impacts than traditional pixel attacks,they have not been well addressed in the literature. In this paper,we propose a generic defense system called TaintRadar to accurately detect localized adversarial examples via analyzing critical regions that have been manipulated by attackers. The main idea is that when removing critical regions from input images,the ranking changes of adversarial labels will be larger than those of benign labels. Compared with existing defense solutions,TaintRadar can effectively capture sophisticated localized partial attacks, e.g.,the eye-glasses attack,while not requiring additional training or fine-tuning of the original model's structure. Comprehensive experiments have been conducted in both digital and physical worlds to verify the effectiveness and robustness of our defense.
LGJan 13, 2021
A Physics-Informed Machine Learning Model for Porosity Analysis in Laser Powder Bed Fusion Additive ManufacturingRui Liu, Sen Liu, Xiaoli Zhang
To control part quality, it is critical to analyze pore generation mechanisms, laying theoretical foundation for future porosity control. Current porosity analysis models use machine setting parameters, such as laser angle and part pose. However, these setting-based models are machine dependent, hence they often do not transfer to analysis of porosity for a different machine. To address the first problem, a physics-informed, data-driven model (PIM), which instead of directly using machine setting parameters to predict porosity levels of printed parts, it first interprets machine settings into physical effects, such as laser energy density and laser radiation pressure. Then, these physical, machine independent effects are used to predict porosity levels according to pass, flag, fail categories instead of focusing on quantitative pore size prediction. With six learning methods evaluation, PIM proved to achieve good performances with prediction error of 10$\sim$26%. Finally, pore-encouraging influence and pore-suppressing influence were analyzed for quality analysis.
RODec 19, 2020
Forming Real-World Human-Robot Cooperation for Tasks With General GoalLingfeng Tao, Michael Bowman, Jiucai Zhang et al.
In human-robot cooperation, the robot cooperates with humans to accomplish the task together. Existing approaches assume the human has a specific goal during the cooperation, and the robot infers and acts toward it. However, in real-world environments, a human usually only has a general goal (e.g., general direction or area in motion planning) at the beginning of the cooperation, which needs to be clarified to a specific goal (i.e., an exact position) during cooperation. The specification process is interactive and dynamic, which depends on the environment and the partner's behavior. The robot that does not consider the goal specification process may cause frustration to the human partner, elongate the time to come to an agreement, and compromise team performance. This work presents the Evolutionary Value Learning approach to model the dynamics of the goal specification process with State-based Multivariate Bayesian Inference and goal specificity-related features. This model enables the robot to enhance the process of the human's goal specification actively and find a cooperative policy in a Deep Reinforcement Learning manner. Our method outperforms existing methods with faster goal specification processes and better team performance in a dynamic ball balancing task with real human subjects.
ROMay 19, 2020
Robust Robot-assisted Tele-grasping Through Intent-Uncertainty-Aware PlanningMichael Bowman, Songpo Li, Xiaoli Zhang
In teleoperation, research has mainly focused on target approaching, where we deal with the more challenging object manipulation task by advancing the shared control technique. Appropriately manipulating an object is challenging due to the fine motion constraint requirements for a specific manipulation task. Although these motion constraints are critical for task success, they often are subtle when observing ambiguous human motion. The disembodiment problem and physical discrepancy between the human and robot hands bring additional uncertainty, further exaggerating the complications of the object manipulation task. Moreover, there is a lack of planning and modeling techniques that can effectively combine the human and robot agents' motion input while considering the ambiguity of the human intent. To overcome this challenge, we built a multi-task robot grasping model and developed an intent-uncertainty-aware grasp planner to generate robust grasp poses given the ambiguous human intent inference inputs. With these validated modeling and planning techniques, it is expected to extend teleoperated robots' functionality and adoption in practical telemanipulation scenarios.
ROMar 11, 2020
A General Arbitration Model for Robust Human-Robot Shared Control with Multi-Source Uncertainty ModelingSongpo Li, Michael Bowman, Xiaoli Zhang
Shared control in teleoperation leverages both human and robot's strengths and has demonstrated great advantages of reducing the difficulties in teleoperating a robot and increasing the task performance. One fundamental question in shared control is how to effectively allocate the control power to the human and robot. Researchers have been subjectively defining the arbitrate policies following conflicting principles, which resulted in great inconsistency in the policies. We attribute this inconsistency to the inconsiderateness of the multi-resource uncertainty in the human-robot system. To fill the gap, we developed a multi-source uncertainty model that was applicable to various types of uncertainty in real world, and then a general arbitration model was developed to comprehensively fuse the uncertainty and regulate the arbitration weight assigned to the robotic agent. Beside traditional macro performance metrics, we introduced objective and quantitative metrics of robotic helpfulness and friendliness that evaluated the assistive robot's cooperation at micro and macro levels. Results from simulations and experiments showed the new arbitration model was more effective and friendly over the existing policies and was robust to coping with multi-source uncertainty. With this new arbitration model, we expect the increased adoption of human-robot shared control in practical and complex teleoperation tasks.
ROMar 7, 2020
An Intent-based Task-aware Shared Control Framework for Intuitive Hands Free TelemanipulationMichael Bowman, Jiucai Zhang, Xiaoli Zhang
Shared control in teleoperation for providing robot assistance to accomplish object manipulation, called telemanipulation, is a new promising yet challenging problem. This has unique challenges--on top of teleoperation challenges in general--due to difficulties of physical discrepancy between human hands and robot hands as well as the fine motion constraints to constitute task success. We present an intuitive shared-control strategy where the focus is on generating robotic grasp poses which are better suited for human perception of successful teleoperated object manipulation and feeling of being in control of the robot, rather than developing objective stable grasp configurations for task success or following the human motion. The former is achieved by understanding human intent and autonomously taking over control on that inference. The latter is achieved by considering human inputs as hard motion constraints which the robot must abide. An arbitration of these two enables a trade-off for the subsequent robot motion to balance accomplishing the inferred task and motion constraints imposed by the operator. The arbitration framework adapts to the level of physical discrepancy between the human and different robot structures, enabling the assistance to indicate and appear to intuitively follow the user. To understand how users perceive good arbitration in object telemanipulation, we have conducted a user study with a hands-free telemanipulation setup to analyze the effect of factors including task predictability, perceived following, and user preference. The hands-free telemanipulation scene is chosen as the validation platform due to its more urgent need of intuitive robotics assistance for task success.
ROMar 7, 2020
Learn and Transfer Knowledge of Preferred Assistance Strategies in Semi-autonomous TelemanipulationLingfeng Tao, Michael Bowman, Xu Zhou et al.
Enabling robots to provide effective assistance yet still accommodating the operator's commands for telemanipulation of an object is very challenging because robot's assistive action is not always intuitive for human operators and human behaviors and preferences are sometimes ambiguous for the robot to interpret. Although various assistance approaches are being developed to improve the control quality from different optimization perspectives, the problem still remains in determining the appropriate approach that satisfies the fine motion constraints for the telemanipulation task and preference of the operator. To address these problems, we developed a novel preference-aware assistance knowledge learning approach. An assistance preference model learns what assistance is preferred by a human, and a stagewise model updating method ensures the learning stability while dealing with the ambiguity of human preference data. Such a preference-aware assistance knowledge enables a teleoperated robot hand to provide more active yet preferred assistance toward manipulation success. We also developed knowledge transfer methods to transfer the preference knowledge across different robot hand structures to avoid extensive robot-specific training. Experiments to telemanipulate a 3-finger hand and 2-finger hand, respectively, to use, move, and hand over a cup have been conducted. Results demonstrated that the methods enabled the robots to effectively learn the preference knowledge and allowed knowledge transfer between robots with less training effort.
MTRL-SCIMar 4, 2020
Physics-informed machine learning for composition-process-property alloy design: shape memory alloy demonstrationSen Liu, Branden B. Kappes, Behnam Amin-ahmadi et al.
Machine learning (ML) is shown to predict new alloys and their performances in a high dimensional, multiple-target-property design space that considers chemistry, multi-step processing routes, and characterization methodology variations. A physics-informed featured engineering approach is shown to enable otherwise poorly performing ML models to perform well with the same data. Specifically, previously engineered elemental features based on alloy chemistries are combined with newly engineered heat treatment process features. The new features result from first transforming the heat treatment parameter data as it was previously recorded using nonlinear mathematical relationships known to describe the thermodynamics and kinetics of phase transformations in alloys. The ability of the ML model to be used for predictive design is validated using blind predictions. Composition - process - property relationships for thermal hysteresis of shape memory alloys (SMAs) with complex microstructures created via multiple melting-homogenization-solutionization-precipitation processing stage variations are captured, in addition to the mean transformation temperatures of the SMAs. The quantitative models of hysteresis exhibited by such highly processed alloys demonstrate the ability for ML models to design for physical complexities that have challenged physics-based modeling approaches for decades.
ROMar 1, 2020
Learn Task First or Learn Human Partner First: A Hierarchical Task Decomposition Method for Human-Robot CooperationLingfeng Tao, Michael Bowman, Jiucai Zhang et al.
Applying Deep Reinforcement Learning (DRL) to Human-Robot Cooperation (HRC) in dynamic control problems is promising yet challenging as the robot needs to learn the dynamics of the controlled system and dynamics of the human partner. In existing research, the robot powered by DRL adopts coupled observation of the environment and the human partner to learn both dynamics simultaneously. However, such a learning strategy is limited in terms of learning efficiency and team performance. This work proposes a novel task decomposition method with a hierarchical reward mechanism that enables the robot to learn the hierarchical dynamic control task separately from learning the human partner's behavior. The method is validated with a hierarchical control task in a simulated environment with human subject experiments. Our method also provides insight into the design of the learning strategy for HRC. The results show that the robot should learn the task first to achieve higher team performance and learn the human first to achieve higher learning efficiency.
CVDec 20, 2018
SFA: Small Faces Attention Face DetectorShi Luo, Xiongfei Li, Rui Zhu et al.
In recent year, tremendous strides have been made in face detection thanks to deep learning. However, most published face detectors deteriorate dramatically as the faces become smaller. In this paper, we present the Small Faces Attention (SFA) face detector to better detect faces with small scale. First, we propose a new scale-invariant face detection architecture which pays more attention to small faces, including 4-branch detection architecture and small faces sensitive anchor design. Second, feature maps fusion strategy is applied in SFA by partially combining high-level features into low-level features to further improve the ability of finding hard faces. Third, we use multi-scale training and testing strategy to enhance face detection performance in practice. Comprehensive experiments show that SFA significantly improves face detection performance, especially on small faces. Our real-time SFA face detector can run at 5 FPS on a single GPU as well as maintain high performance. Besides, our final SFA face detector achieves state-of-the-art detection performance on challenging face detection benchmarks, including WIDER FACE and FDDB datasets, with competitive runtime speed. Both our code and models will be available to the research community.
ROJan 30, 2017
A Review of Methodologies for Natural-Language-Facilitated Human-Robot CooperationRui Liu, Xiaoli Zhang
Natural-language-facilitated human-robot cooperation (NLC) refers to using natural language (NL) to facilitate interactive information sharing and task executions with a common goal constraint between robots and humans. Recently, NLC research has received increasing attention. Typical NLC scenarios include robotic daily assistance, robotic health caregiving, intelligent manufacturing, autonomous navigation, and robot social accompany. However, a thorough review, that can reveal latest methodologies to use NL to facilitate human-robot cooperation, is missing. In this review, a comprehensive summary about methodologies for NLC is presented. NLC research includes three main research focuses: NL instruction understanding, NL-based execution plan generation, and knowledge-world mapping. In-depth analyses on theoretical methods, applications, and model advantages and disadvantages are made. Based on our paper review and perspective, potential research directions of NLC are summarized.
ROJan 28, 2017
Systems of natural-language-facilitated human-robot cooperation: A reviewRui Liu, Xiaoli Zhang
Natural-language-facilitated human-robot cooperation (NLC), in which natural language (NL) is used to share knowledge between a human and a robot for conducting intuitive human-robot cooperation (HRC), is continuously developing in the recent decade. Currently, NLC is used in several robotic domains such as manufacturing, daily assistance and health caregiving. It is necessary to summarize current NLC-based robotic systems and discuss the future developing trends, providing helpful information for future NLC research. In this review, we first analyzed the driving forces behind the NLC research. Regarding to a robot s cognition level during the cooperation, the NLC implementations then were categorized into four types {NL-based control, NL-based robot training, NL-based task execution, NL-based social companion} for comparison and discussion. Last based on our perspective and comprehensive paper review, the future research trends were discussed.
RODec 13, 2016
Stabilization and Trajectory Control of a Quadrotor with Uncertain Suspended LoadXu Zhou, Xiaoli Zhang, Jiucai Zhang et al.
Stabilization and trajectory control of a quadrotor carrying a suspended load with a fixed known mass has been extensively studied in recent years. However, the load mass is not always known beforehand or may vary during the practical transportations. This mass uncertainty brings uncertain disturbances to the quadrotor system, causing existing controllers to have worse stability and trajectory tracking performance. To improve the quadrotor stability and trajectory tracking capability in this situation, we fully investigate the impacts of the uncertain load mass on the quadrotor. By comparing the performances of three different controllers -- the proportional-derivative (PD) controller, the sliding mode controller (SMC), and the model predictive controller (MPC) -- stabilization rather than trajectory tracking error is proved to be the main influence in the load mass uncertainty. A critical motion mass exists for the quadrotor to maintain a desired transportation performance. Moreover, simulation results verify that a controller with strong robustness against disturbances is a good choice for practical applications.
AINov 20, 2016
Generating machine-executable plans from end-user's natural-language instructionsRui Liu, Xiaoli Zhang
It is critical for advanced manufacturing machines to autonomously execute a task by following an end-user's natural language (NL) instructions. However, NL instructions are usually ambiguous and abstract so that the machines may misunderstand and incorrectly execute the task. To address this NL-based human-machine communication problem and enable the machines to appropriately execute tasks by following the end-user's NL instructions, we developed a Machine-Executable-Plan-Generation (exePlan) method. The exePlan method conducts task-centered semantic analysis to extract task-related information from ambiguous NL instructions. In addition, the method specifies machine execution parameters to generate a machine-executable plan by interpreting abstract NL instructions. To evaluate the exePlan method, an industrial robot Baxter was instructed by NL to perform three types of industrial tasks {'drill a hole', 'clean a spot', 'install a screw'}. The experiment results proved that the exePlan method was effective in generating machine-executable plans from the end-user's NL instructions. Such a method has the promise to endow a machine with the ability of NL-instructed task execution.