CVSep 20, 2023Code
Locate and Verify: A Two-Stream Network for Improved Deepfake DetectionChao Shuai, Jieming Zhong, Shuang Wu et al.
Deepfake has taken the world by storm, triggering a trust crisis. Current deepfake detection methods are typically inadequate in generalizability, with a tendency to overfit to image contents such as the background, which are frequently occurring but relatively unimportant in the training dataset. Furthermore, current methods heavily rely on a few dominant forgery regions and may ignore other equally important regions, leading to inadequate uncovering of forgery cues. In this paper, we strive to address these shortcomings from three aspects: (1) We propose an innovative two-stream network that effectively enlarges the potential regions from which the model extracts forgery evidence. (2) We devise three functional modules to handle the multi-stream and multi-scale features in a collaborative learning scheme. (3) Confronted with the challenge of obtaining forgery annotations, we propose a Semi-supervised Patch Similarity Learning strategy to estimate patch-level forged location annotations. Empirically, our method demonstrates significantly improved robustness and generalizability, outperforming previous methods on six benchmarks, and improving the frame-level AUC on Deepfake Detection Challenge preview dataset from 0.797 to 0.835 and video-level AUC on CelebDF$\_$v1 dataset from 0.811 to 0.847. Our implementation is available at https://github.com/sccsok/Locate-and-Verify.
CVSep 18, 2023Code
DFIL: Deepfake Incremental Learning by Exploiting Domain-invariant Forgery CluesKun Pan, Yin Yifang, Yao Wei et al.
The malicious use and widespread dissemination of deepfake pose a significant crisis of trust. Current deepfake detection models can generally recognize forgery images by training on a large dataset. However, the accuracy of detection models degrades significantly on images generated by new deepfake methods due to the difference in data distribution. To tackle this issue, we present a novel incremental learning framework that improves the generalization of deepfake detection models by continual learning from a small number of new samples. To cope with different data distributions, we propose to learn a domain-invariant representation based on supervised contrastive learning, preventing overfit to the insufficient new data. To mitigate catastrophic forgetting, we regularize our model in both feature-level and label-level based on a multi-perspective knowledge distillation approach. Finally, we propose to select both central and hard representative samples to update the replay set, which is beneficial for both domain-invariant representation learning and rehearsal-based knowledge preserving. We conduct extensive experiments on four benchmark datasets, obtaining the new state-of-the-art average forgetting rate of 7.01 and average accuracy of 85.49 on FF++, DFDC-P, DFD, and CDF2. Our code is released at https://github.com/DeepFakeIL/DFIL.
CVDec 19, 2022Code
Universal Object Detection with Large Vision ModelFeng Lin, Wenze Hu, Yaowei Wang et al.
Over the past few years, there has been growing interest in developing a broad, universal, and general-purpose computer vision system. Such systems have the potential to address a wide range of vision tasks simultaneously, without being limited to specific problems or data domains. This universality is crucial for practical, real-world computer vision applications. In this study, our focus is on a specific challenge: the large-scale, multi-domain universal object detection problem, which contributes to the broader goal of achieving a universal vision system. This problem presents several intricate challenges, including cross-dataset category label duplication, label conflicts, and the necessity to handle hierarchical taxonomies. To address these challenges, we introduce our approach to label handling, hierarchy-aware loss design, and resource-efficient model training utilizing a pre-trained large vision model. Our method has demonstrated remarkable performance, securing a prestigious second-place ranking in the object detection track of the Robust Vision Challenge 2022 (RVC 2022) on a million-scale cross-dataset object detection benchmark. We believe that our comprehensive study will serve as a valuable reference and offer an alternative approach for addressing similar challenges within the computer vision community. The source code for our work is openly available at https://github.com/linfeng93/Large-UniDet.
AINov 21, 2022
Intelligent Computing: The Latest Advances, Challenges and FutureShiqiang Zhu, Ting Yu, Tao Xu et al.
Computing is a critical driving force in the development of human civilization. In recent years, we have witnessed the emergence of intelligent computing, a new computing paradigm that is reshaping traditional computing and promoting digital revolution in the era of big data, artificial intelligence and internet-of-things with new computing theories, architectures, methods, systems, and applications. Intelligent computing has greatly broadened the scope of computing, extending it from traditional computing on data to increasingly diverse computing paradigms such as perceptual intelligence, cognitive intelligence, autonomous intelligence, and human-computer fusion intelligence. Intelligence and computing have undergone paths of different evolution and development for a long time but have become increasingly intertwined in recent years: intelligent computing is not only intelligence-oriented but also intelligence-driven. Such cross-fertilization has prompted the emergence and rapid advancement of intelligent computing. Intelligent computing is still in its infancy and an abundance of innovations in the theories, systems, and applications of intelligent computing are expected to occur soon. We present the first comprehensive survey of literature on intelligent computing, covering its theory fundamentals, the technological fusion of intelligence and computing, important applications, challenges, and future perspectives. We believe that this survey is highly timely and will provide a comprehensive reference and cast valuable insights into intelligent computing for academic and industrial researchers and practitioners.
SYOct 30, 2012
Distributed Control of Angle-constrained Circular Formations using Bearing-only MeasurementsShiyu Zhao, Feng Lin, Kemao Peng et al.
This paper studies distributed formation control of multiple agents in the plane using bearing-only measurements. It is assumed that each agent only measures the local bearings of their neighbor agents. The target formation considered in this paper is a circular formation, where each agent has exactly two neighbors. In the target formation, the angle subtended at each agent by their two neighbors is specified. We propose a distributed control law that stabilizes angle-constrained target formations merely using local bearing measurements. The stability of the target formation is analyzed based on Lyapunov approaches. We present a unified proof to show that our control law not only can ensure local exponential stability but also can give local finite-time stability. The exponential or finite-time stability can be easily switched by tuning a parameter in the control law.
CROct 20, 2023Code
FLTracer: Accurate Poisoning Attack Provenance in Federated LearningXinyu Zhang, Qingyu Liu, Zhongjie Ba et al.
Federated Learning (FL) is a promising distributed learning approach that enables multiple clients to collaboratively train a shared global model. However, recent studies show that FL is vulnerable to various poisoning attacks, which can degrade the performance of global models or introduce backdoors into them. In this paper, we first conduct a comprehensive study on prior FL attacks and detection methods. The results show that all existing detection methods are only effective against limited and specific attacks. Most detection methods suffer from high false positives, which lead to significant performance degradation, especially in not independent and identically distributed (non-IID) settings. To address these issues, we propose FLTracer, the first FL attack provenance framework to accurately detect various attacks and trace the attack time, objective, type, and poisoned location of updates. Different from existing methodologies that rely solely on cross-client anomaly detection, we propose a Kalman filter-based cross-round detection to identify adversaries by seeking the behavior changes before and after the attack. Thus, this makes it resilient to data heterogeneity and is effective even in non-IID settings. To further improve the accuracy of our detection method, we employ four novel features and capture their anomalies with the joint decisions. Extensive evaluations show that FLTracer achieves an average true positive rate of over $96.88\%$ at an average false positive rate of less than $2.67\%$, significantly outperforming SOTA detection methods. \footnote{Code is available at \url{https://github.com/Eyr3/FLTracer}.}
CVOct 16, 2023Code
SoybeanNet: Transformer-Based Convolutional Neural Network for Soybean Pod Counting from Unmanned Aerial Vehicle (UAV) ImagesJiajia Li, Raju Thada Magar, Dong Chen et al.
Soybeans are a critical source of food, protein and oil, and thus have received extensive research aimed at enhancing their yield, refining cultivation practices, and advancing soybean breeding techniques. Within this context, soybean pod counting plays an essential role in understanding and optimizing production. Despite recent advancements, the development of a robust pod-counting algorithm capable of performing effectively in real-field conditions remains a significant challenge This paper presents a pioneering work of accurate soybean pod counting utilizing unmanned aerial vehicle (UAV) images captured from actual soybean fields in Michigan, USA. Specifically, this paper presents SoybeanNet, a novel point-based counting network that harnesses powerful transformer backbones for simultaneous soybean pod counting and localization with high accuracy. In addition, a new dataset of UAV-acquired images for soybean pod counting was created and open-sourced, consisting of 113 drone images with more than 260k manually annotated soybean pods captured under natural lighting conditions. Through comprehensive evaluations, SoybeanNet demonstrated superior performance over five state-of-the-art approaches when tested on the collected images. Remarkably, SoybeanNet achieved a counting accuracy of $84.51\%$ when tested on the testing dataset, attesting to its efficacy in real-world scenarios. The publication also provides both the source code (\url{https://github.com/JiajiaLi04/Soybean-Pod-Counting-from-UAV-Images}) and the labeled soybean dataset (\url{https://www.kaggle.com/datasets/jiajiali/uav-based-soybean-pod-images}), offering a valuable resource for future research endeavors in soybean pod counting and related fields.
55.6SYMay 16
Coordination Control of Discrete Event Systems under Cyber AttacksFei Wang, Jan Komenda, Feng Lin
In this paper, coordination control of discrete event systems under joint sensor and actuator attacks is investigated. Sensor attacks are described by a set of attack languages using a proposed ALTER model. Several local supervisors are used to control the system. The goal is to design local supervisors to ensure safety of the system even under cyber attacks (CA). The necessary and sufficient conditions for the existence of such supervisors are derived in terms of conditional decomposability, CA-controllability and CA-observability. A method is developed to calculate local state estimates under sensor attacks. Two methods are also developed to design local supervisors, one for discrete event systems satisfying conditional decomposability, CA-controllability and CA-observability, and one for discrete event systems satisfying conditional decomposability only. The approach works for both stealthy and non-stealthy attacks. A practical example is given to illustrate the results.
SYMar 11, 2013
Finite-time Stabilization of Circular Formations using Bearing-only MeasurementsShiyu Zhao, Feng Lin, Kemao Peng et al.
This paper studies decentralized formation control of multiple vehicles when each vehicle can only measure the local bearings of their neighbors by using bearing-only sensors. Since the inter-vehicle distance cannot be measured, the target formation involves no distance constraints. More specifically, the target formation considered in this paper is an angle-constrained circular formation, where each vehicle has exactly two neighbors and the angle at each vehicle subtended by its two neighbors is pre-specified. To stabilize the target formation, we propose a discontinuous control law that only requires the sign information of the angle errors. Due to the discontinuity of the proposed control law, the stability of the closed-loop system is analyzed by employing a locally Lipschitz Lyapunov function and nonsmooth analysis tools. We prove that the target formation is locally finite-time stable with collision avoidance guaranteed. The evolution of the vehicle positions in the plane is also characterized.
CRNov 3, 2023
ERASER: Machine Unlearning in MLaaS via an Inference Serving-Aware ApproachYuke Hu, Jian Lou, Jiaqi Liu et al.
Over the past years, Machine Learning-as-a-Service (MLaaS) has received a surging demand for supporting Machine Learning-driven services to offer revolutionized user experience across diverse application areas. MLaaS provides inference service with low inference latency based on an ML model trained using a dataset collected from numerous individual data owners. Recently, for the sake of data owners' privacy and to comply with the "right to be forgotten (RTBF)" as enacted by data protection legislation, many machine unlearning methods have been proposed to remove data owners' data from trained models upon their unlearning requests. However, despite their promising efficiency, almost all existing machine unlearning methods handle unlearning requests independently from inference requests, which unfortunately introduces a new security issue of inference service obsolescence and a privacy vulnerability of undesirable exposure for machine unlearning in MLaaS. In this paper, we propose the ERASER framework for machinE unleaRning in MLaAS via an inferencE seRving-aware approach. ERASER strategically choose appropriate unlearning execution timing to address the inference service obsolescence issue. A novel inference consistency certification mechanism is proposed to avoid the violation of RTBF principle caused by postponed unlearning executions, thereby mitigating the undesirable exposure vulnerability. ERASER offers three groups of design choices to allow for tailor-made variants that best suit the specific environments and preferences of various MLaaS systems. Extensive empirical evaluations across various settings confirm ERASER's effectiveness, e.g., it can effectively save up to 99% of inference latency and 31% of computation overhead over the inference-oblivion baseline.
CRAug 3, 2024
ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms using Linguistic FeaturesPeng Cheng, Yuwei Wang, Peng Huang et al.
Extensive research has revealed that adversarial examples (AE) pose a significant threat to voice-controllable smart devices. Recent studies have proposed black-box adversarial attacks that require only the final transcription from an automatic speech recognition (ASR) system. However, these attacks typically involve many queries to the ASR, resulting in substantial costs. Moreover, AE-based adversarial audio samples are susceptible to ASR updates. In this paper, we identify the root cause of these limitations, namely the inability to construct AE attack samples directly around the decision boundary of deep learning (DL) models. Building on this observation, we propose ALIF, the first black-box adversarial linguistic feature-based attack pipeline. We leverage the reciprocal process of text-to-speech (TTS) and ASR models to generate perturbations in the linguistic embedding space where the decision boundary resides. Based on the ALIF pipeline, we present the ALIF-OTL and ALIF-OTA schemes for launching attacks in both the digital domain and the physical playback environment on four commercial ASRs and voice assistants. Extensive evaluations demonstrate that ALIF-OTL and -OTA significantly improve query efficiency by 97.7% and 73.3%, respectively, while achieving competitive performance compared to existing methods. Notably, ALIF-OTL can generate an attack sample with only one query. Furthermore, our test-of-time experiment validates the robustness of our approach against ASR updates.
71.9SYMay 25
Deterministic and Nonblocking Supervisory Control of Discrete Event Systems under Cyber AttacksFeng Lin, Caisheng Wang, Jun Chen et al.
We investigate deterministic and nonblocking supervisory control of discrete event systems under cyber-attacks using the ALTER (Attack Language for Transition-basEd Replacement) model. While prior works consider supervisory control that achieves either the large (upper bound) language or small (lower bound) language separately, deterministic supervisory control achieves both large language and small language at the same time to ensure that the language generated by the supervised system is unique and deterministic. We introduce two new concepts of CA-D-controllability and CA-D-observability and prove that they are necessary and sufficient for the existence of a deterministic supervisor. For nonblocking supervisory control, the objective is to ensure that the supervised system can always reach marked states under any attack scenario. We prove that relative closure, CA-D-controllability, and CA-D-observability together are necessary and sufficient for the existence of a nonblocking supervisor. We further develop methods to verify CA-D-controllability and CA-D-observability. We also illustrate our results using a robotic system example.
SDNov 10, 2022
Privacy-Utility Balanced Voice De-Identification Using Adversarial ExamplesMeng Chen, Li Lu, Jiadi Yu et al.
Faced with the threat of identity leakage during voice data publishing, users are engaged in a privacy-utility dilemma when enjoying convenient voice services. Existing studies employ direct modification or text-based re-synthesis to de-identify users' voices, but resulting in inconsistent audibility in the presence of human participants. In this paper, we propose a voice de-identification system, which uses adversarial examples to balance the privacy and utility of voice services. Instead of typical additive examples inducing perceivable distortions, we design a novel convolutional adversarial example that modulates perturbations into real-world room impulse responses. Benefit from this, our system could preserve user identity from exposure by Automatic Speaker Identification (ASI) while remaining the voice perceptual quality for non-intrusive de-identification. Moreover, our system learns a compact speaker distribution through a conditional variational auto-encoder to sample diverse target embeddings on demand. Combining diverse target generation and input-specific perturbation construction, our system enables any-to-any identify transformation for adaptive de-identification. Experimental results show that our system could achieve 98% and 79% successful de-identification on mainstream ASIs and commercial systems with an objective Mel cepstral distortion of 4.31dB and a subjective mean opinion score of 4.48.
CVMar 4, 2024Code
Exposing the Deception: Uncovering More Forgery Clues for Deepfake DetectionZhongjie Ba, Qingyu Liu, Zhenguang Liu et al.
Deepfake technology has given rise to a spectrum of novel and compelling applications. Unfortunately, the widespread proliferation of high-fidelity fake videos has led to pervasive confusion and deception, shattering our faith that seeing is believing. One aspect that has been overlooked so far is that current deepfake detection approaches may easily fall into the trap of overfitting, focusing only on forgery clues within one or a few local regions. Moreover, existing works heavily rely on neural networks to extract forgery features, lacking theoretical constraints guaranteeing that sufficient forgery clues are extracted and superfluous features are eliminated. These deficiencies culminate in unsatisfactory accuracy and limited generalizability in real-life scenarios. In this paper, we try to tackle these challenges through three designs: (1) We present a novel framework to capture broader forgery clues by extracting multiple non-overlapping local representations and fusing them into a global semantic-rich feature. (2) Based on the information bottleneck theory, we derive Local Information Loss to guarantee the orthogonality of local representations while preserving comprehensive task-relevant information. (3) Further, to fuse the local representations and remove task-irrelevant information, we arrive at a Global Information Loss through the theoretical analysis of mutual information. Empirically, our method achieves state-of-the-art performance on five benchmark datasets.Our code is available at \url{https://github.com/QingyuLiu/Exposing-the-Deception}, hoping to inspire researchers.
79.6CVApr 10
From UAV Imagery to Agronomic Reasoning: A Multimodal LLM Benchmark for Plant PhenotypingYu Wu, Guangzeng Han, Ibra Niang Niang et al.
To improve crop genetics, high-throughput, effective and comprehensive phenotyping is a critical prerequisite. While such tasks were traditionally performed manually, recent advances in multimodal foundation models, especially in vision-language models (VLMs), have enabled more automated and robust phenotypic analysis. However, plant science remains a particularly challenging domain for foundation models because it requires domain-specific knowledge, fine-grained visual interpretation, and complex biological and agronomic reasoning. To address this gap, we develop PlantXpert, an evidence-grounded multimodal reasoning benchmark for soybean and cotton phenotyping. Our benchmark provides a structured and reproducible framework for agronomic adaptation of VLMs, and enables controlled comparison between base models and their domain-adapted counterparts. We constructed a dataset comprising 385 digital images and more than 3,000 benchmark samples spanning key plant science domains including disease, pest control, weed management, and yield. The benchmark can assess diverse capabilities including visual expertise, quantitative reasoning, and multi-step agronomic reasoning. A total of 11 state-of-the-art VLMs were evaluated. The results indicate that task-specific fine-tuning leads to substantial improvement in accuracy, with models such as Qwen3-VL-4B and Qwen3-VL-30B achieving up to 78%. At the same time, gains from model scaling diminish beyond a certain capacity, generalization across soybean and cotton remains uneven, and quantitative as well as biologically grounded reasoning continue to pose substantial challenges. These findings suggest that PlantXpert can serve as a foundation for assessing evidence-grounded agronomic reasoning and for advancing multimodal model development in plant science.
AIJan 21, 2025
UI-TARS: Pioneering Automated GUI Interaction with Native AgentsYujia Qin, Yining Ye, Junjie Fang et al.
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.
LGAug 11, 2025Code
WeChat-YATT: A Scalable, Simple, Efficient, and Production Ready Training LibraryJunyu Wu, Weiming Chang, Xiaotao Liu et al.
Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent paradigm for training large language models and multimodal systems. Despite the notable advances enabled by existing RLHF training frameworks, significant challenges remain to scale to complex multimodal workflows and adapt to dynamic workloads. In particular, current systems often encounter limitations related to controller scalability when managing large models, as well as inefficiencies in orchestrating intricate RLHF pipelines, especially in scenarios that require dynamic sampling and resource allocation. In this paper, we introduce WeChat-YATT Yet Another Transformer Trainer in WeChat, a simple, scalable, and balanced RLHF training framework specifically designed to address these challenges. WeChat-YATT features a parallel controller programming model that enables flexible and efficient orchestration of complex RLHF workflows, effectively mitigating bottlenecks associated with centralized controller architectures and facilitating scalability in large-scale data scenarios. In addition, we propose a dynamic placement schema that adaptively partitions computational resources and schedules workloads, thereby significantly reducing hardware idle time and improving GPU utilization under variable training conditions. We evaluate WeChat-YATT across diverse experimental scenarios, demonstrating its substantial throughput improvements over state-of-the-art RLHF training frameworks. Furthermore, WeChat-YATT has been successfully deployed to train models that support WeChat product features for a large-scale user base, underscoring its effectiveness and robustness in real-world applications. We have made WeChat-YATT publicly available at https://www.github.com/tencent/WeChat-YATT.
CVJan 2
RePose: A Real-Time 3D Human Pose Estimation and Biomechanical Analysis Framework for RehabilitationJunxiao Xue, Pavel Smirnov, Ziao Li et al.
We propose a real-time 3D human pose estimation and motion analysis method termed RePose for rehabilitation training. It is capable of real-time monitoring and evaluation of patients'motion during rehabilitation, providing immediate feedback and guidance to assist patients in executing rehabilitation exercises correctly. Firstly, we introduce a unified pipeline for end-to-end real-time human pose estimation and motion analysis using RGB video input from multiple cameras which can be applied to the field of rehabilitation training. The pipeline can help to monitor and correct patients'actions, thus aiding them in regaining muscle strength and motor functions. Secondly, we propose a fast tracking method for medical rehabilitation scenarios with multiple-person interference, which requires less than 1ms for tracking for a single frame. Additionally, we modify SmoothNet for real-time posture estimation, effectively reducing pose estimation errors and restoring the patient's true motion state, making it visually smoother. Finally, we use Unity platform for real-time monitoring and evaluation of patients' motion during rehabilitation, and to display the muscle stress conditions to assist patients with their rehabilitation training.
56.0CVApr 18
Bridging Coarse and Fine Recognition: A Hybrid Approach for Open-Ended Multi-Granularity Object Recognition in Interactive Educational GamesHanling Yi, Feng Lin, Mao Luo et al.
Recent advances in Multimodal Large Language Models (MLLMs) have enabled open-ended object recognition, yet they struggle with fine-grained tasks. In contrast, CLIP-style models excel at fine-grained recognition but lack broad coverage of general object categories. To bridge this gap, we propose \textbf{HyMOR}, a \textbf{Hy}brid \textbf{M}ulti-granularity open-ended \textbf{O}bject \textbf{R}ecognition framework that integrates an MLLM with a CLIP model. In HyMOR, the MLLM performs open-ended and coarse-grained object recognition, while the CLIP model specializes in fine-grained identification of domain-specific objects such as animals and plants. This hybrid design enables accurate object understanding across multiple semantic granularities, serving as a robust perceptual foundation for downstream multi-modal content generation and interactive gameplay. To support evaluation in content-rich and educational scenarios, we introduce TBO (TextBook Objects), a dataset containing 20,942 images annotated with 8,816 object categories extracted from textbooks. Extensive experiments demonstrate that HyMOR narrows the fine-grained recognition gap with CLIP to 0.2\% while improving general object recognition by 2.5\% over a baseline MLLM, measured by average Sentence-BERT (SBert) similarity. Overall, HyMOR achieves a 23.2\% improvement in average SBert across all evaluated datasets, highlighting its effectiveness in enabling accurate perception for multi-modal game content generation and interactive learning applications.
ROJul 28, 2018Code
Accurate 3D Localization for MAV Swarms by UWB and IMU FusionJiaxin Li, Yingcai Bi, Kun Li et al.
Driven by applications like Micro Aerial Vehicles (MAVs), driver-less cars, etc, localization solution has become an active research topic in the past decade. In recent years, Ultra Wideband (UWB) emerged as a promising technology because of its impressive performance in both indoor and outdoor positioning. But algorithms relying only on UWB sensor usually result in high latency and low bandwidth, which is undesirable in some situations such as controlling a MAV. To alleviate this problem, an Extended Kalman Filter (EKF) based algorithm is proposed to fuse the Inertial Measurement Unit (IMU) and UWB, which achieved 80Hz 3D localization with significantly improved accuracy and almost no delay. To verify the effectiveness and reliability of the proposed approach, a swarm of 6 MAVs is set up to perform a light show in an indoor exhibition hall. Video and source codes are available at https://github.com/lijx10/uwb-localization
SEMar 23, 2024
SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model AgentsFeng Lin, Dong Jae Kim, Tse-Husn et al.
Software process models are essential to facilitate collaboration and communication among software teams to solve complex development tasks. Inspired by these software engineering practices, we present FlowGen - a code generation framework that emulates software process models based on multiple Large Language Model (LLM) agents. We emulate three process models, FlowGenWaterfall, FlowGenTDD, and FlowGenScrum, by assigning LLM agents to embody roles (i.e., requirement engineer, architect, developer, tester, and scrum master) that correspond to everyday development activities and organize their communication patterns. The agents work collaboratively using chain-of-thought and prompt composition with continuous self-refinement to improve the code quality. We use GPT3.5 as our underlying LLM and several baselines (RawGPT, CodeT, Reflexion) to evaluate code generation on four benchmarks: HumanEval, HumanEval-ET, MBPP, and MBPP-ET. Our findings show that FlowGenScrum excels compared to other process models, achieving a Pass@1 of 75.2, 65.5, 82.5, and 56.7 in HumanEval, HumanEval-ET, MBPP, and MBPP-ET, respectively (an average of 15% improvement over RawGPT). Compared with other state-of-the-art techniques, FlowGenScrum achieves a higher Pass@1 in MBPP compared to CodeT, with both outperforming Reflexion. Notably, integrating CodeT into FlowGenScrum resulted in statistically significant improvements, achieving the highest Pass@1 scores. Our analysis also reveals that the development activities impacted code smell and exception handling differently, with design and code review adding more exception handling and reducing code smells. Finally, FlowGen models maintain stable Pass@1 scores across GPT3.5 versions and temperature values, highlighting the effectiveness of software process models in enhancing the quality and stability of LLM-generated code.
IRNov 13, 2025
Don't Waste It: Guiding Generative Recommenders with Structured Human Priors via Multi-head DecodingYunkai Zhang, Qiang Zhang, Feng Lin et al.
Optimizing recommender systems for objectives beyond accuracy, such as diversity, novelty, and personalization, is crucial for long-term user satisfaction. To this end, industrial practitioners have accumulated vast amounts of structured domain knowledge, which we term human priors (e.g., item taxonomies, temporal patterns). This knowledge is typically applied through post-hoc adjustments during ranking or post-ranking. However, this approach remains decoupled from the core model learning, which is particularly undesirable as the industry shifts to end-to-end generative recommendation foundation models. On the other hand, many methods targeting these beyond-accuracy objectives often require architecture-specific modifications and discard these valuable human priors by learning user intent in a fully unsupervised manner. Instead of discarding the human priors accumulated over years of practice, we introduce a backbone-agnostic framework that seamlessly integrates these human priors directly into the end-to-end training of generative recommenders. With lightweight, prior-conditioned adapter heads inspired by efficient LLM decoding strategies, our approach guides the model to disentangle user intent along human-understandable axes (e.g., interaction types, long- vs. short-term interests). We also introduce a hierarchical composition strategy for modeling complex interactions across different prior types. Extensive experiments on three large-scale datasets demonstrate that our method significantly enhances both accuracy and beyond-accuracy objectives. We also show that human priors allow the backbone model to more effectively leverage longer context lengths and larger model sizes.
AISep 2, 2025
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement LearningHaoming Wang, Haoyang Zou, Huatong Song et al. · pku
The development of autonomous agents for graphical user interfaces (GUIs) presents major challenges in artificial intelligence. While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through end-to-end learning, open problems remain in data scalability, multi-turn reinforcement learning (RL), the limitations of GUI-only operation, and environment stability. In this technical report, we present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology: a data flywheel for scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI environment that integrates file systems and terminals, and a unified sandbox platform for large-scale rollouts. Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5. On GUI benchmarks, it reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld, outperforming strong baselines such as Claude and OpenAI agents. In game environments, it attains a mean normalized score of 59.8 across a 15-game suite-roughly 60% of human-level performance-and remains competitive with frontier proprietary models (e.g., OpenAI o3) on LMGame-Bench. Additionally, the model can generalize to long-horizon information-seeking tasks and software engineering benchmarks, highlighting its robustness across diverse agent tasks. Detailed analyses of training dynamics further provide insights into achieving stability and efficiency in large-scale agent RL. These results underscore UI-TARS-2's potential to advance the state of GUI agents and exhibit strong generalization to real-world interactive scenarios.
CLFeb 19, 2024
Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct DecodingHanling Yi, Feng Lin, Hongbin Li et al.
This research aims to accelerate the inference speed of large language models (LLMs) with billions of parameters. We propose \textbf{S}mart \textbf{P}arallel \textbf{A}uto-\textbf{C}orrect d\textbf{E}coding (SPACE), an innovative approach designed for achieving lossless acceleration of LLMs. By integrating semi-autoregressive inference and speculative decoding capabilities, SPACE uniquely enables autoregressive LLMs to parallelize token generation and verification. This is realized through a specialized semi-autoregressive supervised fine-tuning process that equips existing LLMs with the ability to simultaneously predict multiple tokens. Additionally, an auto-correct decoding algorithm facilitates the simultaneous generation and verification of token sequences within a single model invocation. Through extensive experiments on a range of LLMs, SPACE has demonstrated inference speedup ranging from 2.7x-4.0x on HumanEval-X while maintaining output quality.
CVFeb 24, 2024
CLIPose: Category-Level Object Pose Estimation with Pre-trained Vision-Language KnowledgeXiao Lin, Minghao Zhu, Ronghao Dang et al.
Most of existing category-level object pose estimation methods devote to learning the object category information from point cloud modality. However, the scale of 3D datasets is limited due to the high cost of 3D data collection and annotation. Consequently, the category features extracted from these limited point cloud samples may not be comprehensive. This motivates us to investigate whether we can draw on knowledge of other modalities to obtain category information. Inspired by this motivation, we propose CLIPose, a novel 6D pose framework that employs the pre-trained vision-language model to develop better learning of object category information, which can fully leverage abundant semantic knowledge in image and text modalities. To make the 3D encoder learn category-specific features more efficiently, we align representations of three modalities in feature space via multi-modal contrastive learning. In addition to exploiting the pre-trained knowledge of the CLIP's model, we also expect it to be more sensitive with pose parameters. Therefore, we introduce a prompt tuning approach to fine-tune image encoder while we incorporate rotations and translations information in the text descriptions. CLIPose achieves state-of-the-art performance on two mainstream benchmark datasets, REAL275 and CAMERA25, and runs in real-time during inference (40FPS).
CLJan 23, 2024
BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language ModelsFeng Lin, Hanling Yi, Hongbin Li et al.
Large language models (LLMs) commonly employ autoregressive generation during inference, leading to high memory bandwidth demand and consequently extended latency. To mitigate this inefficiency, we present Bi-directional Tuning for lossless Acceleration (BiTA), an innovative method expediting LLMs via streamlined semi-autoregressive generation and draft verification. Inspired by the concept of prompt tuning, we enhance LLMs with a parameter-efficient design called bi-directional tuning for the capability in semi-autoregressive generation. Employing efficient tree-based decoding, the models perform draft candidate generation and verification in parallel, ensuring outputs identical to their autoregressive counterparts under greedy sampling. BiTA serves as a lightweight plug-in module, seamlessly boosting the inference efficiency of existing LLMs without requiring additional assistance models or incurring significant extra memory costs. Applying the proposed BiTA, LLaMA-2-70B-Chat achieves a 2.7$\times$ speedup on the MT-Bench benchmark. Extensive experiments confirm our method surpasses state-of-the-art acceleration techniques.
CRNov 6, 2024
Eguard: Defending LLM Embeddings Against Inversion Attacks via Text Mutual Information OptimizationTiantian Liu, Hongwei Yao, Feng Lin et al.
Embeddings have become a cornerstone in the functionality of large language models (LLMs) due to their ability to transform text data into rich, dense numerical representations that capture semantic and syntactic properties. These embedding vector databases serve as the long-term memory of LLMs, enabling efficient handling of a wide range of natural language processing tasks. However, the surge in popularity of embedding vector databases in LLMs has been accompanied by significant concerns about privacy leakage. Embedding vector databases are particularly vulnerable to embedding inversion attacks, where adversaries can exploit the embeddings to reverse-engineer and extract sensitive information from the original text data. Existing defense mechanisms have shown limitations, often struggling to balance security with the performance of downstream tasks. To address these challenges, we introduce Eguard, a novel defense mechanism designed to mitigate embedding inversion attacks. Eguard employs a transformer-based projection network and text mutual information optimization to safeguard embeddings while preserving the utility of LLMs. Our approach significantly reduces privacy risks, protecting over 95% of tokens from inversion while maintaining high performance across downstream tasks consistent with original embeddings.
CVDec 21, 2024
REO-VLM: Transforming VLM to Meet Regression Challenges in Earth ObservationXizhe Xue, Guoting Wei, Hao Chen et al.
The rapid evolution of Vision Language Models (VLMs) has catalyzed significant advancements in artificial intelligence, expanding research across various disciplines, including Earth Observation (EO). While VLMs have enhanced image understanding and data processing within EO, their applications have predominantly focused on image content description. This limited focus overlooks their potential in geographic and scientific regression tasks, which are essential for diverse EO applications. To bridge this gap, this paper introduces a novel benchmark dataset, called \textbf{REO-Instruct} to unify regression and generation tasks specifically for the EO domain. Comprising 1.6 million multimodal EO imagery and language pairs, this dataset is designed to support both biomass regression and image content interpretation tasks. Leveraging this dataset, we develop \textbf{REO-VLM}, a groundbreaking model that seamlessly integrates regression capabilities with traditional generative functions. By utilizing language-driven reasoning to incorporate scientific domain knowledge, REO-VLM goes beyond solely relying on EO imagery, enabling comprehensive interpretation of complex scientific attributes from EO data. This approach establishes new performance benchmarks and significantly enhances the capabilities of environmental monitoring and resource management.
CVDec 16, 2024
FSFM: A Generalizable Face Security Foundation Model via Self-Supervised Facial Representation LearningGaojian Wang, Feng Lin, Tong Wu et al.
This work asks: with abundant, unlabeled real faces, how to learn a robust and transferable facial representation that boosts various face security tasks with respect to generalization performance? We make the first attempt and propose a self-supervised pretraining framework to learn fundamental representations of real face images, FSFM, that leverages the synergy between masked image modeling (MIM) and instance discrimination (ID). We explore various facial masking strategies for MIM and present a simple yet powerful CRFR-P masking, which explicitly forces the model to capture meaningful intra-region consistency and challenging inter-region coherency. Furthermore, we devise the ID network that naturally couples with MIM to establish underlying local-to-global correspondence via tailored self-distillation. These three learning objectives, namely 3C, empower encoding both local features and global semantics of real faces. After pretraining, a vanilla ViT serves as a universal vision foundation model for downstream face security tasks: cross-dataset deepfake detection, cross-domain face anti-spoofing, and unseen diffusion facial forgery detection. Extensive experiments on 10 public datasets demonstrate that our model transfers better than supervised pretraining, visual and facial self-supervised learning arts, and even outperforms task-specialized SOTA methods.
SEMar 28, 2025
RobuNFR: Evaluating the Robustness of Large Language Models on Non-Functional Requirements Aware Code GenerationFeng Lin, Dong Jae Kim, Zhenhao Li et al.
When using LLMs to address Non-Functional Requirements (NFRs), developers may behave differently (e.g., expressing the same NFR in different words). Robust LLMs should output consistent results across these variations; however, this aspect remains underexplored. We propose RobuNFR for evaluating the robustness of LLMs in NFR-aware code generation across four NFR dimensions: design, readability, reliability, and performance, using three methodologies: prompt variation, regression testing, and diverse workflows. Our experiments show that RobuNFR reveals robustness issues in the tested LLMs when considering NFRs in code generation. Specifically, under prompt variation, including NFRs leads to a decrease in Pass@1 by up to 39 percent and an increase in the standard deviation from 0.48 to 2.48 compared to the baseline without NFRs (i.e., Function-Only). While incorporating NFRs generally improves overall NFR metrics, it also results in higher prompt sensitivity. In regression settings, some LLMs exhibit differences across versions, with improvements in one aspect (e.g., reduced code smells) often accompanied by regressions in another (e.g., decreased correctness), revealing inconsistencies that challenge their robustness. When varying workflows, the tested LLMs show significantly different NFR-aware code generation capabilities between two workflows: (1) integrating NFRs and functional requirements into the initial prompt and (2) enhancing Function-Only-generated code with the same NFR.
CVMay 21, 2024
Gaussian Control with Hierarchical Semantic Graphs in 3D Human RecoveryHongsheng Wang, Weiyue Zhang, Sihao Liu et al.
Although 3D Gaussian Splatting (3DGS) has recently made progress in 3D human reconstruction, it primarily relies on 2D pixel-level supervision, overlooking the geometric complexity and topological relationships of different body parts. To address this gap, we introduce the Hierarchical Graph Human Gaussian Control (HUGS) framework for achieving high-fidelity 3D human reconstruction. Our approach involves leveraging explicitly semantic priors of body parts to ensure the consistency of geometric topology, thereby enabling the capture of the complex geometrical and topological associations among body parts. Additionally, we disentangle high-frequency features from global human features to refine surface details in body parts. Extensive experiments demonstrate that our method exhibits superior performance in human body reconstruction, particularly in enhancing surface details and accurately reconstructing body part junctions. Codes are available at https://wanghongsheng01.github.io/HUGS/.
LGJul 30, 2025
G-Core: A Simple, Scalable and Balanced RLHF TrainerJunyu Wu, Weiming Chang, Xiaotao Liu et al.
Reinforcement Learning from Human Feedback (RLHF) has become an increasingly popular paradigm for training large language models (LLMs) and diffusion models. While existing RLHF training systems have enabled significant progress, they often face challenges in scaling to multi-modal and diffusion workflows and adapting to dynamic workloads. In particular, current approaches may encounter limitations in controller scalability, flexible resource placement, and efficient orchestration when handling complex RLHF pipelines, especially in scenarios involving dynamic sampling or generative reward modeling. In this paper, we present \textbf{G-Core}, a simple, scalable, and balanced RLHF training framework designed to address these challenges. G-Core introduces a parallel controller programming model, enabling flexible and efficient orchestration of complex RLHF workflows without the bottlenecks of a single centralized controller. Furthermore, we propose a dynamic placement schema that adaptively partitions resources and schedules workloads, significantly reducing hardware idle time and improving utilization, even under highly variable training conditions. G-Core has successfully trained models that support WeChat product features serving a large-scale user base, demonstrating its effectiveness and robustness in real-world scenarios. Our results show that G-Core advances the state of the art in RLHF training, providing a solid foundation for future research and deployment of large-scale, human-aligned models.
CVMay 21, 2024
NOVA-3D: Non-overlapped Views for 3D Anime Character ReconstructionHongsheng Wang, Nanjie Yao, Xinrui Zhou et al.
In the animation industry, 3D modelers typically rely on front and back non-overlapped concept designs to guide the 3D modeling of anime characters. However, there is currently a lack of automated approaches for generating anime characters directly from these 2D designs. In light of this, we explore a novel task of reconstructing anime characters from non-overlapped views. This presents two main challenges: existing multi-view approaches cannot be directly applied due to the absence of overlapping regions, and there is a scarcity of full-body anime character data and standard benchmarks. To bridge the gap, we present Non-Overlapped Views for 3D \textbf{A}nime Character Reconstruction (NOVA-3D), a new framework that implements a method for view-aware feature fusion to learn 3D-consistent features effectively and synthesizes full-body anime characters from non-overlapped front and back views directly. To facilitate this line of research, we collected the NOVA-Human dataset, which comprises multi-view images and accurate camera parameters for 3D anime characters. Extensive experiments demonstrate that the proposed method outperforms baseline approaches, achieving superior reconstruction of anime characters with exceptional detail fidelity. In addition, to further verify the effectiveness of our method, we applied it to the animation head reconstruction task and improved the state-of-the-art baseline to 94.453 in SSIM, 7.726 in LPIPS, and 19.575 in PSNR on average. Codes and datasets are available at https://wanghongsheng01.github.io/NOVA-3D/.
CVMay 21, 2024
MOSS: Motion-based 3D Clothed Human Synthesis from Monocular VideoHongsheng Wang, Xiang Cai, Xi Sun et al.
Single-view clothed human reconstruction holds a central position in virtual reality applications, especially in contexts involving intricate human motions. It presents notable challenges in achieving realistic clothing deformation. Current methodologies often overlook the influence of motion on surface deformation, resulting in surfaces lacking the constraints imposed by global motion. To overcome these limitations, we introduce an innovative framework, Motion-Based 3D Clo}thed Humans Synthesis (MOSS), which employs kinematic information to achieve motion-aware Gaussian split on the human surface. Our framework consists of two modules: Kinematic Gaussian Locating Splatting (KGAS) and Surface Deformation Detector (UID). KGAS incorporates matrix-Fisher distribution to propagate global motion across the body surface. The density and rotation factors of this distribution explicitly control the Gaussians, thereby enhancing the realism of the reconstructed surface. Additionally, to address local occlusions in single-view, based on KGAS, UID identifies significant surfaces, and geometric reconstruction is performed to compensate for these deformations. Experimental results demonstrate that MOSS achieves state-of-the-art visual quality in 3D clothed human synthesis from monocular videos. Notably, we improve the Human NeRF and the Gaussian Splatting by 33.94% and 16.75% in LPIPS* respectively. Codes are available at https://wanghongsheng01.github.io/MOSS/.
CVMay 21, 2024
RemoCap: Disentangled Representation Learning for Motion CaptureHongsheng Wang, Lizao Zhang, Zhangnan Zhong et al.
Reconstructing 3D human bodies from realistic motion sequences remains a challenge due to pervasive and complex occlusions. Current methods struggle to capture the dynamics of occluded body parts, leading to model penetration and distorted motion. RemoCap leverages Spatial Disentanglement (SD) and Motion Disentanglement (MD) to overcome these limitations. SD addresses occlusion interference between the target human body and surrounding objects. It achieves this by disentangling target features along the dimension axis. By aligning features based on their spatial positions in each dimension, SD isolates the target object's response within a global window, enabling accurate capture despite occlusions. The MD module employs a channel-wise temporal shuffling strategy to simulate diverse scene dynamics. This process effectively disentangles motion features, allowing RemoCap to reconstruct occluded parts with greater fidelity. Furthermore, this paper introduces a sequence velocity loss that promotes temporal coherence. This loss constrains inter-frame velocity errors, ensuring the predicted motion exhibits realistic consistency. Extensive comparisons with state-of-the-art (SOTA) methods on benchmark datasets demonstrate RemoCap's superior performance in 3D human body reconstruction. On the 3DPW dataset, RemoCap surpasses all competitors, achieving the best results in MPVPE (81.9), MPJPE (72.7), and PA-MPJPE (44.1) metrics. Codes are available at https://wanghongsheng01.github.io/RemoCap/.
CVOct 12, 2025
Scalable Face Security Vision Foundation Model for Deepfake, Diffusion, and Spoofing DetectionGaojian Wang, Feng Lin, Tong Wu et al.
With abundant, unlabeled real faces, how can we learn robust and transferable facial representations to boost generalization across various face security tasks? We make the first attempt and propose FS-VFM, a scalable self-supervised pre-training framework, to learn fundamental representations of real face images. We introduce three learning objectives, namely 3C, that synergize masked image modeling (MIM) and instance discrimination (ID), empowering FS-VFM to encode both local patterns and global semantics of real faces. Specifically, we formulate various facial masking strategies for MIM and devise a simple yet effective CRFR-P masking, which explicitly prompts the model to pursue meaningful intra-region Consistency and challenging inter-region Coherency. We present a reliable self-distillation mechanism that seamlessly couples MIM with ID to establish underlying local-to-global Correspondence. After pre-training, vanilla vision transformers (ViTs) serve as universal Vision Foundation Models for downstream Face Security tasks: cross-dataset deepfake detection, cross-domain face anti-spoofing, and unseen diffusion facial forensics. To efficiently transfer the pre-trained FS-VFM, we further propose FS-Adapter, a lightweight plug-and-play bottleneck atop the frozen backbone with a novel real-anchor contrastive objective. Extensive experiments on 11 public benchmarks demonstrate that our FS-VFM consistently generalizes better than diverse VFMs, spanning natural and facial domains, fully, weakly, and self-supervised paradigms, small, base, and large ViT scales, and even outperforms SOTA task-specific methods, while FS-Adapter offers an excellent efficiency-performance trade-off. The code and models are available on https://fsfm-3c.github.io/fsvfm.html.
CVSep 17, 2025
Morphology-optimized Multi-Scale Fusion: Combining Local Artifacts and Mesoscopic Semantics for Deepfake Detection and LocalizationChao Shuai, Gaojian Wang, Kun Pan et al.
While the pursuit of higher accuracy in deepfake detection remains a central goal, there is an increasing demand for precise localization of manipulated regions. Despite the remarkable progress made in classification-based detection, accurately localizing forged areas remains a significant challenge. A common strategy is to incorporate forged region annotations during model training alongside manipulated images. However, such approaches often neglect the complementary nature of local detail and global semantic context, resulting in suboptimal localization performance. Moreover, an often-overlooked aspect is the fusion strategy between local and global predictions. Naively combining the outputs from both branches can amplify noise and errors, thereby undermining the effectiveness of the localization. To address these issues, we propose a novel approach that independently predicts manipulated regions using both local and global perspectives. We employ morphological operations to fuse the outputs, effectively suppressing noise while enhancing spatial coherence. Extensive experiments reveal the effectiveness of each module in improving the accuracy and robustness of forgery localization.
CVJul 1, 2025
Not All Attention Heads Are What You Need: Refining CLIP's Image Representation with Attention AblationFeng Lin, Marco Chen, Haokui Zhang et al.
This paper investigates the role of attention heads in CLIP's image encoder. Building on interpretability studies, we conduct an exhaustive analysis and find that certain heads, distributed across layers, are detrimental to the resulting representations. To mitigate their impact, we propose a simple yet effective Attention Ablation Technique (AAT) that suppresses selected heads by directly manipulating their attention weights. By incorporating two complementary strategies tailored to different application scenarios, AAT enables the systematic identification and ablation of harmful heads with minimal overhead. Experiments show that AAT consistently improves downstream performance across diverse domains, boosting recall by up to 11.1% on cross-modal retrieval benchmarks. These results highlight that AAT can effectively refine large-scale VLMs with virtually no extra inference cost, while yielding semantically meaningful patterns that align with existing interpretability findings.
IVApr 28, 2025
Dual Attention Driven Lumbar Magnetic Resonance Image Feature Enhancement and Automatic Diagnosis of HerniationLingrui Zhang, Liang Guo, Xiao An et al.
Lumbar disc herniation (LDH) is a common musculoskeletal disease that requires magnetic resonance imaging (MRI) for effective clinical management. However, the interpretation of MRI images heavily relies on the expertise of radiologists, leading to delayed diagnosis and high costs for training physicians. Therefore, this paper proposes an innovative automated LDH classification framework. To address these key issues, the framework utilizes T1-weighted and T2-weighted MRI images from 205 people. The framework extracts clinically actionable LDH features and generates standardized diagnostic outputs by leveraging data augmentation and channel and spatial attention mechanisms. These outputs can help physicians make confident and time-effective care decisions when needed. The proposed framework achieves an area under the receiver operating characteristic curve (AUC-ROC) of 0.969 and an accuracy of 0.9486 for LDH detection. The experimental results demonstrate the performance of the proposed framework. Our framework only requires a small number of datasets for training to demonstrate high diagnostic accuracy. This is expected to be a solution to enhance the LDH detection capabilities of primary hospitals.
AIApr 7, 2025
Transforming Future Data Center Operations and Management via Physical AIZhiwei Cao, Minghao Li, Feng Lin et al.
Data centers (DCs) as mission-critical infrastructures are pivotal in powering the growth of artificial intelligence (AI) and the digital economy. The evolution from Internet DC to AI DC has introduced new challenges in operating and managing data centers for improved business resilience and reduced total cost of ownership. As a result, new paradigms, beyond the traditional approaches based on best practices, must be in order for future data centers. In this research, we propose and develop a novel Physical AI (PhyAI) framework for advancing DC operations and management. Our system leverages the emerging capabilities of state-of-the-art industrial products and our in-house research and development. Specifically, it presents three core modules, namely: 1) an industry-grade in-house simulation engine to simulate DC operations in a highly accurate manner, 2) an AI engine built upon NVIDIA PhysicsNemo for the training and evaluation of physics-informed machine learning (PIML) models, and 3) a digital twin platform built upon NVIDIA Omniverse for our proposed 5-tier digital twin framework. This system presents a scalable and adaptable solution to digitalize, optimize, and automate future data center operations and management, by enabling real-time digital twins for future data centers. To illustrate its effectiveness, we present a compelling case study on building a surrogate model for predicting the thermal and airflow profiles of a large-scale DC in a real-time manner. Our results demonstrate its superior performance over traditional time-consuming Computational Fluid Dynamics/Heat Transfer (CFD/HT) simulation, with a median absolute temperature prediction error of 0.18 °C. This emerging approach would open doors to several potential research directions for advancing Physical AI in future DC operations.
HCFeb 23, 2025
Tool and Tutor? Experimental evidence from AI deployment in cancer diagnosisVivianna Fang He, Sihan Li, Phanish Puranam et al.
Numerous countries globally face shortages of medical experts, deepening inequalities in access to healthcare. Artificial Intelligence (AI)-based diagnostic tools hold considerable promise to tackle this challenge by enabling even novices to deliver expert-level medical services. However, reliance on AI for task completion may hinder the learning required for novices to develop expertise. We thus explore whether AI-based diagnostic tools can be used to enhance not only performance but also learning in the context of lung cancer diagnosis. We examine the distinct effects of AI input during training (i.e., learning how to diagnose) versus in practice (i.e., completing diagnostic tasks) on novice medical professionals' performance. In two field experiments, 576 medical students were randomly assigned across conditions, manipulating the access to AI input during their training, during a test of their diagnostic capabilities, or both. During practice, participants diagnosed potential lung cancer cases using chest CT scans, and their diagnoses were evaluated against the ground truth obtained through histopathological examinations. Study 1 (N = 336) revealed that AI input in training alone improved human diagnostic accuracy by 3.2 percentage points over the control, while AI input during practice alone increased human accuracy by 7.9 percentage points. Combined deployment in both training and practice yielded an improvement of 13.7 percentage points--significantly exceeding either approach alone. Study 2 (N = 240) showed that AI input in practice alone improved accuracy in subsequent practice, unaided by AI, by 9.9 percentage points over the control. Even minimally informative AI input in training improved diagnostic accuracy by 5.3 percentage points over the control. These results reveal AI's dual role: As a tool, it could rapidly improve novices' performance.
CVNov 7, 2024
ProGraph: Temporally-alignable Probability Guided Graph Topological Modeling for 3D Human ReconstructionHongsheng Wang, Zehui Feng, Tong Xiao et al.
Current 3D human motion reconstruction methods from monocular videos rely on features within the current reconstruction window, leading to distortion and deformations in the human structure under local occlusions or blurriness in video frames. To estimate realistic 3D human mesh sequences based on incomplete features, we propose Temporally-alignable Probability Guided Graph Topological Modeling for 3D Human Reconstruction (ProGraph). For missing parts recovery, we exploit the explicit topological-aware probability distribution across the entire motion sequence. To restore the complete human, Graph Topological Modeling (GTM) learns the underlying topological structure, focusing on the relationships inherent in the individual parts. Next, to generate blurred motion parts, Temporal-alignable Probability Distribution (TPDist) utilizes the GTM to predict features based on distribution. This interactive mechanism facilitates motion consistency, allowing the restoration of human parts. Furthermore, Hierarchical Human Loss (HHLoss) constrains the probability distribution errors of inter-frame features during topological structure variation. Our Method achieves superior results than other SOTA methods in addressing occlusions and blurriness on 3DPW.
CVMay 21, 2024
NieR: Normal-Based Lighting Scene RenderingHongsheng Wang, Yang Wang, Yalan Liu et al.
In real-world road scenes, diverse material properties lead to complex light reflection phenomena, making accurate color reproduction crucial for enhancing the realism and safety of simulated driving environments. However, existing methods often struggle to capture the full spectrum of lighting effects, particularly in dynamic scenarios where viewpoint changes induce significant material color variations. To address this challenge, we introduce NieR (Normal-Based Lighting Scene Rendering), a novel framework that takes into account the nuances of light reflection on diverse material surfaces, leading to more precise rendering. To simulate the lighting synthesis process, we present the LD (Light Decomposition) module, which captures the lighting reflection characteristics on surfaces. Furthermore, to address dynamic lighting scenes, we propose the HNGD (Hierarchical Normal Gradient Densification) module to overcome the limitations of sparse Gaussian representation. Specifically, we dynamically adjust the Gaussian density based on normal gradients. Experimental evaluations demonstrate that our method outperforms state-of-the-art (SOTA) methods in terms of visual quality and exhibits significant advantages in performance indicators. Codes are available at https://wanghongsheng01.github.io/NieR/.
NEFeb 8, 2022
Convergence of a New Learning AlgorithmFeng Lin
A new learning algorithm proposed by Brandt and Lin for neural network [1], [2] has been shown to be mathematically equivalent to the conventional back-propagation learning algorithm, but has several advantages over the backpropagation algorithm, including feedback-network-free implementation and biological plausibility. In this paper, we investigate the convergence of the new algorithm. A necessary and sufficient condition for the algorithm to converge is derived. A convergence measure is proposed to measure the convergence rate of the new algorithm. Simulation studies are conducted to investigate the convergence of the algorithm with respect to the number of neurons, the connection distance, the connection density, the ratio of excitatory/inhibitory synapses, the membrane potentials, and the synapse strengths.
CVOct 12, 2021
Joint Learning On The Hierarchy Representation for Fine-Grained Human Action RecognitionMei Chee Leong, Hui Li Tan, Haosong Zhang et al.
Fine-grained human action recognition is a core research topic in computer vision. Inspired by the recently proposed hierarchy representation of fine-grained actions in FineGym and SlowFast network for action recognition, we propose a novel multi-task network which exploits the FineGym hierarchy representation to achieve effective joint learning and prediction for fine-grained human action recognition. The multi-task network consists of three pathways of SlowOnly networks with gradually increased frame rates for events, sets and elements of fine-grained actions, followed by our proposed integration layers for joint learning and prediction. It is a two-stage approach, where it first learns deep feature representation at each hierarchical level, and is followed by feature encoding and fusion for multi-task learning. Our empirical results on the FineGym dataset achieve a new state-of-the-art performance, with 91.80% Top-1 accuracy and 88.46% mean accuracy for element actions, which are 3.40% and 7.26% higher than the previous best results.
OCJun 13, 2021
Stochastic Alternating Direction Method of Multipliers for Byzantine-Robust Distributed LearningFeng Lin, Weiyu Li, Qing Ling
This paper aims to solve a distributed learning problem under Byzantine attacks. In the underlying distributed system, a number of unknown but malicious workers (termed as Byzantine workers) can send arbitrary messages to the master and bias the learning process, due to data corruptions, computation errors or malicious attacks. Prior work has considered a total variation (TV) norm-penalized approximation formulation to handle the Byzantine attacks, where the TV norm penalty forces the regular workers' local variables to be close, and meanwhile, tolerates the outliers sent by the Byzantine workers. To solve the TV norm-penalized approximation formulation, we propose a Byzantine-robust stochastic alternating direction method of multipliers (ADMM) that fully utilizes the separable problem structure. Theoretically, we prove that the proposed method converges to a bounded neighborhood of the optimal solution at a rate of O(1/k) under mild assumptions, where k is the number of iterations and the size of neighborhood is determined by the number of Byzantine workers. Numerical experiments on the MNIST and COVERTYPE datasets demonstrate the effectiveness of the proposed method to various Byzantine attacks.
ASMar 2, 2021
Incorporating VAD into ASR System by Multi-task LearningMeng Li, Xia Yan, Feng Lin
When we use End-to-end automatic speech recognition (E2E-ASR) system for real-world applications, a voice activity detection (VAD) system is usually needed to improve the performance and to reduce the computational cost by discarding non-speech parts in the audio. Usually ASR and VAD systems are trained and utilized independently to each other. In this paper, we present a novel multi-task learning (MTL) framework that incorporates VAD into the ASR system. The proposed system learns ASR and VAD jointly in the training stage. With the assistance of VAD, the ASR performance improves as its connectionist temporal classification (CTC) loss function can leverage the VAD alignment information. In the inference stage, the proposed system removes non-speech parts at low computational cost and recognizes speech parts with high robustness. Experimental results on segmented speech data show that by utilizing VAD information, the proposed method outperforms the baseline ASR system on both English and Chinese datasets. On unsegmented speech data, we find that the system outperforms the ASR systems that build an extra GMM-based or DNN-based voice activity detector.
CVDec 17, 2020
FG-Net: Fast Large-Scale LiDAR Point Clouds Understanding Network Leveraging Correlated Feature Mining and Geometric-Aware ModellingKangcheng Liu, Zhi Gao, Feng Lin et al.
This work presents FG-Net, a general deep learning framework for large-scale point clouds understanding without voxelizations, which achieves accurate and real-time performance with a single NVIDIA GTX 1080 GPU. First, a novel noise and outlier filtering method is designed to facilitate subsequent high-level tasks. For effective understanding purpose, we propose a deep convolutional neural network leveraging correlated feature mining and deformable convolution based geometric-aware modelling, in which the local feature relationships and geometric patterns can be fully exploited. For the efficiency issue, we put forward an inverse density sampling operation and a feature pyramid based residual learning strategy to save the computational cost and memory consumption respectively. Extensive experiments on real-world challenging datasets demonstrated that our approaches outperform state-of-the-art approaches in terms of accuracy and efficiency. Moreover, weakly supervised transfer learning is also conducted to demonstrate the generalization capacity of our method.
SPJun 12, 2020
Injecting Reliable Radio Frequency Fingerprints Using Metasurface for The Internet of ThingsSekhar Rajendran, Zhi Sun, Feng Lin et al.
In Internet of Things, where billions of devices with limited resources are communicating with each other, security has become a major stumbling block affecting the progress of this technology. Existing authentication schemes-based on digital signatures have overhead costs associated with them in terms of computation time, battery power, bandwidth, memory, and related hardware costs. Radio frequency fingerprint (RFF), utilizing the unique device-based information, can be a promising solution for IoT. However, traditional RFFs have become obsolete because of low reliability and reduced user capability. Our proposed solution, Metasurface RF-Fingerprinting Injection (MeRFFI), is to inject a carefully-designed radio frequency fingerprint into the wireless physical layer that can increase the security of a stationary IoT device with minimal overhead. The injection of fingerprint is implemented using a low cost metasurface developed and fabricated in our lab, which is designed to make small but detectable perturbations in the specific frequency band in which the IoT devices are communicating. We have conducted comprehensive system evaluations including distance, orientation, multiple channels where the feasibility, effectiveness, and reliability of these fingerprints are validated. The proposed MeRFFI system can be easily integrated into the existing authentication schemes. The security vulnerabilities are analyzed for some of the most threatening wireless physical layer-based attacks.
CLMay 31, 2020
Recognizing Chinese Judicial Named Entity using BiLSTM-CRFPin Tang, Pinli Yang, Yuang Shi et al.
Named entity recognition (NER) plays an essential role in natural language processing systems. Judicial NER is a fundamental component of judicial information retrieval, entity relation extraction, and knowledge map building. However, Chinese judicial NER remains to be more challenging due to the characteristics of Chinese and high accuracy requirements in the judicial filed. Thus, in this paper, we propose a deep learning-based method named BiLSTM-CRF which consists of bi-directional long short-term memory (BiLSTM) and conditional random fields (CRF). For further accuracy promotion, we propose to use Adaptive moment estimation (Adam) for optimization of the model. To validate our method, we perform experiments on judgment documents including commutation, parole and temporary service outside prison, which is acquired from China Judgments Online. Experimental results achieve the accuracy of 0.876, recall of 0.856 and F1 score of 0.855, which suggests the superiority of the proposed BiLSTM-CRF with Adam optimizer.