AIJul 31, 2024
The Llama 3 Herd of ModelsAaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri et al. · allen-ai, berkeley
Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.
97.3DLJun 2
A Double Bind: Gendered Funding, Research Topics, and Academic Performance in The Social SciencesYang Ding, Ning Zhang, Helen Bao et al.
While female representation in social sciences is increasing, systemic gender disparities may persist in research funding and academic performance. Some argue that female scholars now receive equal opportunities, yet evidence suggests that gender imbalances remain, particularly in specific research areas. This study examines 12,945 National Science Foundation (NSF)-funded principal investigators in social sciences from 2000 to 2019 to assess gender disparities in grant allocation, research topics, and post-award academic performance. Findings reveal a dual imbalance. First, despite similar overall funding success rates, female scholars remain underrepresented in high-impact and traditionally male-dominated research topics. Males dominate most funded topics, especially STEM-related ones, while female-led topics align with traditional gender stereotypes. Second, post-award performance patterns suggest that females outperform males in male-dominated fields, whereas males excel in female-dominated ones, undermining any presumed advantage of female scholars in their own research areas. These disparities contribute to the risk of both genders prematurely exiting the science pipeline. Furthermore, early-career experiences shape these outcomes asymmetrically: postdoctoral experience benefits both genders in female-dominated fields, with stronger effects for males, but disadvantages females in male-dominated fields by reducing their output and citation impact. Longer postdoctoral tenure enhances male researchers' citation impact across all fields but has mixed effects for females depending on field gender composition. These findings underscore the need for policies that address not just overall funding equality, but also gendered disparities across research topics and career trajectories.
CVMar 10, 2022
LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text RetrievalJie Lei, Xinlei Chen, Ning Zhang et al. · meta-ai
Dual encoders and cross encoders have been widely used for image-text retrieval. Between the two, the dual encoder encodes the image and text independently followed by a dot product, while the cross encoder jointly feeds image and text as the input and performs dense multi-modal fusion. These two architectures are typically modeled separately without interaction. In this work, we propose LoopITR, which combines them in the same network for joint learning. Specifically, we let the dual encoder provide hard negatives to the cross encoder, and use the more discriminative cross encoder to distill its predictions back to the dual encoder. Both steps are efficiently performed together in the same model. Our work centers on empirical analyses of this combined architecture, putting the main focus on the design of the distillation objective. Our experimental results highlight the benefits of training the two encoders in the same network, and demonstrate that distillation can be quite effective with just a few hard negative examples. Experiments on two standard datasets (Flickr30K and COCO) show our approach achieves state-of-the-art dual encoder performance when compared with approaches using a similar amount of data.
CVJan 18, 2023Code
HSTFormer: Hierarchical Spatial-Temporal Transformers for 3D Human Pose EstimationXiaoye Qian, Youbao Tang, Ning Zhang et al.
Transformer-based approaches have been successfully proposed for 3D human pose estimation (HPE) from 2D pose sequence and achieved state-of-the-art (SOTA) performance. However, current SOTAs have difficulties in modeling spatial-temporal correlations of joints at different levels simultaneously. This is due to the poses' spatial-temporal complexity. Poses move at various speeds temporarily with various joints and body-parts movement spatially. Hence, a cookie-cutter transformer is non-adaptable and can hardly meet the "in-the-wild" requirement. To mitigate this issue, we propose Hierarchical Spatial-Temporal transFormers (HSTFormer) to capture multi-level joints' spatial-temporal correlations from local to global gradually for accurate 3D HPE. HSTFormer consists of four transformer encoders (TEs) and a fusion module. To the best of our knowledge, HSTFormer is the first to study hierarchical TEs with multi-level fusion. Extensive experiments on three datasets (i.e., Human3.6M, MPI-INF-3DHP, and HumanEva) demonstrate that HSTFormer achieves competitive and consistent performance on benchmarks with various scales and difficulties. Specifically, it surpasses recent SOTAs on the challenging MPI-INF-3DHP dataset and small-scale HumanEva dataset, with a highly generalized systematic approach. The code is available at: https://github.com/qianxiaoye825/HSTFormer.
CVSep 20, 2024
Imagine yourself: Tuning-Free Personalized Image GenerationZecheng He, Bo Sun, Felix Juefei-Xu et al. · meta-ai
Diffusion models have demonstrated remarkable efficacy across various image-to-image tasks. In this research, we introduce Imagine yourself, a state-of-the-art model designed for personalized image generation. Unlike conventional tuning-based personalization techniques, Imagine yourself operates as a tuning-free model, enabling all users to leverage a shared framework without individualized adjustments. Moreover, previous work met challenges balancing identity preservation, following complex prompts and preserving good visual quality, resulting in models having strong copy-paste effect of the reference images. Thus, they can hardly generate images following prompts that require significant changes to the reference image, \eg, changing facial expression, head and body poses, and the diversity of the generated images is low. To address these limitations, our proposed method introduces 1) a new synthetic paired data generation mechanism to encourage image diversity, 2) a fully parallel attention architecture with three text encoders and a fully trainable vision encoder to improve the text faithfulness, and 3) a novel coarse-to-fine multi-stage finetuning methodology that gradually pushes the boundary of visual quality. Our study demonstrates that Imagine yourself surpasses the state-of-the-art personalization model, exhibiting superior capabilities in identity preservation, visual quality, and text alignment. This model establishes a robust foundation for various personalization applications. Human evaluation results validate the model's SOTA superiority across all aspects (identity preservation, text faithfulness, and visual appeal) compared to the previous personalization models.
SYOct 29, 2017
Data-Driven Power Flow Linearization: A Regression ApproachYuxiao Liu, Ning Zhang, Yi Wang et al.
The linearization of a power flow (PF) model is an important approach for simplifying and accelerating the calculation of a power system's control, operation, and optimization. Traditional model-based methods derive linearized PF models by making approximations in the analytical PF model according to the physical characteristics of the power system. Today, more measurements of the power system are available and thus facilitate data-driven approaches beyond model-driven approaches. This work studies a linearized PF model through a data-driven approach. Both a forward regression model ((P, Q) as a function of (theta, V)) and an inverse regression model ((theta, V) as a function of (P, Q)) are proposed. Partial least square (PLS)- and Bayesian linear regression (BLR)-based algorithms are designed to address data collinearity and avoid overfitting. The proposed approach is tested on a series of IEEE standard cases, which include both meshed transmission grids and radial distribution grids, with both Monte Carlo simulated data and public testing data. The results show that the proposed approach can realize a higher calculation accuracy than model-based approaches can. The results also demonstrate that the obtained regression parameter matrices of data-driven models reflect power system physics by demonstrating similar patterns with some power system matrices (e.g., the admittance matrix).
CVNov 23, 2022
Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth EstimationNing Zhang, Francesco Nex, George Vosselman et al.
Self-supervised monocular depth estimation that does not require ground truth for training has attracted attention in recent years. It is of high interest to design lightweight but effective models so that they can be deployed on edge devices. Many existing architectures benefit from using heavier backbones at the expense of model sizes. This paper achieves comparable results with a lightweight architecture. Specifically, the efficient combination of CNNs and Transformers is investigated, and a hybrid architecture called Lite-Mono is presented. A Consecutive Dilated Convolutions (CDC) module and a Local-Global Features Interaction (LGFI) module are proposed. The former is used to extract rich multi-scale local features, and the latter takes advantage of the self-attention mechanism to encode long-range global information into the features. Experiments demonstrate that Lite-Mono outperforms Monodepth2 by a large margin in accuracy, with about 80% fewer trainable parameters.
CVMar 1, 2022
Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular AlignmentMingyang Zhou, Licheng Yu, Amanpreet Singh et al. · meta-ai
Vision-and-Language (V+L) pre-training models have achieved tremendous success in recent years on various multi-modal benchmarks. However, the majority of existing models require pre-training on a large set of parallel image-text data, which is costly to collect, compared to image-only or text-only data. In this paper, we explore unsupervised Vision-and-Language pre-training (UVLP) to learn the cross-modal representation from non-parallel image and text datasets. We found two key factors that lead to good unsupervised V+L pre-training without parallel data: (i) joint image-and-text input (ii) overall image-text alignment (even for non-parallel data). Accordingly, we propose a novel unsupervised V+L pre-training curriculum for non-parallel texts and images. We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks, including region-to-tag, region-to-phrase, and image-to-sentence alignment, to bridge the gap between the two modalities. A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model. We adapt our pre-trained model to a set of V+L downstream tasks, including VQA, NLVR2, Visual Entailment, and RefCOCO+. Our model achieves the state-of-art performance in all these tasks under the unsupervised setting.
IRFeb 21, 2023
Que2Engage: Embedding-based Retrieval for Relevant and Engaging Products at Facebook MarketplaceYunzhong He, Yuxin Tian, Mengjiao Wang et al. · meta-ai
Embedding-based Retrieval (EBR) in e-commerce search is a powerful search retrieval technique to address semantic matches between search queries and products. However, commercial search engines like Facebook Marketplace Search are complex multi-stage systems optimized for multiple business objectives. At Facebook Marketplace, search retrieval focuses on matching search queries with relevant products, while search ranking puts more emphasis on contextual signals to up-rank the more engaging products. As a result, the end-to-end searcher experience is a function of both relevance and engagement, and the interaction between different stages of the system. This presents challenges to EBR systems in order to optimize for better searcher experiences. In this paper we presents Que2Engage, a search EBR system built towards bridging the gap between retrieval and ranking for end-to-end optimizations. Que2Engage takes a multimodal & multitask approach to infuse contextual information into the retrieval stage and to balance different business objectives. We show the effectiveness of our approach via a multitask evaluation framework and thorough baseline comparisons and ablation studies. Que2Engage is deployed on Facebook Marketplace Search and shows significant improvements in searcher engagement in two weeks of A/B testing.
CVNov 23, 2022
CGOF++: Controllable 3D Face Synthesis with Conditional Generative Occupancy FieldsKeqiang Sun, Shangzhe Wu, Ning Zhang et al.
Capitalizing on the recent advances in image generation models, existing controllable face image synthesis methods are able to generate high-fidelity images with some levels of controllability, e.g., controlling the shapes, expressions, textures, and poses of the generated face images. However, previous methods focus on controllable 2D image generative models, which are prone to producing inconsistent face images under large expression and pose changes. In this paper, we propose a new NeRF-based conditional 3D face synthesis framework, which enables 3D controllability over the generated face images by imposing explicit 3D conditions from 3D face priors. At its core is a conditional Generative Occupancy Field (cGOF++) that effectively enforces the shape of the generated face to conform to a given 3D Morphable Model (3DMM) mesh, built on top of EG3D [1], a recent tri-plane-based generative model. To achieve accurate control over fine-grained 3D face shapes of the synthesized images, we additionally incorporate a 3D landmark loss as well as a volume warping loss into our synthesis framework. Experiments validate the effectiveness of the proposed method, which is able to generate high-fidelity face images and shows more precise 3D controllability than state-of-the-art 2D-based controllable face synthesis methods.
CVJun 16, 2022
Controllable 3D Face Synthesis with Conditional Generative Occupancy FieldsKeqiang Sun, Shangzhe Wu, Zhaoyang Huang et al.
Capitalizing on the recent advances in image generation models, existing controllable face image synthesis methods are able to generate high-fidelity images with some levels of controllability, e.g., controlling the shapes, expressions, textures, and poses of the generated face images. However, these methods focus on 2D image generative models, which are prone to producing inconsistent face images under large expression and pose changes. In this paper, we propose a new NeRF-based conditional 3D face synthesis framework, which enables 3D controllability over the generated face images by imposing explicit 3D conditions from 3D face priors. At its core is a conditional Generative Occupancy Field (cGOF) that effectively enforces the shape of the generated face to commit to a given 3D Morphable Model (3DMM) mesh. To achieve accurate control over fine-grained 3D face shapes of the synthesized image, we additionally incorporate a 3D landmark loss as well as a volume warping loss into our synthesis algorithm. Experiments validate the effectiveness of the proposed method, which is able to generate high-fidelity face images and shows more precise 3D controllability than state-of-the-art 2D-based controllable face synthesis methods. Find code and demo at https://keqiangsun.github.io/projects/cgof.
CVOct 26, 2022
FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and CaptioningSuvir Mirchandani, Licheng Yu, Mengjiao Wang et al. · meta-ai, stanford
Multimodal tasks in the fashion domain have significant potential for e-commerce, but involve challenging vision-and-language learning problems - e.g., retrieving a fashion item given a reference image plus text feedback from a user. Prior works on multimodal fashion tasks have either been limited by the data in individual benchmarks, or have leveraged generic vision-and-language pre-training but have not taken advantage of the characteristics of fashion data. Additionally, these works have mainly been restricted to multimodal understanding tasks. To address these gaps, we make two key contributions. First, we propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs. We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks. Second, we propose a flexible decoder-based model architecture capable of both fashion retrieval and captioning tasks. Together, our model design and pre-training approach are competitive on a diverse set of fashion tasks, including cross-modal retrieval, image retrieval with text feedback, image captioning, relative image captioning, and multimodal categorization.
CVMar 23, 2023
Learning and Verification of Task Structure in Instructional VideosMedhini Narasimhan, Licheng Yu, Sean Bell et al.
Given the enormous number of instructional videos available online, learning a diverse array of multi-step task models from videos is an appealing goal. We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos. We pre-train VideoTaskformer using a simple and effective objective: predicting weakly supervised textual labels for steps that are randomly masked out from an instructional video (masked step modeling). Compared to prior work which learns step representations locally, our approach involves learning them globally, leveraging video of the entire surrounding task as context. From these learned representations, we can verify if an unseen video correctly executes a given task, as well as forecast which steps are likely to be taken after a given step. We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order. We also introduce a long-term forecasting benchmark, where the goal is to predict long-range future steps from a given step. Our method outperforms previous baselines on these tasks, and we believe the tasks will be a valuable way for the community to measure the quality of step representations. Additionally, we evaluate VideoTaskformer on 3 existing benchmarks -- procedural activity recognition, step classification, and step forecasting -- and demonstrate on each that our method outperforms existing baselines and achieves new state-of-the-art performance.
CVJul 31, 2024Code
EUDA: An Efficient Unsupervised Domain Adaptation via Self-Supervised Vision TransformerAli Abedi, Q. M. Jonathan Wu, Ning Zhang et al.
Unsupervised domain adaptation (UDA) aims to mitigate the domain shift issue, where the distribution of training (source) data differs from that of testing (target) data. Many models have been developed to tackle this problem, and recently vision transformers (ViTs) have shown promising results. However, the complexity and large number of trainable parameters of ViTs restrict their deployment in practical applications. This underscores the need for an efficient model that not only reduces trainable parameters but also allows for adjustable complexity based on specific needs while delivering comparable performance. To achieve this, in this paper we introduce an Efficient Unsupervised Domain Adaptation (EUDA) framework. EUDA employs the DINOv2, which is a self-supervised ViT, as a feature extractor followed by a simplified bottleneck of fully connected layers to refine features for enhanced domain adaptation. Additionally, EUDA employs the synergistic domain alignment loss (SDAL), which integrates cross-entropy (CE) and maximum mean discrepancy (MMD) losses, to balance adaptation by minimizing classification errors in the source domain while aligning the source and target domain distributions. The experimental results indicate the effectiveness of EUDA in producing comparable results as compared with other state-of-the-art methods in domain adaptation with significantly fewer trainable parameters, between 42% to 99.7% fewer. This showcases the ability to train the model in a resource-limited environment. The code of the model is available at: https://github.com/A-Abedi/EUDA.
59.6CRMay 7Code
AgentDyn: Are Your Agent Security Defenses Deployable in Real-World Dynamic Environments?Hao Li, Ruoyao Wen, Shanghao Shi et al.
AI agents that autonomously interact with external tools and environments have shown great promise across real-world applications. However, their reliance on external data exposes them to serious indirect prompt injection attacks, where malicious instructions embedded in third-party content hijack agent behaviors. To mitigate this threat, a growing number of defenses have been proposed and evaluated under existing agent security benchmarks. These benchmarks provide structured environments for comparing attacks and defenses, and have become a key driver for defense design and optimization. However, as agents move toward more complex and open-ended real-world deployments, there is a pressing need for benchmarks to become more adaptive and better reflect the dynamic environments faced by real-world agentic systems. In this work, we reveal three fundamental flaws in the current benchmarks and push the frontier along these dimensions: (i) lack of dynamic open-ended tasks, (ii) lack of helpful instructions, and (iii) simplistic user tasks. To bridge this gap, we introduce AgentDyn, a manually designed benchmark featuring 60 challenging open-ended tasks and 560 injection test cases across Shopping, GitHub, and Daily Life. Unlike prior static benchmarks, AgentDyn requires dynamic planning and incorporates helpful third-party instructions. Our evaluation of ten state-of-the-art defenses suggests that almost all existing defenses are either not secure enough or suffer from significant over-defense, revealing that existing defenses are still far from real-world deployment. Our benchmark is available at https://github.com/leolee99/AgentDyn.
CVNov 23, 2022
Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video GenerationTsu-Jui Fu, Licheng Yu, Ning Zhang et al.
Generating a video given the first several static frames is challenging as it anticipates reasonable future frames with temporal coherence. Besides video prediction, the ability to rewind from the last frame or infilling between the head and tail is also crucial, but they have rarely been explored for video completion. Since there could be different outcomes from the hints of just a few frames, a system that can follow natural language to perform video completion may significantly improve controllability. Inspired by this, we introduce a novel task, text-guided video completion (TVC), which requests the model to generate a video from partial frames guided by an instruction. We then propose Multimodal Masked Video Generation (MMVG) to address this TVC task. During training, MMVG discretizes the video frames into visual tokens and masks most of them to perform video completion from any time point. At inference time, a single MMVG model can address all 3 cases of TVC, including video prediction, rewind, and infilling, by applying corresponding masking conditions. We evaluate MMVG in various video scenarios, including egocentric, animation, and gaming. Extensive experimental results indicate that MMVG is effective in generating high-quality visual appearances with text guidance for TVC.
CVOct 2, 2023Code
You Only Look at Once for Real-time and Generic Multi-TaskJiayuan Wang, Q. M. Jonathan Wu, Ning Zhang
High precision, lightweight, and real-time responsiveness are three essential requirements for implementing autonomous driving. In this study, we incorporate A-YOLOM, an adaptive, real-time, and lightweight multi-task model designed to concurrently address object detection, drivable area segmentation, and lane line segmentation tasks. Specifically, we develop an end-to-end multi-task model with a unified and streamlined segmentation structure. We introduce a learnable parameter that adaptively concatenates features between necks and backbone in segmentation tasks, using the same loss function for all segmentation tasks. This eliminates the need for customizations and enhances the model's generalization capabilities. We also introduce a segmentation head composed only of a series of convolutional layers, which reduces the number of parameters and inference time. We achieve competitive results on the BDD100k dataset, particularly in visualization outcomes. The performance results show a mAP50 of 81.1% for object detection, a mIoU of 91.0% for drivable area segmentation, and an IoU of 28.8% for lane line segmentation. Additionally, we introduce real-world scenarios to evaluate our model's performance in a real scene, which significantly outperforms competitors. This demonstrates that our model not only exhibits competitive performance but is also more flexible and faster than existing multi-task models. The source codes and pre-trained models are released at https://github.com/JiayuanWang-JW/YOLOv8-multi-task
NIDec 31, 2022
Cost-Effective Two-Stage Network Slicing for Edge-Cloud Orchestrated Vehicular NetworksWen Wu, Kaige Qu, Peng Yang et al.
In this paper, we study a network slicing problem for edge-cloud orchestrated vehicular networks, in which the edge and cloud servers are orchestrated to process computation tasks for reducing network slicing cost while satisfying the quality of service requirements. We propose a two-stage network slicing framework, which consists of 1) network planning stage in a large timescale to perform slice deployment, edge resource provisioning, and cloud resource provisioning, and 2) network operation stage in a small timescale to perform resource allocation and task dispatching. Particularly, we formulate the network slicing problem as a two-timescale stochastic optimization problem to minimize the network slicing cost. Since the problem is NP-hard due to coupled network planning and network operation stages, we develop a Two timescAle netWork Slicing (TAWS) algorithm by collaboratively integrating reinforcement learning (RL) and optimization methods, which can jointly make network planning and operation decisions. Specifically, by leveraging the timescale separation property of decisions, we decouple the problem into a large-timescale network planning subproblem and a small-timescale network operation subproblem. The former is solved by an RL method, and the latter is solved by an optimization method. Simulation results based on real-world vehicle traffic traces show that the TAWS can effectively reduce the network slicing cost as compared to the benchmark scheme.
LGSep 8, 2022
Reward Delay Attacks on Deep Reinforcement LearningAnindya Sarkar, Jiarui Feng, Yevgeniy Vorobeychik et al.
Most reinforcement learning algorithms implicitly assume strong synchrony. We present novel attacks targeting Q-learning that exploit a vulnerability entailed by this assumption by delaying the reward signal for a limited time period. We consider two types of attack goals: targeted attacks, which aim to cause a target policy to be learned, and untargeted attacks, which simply aim to induce a policy with a low reward. We evaluate the efficacy of the proposed attacks through a series of experiments. Our first observation is that reward-delay attacks are extremely effective when the goal is simply to minimize reward. Indeed, we find that even naive baseline reward-delay attacks are also highly successful in minimizing the reward. Targeted attacks, on the other hand, are more challenging, although we nevertheless demonstrate that the proposed approaches remain highly effective at achieving the attacker's targets. In addition, we introduce a second threat model that captures a minimal mitigation that ensures that rewards cannot be used out of sequence. We find that this mitigation remains insufficient to ensure robustness to attacks that delay, but preserve the order, of rewards.
CRDec 12, 2025Code
Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive ScoringPeichun Hua, Hao Li, Shanghao Shi et al.
Large Vision-Language Models (LVLMs) are vulnerable to a growing array of multimodal jailbreak attacks, necessitating defenses that are both generalizable to novel threats and efficient for practical deployment. Many current strategies fall short, either targeting specific attack patterns, which limits generalization, or imposing high computational overhead. While lightweight anomaly-detection methods offer a promising direction, we find that their common one-class design tends to confuse novel benign inputs with malicious ones, leading to unreliable over-rejection. To address this, we propose Representational Contrastive Scoring (RCS), a framework built on a key insight: the most potent safety signals reside within the LVLM's own internal representations. Our approach inspects the internal geometry of these representations, learning a lightweight projection to maximally separate benign and malicious inputs in safety-critical layers. This enables a simple yet powerful contrastive score that differentiates true malicious intent from mere novelty. Our instantiations, MCD (Mahalanobis Contrastive Detection) and KCD (K-nearest Contrastive Detection), achieve state-of-the-art performance on a challenging evaluation protocol designed to test generalization to unseen attack types. This work demonstrates that effective jailbreak detection can be achieved by applying simple, interpretable statistical methods to the appropriate internal representations, offering a practical path towards safer LVLM deployment. Our code is available on Github https://github.com/sarendis56/Jailbreak_Detection_RCS.
CLMar 11, 2022
Hierarchical BERT for Medical Document UnderstandingNing Zhang, Maciej Jankowski
Medical document understanding has gained much attention recently. One representative task is the International Classification of Disease (ICD) diagnosis code assignment. Existing work adopts either RNN or CNN as the backbone network because the vanilla BERT cannot handle well long documents (>2000 to kens). One issue shared across all these approaches is that they are over specific to the ICD code assignment task, losing generality to give the whole document-level and sentence-level embedding. As a result, it is not straight-forward to direct them to other downstream NLU tasks. Motivated by these observations, we propose Medical Document BERT (MDBERT) for long medical document understanding tasks. MDBERT is not only effective in learning representations at different levels of semantics but efficient in encoding long documents by leveraging a bottom-up hierarchical architecture. Compared to vanilla BERT solutions: 1, MDBERT boosts the performance up to relatively 20% on the MIMIC-III dataset, making it comparable to current SOTA solutions; 2, it cuts the computational complexity on self-attention modules to less than 1/100. Other than the ICD code assignment, we conduct a variety of other NLU tasks on a large commercial dataset named as TrialTrove, to showcase MDBERT's strength in delivering different levels of semantics.
ROMar 18, 2023
Channel-Aware Distillation Transformer for Depth Estimation on Nano DronesNing Zhang, Francesco Nex, George Vosselman et al.
Autonomous navigation of drones using computer vision has achieved promising performance. Nano-sized drones based on edge computing platforms are lightweight, flexible, and cheap, thus suitable for exploring narrow spaces. However, due to their extremely limited computing power and storage, vision algorithms designed for high-performance GPU platforms cannot be used for nano drones. To address this issue this paper presents a lightweight CNN depth estimation network deployed on nano drones for obstacle avoidance. Inspired by Knowledge Distillation (KD), a Channel-Aware Distillation Transformer (CADiT) is proposed to facilitate the small network to learn knowledge from a larger network. The proposed method is validated on the KITTI dataset and tested on a nano drone Crazyflie, with an ultra-low power microprocessor GAP8.
NESep 13, 2023
Event-Driven Imaging in Turbid Media: A Confluence of Optoelectronics and Neuromorphic ComputationNing Zhang, Timothy Shea, Arto Nurmikko
In this paper a new optical-computational method is introduced to unveil images of targets whose visibility is severely obscured by light scattering in dense, turbid media. The targets of interest are taken to be dynamic in that their optical properties are time-varying whether stationary in space or moving. The scheme, to our knowledge the first of its kind, is human vision inspired whereby diffuse photons collected from the turbid medium are first transformed to spike trains by a dynamic vision sensor as in the retina, and image reconstruction is then performed by a neuromorphic computing approach mimicking the brain. We combine benchtop experimental data in both reflection (backscattering) and transmission geometries with support from physics-based simulations to develop a neuromorphic computational model and then apply this for image reconstruction of different MNIST characters and image sets by a dedicated deep spiking neural network algorithm. Image reconstruction is achieved under conditions of turbidity where an original image is unintelligible to the human eye or a digital video camera, yet clearly and quantifiable identifiable when using the new neuromorphic computational approach.
CRJan 15Code
ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection AttackHao Li, Yankai Yang, G. Edward Suh et al.
Large Language Models (LLMs) have enabled the development of powerful agentic systems capable of automating complex workflows across various fields. However, these systems are highly vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external data can hijack agent behavior. In this work, we present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks. The core idea of ReasAlign is to incorporate structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks to defend against indirect injection attacks. To further ensure reasoning logic and accuracy, we introduce a test-time scaling mechanism with a preference-optimized judge model that scores reasoning steps and selects the best trajectory. Comprehensive evaluations across various benchmarks show that ReasAlign maintains utility comparable to an undefended model while consistently outperforming Meta SecAlign, the strongest prior guardrail. On the representative open-ended CyberSecEval2 benchmark, which includes multiple prompt-injected tasks, ReasAlign achieves 94.6% utility and only 3.6% ASR, far surpassing the state-of-the-art defensive model of Meta SecAlign (56.4% utility and 74.4% ASR). These results demonstrate that ReasAlign achieves the best trade-off between security and utility, establishing a robust and practical defense against prompt injection attacks in real-world agentic systems. Our code and experimental results could be found at https://github.com/leolee99/ReasAlign.
NADec 11, 2017
An Efficient Multigrid Method for Ground State Solution of Bose-Einstein CondensatesHehu Xie, Fei Xu, Ning Zhang
An efficient multigrid method is proposed to compute the ground state solution of Bose-Einstein condensations by the finite element method based on the combination of the multigrid method for nonlinear eigenvalue problem and an efficient implementation for the nonlinear iteration. The proposed numerical method not only has the optimal convergence rate, but also has the asymptotically optimal computational work which is independent from the nonlinearity of the problem. The independence from the nonlinearity means that the asymptotic estimate of the computational work can reach almost the same as that of solving the corresponding linear boundary value problem by the multigrid method. Some numerical experiments are provided to validate the efficiency of the proposed method.
LGMay 27, 2022
FadMan: Federated Anomaly Detection across Multiple Attributed NetworksNannan Wu, Ning Zhang, Wenjun Wang et al.
Anomaly subgraph detection has been widely used in various applications, ranging from cyber attack in computer networks to malicious activities in social networks. Despite an increasing need for federated anomaly detection across multiple attributed networks, only a limited number of approaches are available for this problem. Federated anomaly detection faces two major challenges. One is that isolated data in most industries are restricted share with others for data privacy and security. The other is most of the centralized approaches training based on data integration. The main idea of federated anomaly detection is aligning private anomalies from local data owners on the public anomalies from the attributed network in the server through public anomalies to federate local anomalies. In each private attributed network, the detected anomaly subgraph is aligned with an anomaly subgraph in the public attributed network. The significant public anomaly subgraphs are selected for federated private anomalies while preventing local private data leakage. The proposed algorithm FadMan is a vertical federated learning framework for public node aligned with many private nodes of different features, and is validated on two tasks correlated anomaly detection on multiple attributed networks and anomaly detection on an attributeless network using five real-world datasets. In the first scenario, FadMan outperforms competitive methods by at least 12% accuracy at 10% noise level. In the second scenario, by analyzing the distribution of abnormal nodes, we find that the nodes of traffic anomalies are associated with the event of postgraduate entrance examination on the same day.
NAJun 19, 2016
Fully Computable Error Bounds for Eigenvalue ProblemHehu Xie, Meiling Yue, Ning Zhang
This paper is concerned with the computable error estimates for the eigenvalue problem which is solved by the general conforming finite element methods on the general meshes. Based on the computable error estimate, we can give an asymptotically lower bound of the general eigenvalues. Furthermore, we also give a guaranteed upper bound of the error estimates for the first eigenfunction approximation and a guaranteed lower bound of the first eigenvalue based on computable error estimator. Some numerical examples are presented to validate the theoretical results deduced in this paper.
IVAug 28, 2022
Accurate and Robust Lesion RECIST Diameter Prediction and Segmentation with TransformersYoubao Tang, Ning Zhang, Yirui Wang et al.
Automatically measuring lesion/tumor size with RECIST (Response Evaluation Criteria In Solid Tumors) diameters and segmentation is important for computer-aided diagnosis. Although it has been studied in recent years, there is still space to improve its accuracy and robustness, such as (1) enhancing features by incorporating rich contextual information while keeping a high spatial resolution and (2) involving new tasks and losses for joint optimization. To reach this goal, this paper proposes a transformer-based network (MeaFormer, Measurement transFormer) for lesion RECIST diameter prediction and segmentation (LRDPS). It is formulated as three correlative and complementary tasks: lesion segmentation, heatmap prediction, and keypoint regression. To the best of our knowledge, it is the first time to use keypoint regression for RECIST diameter prediction. MeaFormer can enhance high-resolution features by employing transformers to capture their long-range dependencies. Two consistency losses are introduced to explicitly build relationships among these tasks for better optimization. Experiments show that MeaFormer achieves the state-of-the-art performance of LRDPS on the large-scale DeepLesion dataset and produces promising results of two downstream clinic-relevant tasks, i.e., 3D lesion segmentation and RECIST assessment in longitudinal studies.
LGAug 17, 2023
Joint Power Control and Data Size Selection for Over-the-Air Computation Aided Federated LearningXuming An, Rongfei Fan, Shiyuan Zuo et al.
Federated learning (FL) has emerged as an appealing machine learning approach to deal with massive raw data generated at multiple mobile devices, {which needs to aggregate the training model parameter of every mobile device at one base station (BS) iteratively}. For parameter aggregating in FL, over-the-air computation is a spectrum-efficient solution, which allows all mobile devices to transmit their parameter-mapped signals concurrently to a BS. Due to heterogeneous channel fading and noise, there exists difference between the BS's received signal and its desired signal, measured as the mean-squared error (MSE). To minimize the MSE, we propose to jointly optimize the signal amplification factors at the BS and the mobile devices as well as the data size (the number of data samples involved in local training) at every mobile device. The formulated problem is challenging to solve due to its non-convexity. To find the optimal solution, with some simplification on cost function and variable replacement, which still preserves equivalence, we transform the changed problem to be a bi-level problem equivalently. For the lower-level problem, optimal solution is found by enumerating every candidate solution from the Karush-Kuhn-Tucker (KKT) condition. For the upper-level problem, the optimal solution is found by exploring its piecewise convexity. Numerical results show that our proposed method can greatly reduce the MSE and can help to improve the training performance of FL compared with benchmark methods.
AIFeb 16
Protecting Language Models Against Unauthorized Distillation through Trace RewritingXinhang Ma, William Yeoh, Ning Zhang et al.
Knowledge distillation is a widely adopted technique for transferring capabilities from LLMs to smaller, more efficient student models. However, unauthorized use of knowledge distillation takes unfair advantage of the considerable effort and cost put into developing frontier models. We investigate methods for modifying teacher-generated reasoning traces to achieve two objectives that deter unauthorized distillation: (1) \emph{anti-distillation}, or degrading the training usefulness of query responses, and (2) \emph{API watermarking}, which embeds verifiable signatures in student models. We introduce several approaches for dynamically rewriting a teacher's reasoning outputs while preserving answer correctness and semantic coherence. Two of these leverage the rewriting capabilities of LLMs, while others use gradient-based techniques. Our experiments show that a simple instruction-based rewriting approach achieves a strong anti-distillation effect while maintaining or even improving teacher performance. Furthermore, we show that our rewriting approach also enables highly reliable watermark detection with essentially no false alarms.
90.7CVMay 12Code
UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMsShuo Ni, Tong Wang, Jing Zhang et al.
Vision-Language Models (VLMs) increasingly operate on ultra-high-resolution (UHR) Earth observation imagery, yet they remain vulnerable to a severe scale mismatch between large-scale scene context and micro-scale targets. We refer to this empirical gap as a "resolution illusion": higher input resolution provides the appearance of richer visual detail, but does not necessarily yield reliable perception of spatially small, task-relevant evidence. To benchmark this challenge, we introduce UHR-Micro, a benchmark comprising 11,253 instructions grounded in 1,212 UHR images, designed to evaluate VLMs at the spatial limits of native Earth observation imagery. UHR-Micro spans diverse micro-target scales, context requirements, task families, and visual conditions, and provides diagnostic annotations that support controlled evaluation and fine-grained error attribution. Experiments with representative high-resolution VLMs show substantial failures in spatial grounding and evidence parsing, despite access to high-resolution inputs. Further analysis suggests that these failures are not fully resolved by increasing model capacity, but are closely tied to insufficient guidance in locating and using task-relevant micro-evidence. Motivated by this finding, we propose Micro-evidence Active Perception (MAP), a reference agent that decomposes queries into evidence-seeking steps, actively inspects candidate regions, and grounds its answers in localized observations. MAP-Agent improves micro-level perception by making high-resolution reasoning evidence-centered rather than image-centered. Together, UHR-Micro and MAP-Agent provide a diagnostic platform for evaluating, understanding, and advancing high-resolution reasoning in Earth observation VLMs. Datasets and source code were released at https://github.com/MiliLab/UHR-Micro.
CVJul 22, 2022
PieTrack: An MOT solution based on synthetic data training and self-supervised domain adaptationYirui Wang, Shenghua He, Youbao Tang et al.
In order to cope with the increasing demand for labeling data and privacy issues with human detection, synthetic data has been used as a substitute and showing promising results in human detection and tracking tasks. We participate in the 7th Workshop on Benchmarking Multi-Target Tracking (BMTT), themed on "How Far Can Synthetic Data Take us"? Our solution, PieTrack, is developed based on synthetic data without using any pre-trained weights. We propose a self-supervised domain adaptation method that enables mitigating the domain shift issue between the synthetic (e.g., MOTSynth) and real data (e.g., MOT17) without involving extra human labels. By leveraging the proposed multi-scale ensemble inference, we achieved a final HOTA score of 58.7 on the MOT17 testing set, ranked third place in the challenge.
LGAug 21, 2023
Federated Learning Robust to Byzantine Attacks: Achieving Zero Optimality GapShiyuan Zuo, Rongfei Fan, Han Hu et al.
In this paper, we propose a robust aggregation method for federated learning (FL) that can effectively tackle malicious Byzantine attacks. At each user, model parameter is firstly updated by multiple steps, which is adjustable over iterations, and then pushed to the aggregation center directly. This decreases the number of interactions between the aggregation center and users, allows each user to set training parameter in a flexible way, and reduces computation burden compared with existing works that need to combine multiple historical model parameters. At the aggregation center, geometric median is leveraged to combine the received model parameters from each user. Rigorous proof shows that zero optimality gap is achieved by our proposed method with linear convergence, as long as the fraction of Byzantine attackers is below half. Numerical results verify the effectiveness of our proposed method.
CVNov 10, 2022
Prior-enhanced Temporal Action Localization using Subject-aware Spatial AttentionYifan Liu, Youbao Tang, Ning Zhang et al.
Temporal action localization (TAL) aims to detect the boundary and identify the class of each action instance in a long untrimmed video. Current approaches treat video frames homogeneously, and tend to give background and key objects excessive attention. This limits their sensitivity to localize action boundaries. To this end, we propose a prior-enhanced temporal action localization method (PETAL), which only takes in RGB input and incorporates action subjects as priors. This proposal leverages action subjects' information with a plug-and-play subject-aware spatial attention module (SA-SAM) to generate an aggregated and subject-prioritized representation. Experimental results on THUMOS-14 and ActivityNet-1.3 datasets demonstrate that the proposed PETAL achieves competitive performance using only RGB features, e.g., boosting mAP by 2.41% or 0.25% over the state-of-the-art approach that uses RGB features or with additional optical flow features on the THUMOS-14 dataset.
47.6DCMay 7
EdgeServing: Deadline-Aware Multi-DNN Serving at the EdgeJiahe Cao, Xiaomeng Li, Qiang Liu et al.
As edge computing expands, serving multiple deep neural network (DNN) models on a single shared GPU has become a common yet challenging scenario, where each scheduling decision affects the tail latency of all concurrent queues. Existing schedulers rely on local heuristics and fail to capture this global impact, while GPU spatial-sharing approaches sacrifice latency predictability. In this paper, we propose EdgeServing, a deadline-aware multi-DNN serving system for edge devices. EdgeServing adopts time-division GPU sharing with early-exit inference for high inference predictability, and introduces a stability score to quantify how each candidate scheduling decision impacts the future queue status. At runtime, it cohesively selects the model, exit point, and batch size to minimize predicted system-wide SLO impact. Experimental results on multiple hardware platforms show that EdgeServing consistently outperforms representative baselines in both SLO violation ratio and P95 latency, enabled by early-exit mechanism, which expands the scheduling action space under tight latency constraints.
CVNov 8, 2025Code
MoEGCL: Mixture of Ego-Graphs Contrastive Representation Learning for Multi-View ClusteringJian Zhu, Xin Zou, Jun Sun et al.
In recent years, the advancement of Graph Neural Networks (GNNs) has significantly propelled progress in Multi-View Clustering (MVC). However, existing methods face the problem of coarse-grained graph fusion. Specifically, current approaches typically generate a separate graph structure for each view and then perform weighted fusion of graph structures at the view level, which is a relatively rough strategy. To address this limitation, we present a novel Mixture of Ego-Graphs Contrastive Representation Learning (MoEGCL). It mainly consists of two modules. In particular, we propose an innovative Mixture of Ego-Graphs Fusion (MoEGF), which constructs ego graphs and utilizes a Mixture-of-Experts network to implement fine-grained fusion of ego graphs at the sample level, rather than the conventional view-level fusion. Additionally, we present the Ego Graph Contrastive Learning (EGCL) module to align the fused representation with the view-specific representation. The EGCL module enhances the representation similarity of samples from the same cluster, not merely from the same sample, further boosting fine-grained graph representation. Extensive experiments demonstrate that MoEGCL achieves state-of-the-art results in deep multi-view clustering tasks. The source code is publicly available at https://github.com/HackerHyper/MoEGCL.
CRAug 23, 2024
An In-Depth Investigation of Data Collection in LLM App EcosystemsYuhao Wu, Evin Jaff, Ke Yang et al.
LLM app (tool) ecosystems are rapidly evolving to support sophisticated use cases that often require extensive user data collection. Given that LLM apps are developed by third parties and anecdotal evidence indicating inconsistent enforcement of policies by LLM platforms, sharing user data with these apps presents significant privacy risks. In this paper, we aim to bring transparency in data practices of LLM app ecosystems. We examine OpenAI's GPT app ecosystem as a case study. We propose an LLM-based framework to analyze the natural language specifications of GPT Actions (custom tools) and assess their data collection practices. Our analysis reveals that Actions collect excessive data across 24 categories and 145 data types, with third-party Actions collecting 6.03% more data on average. We find that several Actions violate OpenAI's policies by collecting sensitive information, such as passwords, which is explicitly prohibited by OpenAI. Lastly, we develop an LLM-based privacy policy analysis framework to automatically check the consistency of data collection by Actions with disclosures in their privacy policies. Our measurements indicate that the disclosures for most of the collected data types are omitted, with only 5.8% of Actions clearly disclosing their data collection practices.
CVSep 2, 2024
Self-Supervised Multi-Scale Network for Blind Image Deblurring via Alternating OptimizationLening Guo, Jing Yu, Ning Zhang et al.
Blind image deblurring is a challenging low-level vision task that involves estimating the unblurred image when the blur kernel is unknown. In this paper, we present a self-supervised multi-scale blind image deblurring method to jointly estimate the latent image and the blur kernel via alternating optimization. In the image estimation step, we construct a multi-scale generator network with multiple inputs and multiple outputs to collaboratively estimate latent images at various scales, supervised by an image pyramid constructed from only the blurred image. This generator places architectural constraints on the network and avoids the need for mathematical expression of image priors. In the blur kernel estimation step, the blur kernel at each scale is independently estimated with a direct solution to a quadratic regularized least-squares model for its flexible adaptation to the proposed multi-scale generator for image estimation. Thanks to the collaborative estimation across multiple scales, our method avoids the computationally intensive coarse-to-fine propagation and additional image deblurring processes used in traditional mathematical optimization-based methods. Quantitative and qualitative experimental results on synthetic and realistic datasets demonstrate the superior performance of our method, especially for handling large and real-world blurs.
CVNov 3, 2025
Compressing Multi-Task Model for Autonomous Driving via Pruning and Knowledge DistillationJiayuan Wang, Q. M. Jonathan Wu, Ning Zhang et al.
Autonomous driving systems rely on panoptic perception to jointly handle object detection, drivable area segmentation, and lane line segmentation. Although multi-task learning is an effective way to integrate these tasks, its increasing model parameters and complexity make deployment on on-board devices difficult. To address this challenge, we propose a multi-task model compression framework that combines task-aware safe pruning with feature-level knowledge distillation. Our safe pruning strategy integrates Taylor-based channel importance with gradient conflict penalty to keep important channels while removing redundant and conflicting channels. To mitigate performance degradation after pruning, we further design a task head-agnostic distillation method that transfers intermediate backbone and encoder features from a teacher to a student model as guidance. Experiments on the BDD100K dataset demonstrate that our compressed model achieves a 32.7% reduction in parameters while segmentation performance shows negligible accuracy loss and only a minor decrease in detection (-1.2% for Recall and -1.8% for mAP50) compared to the teacher. The compressed model still runs at 32.7 FPS in real-time. These results show that combining pruning and knowledge distillation provides an effective compression solution for multi-task panoptic perception.
IVJun 16, 2023
Fusing Structural and Functional Connectivities using Disentangled VAE for Detecting MCIQiankun Zuo, Yanfei Zhu, Libin Lu et al.
Brain network analysis is a useful approach to studying human brain disorders because it can distinguish patients from healthy people by detecting abnormal connections. Due to the complementary information from multiple modal neuroimages, multimodal fusion technology has a lot of potential for improving prediction performance. However, effective fusion of multimodal medical images to achieve complementarity is still a challenging problem. In this paper, a novel hierarchical structural-functional connectivity fusing (HSCF) model is proposed to construct brain structural-functional connectivity matrices and predict abnormal brain connections based on functional magnetic resonance imaging (fMRI) and diffusion tensor imaging (DTI). Specifically, the prior knowledge is incorporated into the separators for disentangling each modality of information by the graph convolutional networks (GCN). And a disentangled cosine distance loss is devised to ensure the disentanglement's effectiveness. Moreover, the hierarchical representation fusion module is designed to effectively maximize the combination of relevant and effective features between modalities, which makes the generated structural-functional connectivity more robust and discriminative in the cognitive disease analysis. Results from a wide range of tests performed on the public Alzheimer's Disease Neuroimaging Initiative (ADNI) database show that the proposed model performs better than competing approaches in terms of classification evaluation. In general, the proposed HSCF model is a promising model for generating brain structural-functional connectivities and identifying abnormal brain connections as cognitive disease progresses.
CLSep 26, 2024
ZALM3: Zero-Shot Enhancement of Vision-Language Alignment via In-Context Information in Multi-Turn Multimodal Medical DialogueZhangpu Li, Changhong Zou, Suxue Ma et al.
The rocketing prosperity of large language models (LLMs) in recent years has boosted the prevalence of vision-language models (VLMs) in the medical sector. In our online medical consultation scenario, a doctor responds to the texts and images provided by a patient in multiple rounds to diagnose her/his health condition, forming a multi-turn multimodal medical dialogue format. Unlike high-quality images captured by professional equipment in traditional medical visual question answering (Med-VQA), the images in our case are taken by patients' mobile phones. These images have poor quality control, with issues such as excessive background elements and the lesion area being significantly off-center, leading to degradation of vision-language alignment in the model training phase. In this paper, we propose ZALM3, a Zero-shot strategy to improve vision-language ALignment in Multi-turn Multimodal Medical dialogue. Since we observe that the preceding text conversations before an image can infer the regions of interest (RoIs) in the image, ZALM3 employs an LLM to summarize the keywords from the preceding context and a visual grounding model to extract the RoIs. The updated images eliminate unnecessary background noise and provide more effective vision-language alignment. To better evaluate our proposed method, we design a new subjective assessment metric for multi-turn unimodal/multimodal medical dialogue to provide a fine-grained performance comparison. Our experiments across three different clinical departments remarkably demonstrate the efficacy of ZALM3 with statistical significance.
CVAug 13, 2025Code
SpeechForensics: Audio-Visual Speech Representation Learning for Face Forgery DetectionYachao Liang, Min Yu, Gang Li et al.
Detection of face forgery videos remains a formidable challenge in the field of digital forensics, especially the generalization to unseen datasets and common perturbations. In this paper, we tackle this issue by leveraging the synergy between audio and visual speech elements, embarking on a novel approach through audio-visual speech representation learning. Our work is motivated by the finding that audio signals, enriched with speech content, can provide precise information effectively reflecting facial movements. To this end, we first learn precise audio-visual speech representations on real videos via a self-supervised masked prediction task, which encodes both local and global semantic information simultaneously. Then, the derived model is directly transferred to the forgery detection task. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods in terms of cross-dataset generalization and robustness, without the participation of any fake video in model training. Code is available at https://github.com/Eleven4AI/SpeechForensics.
CRJun 13, 2025Code
DRIFT: Dynamic Rule-Based Defense with Injection Isolation for Securing LLM AgentsHao Li, Xiaogeng Liu, Hung-Chun Chiu et al.
Large Language Models (LLMs) are increasingly central to agentic systems due to their strong reasoning and planning capabilities. By interacting with external environments through predefined tools, these agents can carry out complex user tasks. Nonetheless, this interaction also introduces the risk of prompt injection attacks, where malicious inputs from external sources can mislead the agent's behavior, potentially resulting in economic loss, privacy leakage, or system compromise. System-level defenses have recently shown promise by enforcing static or predefined policies, but they still face two key challenges: the ability to dynamically update security rules and the need for memory stream isolation. To address these challenges, we propose DRIFT, a Dynamic Rule-based Isolation Framework for Trustworthy agentic systems, which enforces both control- and data-level constraints. A Secure Planner first constructs a minimal function trajectory and a JSON-schema-style parameter checklist for each function node based on the user query. A Dynamic Validator then monitors deviations from the original plan, assessing whether changes comply with privilege limitations and the user's intent. Finally, an Injection Isolator detects and masks any instructions that may conflict with the user query from the memory stream to mitigate long-term risks. We empirically validate the effectiveness of DRIFT on the AgentDojo and ASB benchmark, demonstrating its strong security performance while maintaining high utility across diverse models, showcasing both its robustness and adaptability. The code is released at https://github.com/SaFoLab-WISC/DRIFT.
HCSep 16, 2024
Model-in-the-Loop (MILO): Accelerating Multimodal AI Data Annotation with LLMsYifan Wang, David Stevens, Pranay Shah et al.
The growing demand for AI training data has transformed data annotation into a global industry, but traditional approaches relying on human annotators are often time-consuming, labor-intensive, and prone to inconsistent quality. We propose the Model-in-the-Loop (MILO) framework, which integrates AI/ML models into the annotation process. Our research introduces a collaborative paradigm that leverages the strengths of both professional human annotators and large language models (LLMs). By employing LLMs as pre-annotation and real-time assistants, and judges on annotator responses, MILO enables effective interaction patterns between human annotators and LLMs. Three empirical studies on multimodal data annotation demonstrate MILO's efficacy in reducing handling time, improving data quality, and enhancing annotator experiences. We also introduce quality rubrics for flexible evaluation and fine-grained feedback on open-ended annotations. The MILO framework has implications for accelerating AI/ML development, reducing reliance on human annotation alone, and promoting better alignment between human and machine values.
CVFeb 5
ShapeGaussian: High-Fidelity 4D Human Reconstruction in Monocular Videos via Vision PriorsZhenxiao Liang, Ning Zhang, Youbao Tang et al.
We introduce ShapeGaussian, a high-fidelity, template-free method for 4D human reconstruction from casual monocular videos. Generic reconstruction methods lacking robust vision priors, such as 4DGS, struggle to capture high-deformation human motion without multi-view cues. While template-based approaches, primarily relying on SMPL, such as HUGS, can produce photorealistic results, they are highly susceptible to errors in human pose estimation, often leading to unrealistic artifacts. In contrast, ShapeGaussian effectively integrates template-free vision priors to achieve both high-fidelity and robust scene reconstructions. Our method follows a two-step pipeline: first, we learn a coarse, deformable geometry using pretrained models that estimate data-driven priors, providing a foundation for reconstruction. Then, we refine this geometry using a neural deformation model to capture fine-grained dynamic details. By leveraging 2D vision priors, we mitigate artifacts from erroneous pose estimation in template-based methods and employ multiple reference frames to resolve the invisibility issue of 2D keypoints in a template-free manner. Extensive experiments demonstrate that ShapeGaussian surpasses template-based methods in reconstruction accuracy, achieving superior visual quality and robustness across diverse human motions in casual monocular videos.
LGAug 13, 2024
Class-aware and Augmentation-free Contrastive Learning from Label ProportionJialiang Wang, Ning Zhang, Shimin Di et al.
Learning from Label Proportion (LLP) is a weakly supervised learning scenario in which training data is organized into predefined bags of instances, disclosing only the class label proportions per bag. This paradigm is essential for user modeling and personalization, where user privacy is paramount, offering insights into user preferences without revealing individual data. LLP faces a unique difficulty: the misalignment between bag-level supervision and the objective of instance-level prediction, primarily due to the inherent ambiguity in label proportion matching. Previous studies have demonstrated deep representation learning can generate auxiliary signals to promote the supervision level in the image domain. However, applying these techniques to tabular data presents significant challenges: 1) they rely heavily on label-invariant augmentation to establish multi-view, which is not feasible with the heterogeneous nature of tabular datasets, and 2) tabular datasets often lack sufficient semantics for perfect class distinction, making them prone to suboptimality caused by the inherent ambiguity of label proportion matching. To address these challenges, we propose an augmentation-free contrastive framework TabLLP-BDC that introduces class-aware supervision (explicitly aware of class differences) at the instance level. Our solution features a two-stage Bag Difference Contrastive (BDC) learning mechanism that establishes robust class-aware instance-level supervision by disassembling the nuance between bag label proportions, without relying on augmentations. Concurrently, our model presents a pioneering multi-task pretraining pipeline tailored for tabular-based LLP, capturing intrinsic tabular feature correlations in alignment with label proportion distribution. Extensive experiments demonstrate that TabLLP-BDC achieves state-of-the-art performance for LLP in the tabular domain.
CVAug 2, 2025Code
RMT-PPAD: Real-time Multi-task Learning for Panoptic Perception in Autonomous DrivingJiayuan Wang, Q. M. Jonathan Wu, Katsuya Suto et al.
Autonomous driving systems rely on panoptic driving perception that requires both precision and real-time performance. In this work, we propose RMT-PPAD, a real-time, transformer-based multi-task model that jointly performs object detection, drivable area segmentation, and lane line segmentation. We introduce a lightweight module, a gate control with an adapter to adaptively fuse shared and task-specific features, effectively alleviating negative transfer between tasks. Additionally, we design an adaptive segmentation decoder to learn the weights over multi-scale features automatically during the training stage. This avoids the manual design of task-specific structures for different segmentation tasks. We also identify and resolve the inconsistency between training and testing labels in lane line segmentation. This allows fairer evaluation. Experiments on the BDD100K dataset demonstrate that RMT-PPAD achieves state-of-the-art results with mAP50 of 84.9% and Recall of 95.4% for object detection, mIoU of 92.6% for drivable area segmentation, and IoU of 56.8% and accuracy of 84.7% for lane line segmentation. The inference speed reaches 32.6 FPS. Moreover, we introduce real-world scenarios to evaluate RMT-PPAD performance in practice. The results show that RMT-PPAD consistently delivers stable performance. The source codes and pre-trained models are released at https://github.com/JiayuanWang-JW/RMT-PPAD.
IVAug 28, 2024
GlaLSTM: A Concurrent LSTM Stream Framework for Glaucoma Detection via Biomarker MiningCheng Huang, Weizheng Xie, Tsengdar Lee et al.
Glaucoma is a complex group of eye diseases marked by optic nerve damage, commonly linked to elevated intraocular pressure and biomarkers like retinal nerve fiber layer thickness. Understanding how these biomarkers interact is crucial for unraveling glaucoma's underlying mechanisms. In this paper, we propose GlaLSTM, a novel concurrent LSTM stream framework for glaucoma detection, leveraging latent biomarker relationships. Unlike traditional CNN-based models that primarily detect glaucoma from images, GlaLSTM provides deeper interpretability, revealing the key contributing factors and enhancing model transparency. This approach not only improves detection accuracy but also empowers clinicians with actionable insights, facilitating more informed decision-making. Experimental evaluations confirm that GlaLSTM surpasses existing state-of-the-art methods, demonstrating its potential for both advanced biomarker analysis and reliable glaucoma detection.
CLSep 16, 2024
MGSA: Multi-Granularity Graph Structure Attention for Knowledge Graph-to-Text GenerationShanshan Wang, Chun Zhang, Ning Zhang
The Knowledge Graph-to-Text Generation task aims to convert structured knowledge graphs into coherent and human-readable natural language text. Recent efforts in this field have focused on enhancing pre-trained language models (PLMs) by incorporating graph structure information to capture the intricate structure details of knowledge graphs. However, most of these approaches tend to capture only single-granularity structure information, concentrating either on the relationships between entities within the original graph or on the relationships between words within the same entity or across different entities. This narrow focus results in a significant limitation: models that concentrate solely on entity-level structure fail to capture the nuanced semantic relationships between words, while those that focus only on word-level structure overlook the broader relationships between original entire entities. To overcome these limitations, this paper introduces the Multi-granularity Graph Structure Attention (MGSA), which is based on PLMs. The encoder of the model architecture features an entity-level structure encoding module, a word-level structure encoding module, and an aggregation module that synthesizes information from both structure. This multi-granularity structure encoding approach allows the model to simultaneously capture both entity-level and word-level structure information, providing a more comprehensive understanding of the knowledge graph's structure information, thereby significantly improving the quality of the generated text. We conducted extensive evaluations of the MGSA model using two widely recognized KG-to-Text Generation benchmark datasets, WebNLG and EventNarrative, where it consistently outperformed models that rely solely on single-granularity structure information, demonstrating the effectiveness of our approach.
94.5DCMar 21
TrEnv-X: Transparently Share Serverless Execution Environments Across Different Functions and NodesJialiang Huang, Teng Ma, Zheng Liu et al.
Serverless computing is renowned for its computation elasticity, yet its full potential is often constrained by the requirement for functions to operate within local and dedicated background environments, resulting in limited memory elasticity. To address this limitation, this paper introduces TrEnv-X, a co-designed integration of the serverless platform with the operating system and CXL/RDMA-based remote memory pools. TrEnv-X's core innovations are repurposable sandboxes, which can be shared across different functions to decrease the associated creation overhead, and OS-level memory templates, which enable rapid state restoration from CXL/RDMA-based remote memory pools. To further demonstrate TrEnv-X's versatility, we generalize its design from traditional containers for microVM-based agent workloads and introduce new optimizations, including browser sharing and a page cache bypassing mechanism. Our evaluation shows that TrEnv-X achieves up to 7x reduction in P99 latency and 48% memory savings for container-based functions. When applied to LLM agents, it reduces the P99 latency by up to 58% and memory usage by 61% compared to state-of-the-art systems like E2B.