You Wu

CV
h-index31
39papers
3,530citations
Novelty49%
AI Score61

39 Papers

CLDec 20, 2022
Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters

Boshi Wang, Sewon Min, Xiang Deng et al. · deepmind, uw

Chain-of-Thought (CoT) prompting can dramatically improve the multi-step reasoning abilities of large language models (LLMs). CoT explicitly encourages the LLM to generate intermediate rationales for solving a problem, by providing a series of reasoning steps in the demonstrations. Despite its success, there is still little understanding of what makes CoT prompting effective and which aspects of the demonstrated reasoning steps contribute to its performance. In this paper, we show that CoT reasoning is possible even with invalid demonstrations - prompting with invalid reasoning steps can achieve over 80-90% of the performance obtained using CoT under various metrics, while still generating coherent lines of reasoning during inference. Further experiments show that other aspects of the rationales, such as being relevant to the query and correctly ordering the reasoning steps, are much more important for effective CoT reasoning. Overall, these findings both deepen our understanding of CoT prompting, and open up new questions regarding LLMs' capability to learn to reason in context.

CVApr 6, 2023Code
Zero-shot Generative Model Adaptation via Image-specific Prompt Learning

Jiayi Guo, Chaofei Wang, You Wu et al. · gatech

Recently, CLIP-guided image synthesis has shown appealing performance on adapting a pre-trained source-domain generator to an unseen target domain. It does not require any target-domain samples but only the textual domain labels. The training is highly efficient, e.g., a few minutes. However, existing methods still have some limitations in the quality of generated images and may suffer from the mode collapse issue. A key reason is that a fixed adaptation direction is applied for all cross-domain image pairs, which leads to identical supervision signals. To address this issue, we propose an Image-specific Prompt Learning (IPL) method, which learns specific prompt vectors for each source-domain image. This produces a more precise adaptation direction for every cross-domain image pair, endowing the target-domain generator with greatly enhanced flexibility. Qualitative and quantitative evaluations on various domains demonstrate that IPL effectively improves the quality and diversity of synthesized images and alleviates the mode collapse. Moreover, IPL is independent of the structure of the generative model, such as generative adversarial networks or diffusion models. Code is available at https://github.com/Picsart-AI-Research/IPL-Zero-Shot-Generative-Model-Adaptation.

LGOct 8, 2023Code
Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity

Lu Yin, You Wu, Zhenyu Zhang et al.

Large Language Models (LLMs), renowned for their remarkable performance across diverse domains, present a challenge when it comes to practical deployment due to their colossal model size. In response to this challenge, efforts have been directed toward the application of traditional network pruning techniques to LLMs, uncovering a massive number of parameters that can be pruned in one-shot without hurting performance. Prevailing LLM pruning strategies have consistently adhered to the practice of uniformly pruning all layers at equivalent sparsity, resulting in robust performance. However, this observation stands in contrast to the prevailing trends observed in the field of vision models, where non-uniform layerwise sparsity typically yields stronger results. To understand the underlying reasons for this disparity, we conduct a comprehensive study and discover a strong correlation with the emergence of activation outliers in LLMs. Inspired by this finding, we introduce a novel LLM pruning methodology that incorporates a tailored set of non-uniform layerwise sparsity ratios, termed as Outlier Weighed Layerwise sparsity (OWL). The sparsity ratio of OWL is proportional to the outlier ratio observed within each layer, facilitating a more effective alignment between layerwise weight sparsity and outlier ratios. Our empirical evaluation, conducted across the LLaMA-V1 family and OPT, spanning various benchmarks, demonstrates the distinct advantages offered by OWL over previous methods. For instance, OWL exhibits a remarkable performance gain, surpassing the state-of-the-art Wanda and SparseGPT by 61.22 and 6.80 perplexity at a high sparsity level of 70%, respectively, while delivering 2.6x end-to-end inference speed-up in the DeepSparse inference engine. Codes are available at https://github.com/luuyin/OWL.

23.9LGMay 27
AdaMemento: Adaptive Memory-Assisted Policy Optimization for Reinforcement Learning

Renye Yan, Yaozhong Gan, You Wu et al.

In sparse reward scenarios of reinforcement learning (RL), the memory mechanism provides promising shortcuts to policy optimization by reflecting on past experiences like humans. However, current memory-based RL methods simply store and reuse high-value policies, lacking a deeper refining and filtering of diverse past experiences and hence limiting the capability of memory. In this paper, we propose AdaMemento, an adaptive memory-enhanced RL framework. Instead of just memorizing positive past experiences, we design a memory-reflection module that exploits both positive and negative experiences by learning to predict known local optimal policies based on real-time states. To effectively gather informative trajectories for the memory, we further introduce a fine-grained intrinsic motivation paradigm, where nuances in similar states can be precisely distinguished to guide exploration. The exploitation of past experiences and exploration of new policies are then adaptively coordinated by ensemble learning to approach the global optimum. Furthermore, we theoretically prove the superiority of our new intrinsic motivation and ensemble mechanism. From 59 quantitative and visualization experiments, we confirm that AdaMemento can distinguish subtle states for better exploration and effectively exploiting past experiences in memory, achieving significant improvement over previous methods.

61.1CLMay 29
GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs

Junjie Peng, You Wu, Haoyi Wu et al.

Large language models (LLMs) with extended context lengths rely on the key-value (KV) cache to support attention over prior tokens. However, maintaining the KV cache incurs substantial memory overhead, motivating KV-cache compression methods that enforce a fixed budget through eviction and merging. Modern eviction methods increasingly adopt span-based retention because preserving contiguous spans is empirically effective and better preserves semantic coherence. Yet, when combined with post-eviction merging, span-based retention concentrates merges onto a small set of span-boundary carrier tokens, producing a highly imbalanced merge pattern that exacerbates over-merging and increases information loss. To address this imbalance, we propose GRKV (Global Regression for KV Cache), a training-free KV-cache merging method that directly minimizes the discrepancy between compressed-cache and full-cache attention outputs. GRKV uses ridge-regression-based merge steps to distribute information from evicted tokens across retained tokens, while regularizing the updates to prevent over-smoothing. Across the LongBench and RULER long-context benchmarks, GRKV is the only merging method that improves overall performance with minimal overhead.

CVJul 7, 2024Code
Learning Motion Blur Robust Vision Transformers for Real-Time UAV Tracking

You Wu, Xucheng Wang, Dan Zeng et al.

Unmanned aerial vehicle (UAV) tracking is critical for applications like surveillance, search-and-rescue, and autonomous navigation. However, the high-speed movement of UAVs and targets introduces unique challenges, including real-time processing demands and severe motion blur, which degrade the performance of existing generic trackers. While single-stream vision transformer (ViT) architectures have shown promise in visual tracking, their computational inefficiency and lack of UAV-specific optimizations limit their practicality in this domain. In this paper, we boost the efficiency of this framework by tailoring it into an adaptive computation framework that dynamically exits Transformer blocks for real-time UAV tracking. The motivation behind this is that tracking tasks with fewer challenges can be adequately addressed using low-level feature representations. Simpler tasks can often be handled with less demanding, lower-level features. This approach allows the model use computational resources more efficiently by focusing on complex tasks and conserving resources for easier ones. Another significant enhancement introduced in this paper is the improved effectiveness of ViTs in handling motion blur, a common issue in UAV tracking caused by the fast movements of either the UAV, the tracked objects, or both. This is achieved by acquiring motion blur robust representations through enforcing invariance in the feature representation of the target with respect to simulated motion blur. We refer to our proposed approach as BDTrack. Extensive experiments conducted on four tracking benchmarks validate the effectiveness and versatility of our approach, demonstrating its potential as a practical and effective approach for real-time UAV tracking. Code is released at: https://github.com/wuyou3474/BDTrack.

CVMar 8, 2022
AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant

Benita Wong, Joya Chen, You Wu et al.

A long-standing goal of intelligent assistants such as AR glasses/robots has been to assist users in affordance-centric real-world scenarios, such as "how can I run the microwave for 1 minute?". However, there is still no clear task definition and suitable benchmarks. In this paper, we define a new task called Affordance-centric Question-driven Task Completion, where the AI assistant should learn from instructional videos to provide step-by-step help in the user's view. To support the task, we constructed AssistQ, a new dataset comprising 531 question-answer samples from 100 newly filmed instructional videos. We also developed a novel Question-to-Actions (Q2A) model to address the AQTC task and validate it on the AssistQ dataset. The results show that our model significantly outperforms several VQA-related baselines while still having large room for improvement. We expect our task and dataset to advance Egocentric AI Assistant's development. Our project page is available at: https://showlab.github.io/assistq/.

CVJul 8, 2024Code
Towards Reflected Object Detection: A Benchmark

Yiquan Wu, Zhongtian Wang, You Wu et al.

Object detection has greatly improved over the past decade thanks to advances in deep learning and large-scale datasets. However, detecting objects reflected in surfaces remains an underexplored area. Reflective surfaces are ubiquitous in daily life, appearing in homes, offices, public spaces, and natural environments. Accurate detection and interpretation of reflected objects are essential for various applications. This paper addresses this gap by introducing a extensive benchmark specifically designed for Reflected Object Detection. Our Reflected Object Detection Dataset (RODD) features a diverse collection of images showcasing reflected objects in various contexts, providing standard annotations for both real and reflected objects. This distinguishes it from traditional object detection benchmarks. RODD encompasses 10 categories and includes 21,059 images of real and reflected objects across different backgrounds, complete with standard bounding box annotations and the classification of objects as real or reflected. Additionally, we present baseline results by adapting five state-of-the-art object detection models to address this challenging task. Experimental results underscore the limitations of existing methods when applied to reflected object detection, highlighting the need for specialized approaches. By releasing RODD, we aim to support and advance future research on detecting reflected objects. Dataset and code are available at: https://github.com/jirouvan/ROD.

CVMar 26, 2025Code
Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai et al.

This report presents Wan, a comprehensive and open suite of video foundation models designed to push the boundaries of video generation. Built upon the mainstream diffusion transformer paradigm, Wan achieves significant advancements in generative capabilities through a series of innovations, including our novel VAE, scalable pre-training strategies, large-scale data curation, and automated evaluation metrics. These contributions collectively enhance the model's performance and versatility. Specifically, Wan is characterized by four key features: Leading Performance: The 14B model of Wan, trained on a vast dataset comprising billions of images and videos, demonstrates the scaling laws of video generation with respect to both data and model size. It consistently outperforms the existing open-source models as well as state-of-the-art commercial solutions across multiple internal and external benchmarks, demonstrating a clear and significant performance superiority. Comprehensiveness: Wan offers two capable models, i.e., 1.3B and 14B parameters, for efficiency and effectiveness respectively. It also covers multiple downstream applications, including image-to-video, instruction-guided video editing, and personal video generation, encompassing up to eight tasks. Consumer-Grade Efficiency: The 1.3B model demonstrates exceptional resource efficiency, requiring only 8.19 GB VRAM, making it compatible with a wide range of consumer-grade GPUs. Openness: We open-source the entire series of Wan, including source code and all models, with the goal of fostering the growth of the video generation community. This openness seeks to significantly expand the creative possibilities of video production in the industry and provide academia with high-quality video foundation models. All the code and models are available at https://github.com/Wan-Video/Wan2.1.

LGAug 19, 2024
The Exploration-Exploitation Dilemma Revisited: An Entropy Perspective

Renye Yan, Yaozhong Gan, You Wu et al.

The imbalance of exploration and exploitation has long been a significant challenge in reinforcement learning. In policy optimization, excessive reliance on exploration reduces learning efficiency, while over-dependence on exploitation might trap agents in local optima. This paper revisits the exploration-exploitation dilemma from the perspective of entropy by revealing the relationship between entropy and the dynamic adaptive process of exploration and exploitation. Based on this theoretical insight, we establish an end-to-end adaptive framework called AdaZero, which automatically determines whether to explore or to exploit as well as their balance of strength. Experiments show that AdaZero significantly outperforms baseline models across various Atari and MuJoCo environments with only a single setting. Especially in the challenging environment of Montezuma, AdaZero boosts the final returns by up to fifteen times. Moreover, we conduct a series of visualization analyses to reveal the dynamics of our self-adaptive mechanism, demonstrating how entropy reflects and changes with respect to the agent's performance and adaptive process.

AIJul 8, 2024
AI-driven multi-omics integration for multi-scale predictive modeling of causal genotype-environment-phenotype relationships

You Wu, Lei Xie

Despite the wealth of single-cell multi-omics data, it remains challenging to predict the consequences of novel genetic and chemical perturbations in the human body. It requires knowledge of molecular interactions at all biological levels, encompassing disease models and humans. Current machine learning methods primarily establish statistical correlations between genotypes and phenotypes but struggle to identify physiologically significant causal factors, limiting their predictive power. Key challenges in predictive modeling include scarcity of labeled data, generalization across different domains, and disentangling causation from correlation. In light of recent advances in multi-omics data integration, we propose a new artificial intelligence (AI)-powered biology-inspired multi-scale modeling framework to tackle these issues. This framework will integrate multi-omics data across biological levels, organism hierarchies, and species to predict causal genotype-environment-phenotype relationships under various conditions. AI models inspired by biology may identify novel molecular targets, biomarkers, pharmaceutical agents, and personalized medicines for presently unmet medical needs.

CVAug 16, 2024
Visual-Friendly Concept Protection via Selective Adversarial Perturbations

Xiaoyue Mi, Fan Tang, You Wu et al.

Personalized concept generation by tuning diffusion models with a few images raises potential legal and ethical concerns regarding privacy and intellectual property rights. Researchers attempt to prevent malicious personalization using adversarial perturbations. However, previous efforts have mainly focused on the effectiveness of protection while neglecting the visibility of perturbations. They utilize global adversarial perturbations, which introduce noticeable alterations to original images and significantly degrade visual quality. In this work, we propose the Visual-Friendly Concept Protection (VCPro) framework, which prioritizes the protection of key concepts chosen by the image owner through adversarial perturbations with lower perceptibility. To ensure these perturbations are as inconspicuous as possible, we introduce a relaxed optimization objective to identify the least perceptible yet effective adversarial perturbations, solved using the Lagrangian multiplier method. Qualitative and quantitative experiments validate that VCPro achieves a better trade-off between the visibility of perturbations and protection effectiveness, effectively prioritizing the protection of target concepts in images with less perceptible perturbations.

97.0ROMar 14
ST-VLA: Enabling 4D-Aware Spatiotemporal Understanding for General Robot Manipulation

You Wu, Zixuan Chen, Cunxu Ou et al.

Robotic manipulation in open-world environments requires reasoning across semantics, geometry, and long-horizon action dynamics. Existing hierarchical Vision-Language-Action (VLA) frameworks typically use 2D representations to connect high-level reasoning with low-level control, but lack depth awareness and temporal consistency, limiting robustness in complex 3D scenes. We propose ST-VLA, a hierarchical VLA framework using a unified 3D-4D representation to bridge perception and action. ST-VLA converts 2D guidance into 3D trajectories and generates smooth spatial masks that capture 4D spatio-temporal context, providing a stable interface between semantic reasoning and continuous control. To enable effective learning of such representations, we introduce ST-Human, a large-scale human manipulation dataset with 14 tasks and 300k episodes, annotated with 2D, 3D, and 4D supervision via a semi-automated pipeline. Using ST-Human, we train ST-VLM, a spatio-temporal vision-language model that generates spatially grounded and temporally coherent 3D representations to guide policy execution. The smooth spatial masks focus on task-relevant geometry and stabilize latent representations, enabling online replanning and long-horizon reasoning. Experiments on RLBench and real-world manipulation tasks show that \method significantly outperforms state-of-the-art baselines, improving zero-shot success rates by 44.6% and 30.3%. These results demonstrate that offloading spatio-temporal reasoning to VLMs with unified 3D-4D representations substantially improves robustness and generalization for open-world robotic manipulation. Project website: https://oucx117.github.io/ST-VLA/.

LGJul 9, 2024
Advanced Financial Fraud Detection Using GNN-CL Model

Yu Cheng, Junjie Guo, Shiqing Long et al.

The innovative GNN-CL model proposed in this paper marks a breakthrough in the field of financial fraud detection by synergistically combining the advantages of graph neural networks (gnn), convolutional neural networks (cnn) and long short-term memory (LSTM) networks. This convergence enables multifaceted analysis of complex transaction patterns, improving detection accuracy and resilience against complex fraudulent activities. A key novelty of this paper is the use of multilayer perceptrons (MLPS) to estimate node similarity, effectively filtering out neighborhood noise that can lead to false positives. This intelligent purification mechanism ensures that only the most relevant information is considered, thereby improving the model's understanding of the network structure. Feature weakening often plagues graph-based models due to the dilution of key signals. In order to further address the challenge of feature weakening, GNN-CL adopts reinforcement learning strategies. By dynamically adjusting the weights assigned to central nodes, it reinforces the importance of these influential entities to retain important clues of fraud even in less informative data. Experimental evaluations on Yelp datasets show that the results highlight the superior performance of GNN-CL compared to existing methods.

CVApr 12, 2025Code
Learning Occlusion-Robust Vision Transformers for Real-Time UAV Tracking

You Wu, Xucheng Wang, Xiangyang Yang et al.

Single-stream architectures using Vision Transformer (ViT) backbones show great potential for real-time UAV tracking recently. However, frequent occlusions from obstacles like buildings and trees expose a major drawback: these models often lack strategies to handle occlusions effectively. New methods are needed to enhance the occlusion resilience of single-stream ViT models in aerial tracking. In this work, we propose to learn Occlusion-Robust Representations (ORR) based on ViTs for UAV tracking by enforcing an invariance of the feature representation of a target with respect to random masking operations modeled by a spatial Cox process. Hopefully, this random masking approximately simulates target occlusions, thereby enabling us to learn ViTs that are robust to target occlusion for UAV tracking. This framework is termed ORTrack. Additionally, to facilitate real-time applications, we propose an Adaptive Feature-Based Knowledge Distillation (AFKD) method to create a more compact tracker, which adaptively mimics the behavior of the teacher model ORTrack according to the task's difficulty. This student model, dubbed ORTrack-D, retains much of ORTrack's performance while offering higher efficiency. Extensive experiments on multiple benchmarks validate the effectiveness of our method, demonstrating its state-of-the-art performance. Codes is available at https://github.com/wuyou3474/ORTrack.

AIJun 12, 2025Code
Reasoning RAG via System 1 or System 2: A Survey on Reasoning Agentic Retrieval-Augmented Generation for Industry Challenges

Jintao Liang, Gang Su, Huifeng Lin et al.

Retrieval-Augmented Generation (RAG) has emerged as a powerful framework to overcome the knowledge limitations of Large Language Models (LLMs) by integrating external retrieval with language generation. While early RAG systems based on static pipelines have shown effectiveness in well-structured tasks, they struggle in real-world scenarios requiring complex reasoning, dynamic retrieval, and multi-modal integration. To address these challenges, the field has shifted toward Reasoning Agentic RAG, a paradigm that embeds decision-making and adaptive tool use directly into the retrieval process. In this paper, we present a comprehensive review of Reasoning Agentic RAG methods, categorizing them into two primary systems: predefined reasoning, which follows fixed modular pipelines to boost reasoning, and agentic reasoning, where the model autonomously orchestrates tool interaction during inference. We analyze representative techniques under both paradigms, covering architectural design, reasoning strategies, and tool coordination. Finally, we discuss key research challenges and propose future directions to advance the flexibility, robustness, and applicability of reasoning agentic RAG systems. Our collection of the relevant research has been organized into a https://github.com/ByebyeMonica/Reasoning-Agentic-RAG.

51.4CVMay 15
Do Less, Achieve More: Do We Need Every-Step Optimization for RL Fine-tuning of Diffusion Models?

Renye Yan, Jikang Cheng, Shikun Sun et al.

Despite strong image-generation performance, diffusion models' reconstruction objectives limit alignment with human preferences. RL enables such alignment through explicit rewards. However, most studies apply RL to the full denoising trajectory, making it computationally costly and weakening preference alignment, i.e., doing more but achieving less. We observe that the impact of RL fine-tuning varies significantly across denoising stages. In the early stage, image structures are unstable and distant from the final reward signal. Applying RL at this stage leads to delayed rewards and action-reward mismatching, resulting in high variance and inefficient updates. Conversely, in the later stage, reward gains saturate, and continued training tends to overfit local details, intensifying reward hacking. To tackle these challenges, we propose AdaScope, an RL-enhanced plug-in that improves generation quality while reducing computational cost. Specifically, AdaScope adaptively identifies the optimal intervention timing for RL by perceiving the structural evolution and semantic consistency during denoising, and dynamically terminates training once the denoising converges and reward gains saturate. As a result, it achieves a rare 'dual benefit': a reduction in computational costs alongside a significant performance improvement. We offer theoretical grounds for the design of AdaScope. Compared with state-of-the-art methods, AdaScope improves performance by 66% while cutting computational cost by 59%.

61.4CLApr 15
YOCO++: Enhancing YOCO with KV Residual Connections for Efficient LLM Inference

You Wu, Ziheng Chen, Yizhen Zhang et al.

Cross-layer key-value (KV) compression has been found to be effective in efficient inference of large language models (LLMs). Although they reduce the memory consumption of the KV cache, such methods usually introduce non-negligible performance degradation. In this work, we aim to enhance the performance of YOCO, a cross-layer KV compression method that shares the KVs of the middle layer with the top-half layers. We propose YOCO++, an enhanced YOCO that incorporates a weighted residual connection between the KVs of each bottom-half layer and the bottom layer. Compared to YOCO, YOCO++ increases model capacity while maintaining the same training and inference efficiency. Our experiments show that YOCO++ achieves state-of-the-art performance among the cross-layer KV compression methods at a 50% KV cache compression rate, outperforming the standard Transformer.

LGFeb 15, 2025Code
Superpose Task-specific Features for Model Merging

Haiquan Qiu, You Wu, Dong Li et al.

Model merging enables powerful capabilities in neural networks without requiring additional training. In this paper, we introduce a novel perspective on model merging by leveraging the fundamental mechanisms of neural network representation. Our approach is motivated by the linear representation hypothesis, which states that neural networks encode information through linear combinations of feature vectors. We propose a method that superposes task-specific features from individual models into a merged model. Our approach specifically targets linear transformation matrices, which are crucial for feature activation and extraction in deep networks. By formulating the merging process as a linear system, we can preserve task-specific features from individual models and create merged models that effectively maintain multi-task capabilities compared to existing methods. Extensive experiments across diverse benchmarks and models demonstrate that our method outperforms existing techniques. Code is available at https://github.com/LARS-research/STF.

CVDec 28, 2024Code
Learning an Adaptive and View-Invariant Vision Transformer for Real-Time UAV Tracking

You Wu, Yongxin Li, Mengyuan Liu et al.

Transformer-based models have improved visual tracking, but most still cannot run in real time on resource-limited devices, especially for unmanned aerial vehicle (UAV) tracking. To achieve a better balance between performance and efficiency, we propose AVTrack, an adaptive computation tracking framework that adaptively activates transformer blocks through an Activation Module (AM), which dynamically optimizes the ViT architecture by selectively engaging relevant components. To address extreme viewpoint variations, we propose to learn view-invariant representations via mutual information (MI) maximization. In addition, we propose AVTrack-MD, an enhanced tracker incorporating a novel MI maximization-based multi-teacher knowledge distillation framework. Leveraging multiple off-the-shelf AVTrack models as teachers, we maximize the MI between their aggregated softened features and the corresponding softened feature of the student model, improving the generalization and performance of the student, especially under noisy conditions. Extensive experiments show that AVTrack-MD achieves performance comparable to AVTrack's performance while reducing model complexity and boosting average tracking speed by over 17\%. Codes is available at: https://github.com/wuyou3474/AVTrack.

CVDec 1, 2024Code
MambaNUT: Nighttime UAV Tracking via Mamba-based Adaptive Curriculum Learning

You Wu, Xiangyang Yang, Xucheng Wang et al.

Harnessing low-light enhancement and domain adaptation, nighttime UAV tracking has made substantial strides. However, over-reliance on image enhancement, limited high-quality nighttime data, and a lack of integration between daytime and nighttime trackers hinder the development of an end-to-end trainable framework. Additionally, current ViT-based trackers demand heavy computational resources due to their reliance on the self-attention mechanism. In this paper, we propose a novel pure Mamba-based tracking framework (MambaNUT) that employs a state space model with linear complexity as its backbone, incorporating a single-stream architecture that integrates feature learning and template-search coupling within Vision Mamba. We introduce an adaptive curriculum learning (ACL) approach that dynamically adjusts sampling strategies and loss weights, thereby improving the model's ability of generalization. Our ACL is composed of two levels of curriculum schedulers: (1) sampling scheduler that transforms the data distribution from imbalanced to balanced, as well as from easier (daytime) to harder (nighttime) samples; (2) loss scheduler that dynamically assigns weights based on the size of the training set and IoU of individual instances. Exhaustive experiments on multiple nighttime UAV tracking benchmarks demonstrate that the proposed MambaNUT achieves state-of-the-art performance while requiring lower computational costs. The code will be available at https://github.com/wuyou3474/MambaNUT.

CVJun 12, 2024Code
Adaptively Bypassing Vision Transformer Blocks for Efficient Visual Tracking

Xiangyang Yang, Dan Zeng, Xucheng Wang et al.

Empowered by transformer-based models, visual tracking has advanced significantly. However, the slow speed of current trackers limits their applicability on devices with constrained computational resources. To address this challenge, we introduce ABTrack, an adaptive computation framework that adaptively bypassing transformer blocks for efficient visual tracking. The rationale behind ABTrack is rooted in the observation that semantic features or relations do not uniformly impact the tracking task across all abstraction levels. Instead, this impact varies based on the characteristics of the target and the scene it occupies. Consequently, disregarding insignificant semantic features or relations at certain abstraction levels may not significantly affect the tracking accuracy. We propose a Bypass Decision Module (BDM) to determine if a transformer block should be bypassed, which adaptively simplifies the architecture of ViTs and thus speeds up the inference process. To counteract the time cost incurred by the BDMs and further enhance the efficiency of ViTs, we introduce a novel ViT pruning method to reduce the dimension of the latent representation of tokens in each transformer block. Extensive experiments on multiple tracking benchmarks validate the effectiveness and generality of the proposed method and show that it achieves state-of-the-art performance. Code is released at: https://github.com/xyyang317/ABTrack.

LGFeb 12
SpiralFormer: Looped Transformers Can Learn Hierarchical Dependencies via Multi-Resolution Recursion

Chengting Yu, Xiaobo Shu, Yadao Wang et al.

Recursive (looped) Transformers decouple computational depth from parameter depth by repeatedly applying shared layers, providing an explicit architectural primitive for iterative refinement and latent reasoning. However, early looped Transformers often underperform non-recursive baselines of equal compute. While recent literature has introduced more effective recursion mechanisms to mitigate this gap, existing architectures still operate at a fixed, full-token resolution, neglecting the potential efficiency of computing over compressed latent representations. In this paper, we propose SpiralFormer, a looped Transformer that executes recurrence under a multi-resolution recursion schedule. We provide probing evidence that multi-resolution recursion enables the model to learn hierarchical dependencies by inducing iteration-wise functional specialization across different scales. Empirically, SpiralFormer achieves better parameter and compute efficiency than both looped and non-looped baselines across model scales from 160M to 1.4B, establishing sequence resolution as a potential axis for scaling recursive architectures.

68.0LGApr 29
Exploring the Potential of Probabilistic Transformer for Time Series Modeling: A Report on the ST-PT Framework

Zhangzhi Xiong, Haoyi Wu, You Wu et al.

The Probabilistic Transformer (PT) establishes that the Transformer's self-attention plus its feed-forward block is mathematically equivalent to Mean-Field Variational Inference (MFVI) on a Conditional Random Field (CRF). Under this equivalence the Transformer ceases to be a black-box neural network and becomes a programmable factor graph: graph topology, factor potentials, and the message-passing schedule are all explicit and inspectable primitives that can be engineered. PT was originally developed for natural language and in this report we investigate its potential for time series. We first lift PT into the Spatial-Temporal Probabilistic Transformer (ST-PT) to repair PT's missing channel axis and weak per-step semantics, and adopt ST-PT as a shared cornerstone backbone. We then identify three distinct properties that PT/ST-PT offers as a factor-graph model and derive three Research Questions, one per property, that probe how each property can be exploited in time series: RQ1. The graph topology and potentials are direct programmable primitives. Can this be used to inject symbolic time-series priors into ST-PT through structural graph modifications, especially under data scarcity and noise? RQ2. The CRF's factor matrices are the operator's potentials. Can an external condition program these factor matrices on a per-sample basis, so that conditional generation becomes structural rather than feature-level modulation of a fixed one? RQ3. Each MFVI iteration is a Bayesian posterior update on the factor graph. Can this turn the latent transition of latent-space AutoRegressive (AR) forecasting from an opaque MLP into a principled posterior update, and can a CRF teacher distill its latents into the AR student to counter cumulative error? We give one empirical study per question. Together, these three studies position ST-PT as a programmable framework for time-series modeling.

RMDec 4, 2024
Advanced Risk Prediction and Stability Assessment of Banks Using Time Series Transformer Models

Wenying Sun, Zhen Xu, Wenqing Zhang et al.

This paper aims to study the prediction of the bank stability index based on the Time Series Transformer model. The bank stability index is an important indicator to measure the health status and risk resistance of financial institutions. Traditional prediction methods are difficult to adapt to complex market changes because they rely on single-dimensional macroeconomic data. This paper proposes a prediction framework based on the Time Series Transformer, which uses the self-attention mechanism of the model to capture the complex temporal dependencies and nonlinear relationships in financial data. Through experiments, we compare the model with LSTM, GRU, CNN, TCN and RNN-Transformer models. The experimental results show that the Time Series Transformer model outperforms other models in both mean square error (MSE) and mean absolute error (MAE) evaluation indicators, showing strong prediction ability. This shows that the Time Series Transformer model can better handle multidimensional time series data in bank stability prediction, providing new technical approaches and solutions for financial risk management.

34.1CRApr 25
Branch Landing: Bloom Filter-Based Source Authorization for Forward-Edge CFI on RISC-V

You Wu, Peter Beerel

Jump-Oriented Programming (JOP) attacks exploit indirect control transfers to bypass backward-edge defenses, yet existing forward-edge CFI mechanisms lack precise source-domain authorization: type-based CFI admits all same-signature callers, while tag-based hardware CFI is limited by fixed-width register storage that caps the number of simultaneously authorized sources. We propose Branch Landing (BRL), a landing-based forward-edge CFI framework for RISC-V that replaces fixed-capacity checks with Bloom filter membership queries. Two lightweight ISA extensions, bld and brl, propagate a source Section Identifier (SID) through a dedicated BRState register and validate it at each landing site with fixed-probe latency that is independent of the number of authorized sources under a chosen filter configuration. Section granularity is configurable, supporting policies from type-based to CFG-derived authorization within a single mechanism. We implement Branch Landing in the LLVM RISC-V backend and evaluate it on 81 BEEBS benchmarks under two representative policy configurations: a function-level, type-based policy and a basic-block-level, CFG-derived policy. Under a 3-cycle brl latency model, the two configurations incur average runtime overheads of only 0.210% and 0.421%, with mean code size growth of 0.46% and 0.52% respectively. The CFG-derived policy reduces the average equivalence class size by 32.5% compared to the type-based policy, and all evaluated executions complete without BRL enforcement failures.

LGDec 13, 2024
Integrative Analysis of Financial Market Sentiment Using CNN and GRU for Risk Prediction and Alert Systems

You Wu, Mengfang Sun, Hongye Zheng et al.

This document presents an in-depth examination of stock market sentiment through the integration of Convolutional Neural Networks (CNN) and Gated Recurrent Units (GRU), enabling precise risk alerts. The robust feature extraction capability of CNN is utilized to preprocess and analyze extensive network text data, identifying local features and patterns. The extracted feature sequences are then input into the GRU model to understand the progression of emotional states over time and their potential impact on future market sentiment and risk. This approach addresses the order dependence and long-term dependencies inherent in time series data, resulting in a detailed analysis of stock market sentiment and effective early warnings of future risks.

CLOct 18, 2024
A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference

You Wu, Haoyi Wu, Kewei Tu

Recently, sharing key-value (KV) cache across layers has been found effective in efficient inference of large language models (LLMs). To systematically investigate different techniques of cross-layer KV sharing, we propose a unified framework that covers several recent methods and their novel variants. We conduct comprehensive experiments on all the configurations of the framework, evaluating their generation throughput and performance in language modeling and downstream tasks. We find that when reducing the size of the KV cache by 2$\times$, most configurations can achieve higher throughput than standard transformers while maintaining competitive performance. When further reducing the size of the KV cache, however, pairing queries of all layers with KVs of upper layers performs better, at the expense of additional training cost and prefilling latency. We hope that this work will help users make more informed choices of cross-layer KV sharing approaches and facilitate future research on efficient LLM inference.

CVMar 29, 2024
U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmentation

You Wu, Kean Liu, Xiaoyue Mi et al.

Concept personalization methods enable large text-to-image models to learn specific subjects (e.g., objects/poses/3D models) and synthesize renditions in new contexts. Given that the image references are highly biased towards visual attributes, state-of-the-art personalization models tend to overfit the whole subject and cannot disentangle visual characteristics in pixel space. In this study, we proposed a more challenging setting, namely fine-grained visual appearance personalization. Different from existing methods, we allow users to provide a sentence describing the desired attributes. A novel decoupled self-augmentation strategy is proposed to generate target-related and non-target samples to learn user-specified visual attributes. These augmented data allow for refining the model's understanding of the target attribute while mitigating the impact of unrelated attributes. At the inference stage, adjustments are conducted on semantic space through the learned target and non-target embeddings to further enhance the disentanglement of target attributes. Extensive experiments on various kinds of visual attributes with SOTA personalization methods show the ability of the proposed method to mimic target visual appearance in novel contexts, thus improving the controllability and flexibility of personalization.

CVApr 28, 2024
Tracking Transforming Objects: A Benchmark

You Wu, Yuelong Wang, Yaxin Liao et al.

Tracking transforming objects holds significant importance in various fields due to the dynamic nature of many real-world scenarios. By enabling systems accurately represent transforming objects over time, tracking transforming objects facilitates advancements in areas such as autonomous systems, human-computer interaction, and security applications. Moreover, understanding the behavior of transforming objects provides valuable insights into complex interactions or processes, contributing to the development of intelligent systems capable of robust and adaptive perception in dynamic environments. However, current research in the field mainly focuses on tracking generic objects. In this study, we bridge this gap by collecting a novel dedicated Dataset for Tracking Transforming Objects, called DTTO, which contains 100 sequences, amounting to approximately 9.3K frames. We provide carefully hand-annotated bounding boxes for each frame within these sequences, making DTTO the pioneering benchmark dedicated to tracking transforming objects. We thoroughly evaluate 20 state-of-the-art trackers on the benchmark, aiming to comprehend the performance of existing methods and provide a comparison for future research on DTTO. With the release of DTTO, our goal is to facilitate further research and applications related to tracking transforming objects.

AIApr 3, 2025
Affordable AI Assistants with Knowledge Graph of Thoughts

Maciej Besta, Lorenzo Paleari, Jia Hao Andrea Jiang et al.

Large Language Models (LLMs) are revolutionizing the development of AI assistants capable of performing diverse tasks across domains. However, current state-of-the-art LLM-driven agents face significant challenges, including high operational costs and limited success rates on complex benchmarks like GAIA. To address these issues, we propose Knowledge Graph of Thoughts (KGoT), an innovative AI assistant architecture that integrates LLM reasoning with dynamically constructed knowledge graphs (KGs). KGoT extracts and structures task-relevant knowledge into a dynamic KG representation, iteratively enhanced through external tools such as math solvers, web crawlers, and Python scripts. Such structured representation of task-relevant knowledge enables low-cost models to solve complex tasks effectively while also minimizing bias and noise. For example, KGoT achieves a 29% improvement in task success rates on the GAIA benchmark compared to Hugging Face Agents with GPT-4o mini. Moreover, harnessing a smaller model dramatically reduces operational costs by over 36x compared to GPT-4o. Improvements for other models (e.g., Qwen2.5-32B and Deepseek-R1-70B) and benchmarks (e.g., SimpleQA) are similar. KGoT offers a scalable, affordable, versatile, and high-performing solution for AI assistants.

81.1SYMar 13
From AI Weather Prediction to Infrastructure Resilience: A Correction-Downscaling Framework for Tropical Cyclone Impacts

You Wu, Zhenguo Wang, Naiyu Wang

This paper addresses a missing capability in infrastructure resilience: turning fast, global AI weather forecasts into asset-scale, actionable risk. We introduce the AI-based Correction-Downscaling Framework (ACDF), which transforms coarse AI weather prediction (AIWP) into 500-m, unbiased wind fields and transmission tower/line failure probabilities for tropical cyclones. ACDF separates storm-scale bias correction from terrain-aware downscaling, preventing error propagation while restoring sub-kilometer variability that governs structural loading. Tested on 11 typhoons affecting Zhejiang, China under leave-one-storm-out evaluation, ACDF reduces station-scale wind-speed MAE by 38.8% versus Pangu-Weather, matches observation-assimilated mesoscale analyses, yet runs in 25 s per 12-h cycle on a single GPU. In the Typhoon Hagupit case, ACDF reproduced observed high-wind tails, isolated a coastal high-risk corridor, and flagged the line that failed, demonstrating actionable guidance at tower and line scales. ACDF provides an end-to-end pathway from AI global forecasts to operational, impact-based early warning for critical infrastructure.

LGOct 5, 2025
Spectral Alignment as Predictor of Loss Explosion in Neural Network Training

Haiquan Qiu, You Wu, Yingjie Tan et al.

Loss explosions in training deep neural networks can nullify multi-million dollar training runs. Conventional monitoring metrics like weight and gradient norms are often lagging and ambiguous predictors, as their values vary dramatically across different models and even between layers of the same model, making it difficult to establish a unified standard for detecting impending failure. We introduce Spectral Alignment (SA), a novel, theoretically-grounded metric that monitors the distributional alignment between layer inputs and the principal singular vectors of weight matrices. We show that a collapse in the sign diversity of this alignment is a powerful early predictor of representational collapse and training divergence. Empirical results on language models demonstrate that monitoring the SA distribution provides a significantly earlier and clearer warning of loss explosions than traditional scalar metrics. SA's low computational overhead makes it a practical tool for safeguarding model training.

CVAug 27, 2025
Gradient Rectification for Robust Calibration under Distribution Shift

Yilin Zhang, Cai Xu, You Wu et al.

Deep neural networks often produce overconfident predictions, undermining their reliability in safety-critical applications. This miscalibration is further exacerbated under distribution shift, where test data deviates from the training distribution due to environmental or acquisition changes. While existing approaches improve calibration through training-time regularization or post-hoc adjustment, their reliance on access to or simulation of target domains limits their practicality in real-world scenarios. In this paper, we propose a novel calibration framework that operates without access to target domain information. From a frequency-domain perspective, we identify that distribution shifts often distort high-frequency visual cues exploited by deep models, and introduce a low-frequency filtering strategy to encourage reliance on domain-invariant features. However, such information loss may degrade In-Distribution (ID) calibration performance. Therefore, we further propose a gradient-based rectification mechanism that enforces ID calibration as a hard constraint during optimization. Experiments on synthetic and real-world shifted datasets, including CIFAR-10/100-C and WILDS, demonstrate that our method significantly improves calibration under distribution shift while maintaining strong in-distribution performance.

CYJul 25, 2025
Programmable Virtual Humans Toward Human Physiologically-Based Drug Discovery

You Wu, Philip E. Bourne, Lei Xie

Artificial intelligence (AI) has sparked immense interest in drug discovery, but most current approaches only digitize existing high-throughput experiments. They remain constrained by conventional pipelines. As a result, they do not address the fundamental challenges of predicting drug effects in humans. Similarly, biomedical digital twins, largely grounded in real-world data and mechanistic models, are tailored for late-phase drug development and lack the resolution to model molecular interactions or their systemic consequences, limiting their impact in early-stage discovery. This disconnect between early discovery and late development is one of the main drivers of high failure rates in drug discovery. The true promise of AI lies not in augmenting current experiments but in enabling virtual experiments that are impossible in the real world: testing novel compounds directly in silico in the human body. Recent advances in AI, high-throughput perturbation assays, and single-cell and spatial omics across species now make it possible to construct programmable virtual humans: dynamic, multiscale models that simulate drug actions from molecular to phenotypic levels. By bridging the translational gap, programmable virtual humans offer a transformative path to optimize therapeutic efficacy and safety earlier than ever before. This perspective introduces the concept of programmable virtual humans, explores their roles in a new paradigm of drug discovery centered on human physiology, and outlines key opportunities, challenges, and roadmaps for their realization.

CLSep 10, 2021
ReasonBERT: Pre-trained to Reason with Distant Supervision

Xiang Deng, Yu Su, Alyssa Lees et al.

We present ReasonBert, a pre-training method that augments language models with the ability to reason over long-range relations and multiple, possibly hybrid contexts. Unlike existing pre-training methods that only harvest learning signals from local contexts of naturally occurring texts, we propose a generalized notion of distant supervision to automatically connect multiple pieces of text and tables to create pre-training examples that require long-range reasoning. Different types of reasoning are simulated, including intersecting multiple pieces of evidence, bridging from one piece of evidence to another, and detecting unanswerable cases. We conduct a comprehensive evaluation on a variety of extractive question answering datasets ranging from single-hop to multi-hop and from text-only to table-only to hybrid that require various reasoning capabilities and show that ReasonBert achieves remarkable improvement over an array of strong baselines. Few-shot experiments further demonstrate that our pre-training method substantially improves sample efficiency.

ARJun 30, 2020
RCP: A Low-overhead Reversible Coherence Protocol

You Wu, Xuehai Qian

This paper proposes RCP, a new reversible coherence protocol that ensures invisible speculative load execution (ISLE) with low overhead. RCP can be combined with processor mechanisms that eliminate the effects of speculative instructions on other instructions to achieve low overhead invisible speculative execution (ISE). ISE provides protection that is at least as strong as speculative privacy tracking (SPT) and stronger than speculative taint tracking (STT). RCP is designed by systematically extending the existing coherence protocol to incorporate speculative loads and states. The protocol is implemented in Gem5 and verified with Murphi. The results show that RCP based ISE incurs lower overhead than STT/SDO/SPT while providing similar/stronger protection.

IRJun 26, 2020
TURL: Table Understanding through Representation Learning

Xiang Deng, Huan Sun, Alyssa Lees et al.

Relational tables on the Web store a vast amount of knowledge. Owing to the wealth of such tables, there has been tremendous progress on a variety of tasks in the area of table understanding. However, existing work generally relies on heavily-engineered task-specific features and model architectures. In this paper, we present TURL, a novel framework that introduces the pre-training/fine-tuning paradigm to relational Web tables. During pre-training, our framework learns deep contextualized representations on relational tables in an unsupervised manner. Its universal model design with pre-trained representations can be applied to a wide range of tasks with minimal task-specific fine-tuning. Specifically, we propose a structure-aware Transformer encoder to model the row-column structure of relational tables, and present a new Masked Entity Recovery (MER) objective for pre-training to capture the semantics and knowledge in large-scale unlabeled data. We systematically evaluate TURL with a benchmark consisting of 6 different tasks for table understanding (e.g., relation extraction, cell filling). We show that TURL generalizes well to all tasks and substantially outperforms existing methods in almost all instances.

CLJan 26, 2020
Generating Representative Headlines for News Stories

Xiaotao Gu, Yuning Mao, Jiawei Han et al.

Millions of news articles are published online every day, which can be overwhelming for readers to follow. Grouping articles that are reporting the same event into news stories is a common way of assisting readers in their news consumption. However, it remains a challenging research problem to efficiently and effectively generate a representative headline for each story. Automatic summarization of a document set has been studied for decades, while few studies have focused on generating representative headlines for a set of articles. Unlike summaries, which aim to capture most information with least redundancy, headlines aim to capture information jointly shared by the story articles in short length, and exclude information that is too specific to each individual article. In this work, we study the problem of generating representative headlines for news stories. We develop a distant supervision approach to train large-scale generation models without any human annotation. This approach centers on two technical components. First, we propose a multi-level pre-training framework that incorporates massive unlabeled corpus with different quality-vs.-quantity balance at different levels. We show that models trained within this framework outperform those trained with pure human curated corpus. Second, we propose a novel self-voting-based article attention layer to extract salient information shared by multiple articles. We show that models that incorporate this layer are robust to potential noises in news stories and outperform existing baselines with or without noises. We can further enhance our model by incorporating human labels, and we show our distant supervision approach significantly reduces the demand on labeled data.