Yulong Wang

CV
h-index117
39papers
4,144citations
Novelty55%
AI Score61

39 Papers

LGMar 11, 2023
Adversarial Attacks and Defenses in Machine Learning-Powered Networks: A Contemporary Survey

Yulong Wang, Tong Sun, Shenghong Li et al.

Adversarial attacks and defenses in machine learning and deep neural network have been gaining significant attention due to the rapidly growing applications of deep learning in the Internet and relevant scenarios. This survey provides a comprehensive overview of the recent advancements in the field of adversarial attack and defense techniques, with a focus on deep neural network-based classification models. Specifically, we conduct a comprehensive classification of recent adversarial attack methods and state-of-the-art adversarial defense techniques based on attack principles, and present them in visually appealing tables and tree diagrams. This is based on a rigorous evaluation of the existing works, including an analysis of their strengths and limitations. We also categorize the methods into counter-attack detection and robustness enhancement, with a specific focus on regularization-based methods for enhancing robustness. New avenues of attack are also explored, including search-based, decision-based, drop-based, and physical-world attacks, and a hierarchical classification of the latest defense methods is provided, highlighting the challenges of balancing training costs with performance, maintaining clean accuracy, overcoming the effect of gradient masking, and ensuring method transferability. At last, the lessons learned and open challenges are summarized with future research opportunities recommended.

CLMay 21, 2025
Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought

Tencent Hunyuan Team, Ao Liu, Botong Zhou et al. · tencent-ai

As Large Language Models (LLMs) rapidly advance, we introduce Hunyuan-TurboS, a novel large hybrid Transformer-Mamba Mixture of Experts (MoE) model. It synergistically combines Mamba's long-sequence processing efficiency with Transformer's superior contextual understanding. Hunyuan-TurboS features an adaptive long-short chain-of-thought (CoT) mechanism, dynamically switching between rapid responses for simple queries and deep "thinking" modes for complex problems, optimizing computational resources. Architecturally, this 56B activated (560B total) parameter model employs 128 layers (Mamba2, Attention, FFN) with an innovative AMF/MF block pattern. Faster Mamba2 ensures linear complexity, Grouped-Query Attention minimizes KV cache, and FFNs use an MoE structure. Pre-trained on 16T high-quality tokens, it supports a 256K context length and is the first industry-deployed large-scale Mamba model. Our comprehensive post-training strategy enhances capabilities via Supervised Fine-Tuning (3M instructions), a novel Adaptive Long-short CoT Fusion method, Multi-round Deliberation Learning for iterative improvement, and a two-stage Large-scale Reinforcement Learning process targeting STEM and general instruction-following. Evaluations show strong performance: overall top 7 rank on LMSYS Chatbot Arena with a score of 1356, outperforming leading models like Gemini-2.0-Flash-001 (1352) and o4-mini-2025-04-16 (1345). TurboS also achieves an average of 77.9% across 23 automated benchmarks. Hunyuan-TurboS balances high performance and efficiency, offering substantial capabilities at lower inference costs than many reasoning models, establishing a new paradigm for efficient large-scale pre-trained models.

CVOct 9, 2023
SocialCircle: Learning the Angle-based Social Interaction Representation for Pedestrian Trajectory Prediction

Conghao Wong, Beihao Xia, Ziqian Zou et al.

Analyzing and forecasting trajectories of agents like pedestrians and cars in complex scenes has become more and more significant in many intelligent systems and applications. The diversity and uncertainty in socially interactive behaviors among a rich variety of agents make this task more challenging than other deterministic computer vision tasks. Researchers have made a lot of efforts to quantify the effects of these interactions on future trajectories through different mathematical models and network structures, but this problem has not been well solved. Inspired by marine animals that localize the positions of their companions underwater through echoes, we build a new anglebased trainable social interaction representation, named SocialCircle, for continuously reflecting the context of social interactions at different angular orientations relative to the target agent. We validate the effect of the proposed SocialCircle by training it along with several newly released trajectory prediction models, and experiments show that the SocialCircle not only quantitatively improves the prediction performance, but also qualitatively helps better simulate social interactions when forecasting pedestrian trajectories in a way that is consistent with human intuitions.

CLJul 10, 2024
Flooding Spread of Manipulated Knowledge in LLM-Based Multi-Agent Communities

Tianjie Ju, Yiting Wang, Xinbei Ma et al.

The rapid adoption of large language models (LLMs) in multi-agent systems has highlighted their impressive capabilities in various applications, such as collaborative problem-solving and autonomous negotiation. However, the security implications of these LLM-based multi-agent systems have not been thoroughly investigated, particularly concerning the spread of manipulated knowledge. In this paper, we investigate this critical issue by constructing a detailed threat model and a comprehensive simulation environment that mirrors real-world multi-agent deployments in a trusted platform. Subsequently, we propose a novel two-stage attack method involving Persuasiveness Injection and Manipulated Knowledge Injection to systematically explore the potential for manipulated knowledge (i.e., counterfactual and toxic knowledge) spread without explicit prompt manipulation. Our method leverages the inherent vulnerabilities of LLMs in handling world knowledge, which can be exploited by attackers to unconsciously spread fabricated information. Through extensive experiments, we demonstrate that our attack method can successfully induce LLM-based agents to spread both counterfactual and toxic knowledge without degrading their foundational capabilities during agent communication. Furthermore, we show that these manipulations can persist through popular retrieval-augmented generation frameworks, where several benign agents store and retrieve manipulated chat histories for future interactions. This persistence indicates that even after the interaction has ended, the benign agents may continue to be influenced by manipulated knowledge. Our findings reveal significant security risks in LLM-based multi-agent systems, emphasizing the imperative need for robust defenses against manipulated knowledge spread, such as introducing ``guardian'' agents and advanced fact-checking tools.

CVAug 19, 2022
Dispersed Pixel Perturbation-based Imperceptible Backdoor Trigger for Image Classifier Models

Yulong Wang, Minghui Zhao, Shenghong Li et al.

Typical deep neural network (DNN) backdoor attacks are based on triggers embedded in inputs. Existing imperceptible triggers are computationally expensive or low in attack success. In this paper, we propose a new backdoor trigger, which is easy to generate, imperceptible, and highly effective. The new trigger is a uniformly randomly generated three-dimensional (3D) binary pattern that can be horizontally and/or vertically repeated and mirrored and superposed onto three-channel images for training a backdoored DNN model. Dispersed throughout an image, the new trigger produces weak perturbation to individual pixels, but collectively holds a strong recognizable pattern to train and activate the backdoor of the DNN. We also analytically reveal that the trigger is increasingly effective with the improving resolution of the images. Experiments are conducted using the ResNet-18 and MLP models on the MNIST, CIFAR-10, and BTSR datasets. In terms of imperceptibility, the new trigger outperforms existing triggers, such as BadNets, Trojaned NN, and Hidden Backdoor, by over an order of magnitude. The new trigger achieves an almost 100% attack success rate, only reduces the classification accuracy by less than 0.7%-2.4%, and invalidates the state-of-the-art defense techniques.

73.2CVMar 30Code
Hg-I2P: Bridging Modalities for Generalizable Image-to-Point-Cloud Registration via Heterogeneous Graphs

Pei An, Junfeng Ding, Jiaqi Yang et al.

Image-to-point-cloud (I2P) registration aims to align 2D images with 3D point clouds by establishing reliable 2D-3D correspondences. The drastic modality gap between images and point clouds makes it challenging to learn features that are both discriminative and generalizable, leading to severe performance drops in unseen scenarios. We address this challenge by introducing a heterogeneous graph that enables refining both cross-modal features and correspondences within a unified architecture. The proposed graph represents a mapping between segmented 2D and 3D regions, which enhances cross-modal feature interaction and thus improves feature discriminability. In addition, modeling the consistency among vertices and edges within the graph enables pruning of unreliable correspondences. Building on these insights, we propose a heterogeneous graph embedded I2P registration method, termed Hg-I2P. It learns a heterogeneous graph by mining multi-path feature relationships, adapts features under the guidance of heterogeneous edges, and prunes correspondences using graph-based projection consistency. Experiments on six indoor and outdoor benchmarks under cross-domain setups demonstrate that Hg-I2P significantly outperforms existing methods in both generalization and accuracy. Code is released on https://github.com/anpei96/hg-i2p-demo.

AIJul 15, 2024
Sibyl: Simple yet Effective Agent Framework for Complex Real-world Reasoning

Yulong Wang, Tianhao Shen, Lifeng Liu et al.

Existing agents based on large language models (LLMs) demonstrate robust problem-solving capabilities by integrating LLMs' inherent knowledge, strong in-context learning and zero-shot capabilities, and the use of tools combined with intricately designed LLM invocation workflows by humans. However, these agents still exhibit shortcomings in long-term reasoning and under-use the potential of existing tools, leading to noticeable deficiencies in complex real-world reasoning scenarios. To address these limitations, we introduce Sibyl, a simple yet powerful LLM-based agent framework designed to tackle complex reasoning tasks by efficiently leveraging a minimal set of tools. Drawing inspiration from Global Workspace Theory, Sibyl incorporates a global workspace to enhance the management and sharing of knowledge and conversation history throughout the system. Furthermore, guided by Society of Mind Theory, Sibyl implements a multi-agent debate-based jury to self-refine the final answers, ensuring a comprehensive and balanced approach. This approach aims to reduce system complexity while expanding the scope of problems solvable-from matters typically resolved by humans in minutes to those requiring hours or even days, thus facilitating a shift from System-1 to System-2 thinking. Sibyl has been designed with a focus on scalability and ease of debugging by incorporating the concept of reentrancy from functional programming from its inception, with the aim of seamless and low effort integration in other LLM applications to improve capabilities. Our experimental results on the GAIA benchmark test set reveal that the Sibyl agent instantiated with GPT-4 achieves state-of-the-art performance with an average score of 34.55%, compared to other agents based on GPT-4. We hope that Sibyl can inspire more reliable and reusable LLM-based agent solutions to address complex real-world reasoning tasks.

MEMar 9, 2022
Error-based Knockoffs Inference for Controlled Feature Selection

Xuebin Zhao, Hong Chen, Yingjie Wang et al.

Recently, the scheme of model-X knockoffs was proposed as a promising solution to address controlled feature selection under high-dimensional finite-sample settings. However, the procedure of model-X knockoffs depends heavily on the coefficient-based feature importance and only concerns the control of false discovery rate (FDR). To further improve its adaptivity and flexibility, in this paper, we propose an error-based knockoff inference method by integrating the knockoff features, the error-based feature importance statistics, and the stepdown procedure together. The proposed inference procedure does not require specifying a regression model and can handle feature selection with theoretical guarantees on controlling false discovery proportion (FDP), FDR, or k-familywise error rate (k-FWER). Empirical evaluations demonstrate the competitive performance of our approach on both simulated and real data.

CLFeb 8, 2024Code
On the Robustness of Editing Large Language Models

Xinbei Ma, Tianjie Ju, Jiyang Qiu et al.

Large language models (LLMs) have played a pivotal role in building communicative AI, yet they encounter the challenge of efficient updates. Model editing enables the manipulation of specific knowledge memories and the behavior of language generation without retraining. However, the robustness of model editing remains an open question. This work seeks to understand the strengths and limitations of editing methods, facilitating practical applications of communicative AI. We focus on three key research questions. RQ1: Can edited LLMs behave consistently resembling communicative AI in realistic situations? RQ2: To what extent does the rephrasing of prompts lead LLMs to deviate from the edited knowledge memory? RQ3: Which knowledge features are correlated with the performance and robustness of editing? Our empirical studies uncover a substantial disparity between existing editing methods and the practical application of LLMs. On rephrased prompts that are flexible but common in realistic applications, the performance of editing experiences a significant decline. Further analysis shows that more popular knowledge is memorized better, easier to recall, and more challenging to edit effectively. Code is publicly available at https://github.com/xbmxb/edit_analysis .

CLJul 7, 2025
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann et al. · amazon-science, baidu

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

LGSep 28, 2022
Global Weighted Tensor Nuclear Norm for Tensor Robust Principal Component Analysis

Libin Wang, Yulong Wang, Shiyuan Wang et al.

Tensor Robust Principal Component Analysis (TRPCA), which aims to recover a low-rank tensor corrupted by sparse noise, has attracted much attention in many real applications. This paper develops a new Global Weighted TRPCA method (GWTRPCA), which is the first approach simultaneously considers the significance of intra-frontal slice and inter-frontal slice singular values in the Fourier domain. Exploiting this global information, GWTRPCA penalizes the larger singular values less and assigns smaller weights to them. Hence, our method can recover the low-tubal-rank components more exactly. Moreover, we propose an effective adaptive weight learning strategy by a Modified Cauchy Estimator (MCE) since the weight setting plays a crucial role in the success of GWTRPCA. To implement the GWTRPCA method, we devise an optimization algorithm using an Alternating Direction Method of Multipliers (ADMM) method. Experiments on real-world datasets validate the effectiveness of our proposed method.

CVJan 15
FlowAct-R1: Towards Interactive Humanoid Video Generation

Lizhen Wang, Yongming Zhu, Zhipeng Ge et al.

Interactive humanoid video generation aims to synthesize lifelike visual agents that can engage with humans through continuous and responsive video. Despite recent advances in video synthesis, existing methods often grapple with the trade-off between high-fidelity synthesis and real-time interaction requirements. In this paper, we propose FlowAct-R1, a framework specifically designed for real-time interactive humanoid video generation. Built upon a MMDiT architecture, FlowAct-R1 enables the streaming synthesis of video with arbitrary durations while maintaining low-latency responsiveness. We introduce a chunkwise diffusion forcing strategy, complemented by a novel self-forcing variant, to alleviate error accumulation and ensure long-term temporal consistency during continuous interaction. By leveraging efficient distillation and system-level optimizations, our framework achieves a stable 25fps at 480p resolution with a time-to-first-frame (TTFF) of only around 1.5 seconds. The proposed method provides holistic and fine-grained full-body control, enabling the agent to transition naturally between diverse behavioral states in interactive scenarios. Experimental results demonstrate that FlowAct-R1 achieves exceptional behavioral vividness and perceptual realism, while maintaining robust generalization across diverse character styles.

CVMay 5, 2025Code
TeDA: Boosting Vision-Lanuage Models for Zero-Shot 3D Object Retrieval via Testing-time Distribution Alignment

Zhichuan Wang, Yang Zhou, Jinhai Xiang et al.

Learning discriminative 3D representations that generalize well to unknown testing categories is an emerging requirement for many real-world 3D applications. Existing well-established methods often struggle to attain this goal due to insufficient 3D training data from broader concepts. Meanwhile, pre-trained large vision-language models (e.g., CLIP) have shown remarkable zero-shot generalization capabilities. Yet, they are limited in extracting suitable 3D representations due to substantial gaps between their 2D training and 3D testing distributions. To address these challenges, we propose Testing-time Distribution Alignment (TeDA), a novel framework that adapts a pretrained 2D vision-language model CLIP for unknown 3D object retrieval at test time. To our knowledge, it is the first work that studies the test-time adaptation of a vision-language model for 3D feature learning. TeDA projects 3D objects into multi-view images, extracts features using CLIP, and refines 3D query embeddings with an iterative optimization strategy by confident query-target sample pairs in a self-boosting manner. Additionally, TeDA integrates textual descriptions generated by a multimodal language model (InternVL) to enhance 3D object understanding, leveraging CLIP's aligned feature space to fuse visual and textual cues. Extensive experiments on four open-set 3D object retrieval benchmarks demonstrate that TeDA greatly outperforms state-of-the-art methods, even those requiring extensive training. We also experimented with depth maps on Objaverse-LVIS, further validating its effectiveness. Code is available at https://github.com/wangzhichuan123/TeDA.

CVMar 27, 2025Code
Omni-AD: Learning to Reconstruct Global and Local Features for Multi-class Anomaly Detection

Jiajie Quan, Ao Tong, Yuxuan Cai et al.

In multi-class unsupervised anomaly detection(MUAD), reconstruction-based methods learn to map input images to normal patterns to identify anomalous pixels. However, this strategy easily falls into the well-known "learning shortcut" issue when decoders fail to capture normal patterns and reconstruct both normal and abnormal samples naively. To address that, we propose to learn the input features in global and local manners, forcing the network to memorize the normal patterns more comprehensively. Specifically, we design a two-branch decoder block, named Omni-block. One branch corresponds to global feature learning, where we serialize two self-attention blocks but replace the query and (key, value) with learnable tokens, respectively, thus capturing global features of normal patterns concisely and thoroughly. The local branch comprises depth-separable convolutions, whose locality enables effective and efficient learning of local features for normal patterns. By stacking Omni-blocks, we build a framework, Omni-AD, to learn normal patterns of different granularity and reconstruct them progressively. Comprehensive experiments on public anomaly detection benchmarks show that our method outperforms state-of-the-art approaches in MUAD. Code is available at https://github.com/easyoo/Omni-AD.git

CVJul 29, 2025Code
Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval

Zhichuan Wang, Yang Zhou, Zhe Liu et al.

Open-set 3D object retrieval (3DOR) is an emerging task aiming to retrieve 3D objects of unseen categories beyond the training set. Existing methods typically utilize all modalities (i.e., voxels, point clouds, multi-view images) and train specific backbones before fusion. However, they still struggle to produce generalized representations due to insufficient 3D training data. Being contrastively pre-trained on web-scale image-text pairs, CLIP inherently produces generalized representations for a wide range of downstream tasks. Building upon it, we present a simple yet effective framework named Describe, Adapt and Combine (DAC) by taking only multi-view images for open-set 3DOR. DAC innovatively synergizes a CLIP model with a multi-modal large language model (MLLM) to learn generalized 3D representations, where the MLLM is used for dual purposes. First, it describes the seen category information to align with CLIP's training objective for adaptation during training. Second, it provides external hints about unknown objects complementary to visual cues during inference. To improve the synergy, we introduce an Additive-Bias Low-Rank adaptation (AB-LoRA), which alleviates overfitting and further enhances the generalization to unseen categories. With only multi-view images, DAC significantly surpasses prior arts by an average of +10.01\% mAP on four open-set 3DOR datasets. Moreover, its generalization is also validated on image-based and cross-dataset setups. Code is available at https://github.com/wangzhichuan123/DAC.

LGNov 10, 2021Code
Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters

Xiangru Lian, Binhang Yuan, Xuefeng Zhu et al.

Deep learning based models have dominated the current landscape of production recommender systems. Furthermore, recent years have witnessed an exponential growth of the model scale--from Google's 2016 model with 1 billion parameters to the latest Facebook's model with 12 trillion parameters. Significant quality boost has come with each jump of the model capacity, which makes us believe the era of 100 trillion parameters is around the corner. However, the training of such models is challenging even within industrial scale data centers. This difficulty is inherited from the staggering heterogeneity of the training computation--the model's embedding layer could include more than 99.99% of the total model size, which is extremely memory-intensive; while the rest neural network is increasingly computation-intensive. To support the training of such huge models, an efficient distributed training system is in urgent need. In this paper, we resolve this challenge by careful co-design of both the optimization algorithm and the distributed system architecture. Specifically, in order to ensure both the training efficiency and the training accuracy, we design a novel hybrid training algorithm, where the embedding layer and the dense neural network are handled by different synchronization mechanisms; then we build a system called Persia (short for parallel recommendation training system with hybrid acceleration) to support this hybrid training algorithm. Both theoretical demonstration and empirical study up to 100 trillion parameters have conducted to justified the system design and implementation of Persia. We make Persia publicly available (at https://github.com/PersiaML/Persia) so that anyone would be able to easily train a recommender model at the scale of 100 trillion parameters.

CLOct 13, 2021Code
Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese

Zhuosheng Zhang, Hanqing Zhang, Keming Chen et al.

Although pre-trained models (PLMs) have achieved remarkable improvements in a wide range of NLP tasks, they are expensive in terms of time and resources. This calls for the study of training more efficient models with less computation but still ensures impressive performance. Instead of pursuing a larger scale, we are committed to developing lightweight yet more powerful models trained with equal or less computation and friendly to rapid deployment. This technical report releases our pre-trained model called Mengzi, which stands for a family of discriminative, generative, domain-specific, and multimodal pre-trained model variants, capable of a wide range of language and vision tasks. Compared with public Chinese PLMs, Mengzi is simple but more powerful. Our lightweight model has achieved new state-of-the-art results on the widely-used CLUE benchmark with our optimized pre-training and fine-tuning techniques. Without modifying the model architecture, our model can be easily employed as an alternative to existing PLMs. Our sources are available at https://github.com/Langboat/Mengzi.

CVJan 24, 2020Code
6D Object Pose Regression via Supervised Learning on Point Clouds

Ge Gao, Mikko Lauri, Yulong Wang et al.

This paper addresses the task of estimating the 6 degrees of freedom pose of a known 3D object from depth information represented by a point cloud. Deep features learned by convolutional neural networks from color information have been the dominant features to be used for inferring object poses, while depth information receives much less attention. However, depth information contains rich geometric information of the object shape, which is important for inferring the object pose. We use depth information represented by point clouds as the input to both deep networks and geometry-based pose refinement and use separate networks for rotation and translation regression. We argue that the axis-angle representation is a suitable rotation representation for deep learning, and use a geodesic loss function for rotation regression. Ablation studies show that these design choices outperform alternatives such as the quaternion representation and L2 loss, or regressing translation and rotation with the same network. Our simple yet effective approach clearly outperforms state-of-the-art methods on the YCB-video dataset. The implementation and trained model are avaliable at: https://github.com/GeeeG/CloudPose.

LGSep 30, 2022
Double Graphs Regularized Multi-view Subspace Clustering

Longlong Chen, Yulong Wang, Youheng Liu et al.

Recent years have witnessed a growing academic interest in multi-view subspace clustering. In this paper, we propose a novel Double Graphs Regularized Multi-view Subspace Clustering (DGRMSC) method, which aims to harness both global and local structural information of multi-view data in a unified framework. Specifically, DGRMSC firstly learns a latent representation to exploit the global complementary information of multiple views. Based on the learned latent representation, we learn a self-representation to explore its global cluster structure. Further, Double Graphs Regularization (DGR) is performed on both latent representation and self-representation to take advantage of their local manifold structures simultaneously. Then, we design an iterative algorithm to solve the optimization problem effectively. Extensive experimental results on real-world datasets demonstrate the effectiveness of the proposed method.

63.5LGMay 7
SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask

Liu Hanzuo, Chaofan Lin, Weixuan Sun et al.

Semi-structured sparsity provides a practical path to accelerate large language models (LLMs) with native hardware support, but post-training semi-structured pruning often suffers from substantial quality degradation due to strong structural coupling. Existing methods rely on large-scale sparse retraining to recover accuracy, resulting in high computational cost. We propose SparseForge, a post-training framework that improves recovery efficiency by directly optimizing the sparsity mask rather than scaling up retraining tokens. SparseForge combines Hessian-aware importance estimation with progressive annealing of soft masks into hardware-executable structured sparsity, enabling stable and efficient sparse recovery. On LLaMA-2-7B under 2:4 sparsity, SparseForge achieves 57.27% average zero-shot accuracy with only $\textbf{5B}$ retraining tokens, surpassing the dense model's 56.43% accuracy and approaching the 57.52% result of a state-of-the-art method using $\textbf{40B}$ tokens. Such improvements on the accuracy-efficiency trade-off from SparseForge are shown to be consistent across model families.

CLMar 28, 2024
Interpreting Key Mechanisms of Factual Recall in Transformer-Based Language Models

Ang Lv, Yuhan Chen, Kaiyi Zhang et al.

In this paper, we delve into several mechanisms employed by Transformer-based language models (LLMs) for factual recall tasks. We outline a pipeline consisting of three major steps: (1) Given a prompt ``The capital of France is,'' task-specific attention heads extract the topic token, such as ``France,'' from the context and pass it to subsequent MLPs. (2) As attention heads' outputs are aggregated with equal weight and added to the residual stream, the subsequent MLP acts as an ``activation,'' which either erases or amplifies the information originating from individual heads. As a result, the topic token ``France'' stands out in the residual stream. (3) A deep MLP takes ``France'' and generates a component that redirects the residual stream towards the direction of the correct answer, i.e., ``Paris.'' This procedure is akin to applying an implicit function such as ``get\_capital($X$),'' and the argument $X$ is the topic token information passed by attention heads. To achieve the above quantitative and qualitative analysis for MLPs, we proposed a novel analytic method aimed at decomposing the outputs of the MLP into components understandable by humans. Additionally, we observed a universal anti-overconfidence mechanism in the final layer of models, which suppresses correct predictions. We mitigate this suppression by leveraging our interpretation to improve factual recall confidence. The above interpretations are evaluated across diverse tasks spanning various domains of factual knowledge, using various language models from the GPT-2 families, 1.3B OPT, up to 7B Llama-2, and in both zero- and few-shot setups.

CVMar 3, 2024
Unsigned Orthogonal Distance Fields: An Accurate Neural Implicit Representation for Diverse 3D Shapes

Yujie Lu, Long Wan, Nayu Ding et al.

Neural implicit representation of geometric shapes has witnessed considerable advancements in recent years. However, common distance field based implicit representations, specifically signed distance field (SDF) for watertight shapes or unsigned distance field (UDF) for arbitrary shapes, routinely suffer from degradation of reconstruction accuracy when converting to explicit surface points and meshes. In this paper, we introduce a novel neural implicit representation based on unsigned orthogonal distance fields (UODFs). In UODFs, the minimal unsigned distance from any spatial point to the shape surface is defined solely in one orthogonal direction, contrasting with the multi-directional determination made by SDF and UDF. Consequently, every point in the 3D UODFs can directly access its closest surface points along three orthogonal directions. This distinctive feature leverages the accurate reconstruction of surface points without interpolation errors. We verify the effectiveness of UODFs through a range of reconstruction examples, extending from simple watertight or non-watertight shapes to complex shapes that include hollows, internal or assembling structures.

LGMay 7, 2025
FilterTS: Comprehensive Frequency Filtering for Multivariate Time Series Forecasting

Yulong Wang, Yushuo Liu, Xiaoyi Duan et al.

Multivariate time series forecasting is crucial across various industries, where accurate extraction of complex periodic and trend components can significantly enhance prediction performance. However, existing models often struggle to capture these intricate patterns. To address these challenges, we propose FilterTS, a novel forecasting model that utilizes specialized filtering techniques based on the frequency domain. FilterTS introduces a Dynamic Cross-Variable Filtering Module, a key innovation that dynamically leverages other variables as filters to extract and reinforce shared variable frequency components across variables in multivariate time series. Additionally, a Static Global Filtering Module captures stable frequency components, identified throughout the entire training set. Moreover, the model is built in the frequency domain, converting time-domain convolutions into frequency-domain multiplicative operations to enhance computational efficiency. Extensive experimental results on eight real-world datasets have demonstrated that FilterTS significantly outperforms existing methods in terms of prediction accuracy and computational efficiency.

83.7CVApr 21
DINO Eats CLIP: Adapting Beyond Knowns for Open-set 3D Object Retrieval

Xinwei He, Yansong Zheng, Qianru Han et al.

Vision foundation models have shown great promise for open-set 3D object retrieval (3DOR) through efficient adaptation to multi-view images. Leveraging semantically aligned latent space, previous work typically adapts the CLIP encoder to build view-based 3D descriptors. Despite CLIP's strong generalization ability, its lack of fine-grainedness prompted us to explore the potential of a more recent self-supervised encoder-DINO. To address this, we propose DINO Eats CLIP (DEC), a novel framework for dynamic multi-view integration that is regularized by synthesizing data for unseen classes. We first find that simply mean-pooling over view features from a frozen DINO backbone gives decent performance. Yet, further adaptation causes severe overfitting on average view patterns of known classes. To combat it, we then design a module named Chunking and Adapting Module (CAM). It segments multi-view images into chunks and dynamically integrates local view relations, yielding more robust features than the standard pooling strategy. Finally, we propose Virtual Feature Synthesis (VFS) module to mitigate bias towards known categories explicitly. Under the hood, VFS leverages CLIP's broad, pre-aligned vision-language space to synthesize virtual features for unseen classes. By exposing DEC to these virtual features, we greatly enhance its open-set discrimination capacity. Extensive experiments on standard open-set 3DOR benchmarks demonstrate its superior efficacy.

CVFeb 25, 2025
HRR: Hierarchical Retrospection Refinement for Generated Image Detection

Peipei Yuan, Zijing Xie, Shuo Ye et al.

Generative artificial intelligence holds significant potential for abuse, and generative image detection has become a key focus of research. However, existing methods primarily focused on detecting a specific generative model and emphasizing the localization of synthetic regions, while neglecting the interference caused by image size and style on model learning. Our goal is to reach a fundamental conclusion: Is the image real or generated? To this end, we propose a diffusion model-based generative image detection framework termed Hierarchical Retrospection Refinement~(HRR). It designs a multi-scale style retrospection module that encourages the model to generate detailed and realistic multi-scale representations, while alleviating the learning biases introduced by dataset styles and generative models. Additionally, based on the principle of correntropy sparse additive machine, a feature refinement module is designed to reduce the impact of redundant features on learning and capture the intrinsic structure and patterns of the data, thereby improving the model's generalization ability. Extensive experiments demonstrate the HRR framework consistently delivers significant performance improvements, outperforming state-of-the-art methods in generated image detection task.

CVOct 14, 2024
big.LITTLE Vision Transformer for Efficient Visual Recognition

He Guo, Yulong Wang, Zixuan Ye et al.

In this paper, we introduce the big.LITTLE Vision Transformer, an innovative architecture aimed at achieving efficient visual recognition. This dual-transformer system is composed of two distinct blocks: the big performance block, characterized by its high capacity and substantial computational demands, and the LITTLE efficiency block, designed for speed with lower capacity. The key innovation of our approach lies in its dynamic inference mechanism. When processing an image, our system determines the importance of each token and allocates them accordingly: essential tokens are processed by the high-performance big model, while less critical tokens are handled by the more efficient little model. This selective processing significantly reduces computational load without sacrificing the overall performance of the model, as it ensures that detailed analysis is reserved for the most important information. To validate the effectiveness of our big.LITTLE Vision Transformer, we conducted comprehensive experiments on image classification and segment anything task. Our results demonstrate that the big.LITTLE architecture not only maintains high accuracy but also achieves substantial computational savings. Specifically, our approach enables the efficient handling of large-scale visual recognition tasks by dynamically balancing the trade-offs between performance and efficiency. The success of our method underscores the potential of hybrid models in optimizing both computation and performance in visual recognition tasks, paving the way for more practical and scalable deployment of advanced neural networks in real-world applications.

LGMay 7, 2025
STRGCN: Capturing Asynchronous Spatio-Temporal Dependencies for Irregular Multivariate Time Series Forecasting

Yulong Wang, Xiaofeng Hu, Xiaojian Cui et al.

Irregular multivariate time series (IMTS) are prevalent in real-world applications across many fields, where varying sensor frequencies and asynchronous measurements pose significant modeling challenges. Existing solutions often rely on a pre-alignment strategy to normalize data, which can distort intrinsic patterns and escalate computational and memory demands. Addressing these limitations, we introduce STRGCN, a Spatio-Temporal Relational Graph Convolutional Network that avoids pre-alignment and directly captures the complex interdependencies in IMTS by representing them as a fully connected graph. Each observation is represented as a node, allowing the model to effectively handle misaligned timestamps by mapping all inter-node relationships, thus faithfully preserving the asynchronous nature of the data. Moreover, we enhance this model with a hierarchical ``Sandwich'' structure that strategically aggregates nodes to optimize graph embeddings, reducing computational overhead while maintaining detailed local and global context. Extensive experiments on four public datasets demonstrate that STRGCN achieves state-of-the-art accuracy, competitive memory usage and training speed.

CLOct 29, 2024
Linear Chain Transformation: Expanding Optimization Dynamics for Fine-Tuning Large Language Models

Yulong Wang, Chang Zuo, Yin Xuan et al.

Fine-tuning large language models (LLMs) has become essential for adapting pretrained models to specific downstream tasks. In this paper, we propose Linear Chain Transformation (LinChain), a novel approach that introduces a sequence of linear transformations during fine-tuning to enrich optimization dynamics. By incorporating multiple linear transformations into the parameter update process, LinChain expands the effective rank of updates and enhances the model's ability to learn complex task-specific representations. We demonstrate that this method significantly improves the performance of LLM fine-tuning over state-of-the-art methods by providing more flexible optimization paths during training, while maintaining the inference efficiency of the resulting model. Our experiments on various benchmark tasks show that LinChain leads to better generalization, fewer learnable parameters, and improved task adaptation, making it a compelling strategy for LLM fine-tuning.

MLMar 2, 2024
High-Dimensional Tail Index Regression: with An Application to Text Analyses of Viral Posts in Social Media

Yuya Sasaki, Jing Tao, Yulong Wang

Motivated by the empirical observation of power-law distributions in the credits (e.g., "likes") of viral social media posts, we introduce a high-dimensional tail index regression model and propose methods for estimation and inference of its parameters. First, we present a regularized estimator, establish its consistency, and derive its convergence rate. Second, we introduce a debiasing technique for the regularized estimator to facilitate inference and prove its asymptotic normality. Third, we extend our approach to handle large-scale online streaming data using stochastic gradient descent. Simulation studies corroborate our theoretical findings. We apply these methods to the text analysis of viral posts on X (formerly Twitter) related to LGBTQ+ topics.

CRMay 3, 2023
New Adversarial Image Detection Based on Sentiment Analysis

Yulong Wang, Tianxiang Li, Shenghong Li et al.

Deep Neural Networks (DNNs) are vulnerable to adversarial examples, while adversarial attack models, e.g., DeepFool, are on the rise and outrunning adversarial example detection techniques. This paper presents a new adversarial example detector that outperforms state-of-the-art detectors in identifying the latest adversarial attacks on image datasets. Specifically, we propose to use sentiment analysis for adversarial example detection, qualified by the progressively manifesting impact of an adversarial perturbation on the hidden-layer feature maps of a DNN under attack. Accordingly, we design a modularized embedding layer with the minimum learnable parameters to embed the hidden-layer feature maps into word vectors and assemble sentences ready for sentiment analysis. Extensive experiments demonstrate that the new detector consistently surpasses the state-of-the-art detection algorithms in detecting the latest attacks launched against ResNet and Inception neutral networks on the CIFAR-10, CIFAR-100 and SVHN datasets. The detector only has about 2 million parameters, and takes shorter than 4.6 milliseconds to detect an adversarial example generated by the latest attack models using a Tesla K80 GPU card.

IRNov 3, 2021
Three-dimensional Cooperative Localization of Commercial-Off-The-Shelf Sensors

Yulong Wang, Shenghong Li, Wei Ni et al.

Many location-based services use Received Signal Strength (RSS) measurements due to their universal availability. In this paper, we study the association of a large number of low-cost Internet-of-Things (IoT) sensors and their possible installation locations, which can enable various sensing and automation-related applications. We propose an efficient approach to solve the corresponding permutation combinatorial optimization problem, which integrates continuous space cooperative localization and permutation space likelihood ascent search. A convex relaxation-based optimization is designed to estimate the coarse locations of blindfolded devices in continuous 3D spaces, which are then projected to the feasible permutation space. An efficient Cramér-Rao Lower Bound based likelihood ascent search algorithm is proposed to refine the solution. Extensive experiments were conducted to evaluate the performance of the proposed approach, which show that the proposed approach significantly outperforms state-of-the-art combinatorial optimization algorithms and achieves close-to-100% accuracy with affordable execution time.

CVJun 18, 2021
Analyzing Adversarial Robustness of Deep Neural Networks in Pixel Space: a Semantic Perspective

Lina Wang, Xingshu Chen, Yulong Wang et al.

The vulnerability of deep neural networks to adversarial examples, which are crafted maliciously by modifying the inputs with imperceptible perturbations to misled the network produce incorrect outputs, reveals the lack of robustness and poses security concerns. Previous works study the adversarial robustness of image classifiers on image level and use all the pixel information in an image indiscriminately, lacking of exploration of regions with different semantic meanings in the pixel space of an image. In this work, we fill this gap and explore the pixel space of the adversarial image by proposing an algorithm to looking for possible perturbations pixel by pixel in different regions of the segmented image. The extensive experimental results on CIFAR-10 and ImageNet verify that searching for the modified pixel in only some pixels of an image can successfully launch the one-pixel adversarial attacks without requiring all the pixels of the entire image, and there exist multiple vulnerable points scattered in different regions of an image. We also demonstrate that the adversarial robustness of different regions on the image varies with the amount of semantic information contained.

CVApr 20, 2021
Semantic Segmentation by Improved Generative Adversarial Networks

ZengShun Zhaoa, Yulong Wang, Ke Liu et al.

While most existing segmentation methods usually combined the powerful feature extraction capabilities of CNNs with Conditional Random Fields (CRFs) post-processing, the result always limited by the fault of CRFs . Due to the notoriously slow calculation speeds and poor efficiency of CRFs, in recent years, CRFs post-processing has been gradually eliminated. In this paper, an improved Generative Adversarial Networks (GANs) for image semantic segmentation task (semantic segmentation by GANs, Seg-GAN) is proposed to facilitate further segmentation research. In addition, we introduce Convolutional CRFs (ConvCRFs) as an effective improvement solution for the image semantic segmentation task. Towards the goal of differentiating the segmentation results from the ground truth distribution and improving the details of the output images, the proposed discriminator network is specially designed in a full convolutional manner combined with cascaded ConvCRFs. Besides, the adversarial loss aggressively encourages the output image to be close to the distribution of the ground truth. Our method not only learns an end-to-end mapping from input image to corresponding output image, but also learns a loss function to train this mapping. The experiments show that our method achieves better performance than state-of-the-art methods.

BIO-PHMar 15, 2021
Shape-induced obstacle attraction and repulsion during dynamic locomotion

Yuanfeng Han, Ratan Othayoth, Yulong Wang et al.

Robots still struggle to dynamically traverse complex 3-D terrain with many large obstacles, an ability required for many critical applications. Body-obstacle interaction is often inevitable and induces perturbation and uncertainty in motion that challenges closed-form dynamic modeling. Here, inspired by recent discovery of a terradynamic streamlined shape, we studied how two body shapes interacting with obstacles affect turning and pitching motions of an open-loop multi-legged robot and cockroaches during dynamic locomotion. With a common cuboidal body, the robot was attracted towards obstacles, resulting in pitching up and flipping-over. By contrast, with an elliptical body, the robot was repelled by obstacles and readily traversed. The animal displayed qualitatively similar turning and pitching motions induced by these two body shapes. However, unlike the cuboidal robot, the cuboidal animal was capable of escaping obstacle attraction and subsequent high pitching and flipping over, which inspired us to develop an empirical pitch-and-turn strategy for cuboidal robots. Considering the similarity of our self-propelled body-obstacle interaction with part-feeder interaction in robotic part manipulation, we developed a quasi-static potential energy landscape model to explain the dependence of dynamic locomotion on body shape. Our experimental and modeling results also demonstrated that obstacle attraction or repulsion is an inherent property of locomotor body shape and insensitive to obstacle geometry and size. Our study expanded the concept and usefulness of terradynamic shapes for passive control of robot locomotion to traverse large obstacles using physical interaction. Our study is also a step in establishing an energy landscape approach to locomotor transitions.

CVMar 3, 2020
Data-Free Adversarial Perturbations for Practical Black-Box Attack

ZhaoXin Huan, Yulong Wang, Xiaolu Zhang et al.

Neural networks are vulnerable to adversarial examples, which are malicious inputs crafted to fool pre-trained models. Adversarial examples often exhibit black-box attacking transferability, which allows that adversarial examples crafted for one model can fool another model. However, existing black-box attack methods require samples from the training data distribution to improve the transferability of adversarial examples across different models. Because of the data dependence, the fooling ability of adversarial perturbations is only applicable when training data are accessible. In this paper, we present a data-free method for crafting adversarial perturbations that can fool a target model without any knowledge about the training data distribution. In the practical setting of a black-box attack scenario where attackers do not have access to target models and training data, our method achieves high fooling rates on target models and outperforms other universal adversarial perturbation methods. Our method empirically shows that current deep learning models are still at risk even when the attackers do not have access to training data.

LGOct 7, 2019
Interpretable Disentanglement of Neural Networks by Extracting Class-Specific Subnetwork

Yulong Wang, Xiaolin Hu, Hang Su

We propose a novel perspective to understand deep neural networks in an interpretable disentanglement form. For each semantic class, we extract a class-specific functional subnetwork from the original full model, with compressed structure while maintaining comparable prediction performance. The structure representations of extracted subnetworks display a resemblance to their corresponding class semantic similarities. We also apply extracted subnetworks in visual explanation and adversarial example detection tasks by merely replacing the original full model with class-specific subnetworks. Experiments demonstrate that this intuitive operation can effectively improve explanation saliency accuracy for gradient-based explanation methods, and increase the detection rate for confidence score-based adversarial example detection methods.

CVSep 27, 2019
Pruning from Scratch

Yulong Wang, Xiaolu Zhang, Lingxi Xie et al.

Network pruning is an important research field aiming at reducing computational costs of neural networks. Conventional approaches follow a fixed paradigm which first trains a large and redundant network, and then determines which units (e.g., channels) are less important and thus can be removed. In this work, we find that pre-training an over-parameterized model is not necessary for obtaining the target pruned structure. In fact, a fully-trained over-parameterized model will reduce the search space for the pruned structure. We empirically show that more diverse pruned structures can be directly pruned from randomly initialized weights, including potential models with better performance. Therefore, we propose a novel network pruning pipeline which allows pruning from scratch. In the experiments for compressing classification models on CIFAR10 and ImageNet datasets, our approach not only greatly reduces the pre-training burden of traditional pruning methods, but also achieves similar or even higher accuracy under the same computation budgets. Our results facilitate the community to rethink the effectiveness of existing techniques used for network pruning.

SDDec 14, 2017
A Hierarchical Recurrent Neural Network for Symbolic Melody Generation

Jian Wu, Changran Hu, Yulong Wang et al.

In recent years, neural networks have been used to generate symbolic melodies. However, the long-term structure in the melody has posed great difficulty for designing a good model. In this paper, we present a hierarchical recurrent neural network for melody generation, which consists of three Long-Short-Term-Memory (LSTM) subnetworks working in a coarse-to-fine manner along time. Specifically, the three subnetworks generate bar profiles, beat profiles and notes in turn, and the output of the high-level subnetworks are fed into the low-level subnetworks, serving as guidance for generating the finer time-scale melody components in low-level subnetworks. Two human behavior experiments demonstrate the advantage of this structure over the single-layer LSTM which attempts to learn all hidden structures in melodies. Compared with the state-of-the-art models MidiNet and MusicVAE, the hierarchical recurrent neural network produces better melodies evaluated by humans.

CVNov 5, 2017
Modal Regression based Atomic Representation for Robust Face Recognition

Yulong Wang, Yuan Yan Tang, Luoqing Li et al.

Representation based classification (RC) methods such as sparse RC (SRC) have shown great potential in face recognition in recent years. Most previous RC methods are based on the conventional regression models, such as lasso regression, ridge regression or group lasso regression. These regression models essentially impose a predefined assumption on the distribution of the noise variable in the query sample, such as the Gaussian or Laplacian distribution. However, the complicated noises in practice may violate the assumptions and impede the performance of these RC methods. In this paper, we propose a modal regression based atomic representation and classification (MRARC) framework to alleviate such limitation. Unlike previous RC methods, the MRARC framework does not require the noise variable to follow any specific predefined distributions. This gives rise to the capability of MRARC in handling various complex noises in reality. Using MRARC as a general platform, we also develop four novel RC methods for unimodal and multimodal face recognition, respectively. In addition, we devise a general optimization algorithm for the unified MRARC framework based on the alternating direction method of multipliers (ADMM) and half-quadratic theory. The experiments on real-world data validate the efficacy of MRARC for robust face recognition.