Yinghao Wu

CV
h-index28
14papers
108citations
Novelty52%
AI Score55

14 Papers

50.1CVMay 25
VEN-VL: A Visual Ensemble MoE Framework for Effective and Efficient Multi-Modal Understanding

Yinghao Wu, Zhuoyan Luo, Yiyao Yu et al.

Despite the remarkable progress achieved by recent efficient methods in accelerating multimodal understanding, they still suffer from noticeable performance degradation. Their emphasis on the high compression ratio of a single visual clue and reliance on the heuristic pruning strategy with coarse attention alignment incurs a bottleneck on the information capacity and density of visual tokens. Addressing this limitation, we propose VEN-VL, a visual ensemble MoE framework for effective and efficient perception following the enrich then compact principle. Specifically, we first enrich the information capacity by unifying the visual representations of different perspectives, and then progressively compact it with adaptive routers in specialized visual experts to enhance the information density. Furthermore, we incorporate the reconstruction ability of vanilla structure via explicit visual supervision, facilitating crucial information preservation. Experimental results demonstrate our superiority in complex visual tasks with few information-condensed tokens, which effectively bridges the gap between performance and efficiency.

CVOct 14, 2022
InFIP: An Explainable DNN Intellectual Property Protection Method based on Intrinsic Features

Mingfu Xue, Xin Wang, Yinghao Wu et al.

Intellectual property (IP) protection for Deep Neural Networks (DNNs) has raised serious concerns in recent years. Most existing works embed watermarks in the DNN model for IP protection, which need to modify the model and lack of interpretability. In this paper, for the first time, we propose an interpretable intellectual property protection method for DNN based on explainable artificial intelligence. Compared with existing works, the proposed method does not modify the DNN model, and the decision of the ownership verification is interpretable. We extract the intrinsic features of the DNN model by using Deep Taylor Decomposition. Since the intrinsic feature is composed of unique interpretation of the model's decision, the intrinsic feature can be regarded as fingerprint of the model. If the fingerprint of a suspected model is the same as the original model, the suspected model is considered as a pirated model. Experimental results demonstrate that the fingerprints can be successfully used to verify the ownership of the model and the test accuracy of the model is not affected. Furthermore, the proposed method is robust to fine-tuning attack, pruning attack, watermark overwriting attack, and adaptive attack.

CVMay 24, 2024Code
CoHD: A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression Segmentation

Zhuoyan Luo, Yinghao Wu, Tianheng Cheng et al.

The newly proposed Generalized Referring Expression Segmentation (GRES) amplifies the formulation of classic RES by involving complex multiple/non-target scenarios. Recent approaches address GRES by directly extending the well-adopted RES frameworks with object-existence identification. However, these approaches tend to encode multi-granularity object information into a single representation, which makes it difficult to precisely represent comprehensive objects of different granularity. Moreover, the simple binary object-existence identification across all referent scenarios fails to specify their inherent differences, incurring ambiguity in object understanding. To tackle the above issues, we propose a \textbf{Co}unting-Aware \textbf{H}ierarchical \textbf{D}ecoding framework (CoHD) for GRES. By decoupling the intricate referring semantics into different granularity with a visual-linguistic hierarchy, and dynamic aggregating it with intra- and inter-selection, CoHD boosts multi-granularity comprehension with the reciprocal benefit of the hierarchical nature. Furthermore, we incorporate the counting ability by embodying multiple/single/non-target scenarios into count- and category-level supervision, facilitating comprehensive object perception. Experimental results on gRefCOCO, Ref-ZOM, R-RefCOCO, and RefCOCO benchmarks demonstrate the effectiveness and rationality of CoHD which outperforms state-of-the-art GRES methods by a remarkable margin. Code is available at \href{https://github.com/RobertLuo1/CoHD}{here}.

GROct 20, 2025Code
Shape-aware Inertial Poser: Motion Tracking for Humans with Diverse Shapes Using Sparse Inertial Sensors

Lu Yin, Ziying Shi, Yinghao Wu et al.

Human motion capture with sparse inertial sensors has gained significant attention recently. However, existing methods almost exclusively rely on a template adult body shape to model the training data, which poses challenges when generalizing to individuals with largely different body shapes (such as a child). This is primarily due to the variation in IMU-measured acceleration caused by changes in body shape. To fill this gap, we propose Shape-aware Inertial Poser (SAIP), the first solution considering body shape differences in sparse inertial-based motion capture. Specifically, we decompose the sensor measurements related to shape and pose in order to effectively model their joint correlations. Firstly, we train a regression model to transfer the IMU-measured accelerations of a real body to match the template adult body model, compensating for the shape-related sensor measurements. Then, we can easily follow the state-of-the-art methods to estimate the full body motions of the template-shaped body. Finally, we utilize a second regression model to map the joint velocities back to the real body, combined with a shape-aware physical optimization strategy to calculate global motions on the subject. Furthermore, our method relies on body shape awareness, introducing the first inertial shape estimation scheme. This is accomplished by modeling the shape-conditioned IMU-pose correlation using an MLP-based network. To validate the effectiveness of SAIP, we also present the first IMU motion capture dataset containing individuals of different body sizes. This dataset features 10 children and 10 adults, with heights ranging from 110 cm to 190 cm, and a total of 400 minutes of paired IMU-Motion samples. Extensive experimental results demonstrate that SAIP can effectively handle motion capture tasks for diverse body shapes. The code and dataset are available at https://github.com/yinlu5942/SAIP.

CLApr 1, 2025
Z1: Efficient Test-time Scaling with Code

Zhaojian Yu, Yinghao Wu, Yilun Zhao et al.

Large Language Models (LLMs) can achieve enhanced complex problem-solving through test-time computing scaling, yet this often entails longer contexts and numerous reasoning token costs. In this paper, we propose an efficient test-time scaling method that trains LLMs on code-related reasoning trajectories, facilitating their reduction of excess thinking tokens while maintaining performance. First, we create Z1-Code-Reasoning-107K, a curated dataset of simple and complex coding problems paired with their short and long solution trajectories. Second, we present a novel Shifted Thinking Window to mitigate overthinking overhead by removing context-delimiting tags (e.g., <think>. . . </think>) and capping reasoning tokens. Trained with long and short trajectory data and equipped with Shifted Thinking Window, our model, Z1-7B, demonstrates the ability to adjust its reasoning level as the complexity of problems and exhibits efficient test-time scaling across different reasoning tasks that matches R1-Distill-Qwen-7B performance with about 30% of its average thinking tokens. Notably, fine-tuned with only code trajectories, Z1-7B demonstrates generalization to broader reasoning tasks (47.5% on GPQA Diamond). Our analysis of efficient reasoning elicitation also provides valuable insights for future research.

CYMay 21, 2024
OpenCarbonEval: A Unified Carbon Emission Estimation Framework in Large-Scale AI Models

Zhaojian Yu, Yinghao Wu, Zhuotao Deng et al.

In recent years, large-scale auto-regressive models have made significant progress in various tasks, such as text or video generation. However, the environmental impact of these models has been largely overlooked, with a lack of assessment and analysis of their carbon footprint. To address this gap, we introduce OpenCarbonEval, a unified framework for integrating large-scale models across diverse modalities to predict carbon emissions, which could provide AI service providers and users with a means to estimate emissions beforehand and help mitigate the environmental pressure associated with these models. In OpenCarbonEval, we propose a dynamic throughput modeling approach that could capture workload and hardware fluctuations in the training process for more precise emissions estimates. Our evaluation results demonstrate that OpenCarbonEval can more accurately predict training emissions than previous methods, and can be seamlessly applied to different modal tasks. Specifically, we show that OpenCarbonEval achieves superior performance in predicting carbon emissions for both visual models and language models. By promoting sustainable AI development and deployment, OpenCarbonEval can help reduce the environmental impact of large-scale models and contribute to a more environmentally responsible future for the AI community.

CRJun 20, 2025
CUBA: Controlled Untargeted Backdoor Attack against Deep Neural Networks

Yinghao Wu, Liyan Zhang

Backdoor attacks have emerged as a critical security threat against deep neural networks in recent years. The majority of existing backdoor attacks focus on targeted backdoor attacks, where trigger is strongly associated to specific malicious behavior. Various backdoor detection methods depend on this inherent property and shows effective results in identifying and mitigating such targeted attacks. However, a purely untargeted attack in backdoor scenarios is, in some sense, self-weakening, since the target nature is what makes backdoor attacks so powerful. In light of this, we introduce a novel Constrained Untargeted Backdoor Attack (CUBA), which combines the flexibility of untargeted attacks with the intentionality of targeted attacks. The compromised model, when presented with backdoor images, will classify them into random classes within a constrained range of target classes selected by the attacker. This combination of randomness and determinedness enables the proposed untargeted backdoor attack to natively circumvent existing backdoor defense methods. To implement the untargeted backdoor attack under controlled flexibility, we propose to apply logit normalization on cross-entropy loss with flipped one-hot labels. By constraining the logit during training, the compromised model will show a uniform distribution across selected target classes, resulting in controlled untargeted attack. Extensive experiments demonstrate the effectiveness of the proposed CUBA on different datasets.

ROSep 20, 2025
TranTac: Leveraging Transient Tactile Signals for Contact-Rich Robotic Manipulation

Yinghao Wu, Shuhong Hou, Haowen Zheng et al.

Robotic manipulation tasks such as inserting a key into a lock or plugging a USB device into a port can fail when visual perception is insufficient to detect misalignment. In these situations, touch sensing is crucial for the robot to monitor the task's states and make precise, timely adjustments. Current touch sensing solutions are either insensitive to detect subtle changes or demand excessive sensor data. Here, we introduce TranTac, a data-efficient and low-cost tactile sensing and control framework that integrates a single contact-sensitive 6-axis inertial measurement unit within the elastomeric tips of a robotic gripper for completing fine insertion tasks. Our customized sensing system can detect dynamic translational and torsional deformations at the micrometer scale, enabling the tracking of visually imperceptible pose changes of the grasped object. By leveraging transformer-based encoders and diffusion policy, TranTac can imitate human insertion behaviors using transient tactile cues detected at the gripper's tip during insertion processes. These cues enable the robot to dynamically control and correct the 6-DoF pose of the grasped object. When combined with vision, TranTac achieves an average success rate of 79% on object grasping and insertion tasks, outperforming both vision-only policy and the one augmented with end-effector 6D force/torque sensing. Contact localization performance is also validated through tactile-only misaligned insertion tasks, achieving an average success rate of 88%. We assess the generalizability by training TranTac on a single prism-slot pair and testing it on unseen data, including a USB plug and a metal key, and find that the insertion tasks can still be completed with an average success rate of nearly 70%. The proposed framework may inspire new robotic tactile sensing systems for delicate manipulation tasks.

CRJul 1, 2025
BadViM: Backdoor Attack against Vision Mamba

Yinghao Wu, Liyan Zhang

Vision State Space Models (SSMs), particularly architectures like Vision Mamba (ViM), have emerged as promising alternatives to Vision Transformers (ViTs). However, the security implications of this novel architecture, especially their vulnerability to backdoor attacks, remain critically underexplored. Backdoor attacks aim to embed hidden triggers into victim models, causing the model to misclassify inputs containing these triggers while maintaining normal behavior on clean inputs. This paper investigates the susceptibility of ViM to backdoor attacks by introducing BadViM, a novel backdoor attack framework specifically designed for Vision Mamba. The proposed BadViM leverages a Resonant Frequency Trigger (RFT) that exploits the frequency sensitivity patterns of the victim model to create stealthy, distributed triggers. To maximize attack efficacy, we propose a Hidden State Alignment loss that strategically manipulates the internal representations of model by aligning the hidden states of backdoor images with those of target classes. Extensive experimental results demonstrate that BadViM achieves superior attack success rates while maintaining clean data accuracy. Meanwhile, BadViM exhibits remarkable resilience against common defensive measures, including PatchDrop, PatchShuffle and JPEG compression, which typically neutralize normal backdoor attacks.

SIJan 19, 2024
Fast Butterfly-Core Community Search For Large Labeled Graphs

JiaYi Du, Yinghao Wu, Wei Ai et al.

Community Search (CS) aims to identify densely interconnected subgraphs corresponding to query vertices within a graph. However, existing heterogeneous graph-based community search methods need help identifying cross-group communities and suffer from efficiency issues, making them unsuitable for large graphs. This paper presents a fast community search model based on the Butterfly-Core Community (BCC) structure for heterogeneous graphs. The Random Walk with Restart (RWR) algorithm and butterfly degree comprehensively evaluate the importance of vertices within communities, allowing leader vertices to be rapidly updated to maintain cross-group cohesion. Moreover, we devised a more efficient method for updating vertex distances, which minimizes vertex visits and enhances operational efficiency. Extensive experiments on several real-world temporal graphs demonstrate the effectiveness and efficiency of this solution.

SIJan 19, 2024
An Effective Index for Truss-based Community Search on Large Directed Graphs

Wei Ai, CanHao Xie, Tao Meng et al.

Community search is a derivative of community detection that enables online and personalized discovery of communities and has found extensive applications in massive real-world networks. Recently, there needs to be more focus on the community search issue within directed graphs, even though substantial research has been carried out on undirected graphs. The recently proposed D-truss model has achieved good results in the quality of retrieved communities. However, existing D-truss-based work cannot perform efficient community searches on large graphs because it consumes too many computing resources to retrieve the maximal D-truss. To overcome this issue, we introduce an innovative merge relation known as D-truss-connected to capture the inherent density and cohesiveness of edges within D-truss. This relation allows us to partition all the edges in the original graph into a series of D-truss-connected classes. Then, we construct a concise and compact index, ConDTruss, based on D-truss-connected. Using ConDTruss, the efficiency of maximum D-truss retrieval will be greatly improved, making it a theoretically optimal approach. Experimental evaluations conducted on large directed graph certificate the effectiveness of our proposed method.

CVJan 31, 2022
Imperceptible and Multi-channel Backdoor Attack against Deep Neural Networks

Mingfu Xue, Shifeng Ni, Yinghao Wu et al.

Recent researches demonstrate that Deep Neural Networks (DNN) models are vulnerable to backdoor attacks. The backdoored DNN model will behave maliciously when images containing backdoor triggers arrive. To date, existing backdoor attacks are single-trigger and single-target attacks, and the triggers of most existing backdoor attacks are obvious thus are easy to be detected or noticed. In this paper, we propose a novel imperceptible and multi-channel backdoor attack against Deep Neural Networks by exploiting Discrete Cosine Transform (DCT) steganography. Based on the proposed backdoor attack method, we implement two variants of backdoor attacks, i.e., N-to-N backdoor attack and N-to-One backdoor attack. Specifically, for a colored image, we utilize DCT steganography to construct the trigger on different channels of the image. As a result, the trigger is stealthy and natural. Based on the proposed method, we implement multi-target and multi-trigger backdoor attacks. Experimental results demonstrate that the average attack success rate of the N-to-N backdoor attack is 93.95% on CIFAR-10 dataset and 91.55% on TinyImageNet dataset, respectively. The average attack success rate of N-to-One attack is 90.22% and 89.53% on CIFAR-10 and TinyImageNet datasets, respectively. Meanwhile, the proposed backdoor attack does not affect the classification accuracy of the DNN model. Moreover, the proposed attack is demonstrated to be robust to the state-of-the-art backdoor defense (Neural Cleanse).

CRSep 16, 2021
Protect the Intellectual Property of Dataset against Unauthorized Use

Mingfu Xue, Yinghao Wu, Yushu Zhang et al.

Training high performance Deep Neural Networks (DNNs) models require large-scale and high-quality datasets. The expensive cost of collecting and annotating large-scale datasets make the valuable datasets can be considered as the Intellectual Property (IP) of the dataset owner. To date, almost all the copyright protection schemes for deep learning focus on the copyright protection of models, while the copyright protection of the dataset is rarely studied. In this paper, we propose a novel method to actively protect the dataset from being used to train DNN models without authorization. Experimental results on on CIFAR-10 and TinyImageNet datasets demonstrate the effectiveness of the proposed method. Compared with the model trained on clean dataset, the proposed method can effectively make the test accuracy of the unauthorized model trained on protected dataset drop from 86.21% to 38.23% and from 74.00% to 16.20% on CIFAR-10 and TinyImageNet datasets, respectively.

CVMay 29, 2021
Detecting Backdoor in Deep Neural Networks via Intentional Adversarial Perturbations

Mingfu Xue, Yinghao Wu, Zhiyu Wu et al.

Recent researches show that deep learning model is susceptible to backdoor attacks. Many defenses against backdoor attacks have been proposed. However, existing defense works require high computational overhead or backdoor attack information such as the trigger size, which is difficult to satisfy in realistic scenarios. In this paper, a novel backdoor detection method based on adversarial examples is proposed. The proposed method leverages intentional adversarial perturbations to detect whether an image contains a trigger, which can be applied in both the training stage and the inference stage (sanitize the training set in training stage and detect the backdoor instances in inference stage). Specifically, given an untrusted image, the adversarial perturbation is added to the image intentionally. If the prediction of the model on the perturbed image is consistent with that on the unperturbed image, the input image will be considered as a backdoor instance. Compared with most existing defense works, the proposed adversarial perturbation based method requires low computational resources and maintains the visual quality of the images. Experimental results show that, the backdoor detection rate of the proposed defense method is 99.63%, 99.76% and 99.91% on Fashion-MNIST, CIFAR-10 and GTSRB datasets, respectively. Besides, the proposed method maintains the visual quality of the image as the l2 norm of the added perturbation are as low as 2.8715, 3.0513 and 2.4362 on Fashion-MNIST, CIFAR-10 and GTSRB datasets, respectively. In addition, it is also demonstrated that the proposed method can achieve high defense performance against backdoor attacks under different attack settings (trigger transparency, trigger size and trigger pattern). Compared with the existing defense work (STRIP), the proposed method has better detection performance on all the three datasets, and is more efficient than STRIP.