CVApr 9, 2023
HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image GenerationXuan Ju, Ailing Zeng, Chenchen Zhao et al.
Controllable human image generation (HIG) has numerous real-life applications. State-of-the-art solutions, such as ControlNet and T2I-Adapter, introduce an additional learnable branch on top of the frozen pre-trained stable diffusion (SD) model, which can enforce various conditions, including skeleton guidance of HIG. While such a plug-and-play approach is appealing, the inevitable and uncertain conflicts between the original images produced from the frozen SD branch and the given condition incur significant challenges for the learnable branch, which essentially conducts image feature editing for condition enforcement. In this work, we propose a native skeleton-guided diffusion model for controllable HIG called HumanSD. Instead of performing image editing with dual-branch diffusion, we fine-tune the original SD model using a novel heatmap-guided denoising loss. This strategy effectively and efficiently strengthens the given skeleton condition during model training while mitigating the catastrophic forgetting effects. HumanSD is fine-tuned on the assembly of three large-scale human-centric datasets with text-image-pose information, two of which are established in this work. As shown in Figure 1, HumanSD outperforms ControlNet in terms of accurate pose control and image quality, particularly when the given skeleton guidance is sophisticated.
CVAug 15, 2023
DiffGuard: Semantic Mismatch-Guided Out-of-Distribution Detection using Pre-trained Diffusion ModelsRuiyuan Gao, Chenchen Zhao, Lanqing Hong et al.
Given a classifier, the inherent property of semantic Out-of-Distribution (OOD) samples is that their contents differ from all legal classes in terms of semantics, namely semantic mismatch. There is a recent work that directly applies it to OOD detection, which employs a conditional Generative Adversarial Network (cGAN) to enlarge semantic mismatch in the image space. While achieving remarkable OOD detection performance on small datasets, it is not applicable to ImageNet-scale datasets due to the difficulty in training cGANs with both input images and labels as conditions. As diffusion models are much easier to train and amenable to various conditions compared to cGANs, in this work, we propose to directly use pre-trained diffusion models for semantic mismatch-guided OOD detection, named DiffGuard. Specifically, given an OOD input image and the predicted label from the classifier, we try to enlarge the semantic difference between the reconstructed OOD image under these conditions and the original input image. We also present several test-time techniques to further strengthen such differences. Experimental results show that DiffGuard is effective on both Cifar-10 and hard cases of the large-scale ImageNet, and it can be easily combined with existing OOD detection techniques to achieve state-of-the-art OOD detection results.
LGJul 20, 2025Code
MMCircuitEval: A Comprehensive Multimodal Circuit-Focused Benchmark for Evaluating LLMsChenchen Zhao, Zhengyuan Shi, Xiangyu Wen et al.
The emergence of multimodal large language models (MLLMs) presents promising opportunities for automation and enhancement in Electronic Design Automation (EDA). However, comprehensively evaluating these models in circuit design remains challenging due to the narrow scope of existing benchmarks. To bridge this gap, we introduce MMCircuitEval, the first multimodal benchmark specifically designed to assess MLLM performance comprehensively across diverse EDA tasks. MMCircuitEval comprises 3614 meticulously curated question-answer (QA) pairs spanning digital and analog circuits across critical EDA stages - ranging from general knowledge and specifications to front-end and back-end design. Derived from textbooks, technical question banks, datasheets, and real-world documentation, each QA pair undergoes rigorous expert review for accuracy and relevance. Our benchmark uniquely categorizes questions by design stage, circuit type, tested abilities (knowledge, comprehension, reasoning, computation), and difficulty level, enabling detailed analysis of model capabilities and limitations. Extensive evaluations reveal significant performance gaps among existing LLMs, particularly in back-end design and complex computations, highlighting the critical need for targeted training datasets and modeling approaches. MMCircuitEval provides a foundational resource for advancing MLLMs in EDA, facilitating their integration into real-world circuit design workflows. Our benchmark is available at https://github.com/cure-lab/MMCircuitEval.
CVMar 12
FBCIR: Balancing Cross-Modal Focuses in Composed Image RetrievalChenchen Zhao, Jianhuan Zhuo, Muxi Chen et al.
Composed image retrieval (CIR) requires multi-modal models to jointly reason over visual content and semantic modifications presented in text-image input pairs. While current CIR models achieve strong performance on common benchmark cases, their accuracies often degrades in more challenging scenarios where negative candidates are semantically aligned with the query image or text. In this paper, we attribute this degradation to focus imbalances, where models disproportionately attend to one modality while neglecting the other. To validate this claim, we propose FBCIR, a multi-modal focus interpretation method that identifies the most crucial visual and textual input components to a model's retrieval decisions. Using FBCIR, we report that focus imbalances are prevalent in existing CIR models, especially under hard negative settings. Building on the analyses, we further propose a CIR data augmentation workflow that facilitates existing CIR datasets with curated hard negatives designed to encourage balanced cross-modal reasoning. Extensive experiments across multiple CIR models demonstrate that the proposed augmentation consistently improves performance in challenging cases, while maintaining their capabilities on standard benchmarks. Together, our interpretation method and data augmentation workflow provide a new perspective on CIR model diagnosis and robustness improvements.
LGOct 11, 2023
Multichannel consecutive data cross-extraction with 1DCNN-attention for diagnosis of power transformerWei Zheng, Guogang Zhang, Chenchen Zhao et al.
Power transformer plays a critical role in grid infrastructure, and its diagnosis is paramount for maintaining stable operation. However, the current methods for transformer diagnosis focus on discrete dissolved gas analysis, neglecting deep feature extraction of multichannel consecutive data. The unutilized sequential data contains the significant temporal information reflecting the transformer condition. In light of this, the structure of multichannel consecutive data cross-extraction (MCDC) is proposed in this article in order to comprehensively exploit the intrinsic characteristic and evaluate the states of transformer. Moreover, for the better accommodation in scenario of transformer diagnosis, one dimensional convolution neural network attention (1DCNN-attention) mechanism is introduced and offers a more efficient solution given the simplified spatial complexity. Finally, the effectiveness of MCDC and the superior generalization ability, compared with other algorithms, are validated in experiments conducted on a dataset collected from real operation cases of power transformer. Additionally, the better stability of 1DCNN-attention has also been certified.
CVSep 27, 2025Code
GRAPE: Let GPRO Supervise Query Rewriting by Ranking for RetrievalZhaohua Zhang, Jianhuan Zhuo, Muxi Chen et al.
The CLIP model has become a cornerstone of large-scale retrieval systems by aligning text and image data in a unified embedding space. Despite its simplicity and efficiency, CLIP struggles when applied to tasks whose input distributions diverge from its training corpus, such as queries with multilingual, long-form, or multimodal differences. To avoid costly retraining, existing methods mainly adopt query-rewriting strategies with large language models (LLMs), aiming to mitigate distribution gaps at the query level. However, due to the lack of supervision signals, LLMs fail to generate the optimal one that fits the training distribution. We address this challenge with GRAPE (Grouped Ranking-Aware Policy Optimization Enhancement), a plug-and-play enhancement approach that incorporates ranking signals into retrieval-guided query rewriting with LLMs. Intuitively, GRAPE proposes to leverage GRPO to bridge distributional differences -- including length, multilingual, and modality shifts -- by transforming queries into forms better aligned with the retriever's training distribution. However, our preliminary experiment finds that naively finetuning LLM with similarity scores can lead to score inflation, where nearly all candidates are assigned unexpectedly high scores regardless of their true relevance. To address score inflation, we propose a corpus-relative ranking-based reward, which explicitly aligns optimization with ranking metrics while suppressing spurious score inflation. Extensive experiments demonstrate that GRAPE consistently improves retrieval performance under distributional shifts -- including multilingual differences (Flickr30k-CN, CVLUE, XM3600), length differences (Wikipedia), and multimodal differences (CIRR) -- achieving an average improvement of 4.9\% in Recall\@10. The code is available at https://github.com/Chinese0123456/GRAPE.git
CVSep 26, 2025Code
FailureAtlas:Mapping the Failure Landscape of T2I Models via Active ExplorationMuxi Chen, Zhaohua Zhang, Chenchen Zhao et al.
Static benchmarks have provided a valuable foundation for comparing Text-to-Image (T2I) models. However, their passive design offers limited diagnostic power, struggling to uncover the full landscape of systematic failures or isolate their root causes. We argue for a complementary paradigm: active exploration. We introduce FailureAtlas, the first framework designed to autonomously explore and map the vast failure landscape of T2I models at scale. FailureAtlas frames error discovery as a structured search for minimal, failure-inducing concepts. While it is a computationally explosive problem, we make it tractable with novel acceleration techniques. When applied to Stable Diffusion models, our method uncovers hundreds of thousands of previously unknown error slices (over 247,000 in SD1.5 alone) and provides the first large-scale evidence linking these failures to data scarcity in the training set. By providing a principled and scalable engine for deep model auditing, FailureAtlas establishes a new, diagnostic-first methodology to guide the development of more robust generative AI. The code is available at https://github.com/cure-lab/FailureAtlas
CVJan 28, 2025
HiBug2: Efficient and Interpretable Error Slice Discovery for Comprehensive Model DebuggingMuxi Chen, Chenchen Zhao, Qiang Xu
Despite the significant success of deep learning models in computer vision, they often exhibit systematic failures on specific data subsets, known as error slices. Identifying and mitigating these error slices is crucial to enhancing model robustness and reliability in real-world scenarios. In this paper, we introduce HiBug2, an automated framework for error slice discovery and model repair. HiBug2 first generates task-specific visual attributes to highlight instances prone to errors through an interpretable and structured process. It then employs an efficient slice enumeration algorithm to systematically identify error slices, overcoming the combinatorial challenges that arise during slice exploration. Additionally, HiBug2 extends its capabilities by predicting error slices beyond the validation set, addressing a key limitation of prior approaches. Extensive experiments across multiple domains, including image classification, pose estimation, and object detection - show that HiBug2 not only improves the coherence and precision of identified error slices but also significantly enhances the model repair capabilities.
LGSep 26, 2025
Concept-SAE: Active Causal Probing of Visual Model BehaviorJianrong Ding, Muxi Chen, Chenchen Zhao et al.
Standard Sparse Autoencoders (SAEs) excel at discovering a dictionary of a model's learned features, offering a powerful observational lens. However, the ambiguous and ungrounded nature of these features makes them unreliable instruments for the active, causal probing of model behavior. To solve this, we introduce Concept-SAE, a framework that forges semantically grounded concept tokens through a novel hybrid disentanglement strategy. We first quantitatively demonstrate that our dual-supervision approach produces tokens that are remarkably faithful and spatially localized, outperforming alternative methods in disentanglement. This validated fidelity enables two critical applications: (1) we probe the causal link between internal concepts and predictions via direct intervention, and (2) we probe the model's failure modes by systematically localizing adversarial vulnerabilities to specific layers. Concept-SAE provides a validated blueprint for moving beyond correlational interpretation to the mechanistic, causal probing of model behavior.
CVDec 21, 2020
Blurring Fools the Network -- Adversarial Attacks by Feature Peak Suppression and Gaussian BlurringChenchen Zhao, Hao Li
Existing pixel-level adversarial attacks on neural networks may be deficient in real scenarios, since pixel-level changes on the data cannot be fully delivered to the neural network after camera capture and multiple image preprocessing steps. In contrast, in this paper, we argue from another perspective that gaussian blurring, a common technique of image preprocessing, can be aggressive itself in specific occasions, thus exposing the network to real-world adversarial attacks. We first propose an adversarial attack demo named peak suppression (PS) by suppressing the values of peak elements in the features of the data. Based on the blurring spirit of PS, we further apply gaussian blurring to the data, to investigate the potential influence and threats of gaussian blurring to performance of the network. Experiment results show that PS and well-designed gaussian blurring can form adversarial attacks that completely change classification results of a well-trained target network. With the strong physical significance and wide applications of gaussian blurring, the proposed approach will also be capable of conducting real world attacks.
CVDec 21, 2020
Amplifying the Anterior-Posterior Difference via Data Enhancement -- A More Robust Deep Monocular Orientation Estimation SolutionChenchen Zhao, Hao Li
Existing deep-learning based monocular orientation estimation algorithms faces the problem of confusion between the anterior and posterior parts of the objects, caused by the feature similarity of such parts in typical objects in traffic scenes such as cars and pedestrians. While difficult to solve, the problem may lead to serious orientation estimation errors, and pose threats to the upcoming decision making process of the ego vehicle, since the predicted tracks of objects may have directions opposite to ground truths. In this paper, we mitigate this problem by proposing a pretraining method. The method focuses on predicting the left/right semicircle in which the orientation of the object is located. The trained semicircle prediction model is then integrated into the orientation angle estimation model which predicts a value in range $[0, π]$. Experiment results show that the proposed semicircle prediction enhances the accuracy of orientation estimation, and mitigates the problem stated above. With the proposed method, a backbone achieves similar state-of-the-art orientation estimation performance to existing approaches with well-designed network structures.
CVDec 21, 2020
Exploiting Vulnerability of Pooling in Convolutional Neural Networks by Strict Layer-Output Manipulation for Adversarial AttacksChenchen Zhao, Hao Li
Convolutional neural networks (CNN) have been more and more applied in mobile robotics such as intelligent vehicles. Security of CNNs in robotics applications is an important issue, for which potential adversarial attacks on CNNs are worth research. Pooling is a typical step of dimension reduction and information discarding in CNNs. Such information discarding may result in mis-deletion and mis-preservation of data features which largely influence the output of the network. This may aggravate the vulnerability of CNNs to adversarial attacks. In this paper, we conduct adversarial attacks on CNNs from the perspective of network structure by investigating and exploiting the vulnerability of pooling. First, a novel adversarial attack methodology named Strict Layer-Output Manipulation (SLOM) is proposed. Then an attack method based on Strict Pooling Manipulation (SPM) which is an instantiation of the SLOM spirit is designed to effectively realize both type I and type II adversarial attacks on a target CNN. Performances of attacks based on SPM at different depths are also investigated and compared. Moreover, performances of attack methods designed by instantiating the SLOM spirit with different operation layers of CNNs are compared. Experiment results reflect that pooling tends to be more vulnerable to adversarial attacks than other operations in CNNs.
CVSep 24, 2019
Monocular Pedestrian Orientation Estimation Based on Deep 2D-3D FeedforwardChenchen Zhao, Yeqiang Qian, Ming Yang
Accurate pedestrian orientation estimation of autonomous driving helps the ego vehicle obtain the intentions of pedestrians in the related environment, which are the base of safety measures such as collision avoidance and prewarning. However, because of relatively small sizes and high-level deformation of pedestrians, common pedestrian orientation estimation models fail to extract sufficient and comprehensive information from them, thus having their performance restricted, especially monocular ones which fail to obtain depth information of objects and related environment. In this paper, a novel monocular pedestrian orientation estimation model, called FFNet, is proposed. Apart from camera captures, the model adds the 2D and 3D dimensions of pedestrians as two other inputs according to the logic relationship between orientation and them. The 2D and 3D dimensions of pedestrians are determined from the camera captures and further utilized through two feedforward links connected to the orientation estimator. The feedforward links strengthen the logicality and interpretability of the network structure of the proposed model. Experiments show that the proposed model has at least 1.72% AOS increase than most state-of-the-art models after identical training processes. The model also has competitive results in orientation estimation evaluation on KITTI dataset.