CVJun 30, 2023
Exploration and Exploitation of Unlabeled Data for Open-Set Semi-Supervised LearningGanlong Zhao, Guanbin Li, Yipeng Qin et al.
In this paper, we address a complex but practical scenario in semi-supervised learning (SSL) named open-set SSL, where unlabeled data contain both in-distribution (ID) and out-of-distribution (OOD) samples. Unlike previous methods that only consider ID samples to be useful and aim to filter out OOD ones completely during training, we argue that the exploration and exploitation of both ID and OOD samples can benefit SSL. To support our claim, i) we propose a prototype-based clustering and identification algorithm that explores the inherent similarity and difference among samples at feature level and effectively cluster them around several predefined ID and OOD prototypes, thereby enhancing feature learning and facilitating ID/OOD identification; ii) we propose an importance-based sampling method that exploits the difference in importance of each ID and OOD sample to SSL, thereby reducing the sampling bias and improving the training. Our proposed method achieves state-of-the-art in several challenging benchmarks, and improves upon existing SSL methods even when ID samples are totally absent in unlabeled data.
CVMay 20Code
Spatial Gram Alignment for Ultra-High-Resolution Image SynthesisJinjin Zhang, Xiefan Guo, Di Huang
Modern ultra-high-resolution image synthesis relies heavily on the robust generative capacity of large-scale pre-trained Latent Diffusion Models (LDMs). While recent representation alignment methods have proven effective by distilling visual priors from foundation models (e.g., SAM or DINO) into generative latent features, scaling these approaches to pre-trained LDMs at extreme resolutions exposes a critical learnability-fidelity conflict. Specifically, forcing direct patch-wise feature distillation inherently perturbs the pre-trained latent manifold, ultimately leading to generation degradation. To address this bottleneck, we propose Spatial Gram Alignment (SGA), a novel framework that explicitly leverages the representation priors of vision foundation models while preserving the native generative capacity of LDMs. Moving beyond restrictive direct alignment, SGA imposes a non-invasive spatial constraint by aligning the internal self-similarities of the generative features with those of the foundation priors. This spatial constraint effectively establishes macroscopic structural coherence, while the native generative objectives retain the microscopic pixel-level fidelity inherent to the original LDMs. Notably, this versatile strategy integrates seamlessly across both intermediate diffusion features and VAE latents within pre-trained LDMs. Extensive experiments demonstrate that SGA achieves state-of-the-art performance for ultra-high-resolution text-to-image synthesis, yielding an effective reconciliation between global structural integrity and fine-grained visual details. Code is available at https://github.com/zhang0jhon/SGA.
CVMay 19Code
What Makes Synthetic Data Effective in Image SegmentationJinjin Zhang, Xiefan Guo, Yizhou Jin et al.
Driven by rapid advances in large-scale generative models, synthetic data has emerged as a promising solution for visual understanding. While modern diffusion models achieve remarkable photorealistic image synthesis, their potential in complex visual segmentation tasks remains underexplored. In this work, we conduct a systematic analysis of synthetic images from state-of-the-art diffusion models to uncover the factors governing their utility. In particular, synthetic images characterized by dense scene composition and fine instance fidelity demonstrate distinctive benefits, yielding significantly more discriminative spatial representations. Building on these insights, we propose SENSE, a unified framework that leverages flexible and scalable synthetic data to substantially enhance segmentation performance. Notably, SENSE is model-agnostic, compatible with diverse architectures (e.g., DPT and Mask2Former), and scales effectively across models with varying parameter capacities. Extensive experiments on Cityscapes, COCO, and ADE20K validate the effectiveness and generalization capability of our approach. Code is available at https://github.com/zhang0jhon/SENSE.
CVMar 28Code
Reasoning-Driven Anomaly Detection and Localization with Image-Level SupervisionYizhou Jin, Yuezhu Feng, Jinjin Zhang et al.
Multimodal large language models (MLLMs) have recently demonstrated remarkable reasoning and perceptual abilities for anomaly detection. However, most approaches remain confined to image-level anomaly detection and textual reasoning, while pixel-level localization still relies on external vision modules and dense annotations. In this work, we activate the intrinsic reasoning potential of MLLMs to perform anomaly detection, pixel-level localization, and interpretable reasoning solely from image-level supervision, without any auxiliary components or pixel-wise labels. Specifically, we propose Reasoning-Driven Anomaly Localization (ReAL), which extracts anomaly-related tokens from the autoregressive reasoning process and aggregates their attention responses to produce pixel-level anomaly maps. We further introduce a Consistency-Guided Reasoning Optimization (CGRO) module that leverages reinforcement learning to align reasoning tokens with visual attentions, resulting in more coherent reasoning and accurate anomaly localization. Extensive experiments on four public benchmarks demonstrate that our method significantly improves anomaly detection, localization, and interpretability. Remarkably, despite relying solely on image-level supervision, our approach achieves performance competitive with MLLM-based methods trained under dense pixel-level supervision. Code is available at https://github.com/YizhouJin313/ReADL.
FLMay 11
The Similarity Control Problem with Required EventsYu Wang, Zhaohui Zhu, Rob van Glabbeek et al.
In order to guarantee that a supervised system satisfies safety requirements of the specification, as well as requirements saying that in certain states certain events must be enabled, this paper introduces required events for discrete event systems and reconsiders the similarity control problem while taking all requirements from the specification into account. The notion of a covariant-contravariant simulation, which is finer than the conventional notion of simulation, is adopted to act as the behavioral relation of supervisory control theory. A necessary and sufficient condition for the solvability of this problem is established and a method for synthesizing a maximally permissive supervisor is provided.
CVMar 24, 2025Code
Towards Training-free Anomaly Detection with Vision and Language Foundation ModelsJinjin Zhang, Guodong Wang, Yizhou Jin et al.
Anomaly detection is valuable for real-world applications, such as industrial quality inspection. However, most approaches focus on detecting local structural anomalies while neglecting compositional anomalies incorporating logical constraints. In this paper, we introduce LogSAD, a novel multi-modal framework that requires no training for both Logical and Structural Anomaly Detection. First, we propose a match-of-thought architecture that employs advanced large multi-modal models (i.e. GPT-4V) to generate matching proposals, formulating interests and compositional rules of thought for anomaly detection. Second, we elaborate on multi-granularity anomaly detection, consisting of patch tokens, sets of interests, and composition matching with vision and language foundation models. Subsequently, we present a calibration module to align anomaly scores from different detectors, followed by integration strategies for the final decision. Consequently, our approach addresses both logical and structural anomaly detection within a unified framework and achieves state-of-the-art results without the need for training, even when compared to supervised approaches, highlighting its robustness and effectiveness. Code is available at https://github.com/zhang0jhon/LogSAD.
CVJun 2, 2025Code
Ultra-High-Resolution Image Synthesis: Data, Method and EvaluationJinjin Zhang, Qiuyu Huang, Junjie Liu et al.
Ultra-high-resolution image synthesis holds significant potential, yet remains an underexplored challenge due to the absence of standardized benchmarks and computational constraints. In this paper, we establish Aesthetic-4K, a meticulously curated dataset containing dedicated training and evaluation subsets specifically designed for comprehensive research on ultra-high-resolution image synthesis. This dataset consists of high-quality 4K images accompanied by descriptive captions generated by GPT-4o. Furthermore, we propose Diffusion-4K, an innovative framework for the direct generation of ultra-high-resolution images. Our approach incorporates the Scale Consistent Variational Auto-Encoder (SC-VAE) and Wavelet-based Latent Fine-tuning (WLF), which are designed for efficient visual token compression and the capture of intricate details in ultra-high-resolution images, thereby facilitating direct training with photorealistic 4K data. This method is applicable to various latent diffusion models and demonstrates its efficacy in synthesizing highly detailed 4K images. Additionally, we propose novel metrics, namely the GLCM Score and Compression Ratio, to assess the texture richness and fine details in local patches, in conjunction with holistic measures such as FID, Aesthetics, and CLIPScore, enabling a thorough and multifaceted evaluation of ultra-high-resolution image synthesis. Consequently, Diffusion-4K achieves impressive performance in ultra-high-resolution image synthesis, particularly when powered by state-of-the-art large-scale diffusion models (eg, Flux-12B). The source code is publicly available at https://github.com/zhang0jhon/diffusion-4k.
CVFeb 6
LIBERO-X: Robustness Litmus for Vision-Language-Action ModelsGuodong Wang, Chenkai Zhang, Qingjie Liu et al.
Reliable benchmarking is critical for advancing Vision-Language-Action (VLA) models, as it reveals their generalization, robustness, and alignment of perception with language-driven manipulation tasks. However, existing benchmarks often provide limited or misleading assessments due to insufficient evaluation protocols that inadequately capture real-world distribution shifts. This work systematically rethinks VLA benchmarking from both evaluation and data perspectives, introducing LIBERO-X, a benchmark featuring: 1) A hierarchical evaluation protocol with progressive difficulty levels targeting three core capabilities: spatial generalization, object recognition, and task instruction understanding. This design enables fine-grained analysis of performance degradation under increasing environmental and task complexity; 2) A high-diversity training dataset collected via human teleoperation, where each scene supports multiple fine-grained manipulation objectives to bridge the train-evaluation distribution gap. Experiments with representative VLA models reveal significant performance drops under cumulative perturbations, exposing persistent limitations in scene comprehension and instruction grounding. By integrating hierarchical evaluation with diverse training data, LIBERO-X offers a more reliable foundation for assessing and advancing VLA development.
CVDec 10, 2019Code
A Feasible Framework for Arbitrary-Shaped Scene Text RecognitionJinjin Zhang, Wei Wang, Di Huang et al.
Deep learning based methods have achieved surprising progress in Scene Text Recognition (STR), one of classic problems in computer vision. In this paper, we propose a feasible framework for multi-lingual arbitrary-shaped STR, including instance segmentation based text detection and language model based attention mechanism for text recognition. Our STR algorithm not only recognizes Latin and Non-Latin characters, but also supports arbitrary-shaped text recognition. Our method wins the championship on Scene Text Spotting Task (Latin Only, Latin and Chinese) of ICDAR2019 Robust Reading Challenge on ArbitraryShaped Text Competition. Code is available at https://github.com/zhang0jhon/AttentionOCR.
CVMar 24, 2025
Diffusion-4K: Ultra-High-Resolution Image Synthesis with Latent Diffusion ModelsJinjin Zhang, Qiuyu Huang, Junjie Liu et al.
In this paper, we present Diffusion-4K, a novel framework for direct ultra-high-resolution image synthesis using text-to-image diffusion models. The core advancements include: (1) Aesthetic-4K Benchmark: addressing the absence of a publicly available 4K image synthesis dataset, we construct Aesthetic-4K, a comprehensive benchmark for ultra-high-resolution image generation. We curated a high-quality 4K dataset with carefully selected images and captions generated by GPT-4o. Additionally, we introduce GLCM Score and Compression Ratio metrics to evaluate fine details, combined with holistic measures such as FID, Aesthetics and CLIPScore for a comprehensive assessment of ultra-high-resolution images. (2) Wavelet-based Fine-tuning: we propose a wavelet-based fine-tuning approach for direct training with photorealistic 4K images, applicable to various latent diffusion models, demonstrating its effectiveness in synthesizing highly detailed 4K images. Consequently, Diffusion-4K achieves impressive performance in high-quality image synthesis and text prompt adherence, especially when powered by modern large-scale diffusion models (e.g., SD3-2B and Flux-12B). Extensive experimental results from our benchmark demonstrate the superiority of Diffusion-4K in ultra-high-resolution image synthesis.
CVDec 5, 2024
ONER: Online Experience Replay for Incremental Anomaly DetectionYizhou Jin, Jiahui Zhu, Guodong Wang et al.
Incremental anomaly detection aims to sequentially identify defects in industrial product lines but suffers from catastrophic forgetting, primarily due to knowledge overwriting during parameter updates and feature conflicts between tasks. In this work, We propose ONER (ONline Experience Replay), an end-to-end framework that addresses these issues by synergistically integrating two types of experience: (1) decomposed prompts, which dynamically generate image-conditioned prompts from reusable modules to retain prior knowledge thus prevent knowledge overwriting, and (2) semantic prototypes, which enforce separability in latent feature spaces at pixel and image levels to mitigate cross-task feature conflicts. Extensive experiments demonstrate the superiority of ONER, achieving state-of-the-art performance with +4.4% Pixel AUROC and +28.3% Pixel AUPR improvements on the MVTec AD dataset over prior methods. Remarkably, ONER achieves this with only 0.019M parameters and 5 training epochs per task, confirming its efficiency and stability for real-world industrial deployment.
LGOct 8, 2025
Targeted Digital Twin via Flow Map Learning and Its Application to Fluid DynamicsQifan Chen, Zhongshu Xu, Jinjin Zhang et al.
We present a numerical framework for constructing a targeted digital twin (tDT) that directly models the dynamics of quantities of interest (QoIs) in a full digital twin (DT). The proposed approach employs memory-based flow map learning (FML) to develop a data-driven model of the QoIs using short bursts of trajectory data generated through repeated executions of the full DT. This renders the construction of the FML-based tDT an entirely offline computational process. During online simulation, the learned tDT can efficiently predict and analyze the long-term dynamics of the QoIs without requiring simulations of the full DT system, thereby achieving substantial computational savings. After introducing the general numerical procedure, we demonstrate the construction and predictive capability of the tDT in a computational fluid dynamics (CFD) example: two-dimensional incompressible flow past a cylinder. The QoIs in this problem are the hydrodynamic forces exerted on the cylinder. The resulting tDTs are compact dynamical systems that evolve these forces without explicit knowledge of the underlying flow field. Numerical results show that the tDTs yield accurate long-term predictions of the forces while entirely bypassing full flow simulations.
CVJun 13, 2021
Representation and Correlation Enhanced Encoder-Decoder Framework for Scene Text RecognitionMengmeng Cui, Wei Wang, Jinjin Zhang et al.
Attention-based encoder-decoder framework is widely used in the scene text recognition task. However, for the current state-of-the-art(SOTA) methods, there is room for improvement in terms of the efficient usage of local visual and global context information of the input text image, as well as the robust correlation between the scene processing module(encoder) and the text processing module(decoder). In this paper, we propose a Representation and Correlation Enhanced Encoder-Decoder Framework(RCEED) to address these deficiencies and break performance bottleneck. In the encoder module, local visual feature, global context feature, and position information are aligned and fused to generate a small-size comprehensive feature map. In the decoder module, two methods are utilized to enhance the correlation between scene and text feature space. 1) The decoder initialization is guided by the holistic feature and global glimpse vector exported from the encoder. 2) The feature enriched glimpse vector produced by the Multi-Head General Attention is used to assist the RNN iteration and the character prediction at each time step. Meanwhile, we also design a Layernorm-Dropout LSTM cell to improve model's generalization towards changeable texts. Extensive experiments on the benchmarks demonstrate the advantageous performance of RCEED in scene text recognition tasks, especially the irregular ones.
AIDec 19, 2020
More on extension-based semantics of argumentationLixing Tan, Zhaohui Zhu, Jinjin Zhang
After a few decades of development, computational argumentation has become one of the active realms in AI. This paper considers extension-based concrete and abstract semantics of argumentation. For concrete ones, based on Grossi and Modgil's recent work, this paper considers some issues on graded extension-based semantics of abstract argumentation framework (AAF, for short). First, an alternative fundamental lemma is given, which generalizes the corresponding result due to Grossi and Modgil by relaxing the constraint on parameters. This lemma provides a new sufficient condition for preserving conflict-freeness and brings a Galois adjunction between admissible sets and complete extensions, which is of vital importance in constructing some special extensions in terms of iterations of the defense function. Applying such a lemma, some flaws in Grossi and Modgil's work are corrected, and the structural property and universal definability of various extension-based semantics are given. Second, an operator so-called reduced meet modulo an ultrafilter is presented, which is a simple but powerful tool in exploring infinite AAFs. The neutrality function and the defense function, which play central roles in Dung's abstract argumentation theory, are shown to be distributive over reduced meets modulo any ultrafilter. A variety of fundamental semantics of AAFs, including conflict-free, admissible, complete and stable semantics, etc, are shown to be closed under this operator. Based on this fact, a number of applications of such operators are considered. In particular, we provide a simple and uniform method to prove the universal definability of a family of range related semantics. Since all graded concrete semantics considered in this paper are generalizations of corresponding non-graded ones, all results about them obtained in this paper also hold in the traditional situation.
CVMay 22, 2018
Pose-Based Two-Stream Relational Networks for Action Recognition in VideosWei Wang, Jinjin Zhang, Chenyang Si et al.
Recently, pose-based action recognition has gained more and more attention due to the better performance compared with traditional appearance-based methods. However, there still exist two problems to be further solved. First, existing pose-based methods generally recognize human actions with captured 3D human poses which are very difficult to obtain in real scenarios. Second, few pose-based methods model the action-related objects in recognizing human-object interaction actions in which objects play an important role. To solve the problems above, we propose a pose-based two-stream relational network (PSRN) for action recognition. In PSRN, one stream models the temporal dynamics of the targeted 2D human pose sequences which are directly extracted from raw videos, and the other stream models the action-related objects from a randomly sampled video frame. Most importantly, instead of fusing two-streams in the class score layer as before, we propose a pose-object relational network to model the relationship between human poses and action-related objects. We evaluate the proposed PSRN on two challenging benchmarks, i.e., Sub-JHMDB and PennAction. Experimental results show that our PSRN obtains the state-the-of-art performance on Sub-JHMDB (80.2%) and PennAction (98.1%). Our work opens a new door to action recognition by combining 2D human pose extracted from raw video and image appearance.
LOJan 15, 2013
On Recursive Operations Over Logic LTSYan Zhang, Zhaohui Zhu, Jinjin Zhang
Recently, in order to mix algebraic and logic styles of specification in a uniform framework, the notion of a logic labelled transition system (Logic LTS or LLTS for short) has been introduced and explored. A variety of constructors over LLTS, including usual process-algebraic operators, logic connectives (conjunction and disjunction) and standard temporal operators (always and unless), have been given. However, no attempt has made so far to develop general theory concerning (nested) recursive operations over LLTSs and a few fundamental problems are still open. This paper intends to study this issue in pure process-algebraic style. A few fundamental properties, including precongruence and the uniqueness of consistent solutions for equations, will be established.