Jianxin Wang

CV
h-index86
20papers
134citations
Novelty53%
AI Score50

20 Papers

8.4CVJun 21, 2023Code
Inter-Instance Similarity Modeling for Contrastive Learning

Chengchao Shen, Dawei Liu, Hao Tang et al.

The existing contrastive learning methods widely adopt one-hot instance discrimination as pretext task for self-supervised learning, which inevitably neglects rich inter-instance similarities among natural images, then leading to potential representation degeneration. In this paper, we propose a novel image mix method, PatchMix, for contrastive learning in Vision Transformer (ViT), to model inter-instance similarities among images. Following the nature of ViT, we randomly mix multiple images from mini-batch in patch level to construct mixed image patch sequences for ViT. Compared to the existing sample mix methods, our PatchMix can flexibly and efficiently mix more than two images and simulate more complicated similarity relations among natural images. In this manner, our contrastive framework can significantly reduce the gap between contrastive objective and ground truth in reality. Experimental results demonstrate that our proposed method significantly outperforms the previous state-of-the-art on both ImageNet-1K and CIFAR datasets, e.g., 3.0% linear accuracy improvement on ImageNet-1K and 8.7% kNN accuracy improvement on CIFAR100. Moreover, our method achieves the leading transfer performance on downstream tasks, object detection and instance segmentation on COCO dataset. The code is available at https://github.com/visresearch/patchmix

3.7CVOct 9, 2022
Coded Residual Transform for Generalizable Deep Metric Learning

Shichao Kan, Yixiong Liang, Min Li et al.

A fundamental challenge in deep metric learning is the generalization capability of the feature embedding network model since the embedding network learned on training classes need to be evaluated on new test classes. To address this challenge, in this paper, we introduce a new method called coded residual transform (CRT) for deep metric learning to significantly improve its generalization capability. Specifically, we learn a set of diversified prototype features, project the feature map onto each prototype, and then encode its features using their projection residuals weighted by their correlation coefficients with each prototype. The proposed CRT method has the following two unique characteristics. First, it represents and encodes the feature map from a set of complimentary perspectives based on projections onto diversified prototypes. Second, unlike existing transformer-based feature representation approaches which encode the original values of features based on global correlation analysis, the proposed coded residual transform encodes the relative differences between the original features and their projected prototypes. Embedding space density and spectral decay analysis show that this multi-perspective projection onto diversified prototypes and coded residual representation are able to achieve significantly improved generalization capability in metric learning. Finally, to further enhance the generalization performance, we propose to enforce the consistency on their feature similarity matrices between coded residual transforms with different sizes of projection prototypes and embedding dimensions. Our extensive experimental results and ablation studies demonstrate that the proposed CRT method outperform the state-of-the-art deep metric learning methods by large margins and improving upon the current best method by up to 4.28% on the CUB dataset.

5.2CVJul 2, 2024Code
Multi-Grained Contrast for Data-Efficient Unsupervised Representation Learning

Chengchao Shen, Jianzhong Chen, Jianxin Wang

The existing contrastive learning methods mainly focus on single-grained representation learning, e.g., part-level, object-level or scene-level ones, thus inevitably neglecting the transferability of representations on other granularity levels. In this paper, we aim to learn multi-grained representations, which can effectively describe the image on various granularity levels, thus improving generalization on extensive downstream tasks. To this end, we propose a novel Multi-Grained Contrast method (MGC) for unsupervised representation learning. Specifically, we construct delicate multi-grained correspondences between positive views and then conduct multi-grained contrast by the correspondences to learn more general unsupervised representations. Without pretrained on large-scale dataset, our method significantly outperforms the existing state-of-the-art methods on extensive downstream tasks, including object detection, instance segmentation, scene parsing, semantic segmentation and keypoint detection. Moreover, experimental results support the data-efficient property and excellent representation transferability of our method. The source code and trained weights are available at \url{https://github.com/visresearch/mgc}.

1.4CVJun 27, 2022
Mushroom image recognition and distance generation based on attention-mechanism model and genetic information

Wenbin Liao, Jiewen Xiao, Chengbo Zhao et al.

The species identification of Macrofungi, i.e. mushrooms, has always been a challenging task. There are still a large number of poisonous mushrooms that have not been found, which poses a risk to people's life. However, the traditional identification method requires a large number of experts with knowledge in the field of taxonomy for manual identification, it is not only inefficient but also consumes a lot of manpower and capital costs. In this paper, we propose a new model based on attention-mechanism, MushroomNet, which applies the lightweight network MobileNetV3 as the backbone model, combined with the attention structure proposed by us, and has achieved excellent performance in the mushroom recognition task. On the public dataset, the test accuracy of the MushroomNet model has reached 83.9%, and on the local dataset, the test accuracy has reached 77.4%. The proposed attention mechanisms well focused attention on the bodies of mushroom image for mixed channel attention and the attention heat maps visualized by Grad-CAM. Further, in this study, genetic distance was added to the mushroom image recognition task, the genetic distance was used as the representation space, and the genetic distance between each pair of mushroom species in the dataset was used as the embedding of the genetic distance representation space, so as to predict the image distance and species. identify. We found that using the MES activation function can predict the genetic distance of mushrooms very well, but the accuracy is lower than that of SoftMax. The proposed MushroomNet was demonstrated it shows great potential for automatic and online mushroom image and the proposed automatic procedure would assist and be a reference to traditional mushroom classification.

7.9MLMay 31, 2022Code
Optimal Activation Functions for the Random Features Regression Model

Jianxin Wang, José Bento

The asymptotic mean squared test error and sensitivity of the Random Features Regression model (RFR) have been recently studied. We build on this work and identify in closed-form the family of Activation Functions (AFs) that minimize a combination of the test error and sensitivity of the RFR under different notions of functional parsimony. We find scenarios under which the optimal AFs are linear, saturated linear functions, or expressible in terms of Hermite polynomials. Finally, we show how using optimal AFs impacts well-established properties of the RFR model, such as its double descent curve, and the dependency of its optimal regularization parameter on the observation noise level.

1.4CVJun 28, 2022Code
Taxonomy and evolution predicting using deep learning in images

Jiewen Xiao, Wenbin Liao, Ming Zhang et al.

Molecular and morphological characters, as important parts of biological taxonomy, are contradictory but need to be integrated. Organism's image recognition and bioinformatics are emerging and hot problems nowadays but with a gap between them. In this work, a multi-branching recognition framework mediated by genetic information bridges this barrier, which establishes the link between macro-morphology and micro-molecular information of mushrooms. The novel multi-perspective structure is proposed to fuse the feature images from three branching models, which significantly improves the accuracy of recognition by about 10% and up to more than 90%. Further, genetic information is implemented to the mushroom image recognition task by using genetic distance embeddings as the representation space for predicting image distance and species identification. Semantic overfitting of traditional classification tasks and the granularity of fine-grained image recognition are also discussed in depth for the first time. The generalizability of the model was investigated in fine-grained scenarios using zero-shot learning tasks, which could predict the taxonomic and evolutionary information of unseen samples. We presented the first method to map images to DNA, namely used an encoder mapping image to genetic distances, and then decoded DNA through a pre-trained decoder, where the total test accuracy on 37 species for DNA prediction is 87.45%. This study creates a novel recognition framework by systematically studying the mushroom image recognition problem, bridging the gap between macroscopic biological information and microscopic molecular information, which will provide a new reference for intelligent biometrics in the future.

1.2DCAug 26, 2024
Resource Efficient Asynchronous Federated Learning for Digital Twin Empowered IoT Network

Shunfeng Chu, Jun Li, Jianxin Wang et al.

As an emerging technology, digital twin (DT) can provide real-time status and dynamic topology mapping for Internet of Things (IoT) devices. However, DT and its implementation within industrial IoT networks necessitates substantial, distributed data support, which often leads to ``data silos'' and raises privacy concerns. To address these issues, we develop a dynamic resource scheduling algorithm tailored for the asynchronous federated learning (FL)-based lightweight DT empowered IoT network. Specifically, our approach aims to minimize a multi-objective function that encompasses both energy consumption and latency by optimizing IoT device selection and transmit power control, subject to FL model performance constraints. We utilize the Lyapunov method to decouple the formulated problem into a series of one-slot optimization problems and develop a two-stage optimization algorithm to achieve the optimal transmission power control and IoT device scheduling strategies. In the first stage, we derive closed-form solutions for optimal transmit power on the IoT device side. In the second stage, since partial state information is unknown, e.g., the transmitting power and computational frequency of IoT device, the edge server employs a multi-armed bandit (MAB) framework to model the IoT device selection problem and utilizes an efficient online algorithm, namely the client utility-based upper confidence bound (CU-UCB), to address it. Numerical results validate our algorithm's superiority over benchmark schemes, and simulations demonstrate that our algorithm achieves faster training speeds on the Fashion-MNIST and CIFAR-10 datasets within the same training duration.

10.7LGSep 5, 2023
sasdim: self-adaptive noise scaling diffusion model for spatial time series imputation

Shunyang Zhang, Senzhang Wang, Xianzhen Tan et al.

Spatial time series imputation is critically important to many real applications such as intelligent transportation and air quality monitoring. Although recent transformer and diffusion model based approaches have achieved significant performance gains compared with conventional statistic based methods, spatial time series imputation still remains as a challenging issue due to the complex spatio-temporal dependencies and the noise uncertainty of the spatial time series data. Especially, recent diffusion process based models may introduce random noise to the imputations, and thus cause negative impact on the model performance. To this end, we propose a self-adaptive noise scaling diffusion model named SaSDim to more effectively perform spatial time series imputation. Specially, we propose a new loss function that can scale the noise to the similar intensity, and propose the across spatial-temporal global convolution module to more effectively capture the dynamic spatial-temporal dependencies. Extensive experiments conducted on three real world datasets verify the effectiveness of SaSDim by comparison with current state-of-the-art baselines.

13.5CLSep 18, 2024Code
RUIE: Retrieval-based Unified Information Extraction using Large Language Model

Xincheng Liao, Junwen Duan, Yixi Huang et al.

Unified information extraction (UIE) aims to extract diverse structured information from unstructured text. While large language models (LLMs) have shown promise for UIE, they require significant computational resources and often struggle to generalize to unseen tasks. We propose RUIE (Retrieval-based Unified Information Extraction), a framework that leverages in-context learning for efficient task generalization. RUIE introduces a novel demonstration selection mechanism combining LLM preferences with a keyword-enhanced reward model, and employs a bi-encoder retriever trained through contrastive learning and knowledge distillation. As the first trainable retrieval framework for UIE, RUIE serves as a universal plugin for various LLMs. Experimental results on eight held-out datasets demonstrate RUIE's effectiveness, with average F1-score improvements of 19.22 and 3.22 compared to instruction-tuning methods and other retrievers, respectively.

4.1LGJul 22, 2025Code
Bipartite Patient-Modality Graph Learning with Event-Conditional Modelling of Censoring for Cancer Survival Prediction

Hailin Yue, Hulin Kuang, Jin Liu et al.

Accurately predicting the survival of cancer patients is crucial for personalized treatment. However, existing studies focus solely on the relationships between samples with known survival risks, without fully leveraging the value of censored samples. Furthermore, these studies may suffer performance degradation in modality-missing scenarios and even struggle during the inference process. In this study, we propose a bipartite patient-modality graph learning with event-conditional modelling of censoring for cancer survival prediction (CenSurv). Specifically, we first use graph structure to model multimodal data and obtain representation. Then, to alleviate performance degradation in modality-missing scenarios, we design a bipartite graph to simulate the patient-modality relationship in various modality-missing scenarios and leverage a complete-incomplete alignment strategy to explore modality-agnostic features. Finally, we design a plug-and-play event-conditional modeling of censoring (ECMC) that selects reliable censored data using dynamic momentum accumulation confidences, assigns more accurate survival times to these censored data, and incorporates them as uncensored data into training. Comprehensive evaluations on 5 publicly cancer datasets showcase the superiority of CenSurv over the best state-of-the-art by 3.1% in terms of the mean C-index, while also exhibiting excellent robustness under various modality-missing scenarios. In addition, using the plug-and-play ECMC module, the mean C-index of 8 baselines increased by 1.3% across 5 datasets. Code of CenSurv is available at https://github.com/yuehailin/CenSurv.

3.6CVJun 9, 2025Code
Multiple Object Stitching for Unsupervised Representation Learning

Chengchao Shen, Dawei Liu, Jianxin Wang

Contrastive learning for single object centric images has achieved remarkable progress on unsupervised representation, but suffering inferior performance on the widespread images with multiple objects. In this paper, we propose a simple but effective method, Multiple Object Stitching (MOS), to refine the unsupervised representation for multi-object images. Specifically, we construct the multi-object images by stitching the single object centric ones, where the objects in the synthesized multi-object images are predetermined. Hence, compared to the existing contrastive methods, our method provides additional object correspondences between multi-object images without human annotations. In this manner, our method pays more attention to the representations of each object in multi-object image, thus providing more detailed representations for complicated downstream tasks, such as object detection and semantic segmentation. Experimental results on ImageNet, CIFAR and COCO datasets demonstrate that our proposed method achieves the leading unsupervised representation performance on both single object centric images and multi-object ones. The source code is available at https://github.com/visresearch/MultipleObjectStitching.

3.6CVJun 10, 2025Code
Diversity-Guided MLP Reduction for Efficient Large Vision Transformers

Chengchao Shen, Hourun Zhu, Gongfan Fang et al.

Transformer models achieve excellent scaling property, where the performance is improved with the increment of model capacity. However, large-scale model parameters lead to an unaffordable cost of computing and memory. We analyze popular transformer architectures and find that multilayer perceptron (MLP) modules take up the majority of model parameters. To this end, we focus on the recoverability of the compressed models and propose a Diversity-Guided MLP Reduction (DGMR) method to significantly reduce the parameters of large vision transformers with only negligible performance degradation. Specifically, we conduct a Gram-Schmidt weight pruning strategy to eliminate redundant neurons of MLP hidden layer, while preserving weight diversity for better performance recover during distillation. Compared to the model trained from scratch, our pruned model only requires 0.06\% data of LAION-2B (for the training of large vision transformers) without labels (ImageNet-1K) to recover the original performance. Experimental results on several state-of-the-art large vision transformers demonstrate that our method achieves a more than 57.0\% parameter and FLOPs reduction in a near lossless manner. Notably, for EVA-CLIP-E (4.4B), our method accomplishes a 71.5\% parameter and FLOPs reduction without performance degradation. The source code and trained weights are available at https://github.com/visresearch/DGMR.

9.4RODec 17, 2024Code
C2F-TP: A Coarse-to-Fine Denoising Framework for Uncertainty-Aware Trajectory Prediction

Zichen Wang, Hao Miao, Senzhang Wang et al.

Accurately predicting the trajectory of vehicles is critically important for ensuring safety and reliability in autonomous driving. Although considerable research efforts have been made recently, the inherent trajectory uncertainty caused by various factors including the dynamic driving intends and the diverse driving scenarios still poses significant challenges to accurate trajectory prediction. To address this issue, we propose C2F-TP, a coarse-to-fine denoising framework for uncertainty-aware vehicle trajectory prediction. C2F-TP features an innovative two-stage coarse-to-fine prediction process. Specifically, in the spatial-temporal interaction stage, we propose a spatial-temporal interaction module to capture the inter-vehicle interactions and learn a multimodal trajectory distribution, from which a certain number of noisy trajectories are sampled. Next, in the trajectory refinement stage, we design a conditional denoising model to reduce the uncertainty of the sampled trajectories through a step-wise denoising operation. Extensive experiments are conducted on two real datasets NGSIM and highD that are widely adopted in trajectory prediction. The result demonstrates the effectiveness of our proposal.

8.4CVJul 25, 2025
Perspective from a Higher Dimension: Can 3D Geometric Priors Help Visual Floorplan Localization?

Bolei Chen, Jiaxu Kang, Haonan Yang et al.

Since a building's floorplans are easily accessible, consistent over time, and inherently robust to changes in visual appearance, self-localization within the floorplan has attracted researchers' interest. However, since floorplans are minimalist representations of a building's structure, modal and geometric differences between visual perceptions and floorplans pose challenges to this task. While existing methods cleverly utilize 2D geometric features and pose filters to achieve promising performance, they fail to address the localization errors caused by frequent visual changes and view occlusions due to variously shaped 3D objects. To tackle these issues, this paper views the 2D Floorplan Localization (FLoc) problem from a higher dimension by injecting 3D geometric priors into the visual FLoc algorithm. For the 3D geometric prior modeling, we first model geometrically aware view invariance using multi-view constraints, i.e., leveraging imaging geometric principles to provide matching constraints between multiple images that see the same points. Then, we further model the view-scene aligned geometric priors, enhancing the cross-modal geometry-color correspondences by associating the scene's surface reconstruction with the RGB frames of the sequence. Both 3D priors are modeled through self-supervised contrastive learning, thus no additional geometric or semantic annotations are required. These 3D priors summarized in extensive realistic scenes bridge the modal gap while improving localization success without increasing the computational burden on the FLoc algorithm. Sufficient comparative studies demonstrate that our method significantly outperforms state-of-the-art methods and substantially boosts the FLoc accuracy. All data and code will be released after the anonymous review.

13.5ROMay 27, 2025
PartInstruct: Part-level Instruction Following for Fine-grained Robot Manipulation

Yifan Yin, Zhengtao Han, Shivam Aarya et al.

Fine-grained robot manipulation, such as lifting and rotating a bottle to display the label on the cap, requires robust reasoning about object parts and their relationships with intended tasks. Despite recent advances in training general-purpose robot manipulation policies guided by language instructions, there is a notable lack of large-scale datasets for fine-grained manipulation tasks with part-level instructions and diverse 3D object instances annotated with part-level labels. In this work, we introduce PartInstruct, the first large-scale benchmark for training and evaluating fine-grained robot manipulation models using part-level instructions. PartInstruct comprises 513 object instances across 14 categories, each annotated with part-level information, and 1302 fine-grained manipulation tasks organized into 16 task classes. Our training set consists of over 10,000 expert demonstrations synthesized in a 3D simulator, where each demonstration is paired with a high-level task instruction, a chain of base part-based skill instructions, and ground-truth 3D information about the object and its parts. Additionally, we designed a comprehensive test suite to evaluate the generalizability of learned policies across new states, objects, and tasks. We evaluated several state-of-the-art robot manipulation approaches, including end-to-end vision-language policy learning and bi-level planning models for robot manipulation on our benchmark. The experimental results reveal that current models struggle to robustly ground part concepts and predict actions in 3D space, and face challenges when manipulating object parts in long-horizon tasks.

6.2CVAug 2, 2025
Perspective from a Broader Context: Can Room Style Knowledge Help Visual Floorplan Localization?

Bolei Chen, Shengsheng Yan, Yongzheng Cui et al.

Since a building's floorplan remains consistent over time and is inherently robust to changes in visual appearance, visual Floorplan Localization (FLoc) has received increasing attention from researchers. However, as a compact and minimalist representation of the building's layout, floorplans contain many repetitive structures (e.g., hallways and corners), thus easily result in ambiguous localization. Existing methods either pin their hopes on matching 2D structural cues in floorplans or rely on 3D geometry-constrained visual pre-trainings, ignoring the richer contextual information provided by visual images. In this paper, we suggest using broader visual scene context to empower FLoc algorithms with scene layout priors to eliminate localization uncertainty. In particular, we propose an unsupervised learning technique with clustering constraints to pre-train a room discriminator on self-collected unlabeled room images. Such a discriminator can empirically extract the hidden room type of the observed image and distinguish it from other room types. By injecting the scene context information summarized by the discriminator into an FLoc algorithm, the room style knowledge is effectively exploited to guide definite visual FLoc. We conducted sufficient comparative studies on two standard visual Floc benchmarks. Our experiments show that our approach outperforms state-of-the-art methods and achieves significant improvements in robustness and accuracy.

3.6CVJun 13, 2025
O2Former:Direction-Aware and Multi-Scale Query Enhancement for SAR Ship Instance Segmentation

F. Gao, Y Li, X He et al.

Instance segmentation of ships in synthetic aperture radar (SAR) imagery is critical for applications such as maritime monitoring, environmental analysis, and national security. SAR ship images present challenges including scale variation, object density, and fuzzy target boundary, which are often overlooked in existing methods, leading to suboptimal performance. In this work, we propose O2Former, a tailored instance segmentation framework that extends Mask2Former by fully leveraging the structural characteristics of SAR imagery. We introduce two key components. The first is the Optimized Query Generator(OQG). It enables multi-scale feature interaction by jointly encoding shallow positional cues and high-level semantic information. This improves query quality and convergence efficiency. The second component is the Orientation-Aware Embedding Module(OAEM). It enhances directional sensitivity through direction-aware convolution and polar-coordinate encoding. This effectively addresses the challenge of uneven target orientations in SAR scenes. Together, these modules facilitate precise feature alignment from backbone to decoder and strengthen the model's capacity to capture fine-grained structural details. Extensive experiments demonstrate that O2Former outperforms state of the art instance segmentation baselines, validating its effectiveness and generalization on SAR ship datasets.

2.7CLFeb 20, 2025
ICA-RAG: Information Completeness Guided Adaptive Retrieval-Augmented Generation for Disease Diagnosis

Jiawei He, Mingyi Jia, Zhihao Jia et al.

Retrieval-Augmented Large Language Models (LLMs), which integrate external knowledge, have shown remarkable performance in medical domains, including clinical diagnosis. However, existing RAG methods often struggle to tailor retrieval strategies to diagnostic difficulty and input sample informativeness. This limitation leads to excessive and often unnecessary retrieval, impairing computational efficiency and increasing the risk of introducing noise that can degrade diagnostic accuracy. To address this, we propose ICA-RAG (\textbf{I}nformation \textbf{C}ompleteness Guided \textbf{A}daptive \textbf{R}etrieval-\textbf{A}ugmented \textbf{G}eneration), a novel framework for enhancing RAG reliability in disease diagnosis. ICA-RAG utilizes an adaptive control module to assess the necessity of retrieval based on the input's information completeness. By optimizing retrieval and incorporating knowledge filtering, ICA-RAG better aligns retrieval operations with clinical requirements. Experiments on three Chinese electronic medical record datasets demonstrate that ICA-RAG significantly outperforms baseline methods, highlighting its effectiveness in clinical diagnosis.

13.4LGJun 4, 2024Code
PeFAD: A Parameter-Efficient Federated Framework for Time Series Anomaly Detection

Ronghui Xu, Hao Miao, Senzhang Wang et al.

With the proliferation of mobile sensing techniques, huge amounts of time series data are generated and accumulated in various domains, fueling plenty of real-world applications. In this setting, time series anomaly detection is practically important. It endeavors to identify deviant samples from the normal sample distribution in time series. Existing approaches generally assume that all the time series is available at a central location. However, we are witnessing the decentralized collection of time series due to the deployment of various edge devices. To bridge the gap between the decentralized time series data and the centralized anomaly detection algorithms, we propose a Parameter-efficient Federated Anomaly Detection framework named PeFAD with the increasing privacy concerns. PeFAD for the first time employs the pre-trained language model (PLM) as the body of the client's local model, which can benefit from its cross-modality knowledge transfer capability. To reduce the communication overhead and local model adaptation cost, we propose a parameter-efficient federated training module such that clients only need to fine-tune small-scale parameters and transmit them to the server for update. PeFAD utilizes a novel anomaly-driven mask selection strategy to mitigate the impact of neglected anomalies during training. A knowledge distillation operation on a synthetic privacy-preserving dataset that is shared by all the clients is also proposed to address the data heterogeneity issue across clients. We conduct extensive evaluations on four real datasets, where PeFAD outperforms existing state-of-the-art baselines by up to 28.74%.

6.0NEJan 24, 2020
Runtime Performances of Randomized Search Heuristics for the Dynamic Weighted Vertex Cover Problem

Feng Shi, Frank Neumann, Jianxin Wang

Randomized search heuristics such as evolutionary algorithms are frequently applied to dynamic combinatorial optimization problems. Within this paper, we present a dynamic model of the classic Weighted Vertex Cover problem and analyze the runtime performances of the well-studied algorithms Randomized Local Search and (1+1) EA adapted to it, to contribute to the theoretical understanding of evolutionary computing for problems with dynamic changes. In our investigations, we use an edge-based representation based on the dual form of the Linear Programming formulation for the problem and study the expected runtime that the adapted algorithms require to maintain a 2-approximate solution when the given weighted graph is modified by an edge-editing or weight-editing operation. Considering the weights on the vertices may be exponentially large with respect to the size of the graph, the step size adaption strategy is incorporated, with or without the 1/5-th rule that is employed to control the increasing/decreasing rate of the step size. Our results show that three of the four algorithms presented in the paper can recompute 2-approximate solutions for the studied dynamic changes in polynomial expected runtime, but the (1+1) EA with 1/5-th Rule requires pseudo-polynomial expected runtime.