Ruijie Yang

CV
h-index15
12papers
119citations
Novelty51%
AI Score43

12 Papers

IVMay 18, 2022Code
Global Contrast Masked Autoencoders Are Powerful Pathological Representation Learners

Hao Quan, Xingyu Li, Weixing Chen et al.

Based on digital pathology slice scanning technology, artificial intelligence algorithms represented by deep learning have achieved remarkable results in the field of computational pathology. Compared to other medical images, pathology images are more difficult to annotate, and thus, there is an extreme lack of available datasets for conducting supervised learning to train robust deep learning models. In this paper, we propose a self-supervised learning (SSL) model, the global contrast-masked autoencoder (GCMAE), which can train the encoder to have the ability to represent local-global features of pathological images, also significantly improve the performance of transfer learning across data sets. In this study, the ability of the GCMAE to learn migratable representations was demonstrated through extensive experiments using a total of three different disease-specific hematoxylin and eosin (HE)-stained pathology datasets: Camelyon16, NCTCRC and BreakHis. In addition, this study designed an effective automated pathology diagnosis process based on the GCMAE for clinical applications. The source code of this paper is publicly available at https://github.com/StarUniversus/gcmae.

LGJul 1, 2023
Common Knowledge Learning for Generating Transferable Adversarial Examples

Ruijie Yang, Yuanfang Guo, Junfu Wang et al.

This paper focuses on an important type of black-box attacks, i.e., transfer-based adversarial attacks, where the adversary generates adversarial examples by a substitute (source) model and utilize them to attack an unseen target model, without knowing its information. Existing methods tend to give unsatisfactory adversarial transferability when the source and target models are from different types of DNN architectures (e.g. ResNet-18 and Swin Transformer). In this paper, we observe that the above phenomenon is induced by the output inconsistency problem. To alleviate this problem while effectively utilizing the existing DNN models, we propose a common knowledge learning (CKL) framework to learn better network weights to generate adversarial examples with better transferability, under fixed network architectures. Specifically, to reduce the model-specific features and obtain better output distributions, we construct a multi-teacher framework, where the knowledge is distilled from different teacher architectures into one student network. By considering that the gradient of input is usually utilized to generated adversarial examples, we impose constraints on the gradients between the student and teacher models, to further alleviate the output inconsistency problem and enhance the adversarial transferability. Extensive experiments demonstrate that our proposed work can significantly improve the adversarial transferability.

CVJul 19, 2024
Multi-modal Relation Distillation for Unified 3D Representation Learning

Huiqun Wang, Yiping Bao, Panwang Pan et al.

Recent advancements in multi-modal pre-training for 3D point clouds have demonstrated promising results by aligning heterogeneous features across 3D shapes and their corresponding 2D images and language descriptions. However, current straightforward solutions often overlook intricate structural relations among samples, potentially limiting the full capabilities of multi-modal learning. To address this issue, we introduce Multi-modal Relation Distillation (MRD), a tri-modal pre-training framework, which is designed to effectively distill reputable large Vision-Language Models (VLM) into 3D backbones. MRD aims to capture both intra-relations within each modality as well as cross-relations between different modalities and produce more discriminative 3D shape representations. Notably, MRD achieves significant improvements in downstream zero-shot classification tasks and cross-modality retrieval tasks, delivering new state-of-the-art performance.

CLApr 2
Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy

Ruijie Yang, Yan Zhu, Peiyao Fu et al.

Automatic speech recognition (ASR) is a critical interface for human-AI interaction in gastrointestinal endoscopy, yet its reliability in real-world clinical settings is limited by domain-specific terminology and complex acoustic conditions. Here, we present EndoASR, a domain-adapted ASR system designed for real-time deployment in endoscopic workflows. We develop a two-stage adaptation strategy based on synthetic endoscopy reports, targeting domain-specific language modeling and noise robustness. In retrospective evaluation across six endoscopists, EndoASR substantially improves both transcription accuracy and clinical usability, reducing character error rate (CER) from 20.52% to 14.14% and increasing medical term accuracy (Med ACC) from 54.30% to 87.59%. In a prospective multi-center study spanning five independent endoscopy centers, EndoASR demonstrates consistent generalization under heterogeneous real-world conditions. Compared with the baseline Paraformer model, CER is reduced from 16.20% to 14.97%, while Med ACC is improved from 61.63% to 84.16%, confirming its robustness in practical deployment scenarios. Notably, EndoASR achieves a real-time factor (RTF) of 0.005, significantly faster than Whisper-large-v3 (RTF 0.055), while maintaining a compact model size of 220M parameters, enabling efficient edge deployment. Furthermore, integration with large language models demonstrates that improved ASR quality directly enhances downstream structured information extraction and clinician-AI interaction. These results demonstrate that domain-adapted ASR can serve as a reliable interface for human-AI teaming in gastrointestinal endoscopy, with consistent performance validated across multi-center real-world clinical settings.

CVJul 16, 2024
EndoFinder: Online Image Retrieval for Explainable Colorectal Polyp Diagnosis

Ruijie Yang, Yan Zhu, Peiyao Fu et al.

Determining the necessity of resecting malignant polyps during colonoscopy screen is crucial for patient outcomes, yet challenging due to the time-consuming and costly nature of histopathology examination. While deep learning-based classification models have shown promise in achieving optical biopsy with endoscopic images, they often suffer from a lack of explainability. To overcome this limitation, we introduce EndoFinder, a content-based image retrieval framework to find the 'digital twin' polyp in the reference database given a newly detected polyp. The clinical semantics of the new polyp can be inferred referring to the matched ones. EndoFinder pioneers a polyp-aware image encoder that is pre-trained on a large polyp dataset in a self-supervised way, merging masked image modeling with contrastive learning. This results in a generic embedding space ready for different downstream clinical tasks based on image retrieval. We validate the framework on polyp re-identification and optical biopsy tasks, with extensive experiments demonstrating that EndoFinder not only achieves explainable diagnostics but also matches the performance of supervised classification models. EndoFinder's reliance on image retrieval has the potential to support diverse downstream decision-making tasks during real-time colonoscopy procedures.

CVJun 4, 2024Code
Leveraging Predicate and Triplet Learning for Scene Graph Generation

Jiankai Li, Yunhong Wang, Xiefan Guo et al.

Scene Graph Generation (SGG) aims to identify entities and predict the relationship triplets \textit{\textless subject, predicate, object\textgreater } in visual scenes. Given the prevalence of large visual variations of subject-object pairs even in the same predicate, it can be quite challenging to model and refine predicate representations directly across such pairs, which is however a common strategy adopted by most existing SGG methods. We observe that visual variations within the identical triplet are relatively small and certain relation cues are shared in the same type of triplet, which can potentially facilitate the relation learning in SGG. Moreover, for the long-tail problem widely studied in SGG task, it is also crucial to deal with the limited types and quantity of triplets in tail predicates. Accordingly, in this paper, we propose a Dual-granularity Relation Modeling (DRM) network to leverage fine-grained triplet cues besides the coarse-grained predicate ones. DRM utilizes contexts and semantics of predicate and triplet with Dual-granularity Constraints, generating compact and balanced representations from two perspectives to facilitate relation recognition. Furthermore, a Dual-granularity Knowledge Transfer (DKT) strategy is introduced to transfer variation from head predicates/triplets to tail ones, aiming to enrich the pattern diversity of tail classes to alleviate the long-tail problem. Extensive experiments demonstrate the effectiveness of our method, which establishes new state-of-the-art performance on Visual Genome, Open Image, and GQA datasets. Our code is available at \url{https://github.com/jkli1998/DRM}

CVDec 26, 2024Code
Generating Editable Head Avatars with 3D Gaussian GANs

Guohao Li, Hongyu Yang, Yifang Men et al.

Generating animatable and editable 3D head avatars is essential for various applications in computer vision and graphics. Traditional 3D-aware generative adversarial networks (GANs), often using implicit fields like Neural Radiance Fields (NeRF), achieve photorealistic and view-consistent 3D head synthesis. However, these methods face limitations in deformation flexibility and editability, hindering the creation of lifelike and easily modifiable 3D heads. We propose a novel approach that enhances the editability and animation control of 3D head avatars by incorporating 3D Gaussian Splatting (3DGS) as an explicit 3D representation. This method enables easier illumination control and improved editability. Central to our approach is the Editable Gaussian Head (EG-Head) model, which combines a 3D Morphable Model (3DMM) with texture maps, allowing precise expression control and flexible texture editing for accurate animation while preserving identity. To capture complex non-facial geometries like hair, we use an auxiliary set of 3DGS and tri-plane features. Extensive experiments demonstrate that our approach delivers high-quality 3D-aware synthesis with state-of-the-art controllability. Our code and models are available at https://github.com/liguohao96/EGG3D.

CVApr 19, 2024
AED-PADA:Improving Generalizability of Adversarial Example Detection via Principal Adversarial Domain Adaptation

Heqi Peng, Yunhong Wang, Ruijie Yang et al.

Adversarial example detection, which can be conveniently applied in many scenarios, is important in the area of adversarial defense. Unfortunately, existing detection methods suffer from poor generalization performance, because their training process usually relies on the examples generated from a single known adversarial attack and there exists a large discrepancy between the training and unseen testing adversarial examples. To address this issue, we propose a novel method, named Adversarial Example Detection via Principal Adversarial Domain Adaptation (AED-PADA). Specifically, our approach identifies the Principal Adversarial Domains (PADs), i.e., a combination of features of the adversarial examples generated by different attacks, which possesses a large portion of the entire adversarial feature space. Subsequently, we pioneer to exploit Multi-source Unsupervised Domain Adaptation in adversarial example detection, with PADs as the source domains. Experimental results demonstrate the superior generalization ability of our proposed AED-PADA. Note that this superiority is particularly achieved in challenging scenarios characterized by employing the minimal magnitude constraint for the perturbations.

CVMay 14, 2025
Endo-CLIP: Progressive Self-Supervised Pre-training on Raw Colonoscopy Records

Yili He, Yan Zhu, Peiyao Fu et al.

Pre-training on image-text colonoscopy records offers substantial potential for improving endoscopic image analysis, but faces challenges including non-informative background images, complex medical terminology, and ambiguous multi-lesion descriptions. We introduce Endo-CLIP, a novel self-supervised framework that enhances Contrastive Language-Image Pre-training (CLIP) for this domain. Endo-CLIP's three-stage framework--cleansing, attunement, and unification--addresses these challenges by (1) removing background frames, (2) leveraging large language models to extract clinical attributes for fine-grained contrastive learning, and (3) employing patient-level cross-attention to resolve multi-polyp ambiguities. Extensive experiments demonstrate that Endo-CLIP significantly outperforms state-of-the-art pre-training methods in zero-shot and few-shot polyp detection and classification, paving the way for more accurate and clinically relevant endoscopic analysis.

CVAug 25, 2021
iDARTS: Improving DARTS by Node Normalization and Decorrelation Discretization

Huiqun Wang, Ruijie Yang, Di Huang et al.

Differentiable ARchiTecture Search (DARTS) uses a continuous relaxation of network representation and dramatically accelerates Neural Architecture Search (NAS) by almost thousands of times in GPU-day. However, the searching process of DARTS is unstable, which suffers severe degradation when training epochs become large, thus limiting its application. In this paper, we claim that this degradation issue is caused by the imbalanced norms between different nodes and the highly correlated outputs from various operations. We then propose an improved version of DARTS, namely iDARTS, to deal with the two problems. In the training phase, it introduces node normalization to maintain the norm balance. In the discretization phase, the continuous architecture is approximated based on the similarity between the outputs of the node and the decorrelated operations rather than the values of the architecture parameters. Extensive evaluation is conducted on CIFAR-10 and ImageNet, and the error rates of 2.25\% and 24.7\% are reported within 0.2 and 1.9 GPU-day for architecture search respectively, which shows its effectiveness. Additional analysis also reveals that iDARTS has the advantage in robustness and generalization over other DARTS-based counterparts.

CVAug 16, 2021
Exploring Transferable and Robust Adversarial Perturbation Generation from the Perspective of Network Hierarchy

Ruikui Wang, Yuanfang Guo, Ruijie Yang et al.

The transferability and robustness of adversarial examples are two practical yet important properties for black-box adversarial attacks. In this paper, we explore effective mechanisms to boost both of them from the perspective of network hierarchy, where a typical network can be hierarchically divided into output stage, intermediate stage and input stage. Since over-specialization of source model, we can hardly improve the transferability and robustness of the adversarial perturbations in the output stage. Therefore, we focus on the intermediate and input stages in this paper and propose a transferable and robust adversarial perturbation generation (TRAP) method. Specifically, we propose the dynamically guided mechanism to continuously calculate accurate directional guidances for perturbation generation in the intermediate stage. In the input stage, instead of the single-form transformation augmentations adopted in the existing methods, we leverage multiform affine transformation augmentations to further enrich the input diversity and boost the robustness and transferability of the adversarial perturbations. Extensive experiments demonstrate that our TRAP achieves impressive transferability and high robustness against certain interferences.

CVMay 1, 2021
A Perceptual Distortion Reduction Framework: Towards Generating Adversarial Examples with High Perceptual Quality and Attack Success Rate

Ruijie Yang, Yunhong Wang, Ruikui Wang et al.

Most of the adversarial attack methods suffer from large perceptual distortions such as visible artifacts, when the attack strength is relatively high. These perceptual distortions contain a certain portion which contributes less to the attack success rate. This portion of distortions, which is induced by unnecessary modifications and lack of proper perceptual distortion constraint, is the target of the proposed framework. In this paper, we propose a perceptual distortion reduction framework to tackle this problem from two perspectives. Firstly, we propose a perceptual distortion constraint and add it into the objective function to jointly optimize the perceptual distortions and attack success rate. Secondly, we propose an adaptive penalty factor $λ$ to balance the discrepancies between different samples. Since SGD and Momentum-SGD cannot optimize our complex non-convex problem, we exploit Adam in optimization. Extensive experiments have verified the superiority of our proposed framework.