Jiayi Zhu

CV
h-index29
20papers
138citations
Novelty51%
AI Score58

20 Papers

IVMar 9, 2023Code
Hybrid Dual Mean-Teacher Network With Double-Uncertainty Guidance for Semi-Supervised Segmentation of MRI Scans

Jiayi Zhu, Bart Bolsterlee, Brian V. Y. Chow et al.

Semi-supervised learning has made significant progress in medical image segmentation. However, existing methods primarily utilize information acquired from a single dimensionality (2D/3D), resulting in sub-optimal performance on challenging data, such as magnetic resonance imaging (MRI) scans with multiple objects and highly anisotropic resolution. To address this issue, we present a Hybrid Dual Mean-Teacher (HD-Teacher) model with hybrid, semi-supervised, and multi-task learning to achieve highly effective semi-supervised segmentation. HD-Teacher employs a 2D and a 3D mean-teacher network to produce segmentation labels and signed distance fields from the hybrid information captured in both dimensionalities. This hybrid learning mechanism allows HD-Teacher to combine the `best of both worlds', utilizing features extracted from either 2D, 3D, or both dimensions to produce outputs as it sees fit. Outputs from 2D and 3D teacher models are also dynamically combined, based on their individual uncertainty scores, into a single hybrid prediction, where the hybrid uncertainty is estimated. We then propose a hybrid regularization module to encourage both student models to produce results close to the uncertainty-weighted hybrid prediction. The hybrid uncertainty suppresses unreliable knowledge in the hybrid prediction, leaving only useful information to improve network performance further. Extensive experiments of binary and multi-class segmentation conducted on three MRI datasets demonstrate the effectiveness of the proposed framework. Code is available at https://github.com/ThisGame42/Hybrid-Teacher.

NCMar 31, 2023Code
Temporal Dynamic Synchronous Functional Brain Network for Schizophrenia Diagnosis and Lateralization Analysis

Cheng Zhu, Ying Tan, Shuqi Yang et al.

The available evidence suggests that dynamic functional connectivity (dFC) can capture time-varying abnormalities in brain activity in resting-state cerebral functional magnetic resonance imaging (rs-fMRI) data and has a natural advantage in uncovering mechanisms of abnormal brain activity in schizophrenia(SZ) patients. Hence, an advanced dynamic brain network analysis model called the temporal brain category graph convolutional network (Temporal-BCGCN) was employed. Firstly, a unique dynamic brain network analysis module, DSF-BrainNet, was designed to construct dynamic synchronization features. Subsequently, a revolutionary graph convolution method, TemporalConv, was proposed, based on the synchronous temporal properties of feature. Finally, the first modular abnormal hemispherical lateralization test tool in deep learning based on rs-fMRI data, named CategoryPool, was proposed. This study was validated on COBRE and UCLA datasets and achieved 83.62% and 89.71% average accuracies, respectively, outperforming the baseline model and other state-of-the-art methods. The ablation results also demonstrate the advantages of TemporalConv over the traditional edge feature graph convolution approach and the improvement of CategoryPool over the classical graph pooling approach. Interestingly, this study showed that the lower order perceptual system and higher order network regions in the left hemisphere are more severely dysfunctional than in the right hemisphere in SZ and reaffirms the importance of the left medial superior frontal gyrus in SZ. Our core code is available at: https://github.com/swfen/Temporal-BCGCN.

CVJan 28Code
Advancing Open-source World Models

Robbyant Team, Zelin Gao, Qiuyu Wang et al.

We present LingBot-World, an open-sourced world simulator stemming from video generation. Positioned as a top-tier world model, LingBot-World offers the following features. (1) It maintains high fidelity and robust dynamics in a broad spectrum of environments, including realism, scientific contexts, cartoon styles, and beyond. (2) It enables a minute-level horizon while preserving contextual consistency over time, which is also known as "long-term memory". (3) It supports real-time interactivity, achieving a latency of under 1 second when producing 16 frames per second. We provide public access to the code and model in an effort to narrow the divide between open-source and closed-source technologies. We believe our release will empower the community with practical applications across areas like content creation, gaming, and robot learning.

CVMay 29
Cross-Modal Clinical Knowledge Integration for Mammography Report Generation

Jiayi Zhu, Fuxiang Huang, Yu Xie et al.

Breast cancer is a major global health concern, and mammography screening plays a central role in early detection. The large volume of screening examinations creates a substantial workload for radiologists, making accurate and consistent report generation a critical clinical challenge. Existing automated mammography report generation methods primarily focus on direct visual-to-text mapping, while overlooking the structured clinical reasoning process followed by radiologists in real-world practice. To address this limitation, we propose MammoRG, a mammography report generation framework that explicitly simulates the clinical reporting workflow by following the BI-RADS guideline and incorporating prior clinical knowledge to produce diagnostic reports. Specifically, MammoRG adopts a two-stage training framework. In the first stage, the model learns to integrate clinically relevant prior knowledge from a patient's four-view mammograms through classification-based supervision. In the second stage, a terminology-aware supervised fine-tuning strategy is introduced to model mammography-specific clinical terms as atomic semantic units, enabling the generation of high-quality reports with improved clinical consistency. To facilitate clinical efficacy evaluation of generated reports, we further develop MammoRGTool, a dedicated mammography report parsing tool that extracts structured clinical information from free-text reports. Extensive experiments demonstrate that MammoRG consistently outperforms existing methods across multiple clinical efficacy metrics, particularly in diagnosis-related BI-RADS F1, where it surpasses the second-best model by 2.73%, 2.04%, 1.90%, and 3.27% on the internal, external 1, external 2, and VinDr-Mammo datasets, respectively.

CVAug 23, 2023Code
Edge-aware Hard Clustering Graph Pooling for Brain Imaging

Cheng Zhu, Jiayi Zhu, Xi Wu et al.

Graph Convolutional Networks (GCNs) can capture non-Euclidean spatial dependence between different brain regions. The graph pooling operator, a crucial element of GCNs, enhances the representation learning capability and facilitates the acquisition of abnormal brain maps. However, most existing research designs graph pooling operators solely from the perspective of nodes while disregarding the original edge features. This confines graph pooling application scenarios and diminishes its ability to capture critical substructures. In this paper, we propose a novel edge-aware hard clustering graph pool (EHCPool), which is tailored to dominant edge features and redefines the clustering process. EHCPool initially introduced the 'Edge-to-Node' score criterion which utilized edge information to evaluate the significance of nodes. An innovative Iteration n-top strategy was then developed, guided by edge scores, to adaptively learn sparse hard clustering assignments for graphs. Additionally, a N-E Aggregation strategy is designed to aggregate node and edge features in each independent subgraph. Extensive experiments on the multi-site public datasets demonstrate the superiority and robustness of the proposed model. More notably, EHCPool has the potential to probe different types of dysfunctional brain networks from a data-driven perspective. Method code: https://github.com/swfen/EHCPool

CVApr 10Code
StreamMeCo: Long-Term Agent Memory Compression for Efficient Streaming Video Understanding

Junxi Wang, Te Sun, Jiayi Zhu et al.

Vision agent memory has shown remarkable effectiveness in streaming video understanding. However, storing such memory for videos incurs substantial memory overhead, leading to high costs in both storage and computation. To address this issue, we propose StreamMeCo, an efficient Stream Agent Memory Compression framework. Specifically, based on the connectivity of the memory graph, StreamMeCo introduces edge-free minmax sampling for the isolated nodes and an edge-aware weight pruning for connected nodes, evicting the redundant memory nodes while maintaining the accuracy. In addition, we introduce a time-decay memory retrieval mechanism to further eliminate the performance degradation caused by memory compression. Extensive experiments on three challenging benchmark datasets (M3-Bench-robot, M3-Bench-web and Video-MME-Long) demonstrate that under 70% memory graph compression, StreamMeCo achieves a 1.87* speedup in memory retrieval while delivering an average accuracy improvement of 1.0%. Our code is available at https://github.com/Celina-love-sweet/StreamMeCo.

CVMar 3
ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

Jiayi Zhu, Jianing Zhang, Yiying Yang et al.

This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interaction. ShareVerse leverages the generation capability of large video models and integrates three key innovations: 1) A dataset for large-scale multi-agent interactive world modeling is built on the CARLA simulation platform, featuring diverse scenes, weather conditions, and interactive trajectories with paired multi-view videos (front/ rear/ left/ right views per agent) and camera data. 2) We propose a spatial concatenation strategy for four-view videos of independent agents to model a broader environment and to ensure internal multi-view geometric consistency. 3) We integrate cross-agent attention blocks into the pretrained video model, which enable interactive transmission of spatial-temporal information across agents, guaranteeing shared world consistency in overlapping regions and reasonable generation in non-overlapping regions. ShareVerse, which supports 49-frame large-scale video generation, accurately perceives the position of dynamic agents and achieves consistent shared world modeling.

IMAug 28, 2023
PolarRec: Radio Interferometric Data Reconstruction with Polar Coordinate Representation

Ruoqi Wang, Zhuoyang Chen, Jiayi Zhu et al.

In radio astronomy, visibility data, which are measurements of wave signals from radio telescopes, are transformed into images for observation of distant celestial objects. However, these resultant images usually contain both real sources and artifacts, due to signal sparsity and other factors. One way to obtain cleaner images is to reconstruct samples into dense forms before imaging. Unfortunately, existing reconstruction methods often miss some components of visibility in frequency domain, so blurred object edges and persistent artifacts remain in the images. Furthermore, the computation overhead is high on irregular visibility samples due to the data skew. To address these problems, we propose PolarRec, a transformer-encoder-conditioned reconstruction pipeline with visibility samples converted into the polar coordinate representation. This representation matches the way in which radio telescopes observe a celestial area as the Earth rotates. As a result, visibility samples distribute in the polar system more uniformly than in the Cartesian space. Therefore, we propose to use radial distance in the loss function, to help reconstruct complete visibility effectively. Also, we group visibility samples by their polar angles and propose a group-based encoding scheme to improve the efficiency. Our experiments demonstrate that PolarRec markedly improves imaging results by faithfully reconstructing all frequency components in the visibility domain while significantly reducing the computation cost in visibility data encoding. We believe this high-quality and high-efficiency imaging of PolarRec will better facilitate astronomers to conduct their research.

CVMay 24, 2023Code
DC-Net: Divide-and-Conquer for Salient Object Detection

Jiayi Zhu, Xuebin Qin, Abdulmotaleb Elsaddik

In this paper, we introduce Divide-and-Conquer into the salient object detection (SOD) task to enable the model to learn prior knowledge that is for predicting the saliency map. We design a novel network, Divide-and-Conquer Network (DC-Net) which uses two encoders to solve different subtasks that are conducive to predicting the final saliency map, here is to predict the edge maps with width 4 and location maps of salient objects and then aggregate the feature maps with different semantic information into the decoder to predict the final saliency map. The decoder of DC-Net consists of our newly designed two-level Residual nested-ASPP (ResASPP$^{2}$) modules, which have the ability to capture a large number of different scale features with a small number of convolution operations and have the advantages of maintaining high resolution all the time and being able to obtain a large and compact effective receptive field (ERF). Based on the advantage of Divide-and-Conquer's parallel computing, we use Parallel Acceleration to speed up DC-Net, allowing it to achieve competitive performance on six LR-SOD and five HR-SOD datasets under high efficiency (60 FPS and 55 FPS). Codes and results are available: https://github.com/PiggyJerry/DC-Net.

CVMar 27, 2024
CosalPure: Learning Concept from Group Images for Robust Co-Saliency Detection

Jiayi Zhu, Qing Guo, Felix Juefei-Xu et al.

Co-salient object detection (CoSOD) aims to identify the common and salient (usually in the foreground) regions across a given group of images. Although achieving significant progress, state-of-the-art CoSODs could be easily affected by some adversarial perturbations, leading to substantial accuracy reduction. The adversarial perturbations can mislead CoSODs but do not change the high-level semantic information (e.g., concept) of the co-salient objects. In this paper, we propose a novel robustness enhancement framework by first learning the concept of the co-salient objects based on the input group images and then leveraging this concept to purify adversarial perturbations, which are subsequently fed to CoSODs for robustness enhancement. Specifically, we propose CosalPure containing two modules, i.e., group-image concept learning and concept-guided diffusion purification. For the first module, we adopt a pre-trained text-to-image diffusion model to learn the concept of co-salient objects within group images where the learned concept is robust to adversarial examples. For the second module, we map the adversarial image to the latent space and then perform diffusion generation by embedding the learned concept into the noise prediction function as an extra condition. Our method can effectively alleviate the influence of the SOTA adversarial attack containing different adversarial patterns, including exposure and noise. The extensive results demonstrate that our method could enhance the robustness of CoSODs significantly.

LGSep 25, 2025
Why Attention Fails: The Degeneration of Transformers into MLPs in Time Series Forecasting

Zida Liang, Jiayi Zhu, Weiqiang Sun

Transformer-based architectures achieved high performance in natural language processing and computer vision, yet many studies have shown that they have not demonstrated a clear advantage in time series forecasting and even underperform simple linear baselines in some cases. However, most of these studies have not thoroughly explored the reasons behind the failure of transformers. To better understand time-series transformers(TST), we designed a series of experiments, progressively modifying transformers into MLPs to investigate the impact of the attention mechanism. Surprisingly, transformer blocks often degenerate into simple MLPs in existing time-series transformers. We designed a interpretable dataset to investigate the reasons behind the failure of the attention mechanism and revealed that the attention mechanism is not working in the expected way. We theoretically analyzed the reasons behind this phenomenon, demonstrating that the current embedding methods fail to allow transformers to function in a well-structured latent space, and further analyzed the deeper underlying causes of the failure of embedding.

CVNov 16, 2025
Beyond Pixels: Semantic-aware Typographic Attack for Geo-Privacy Protection

Jiayi Zhu, Yihao Huang, Yue Cao et al.

Large Visual Language Models (LVLMs) now pose a serious yet overlooked privacy threat, as they can infer a social media user's geolocation directly from shared images, leading to unintended privacy leakage. While adversarial image perturbations provide a potential direction for geo-privacy protection, they require relatively strong distortions to be effective against LVLMs, which noticeably degrade visual quality and diminish an image's value for sharing. To overcome this limitation, we identify typographical attacks as a promising direction for protecting geo-privacy by adding text extension outside the visual content. We further investigate which textual semantics are effective in disrupting geolocation inference and design a two-stage, semantics-aware typographical attack that generates deceptive text to protect user privacy. Extensive experiments across three datasets demonstrate that our approach significantly reduces geolocation prediction accuracy of five state-of-the-art commercial LVLMs, establishing a practical and visually-preserving protection strategy against emerging geo-privacy threats.

CVSep 24, 2025
A Versatile Foundation Model for AI-enabled Mammogram Interpretation

Fuxiang Huang, Jiayi Zhu, Yunfang Yu et al.

Breast cancer is the most commonly diagnosed cancer and the leading cause of cancer-related mortality in women globally. Mammography is essential for the early detection and diagnosis of breast lesions. Despite recent progress in foundation models (FMs) for mammogram analysis, their clinical translation remains constrained by several fundamental limitations, including insufficient diversity in training data, limited model generalizability, and a lack of comprehensive evaluation across clinically relevant tasks. Here, we introduce VersaMammo, a versatile foundation model for mammograms, designed to overcome these limitations. We curated the largest multi-institutional mammogram dataset to date, comprising 706,239 images from 21 sources. To improve generalization, we propose a two-stage pre-training strategy to develop VersaMammo, a mammogram foundation model. First, a teacher model is trained via self-supervised learning to extract transferable features from unlabeled mammograms. Then, supervised learning combined with knowledge distillation transfers both features and clinical knowledge into VersaMammo. To ensure a comprehensive evaluation, we established a benchmark comprising 92 specific tasks, including 68 internal tasks and 24 external validation tasks, spanning 5 major clinical task categories: lesion detection, segmentation, classification, image retrieval, and visual question answering. VersaMammo achieves state-of-the-art performance, ranking first in 50 out of 68 specific internal tasks and 20 out of 24 external validation tasks, with average ranks of 1.5 and 1.2, respectively. These results demonstrate its superior generalization and clinical utility, offering a substantial advancement toward reliable and scalable breast cancer screening and diagnosis.

CVAug 5, 2025
Seeing It Before It Happens: In-Generation NSFW Detection for Diffusion-Based Text-to-Image Models

Fan Yang, Yihao Huang, Jiayi Zhu et al.

Diffusion-based text-to-image (T2I) models enable high-quality image generation but also pose significant risks of misuse, particularly in producing not-safe-for-work (NSFW) content. While prior detection methods have focused on filtering prompts before generation or moderating images afterward, the in-generation phase of diffusion models remains largely unexplored for NSFW detection. In this paper, we introduce In-Generation Detection (IGD), a simple yet effective approach that leverages the predicted noise during the diffusion process as an internal signal to identify NSFW content. This approach is motivated by preliminary findings suggesting that the predicted noise may capture semantic cues that differentiate NSFW from benign prompts, even when the prompts are adversarially crafted. Experiments conducted on seven NSFW categories show that IGD achieves an average detection accuracy of 91.32% over naive and adversarial NSFW prompts, outperforming seven baseline methods.

CVJun 28, 2025
Degradation-Modeled Multipath Diffusion for Tunable Metalens Photography

Jianing Zhang, Jiayi Zhu, Feiyu Ji et al.

Metalenses offer significant potential for ultra-compact computational imaging but face challenges from complex optical degradation and computational restoration difficulties. Existing methods typically rely on precise optical calibration or massive paired datasets, which are non-trivial for real-world imaging systems. Furthermore, a lack of control over the inference process often results in undesirable hallucinated artifacts. We introduce Degradation-Modeled Multipath Diffusion for tunable metalens photography, leveraging powerful natural image priors from pretrained models instead of large datasets. Our framework uses positive, neutral, and negative-prompt paths to balance high-frequency detail generation, structural fidelity, and suppression of metalens-specific degradation, alongside \textit{pseudo} data augmentation. A tunable decoder enables controlled trade-offs between fidelity and perceptual quality. Additionally, a spatially varying degradation-aware attention (SVDA) module adaptively models complex optical and sensor-induced degradation. Finally, we design and build a millimeter-scale MetaCamera for real-world validation. Extensive results show that our approach outperforms state-of-the-art methods, achieving high-fidelity and sharp image reconstruction. More materials: https://dmdiff.github.io/.

GNJun 2, 2025
GenDMR: A dynamic multimodal role-swapping network for identifying risk gene phenotypes

Lina Qin, Cheng Zhu, Chuqi Zhou et al.

Recent studies have shown that integrating multimodal data fusion techniques for imaging and genetic features is beneficial for the etiological analysis and predictive diagnosis of Alzheimer's disease (AD). However, there are several critical flaws in current deep learning methods. Firstly, there has been insufficient discussion and exploration regarding the selection and encoding of genetic information. Secondly, due to the significantly superior classification value of AD imaging features compared to genetic features, many studies in multimodal fusion emphasize the strengths of imaging features, actively mitigating the influence of weaker features, thereby diminishing the learning of the unique value of genetic features. To address this issue, this study proposes the dynamic multimodal role-swapping network (GenDMR). In GenDMR, we develop a novel approach to encode the spatial organization of single nucleotide polymorphisms (SNPs), enhancing the representation of their genomic context. Additionally, to adaptively quantify the disease risk of SNPs and brain region, we propose a multi-instance attention module to enhance model interpretability. Furthermore, we introduce a dominant modality selection module and a contrastive self-distillation module, combining them to achieve a dynamic teacher-student role exchange mechanism based on dominant and auxiliary modalities for bidirectional co-updating of different modal data. Finally, GenDMR achieves state-of-the-art performance on the ADNI public dataset and visualizes attention to different SNPs, focusing on confirming 12 potential high-risk genes related to AD, including the most classic APOE and recently highlighted significant risk genes. This demonstrates GenDMR's interpretable analytical capability in exploring AD genetic features, providing new insights and perspectives for the development of multimodal data fusion techniques.

CVDec 21, 2024
Concept Guided Co-salient Object Detection

Jiayi Zhu, Qing Guo, Felix Juefei-Xu et al.

Co-salient object detection (Co-SOD) aims to identify common salient objects across a group of related images. While recent methods have made notable progress, they typically rely on low-level visual patterns and lack semantic priors, limiting their detection performance. We propose ConceptCoSOD, a concept-guided framework that introduces high-level semantic knowledge to enhance co-saliency detection. By extracting shared text-based concepts from the input image group, ConceptCoSOD provides semantic guidance that anchors the detection process. To further improve concept quality, we analyze the effect of diffusion timesteps and design a resampling strategy that selects more informative steps for learning robust concepts. This semantic prior, combined with the resampling-enhanced representation, enables accurate and consistent segmentation even in challenging visual conditions. Extensive experiments on three benchmark datasets and five corrupted settings demonstrate that ConceptCoSOD significantly outperforms existing methods in both accuracy and generalization.

CVJan 17, 2022
Masked Faces with Faced Masks

Jiayi Zhu, Qing Guo, Felix Juefei-Xu et al.

Modern face recognition systems (FRS) still fall short when the subjects are wearing facial masks, a common theme in the age of respiratory pandemics. An intuitive partial remedy is to add a mask detector to flag any masked faces so that the FRS can act accordingly for those low-confidence masked faces. In this work, we set out to investigate the potential vulnerability of such FRS equipped with a mask detector, on large-scale masked faces, which might trigger a serious risk, e.g., letting a suspect evade the FRS where both facial identity and mask are undetected. As existing face recognizers and mask detectors have high performance in their respective tasks, it is significantly challenging to simultaneously fool them and preserve the transferability of the attack. We formulate the new task as the generation of realistic & adversarial-faced mask and make three main contributions: First, we study the naive Delanunay-based masking method (DM) to simulate the process of wearing a faced mask that is cropped from a template image, which reveals the main challenges of this new task. Second, we further equip the DM with the adversarial noise attack and propose the adversarial noise Delaunay-based masking method (AdvNoise-DM) that can fool the face recognition and mask detection effectively but make the face less natural. Third, we propose the adversarial filtering Delaunay-based masking method denoted as MF2M by employing the adversarial filtering for AdvNoise-DM and obtain more natural faces. With the above efforts, the final version not only leads to significant performance deterioration of the state-of-the-art (SOTA) deep learning-based FRS, but also remains undetected by the SOTA facial mask detector, thus successfully fooling both systems at the same time.

CVJan 16, 2022
ALA: Naturalness-aware Adversarial Lightness Attack

Yihao Huang, Liangru Sun, Qing Guo et al.

Most researchers have tried to enhance the robustness of DNNs by revealing and repairing the vulnerability of DNNs with specialized adversarial examples. Parts of the attack examples have imperceptible perturbations restricted by Lp norm. However, due to their high-frequency property, the adversarial examples can be defended by denoising methods and are hard to realize in the physical world. To avoid the defects, some works have proposed unrestricted attacks to gain better robustness and practicality. It is disappointing that these examples usually look unnatural and can alert the guards. In this paper, we propose Adversarial Lightness Attack (ALA), a white-box unrestricted adversarial attack that focuses on modifying the lightness of the images. The shape and color of the samples, which are crucial to human perception, are barely influenced. To obtain adversarial examples with a high attack success rate, we propose unconstrained enhancement in terms of the light and shade relationship in images. To enhance the naturalness of images, we craft the naturalness-aware regularization according to the range and distribution of light. The effectiveness of ALA is verified on two popular datasets for different tasks (i.e., ImageNet for image classification and Places-365 for scene recognition).

FLDec 17, 2019
Prema: A Tool for Precise Requirements Editing, Modeling and Analysis

Yihao Huang, Jincao Feng, Hanyue Zheng et al.

We present Prema, a tool for Precise Requirement Editing, Modeling and Analysis. It can be used in various fields for describing precise requirements using formal notations and performing rigorous analysis. By parsing the requirements written in formal modeling language, Prema is able to get a model which aptly depicts the requirements. It also provides different rigorous verification and validation techniques to check whether the requirements meet users' expectation and find potential errors. We show that our tool can provide a unified environment for writing and verifying requirements without using tools that are not well inter-related. For experimental demonstration, we use the requirements of the automatic train protection (ATP) system of CASCO signal co. LTD., the largest railway signal control system manufacturer of China. The code of the tool cannot be released here because the project is commercially confidential. However, a demonstration video of the tool is available at https://youtu.be/BX0yv8pRMWs.