Ziyi Li

CV
h-index14
8papers
136citations
Novelty52%
AI Score57

8 Papers

CVFeb 13, 2023Code
CEDNet: A Cascade Encoder-Decoder Network for Dense Prediction

Gang Zhang, Ziyi Li, Chufeng Tang et al. · tsinghua

Multi-scale features are essential for dense prediction tasks, such as object detection, instance segmentation, and semantic segmentation. The prevailing methods usually utilize a classification backbone to extract multi-scale features and then fuse these features using a lightweight module (e.g., the fusion module in FPN and BiFPN, two typical object detection methods). However, as these methods allocate most computational resources to the classification backbone, the multi-scale feature fusion in these methods is delayed, which may lead to inadequate feature fusion. While some methods perform feature fusion from early stages, they either fail to fully leverage high-level features to guide low-level feature learning or have complex structures, resulting in sub-optimal performance. We propose a streamlined cascade encoder-decoder network, dubbed CEDNet, tailored for dense \mbox{prediction} tasks. All stages in CEDNet share the same encoder-decoder structure and perform multi-scale feature fusion within the decoder. A hallmark of CEDNet is its ability to incorporate high-level features from early stages to guide low-level feature learning in subsequent stages, thereby enhancing the effectiveness of multi-scale feature fusion. We explored three well-known encoder-decoder structures: Hourglass, UNet, and FPN. When integrated into CEDNet, they performed much better than traditional methods that use a pre-designed classification backbone combined with a lightweight fusion module. Extensive experiments on object detection, instance segmentation, and semantic segmentation demonstrated the effectiveness of our method. The code is available at https://github.com/zhanggang001/CEDNet.

CVJan 12, 2023
Open-vocabulary Object Segmentation with Diffusion Models

Ziyi Li, Qinye Zhou, Xiaoyun Zhang et al.

The goal of this paper is to extract the visual-language correspondence from a pre-trained text-to-image diffusion model, in the form of segmentation map, i.e., simultaneously generating images and segmentation masks for the corresponding visual entities described in the text prompt. We make the following contributions: (i) we pair the existing Stable Diffusion model with a novel grounding module, that can be trained to align the visual and textual embedding space of the diffusion model with only a small number of object categories; (ii) we establish an automatic pipeline for constructing a dataset, that consists of {image, segmentation mask, text prompt} triplets, to train the proposed grounding module; (iii) we evaluate the performance of open-vocabulary grounding on images generated from the text-to-image diffusion model and show that the module can well segment the objects of categories beyond seen ones at training time; (iv) we adopt the augmented diffusion model to build a synthetic semantic segmentation dataset, and show that, training a standard segmentation model on such dataset demonstrates competitive performance on the zero-shot segmentation(ZS3) benchmark, which opens up new opportunities for adopting the powerful diffusion model for discriminative tasks.

CVOct 7, 2022
A Simple Plugin for Transforming Images to Arbitrary Scales

Qinye Zhou, Ziyi Li, Weidi Xie et al.

Existing models on super-resolution often specialized for one scale, fundamentally limiting their use in practical scenarios. In this paper, we aim to develop a general plugin that can be inserted into existing super-resolution models, conveniently augmenting their ability towards Arbitrary Resolution Image Scaling, thus termed ARIS. We make the following contributions: (i) we propose a transformer-based plugin module, which uses spatial coordinates as query, iteratively attend the low-resolution image feature through cross-attention, and output visual feature for the queried spatial location, resembling an implicit representation for images; (ii) we introduce a novel self-supervised training scheme, that exploits consistency constraints to effectively augment the model's ability for upsampling images towards unseen scales, i.e. ground-truth high-resolution images are not available; (iii) without loss of generality, we inject the proposed ARIS plugin module into several existing models, namely, IPT, SwinIR, and HAT, showing that the resulting models can not only maintain their original performance on fixed scale factor but also extrapolate to unseen scales, substantially outperforming existing any-scale super-resolution models on standard benchmarks, e.g. Urban100, DIV2K, etc.

NIFeb 6Code
GraFSTNet: Graph-based Frequency SpatioTemporal Network for Cellular Traffic Prediction

Ziyi Li, Hui Ma, Fei Xing et al.

With rapid expansion of cellular networks and the proliferation of mobile devices, cellular traffic data exhibits complex temporal dynamics and spatial correlations, posing challenges to accurate traffic prediction. Previous methods often focus predominantly on temporal modeling or depend on predefined spatial topologies, which limits their ability to jointly model spatio-temporal dependencies and effectively capture periodic patterns in cellular traffic. To address these issues, we propose a cellular traffic prediction framework that integrates spatio-temporal modeling with time-frequency analysis. First, we construct a spatial modeling branch to capture inter-cell dependencies through an attention mechanism, minimizing the reliance on predefined topological structures. Second, we build a time-frequency modeling branch to enhance the representation of periodic patterns. Furthermore, we introduce an adaptive-scale LogCosh loss function, which adjusts the error penalty based on traffic magnitude, preventing large errors from dominating the training process and helping the model maintain relatively stable prediction accuracy across different traffic intensities. Experiments on three open-sourced datasets demonstrate that the proposed method achieves prediction performance superior to state-of-the-art approaches.

IVNov 14, 2025
Boosting Neural Video Representation via Online Structural Reparameterization

Ziyi Li, Qingyu Mao, Shuai Liu et al.

Neural Video Representation~(NVR) is a promising paradigm for video compression, showing great potential in improving video storage and transmission efficiency. While recent advances have made efforts in architectural refinements to improve representational capability, these methods typically involve complex designs, which may incur increased computational overhead and lack the flexibility to integrate into other frameworks. Moreover, the inherent limitation in model capacity restricts the expressiveness of NVR networks, resulting in a performance bottleneck. To overcome these limitations, we propose Online-RepNeRV, a NVR framework based on online structural reparameterization. Specifically, we propose a universal reparameterization block named ERB, which incorporates multiple parallel convolutional paths to enhance the model capacity. To mitigate the overhead, an online reparameterization strategy is adopted to dynamically fuse the parameters during training, and the multi-branch structure is equivalently converted into a single-branch structure after training. As a result, the additional computational and parameter complexity is confined to the encoding stage, without affecting the decoding efficiency. Extensive experiments on mainstream video datasets demonstrate that our method achieves an average PSNR gain of 0.37-2.7 dB over baseline methods, while maintaining comparable training time and decoding speed.

NIApr 28
Probing for Better Age of Information in Energy-Harvesting Random Access Networks

Ziyi Li, Fangming Zhao, Howard H. Yang

In this paper, we investigate the impact of channel probing and reservation on the Age of Information (AoI) in energy-harvesting (EH) random access networks, where each source relies solely on harvested energy for status updating. To mitigate collisions, each node may expend a small amount of energy to send a probing signal before transmission, and a successful probe reserves the channel in the current slot. If probing fails, the node can either remain silent, termed strict avoid free competition (SAFC), attempt data transmission with a certain probability, termed reserved nodes competition (RUC), or adopt all-active nodes competition (AUC), where all energy-sufficient nodes may contend regardless of whether they probed. We derive closed-form expressions for the network-average AoI under these three schemes and validate them via simulations. The results show that AUC consistently achieves the lowest AoI by shortening the waiting time to convert harvested energy into successful updates. This finding challenges the conventional wisdom that strict collision avoidance is always optimal in energy-constrained systems, since allowing additional contention can effectively amortize probing overhead across more transmission opportunities. Comparisons with EH-enabled slotted ALOHA further show that probing-based access significantly outperforms direct transmission in energy-constrained regimes, highlighting channel probing as an effective approach to improving freshness.

CRApr 20
Sark: Oblivious Integrity Without Global State

Alex Lynham, David Alesch, Ziyi Li et al.

In this paper, we introduce Sark, a reference architecture for transferring unforgeable, stateful, oblivious (USO) assets. We describe the motivation, design, and implementation of the core subsystems of Sark, Porters, which accumulate and roll-up commitments from Clients, and Sloop, a permissioned, crash fault-tolerant (CFT) blockchain system. We analyse the operation of the system using the `CIA Triad': Confidentiality, Availability, and Integrity. We then introduce the concept of \textit{local centrality} and use it to address design trade-offs related to decentralization. Finally, we point to future work on Byzantine fault-tolerance (BFT), and mitigating the local centrality of Porters.

CVMar 5Code
Tell2Adapt: A Unified Framework for Source Free Unsupervised Domain Adaptation via Vision Foundation Model

Yulong Shi, Shijie Li, Ziyi Li et al.

Source Free Unsupervised Domain Adaptation (SFUDA) is critical for deploying deep learning models across diverse clinical settings. However, existing methods are typically designed for low-gap, specific domain shifts and cannot generalize into a unified, multi-modalities, and multi-target framework, which presents a major barrier to real-world application. To overcome this issue, we introduce Tell2Adapt, a novel SFUDA framework that harnesses the vast, generalizable knowledge of the Vision Foundation Model (VFM). Our approach ensures high-fidelity VFM prompts through Context-Aware Prompts Regularization (CAPR), which robustly translates varied text prompts into canonical instructions. This enables the generation of high-quality pseudo-labels for efficiently adapting the lightweight student model to target domain. To guarantee clinical reliability, the framework incorporates Visual Plausibility Refinement (VPR), which leverages the VFM's anatomical knowledge to re-ground the adapted model's predictions in target image's low-level visual features, effectively removing noise and false positives. We conduct one of the most extensive SFUDA evaluations to date, validating our framework across 10 domain adaptation directions and 22 anatomical targets, including brain, cardiac, polyp, and abdominal targets. Our results demonstrate that Tell2Adapt consistently outperforms existing approaches, achieving SOTA for a unified SFUDA framework in medical image segmentation. Code are avaliable at https://github.com/derekshiii/Tell2Adapt.