CVJul 28, 2023Code
DocDeshadower: Frequency-Aware Transformer for Document Shadow RemovalZiyang Zhou, Yingtie Lei, Xuhang Chen et al.
Shadows in scanned documents pose significant challenges for document analysis and recognition tasks due to their negative impact on visual quality and readability. Current shadow removal techniques, including traditional methods and deep learning approaches, face limitations in handling varying shadow intensities and preserving document details. To address these issues, we propose DocDeshadower, a novel multi-frequency Transformer-based model built upon the Laplacian Pyramid. By decomposing the shadow image into multiple frequency bands and employing two critical modules: the Attention-Aggregation Network for low-frequency shadow removal and the Gated Multi-scale Fusion Transformer for global refinement. DocDeshadower effectively removes shadows at different scales while preserving document content. Extensive experiments demonstrate DocDeshadower's superior performance compared to state-of-the-art methods, highlighting its potential to significantly improve document shadow removal techniques. The code is available at https://github.com/leiyingtie/DocDeshadower.
CVAug 31, 2024Code
SMAFormer: Synergistic Multi-Attention Transformer for Medical Image SegmentationFuchen Zheng, Xuhang Chen, Weihuang Liu et al.
In medical image segmentation, specialized computer vision techniques, notably transformers grounded in attention mechanisms and residual networks employing skip connections, have been instrumental in advancing performance. Nonetheless, previous models often falter when segmenting small, irregularly shaped tumors. To this end, we introduce SMAFormer, an efficient, Transformer-based architecture that fuses multiple attention mechanisms for enhanced segmentation of small tumors and organs. SMAFormer can capture both local and global features for medical image segmentation. The architecture comprises two pivotal components. First, a Synergistic Multi-Attention (SMA) Transformer block is proposed, which has the benefits of Pixel Attention, Channel Attention, and Spatial Attention for feature enrichment. Second, addressing the challenge of information loss incurred during attention mechanism transitions and feature fusion, we design a Feature Fusion Modulator. This module bolsters the integration between the channel and spatial attention by mitigating reshaping-induced information attrition. To evaluate our method, we conduct extensive experiments on various medical image segmentation tasks, including multi-organ, liver tumor, and bladder tumor segmentation, achieving state-of-the-art results. Code and models are available at: https://github.com/CXH-Research/SMAFormer.
CVAug 20, 2024Code
UIE-UnFold: Deep Unfolding Network with Color Priors and Vision Transformer for Underwater Image EnhancementYingtie Lei, Jia Yu, Yihang Dong et al.
Underwater image enhancement (UIE) plays a crucial role in various marine applications, but it remains challenging due to the complex underwater environment. Current learning-based approaches frequently lack explicit incorporation of prior knowledge about the physical processes involved in underwater image formation, resulting in limited optimization despite their impressive enhancement results. This paper proposes a novel deep unfolding network (DUN) for UIE that integrates color priors and inter-stage feature transformation to improve enhancement performance. The proposed DUN model combines the iterative optimization and reliability of model-based methods with the flexibility and representational power of deep learning, offering a more explainable and stable solution compared to existing learning-based UIE approaches. The proposed model consists of three key components: a Color Prior Guidance Block (CPGB) that establishes a mapping between color channels of degraded and original images, a Nonlinear Activation Gradient Descent Module (NAGDM) that simulates the underwater image degradation process, and an Inter Stage Feature Transformer (ISF-Former) that facilitates feature exchange between different network stages. By explicitly incorporating color priors and modeling the physical characteristics of underwater image formation, the proposed DUN model achieves more accurate and reliable enhancement results. Extensive experiments on multiple underwater image datasets demonstrate the superiority of the proposed model over state-of-the-art methods in both quantitative and qualitative evaluations. The proposed DUN-based approach offers a promising solution for UIE, enabling more accurate and reliable scientific analysis in marine research. The code is available at https://github.com/CXH-Research/UIE-UnFold.
CVOct 31, 2023
UWFormer: Underwater Image Enhancement via a Semi-Supervised Multi-Scale TransformerWeiwen Chen, Yingtie Lei, Shenghong Luo et al.
Underwater images often exhibit poor quality, distorted color balance and low contrast due to the complex and intricate interplay of light, water, and objects. Despite the significant contributions of previous underwater enhancement techniques, there exist several problems that demand further improvement: (i) The current deep learning methods rely on Convolutional Neural Networks (CNNs) that lack the multi-scale enhancement, and global perception field is also limited. (ii) The scarcity of paired real-world underwater datasets poses a significant challenge, and the utilization of synthetic image pairs could lead to overfitting. To address the aforementioned problems, this paper introduces a Multi-scale Transformer-based Network called UWFormer for enhancing images at multiple frequencies via semi-supervised learning, in which we propose a Nonlinear Frequency-aware Attention mechanism and a Multi-Scale Fusion Feed-forward Network for low-frequency enhancement. Besides, we introduce a special underwater semi-supervised training strategy, where we propose a Subaqueous Perceptual Loss function to generate reliable pseudo labels. Experiments using full-reference and non-reference underwater benchmarks demonstrate that our method outperforms state-of-the-art methods in terms of both quantity and visual quality.
CVSep 13, 2023
ShaDocFormer: A Shadow-Attentive Threshold Detector With Cascaded Fusion Refiner for Document Shadow RemovalWeiwen Chen, Yingtie Lei, Shenghong Luo et al.
Document shadow is a common issue that arises when capturing documents using mobile devices, which significantly impacts readability. Current methods encounter various challenges, including inaccurate detection of shadow masks and estimation of illumination. In this paper, we propose ShaDocFormer, a Transformer-based architecture that integrates traditional methodologies and deep learning techniques to tackle the problem of document shadow removal. The ShaDocFormer architecture comprises two components: the Shadow-attentive Threshold Detector (STD) and the Cascaded Fusion Refiner (CFR). The STD module employs a traditional thresholding technique and leverages the attention mechanism of the Transformer to gather global information, thereby enabling precise detection of shadow masks. The cascaded and aggregative structure of the CFR module facilitates a coarse-to-fine restoration process for the entire image. As a result, ShaDocFormer excels in accurately detecting and capturing variations in both shadow and illumination, thereby enabling effective removal of shadows. Extensive experiments demonstrate that ShaDocFormer outperforms current state-of-the-art methods in both qualitative and quantitative measurements.
AIMay 22
SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural SkillsYingtie Lei, Zhongwei Wan, Jiankun Zhang et al.
Large language model (LLM) agents accumulate rich episodic trajectories while solving real-world tasks, but it remains unclear whether such experience can be distilled into reusable procedural skills. We introduce SkillEvolBench, a diagnostic benchmark for evaluating this step from experience reuse to skill formation. It contains 180 tasks across six real-world agent environments, organized into role-conditioned task families with shared latent procedures. Agents learn from acquisition tasks, update an external skill library using compacted trajectories and verifier feedback, and then face frozen deployment tasks testing context shift, adversarial shortcuts, and composition. By comparing self-generated and curated-start skill evolution against no-skill and raw-trajectory controls, SkillEvolBench separates procedural abstraction from base capability, curated prior knowledge, and direct reuse of episodic traces. Across ten model configurations and three agent harnesses, we find that current agents often adapt locally but rarely form robust reusable skills. Skill-based conditions can improve acquisition or replay, and individual models sometimes gain on specific deployment axes, but these gains are unstable under frozen deployment. Raw-trajectory reuse frequently outperforms distilled skills, suggesting that current abstraction procedures discard contextual and procedural cues that remain useful for future tasks. Capacity and cost analyses further show that writing more skills or larger Tier-3 resource libraries is not sufficient: additional updates can improve coverage while introducing episode-specific drift and procedural clutter. These findings position SkillEvolBench as a testbed for measuring when one-off experience becomes durable procedural knowledge rather than task-local memory.
LGFeb 23
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action ModelsJingxuan Zhang, Yunta Hsieh, Zhongwei Wan et al.
Vision-language-action (VLA) models unify perception, language, and control for embodied agents but face significant challenges in practical deployment due to rapidly increasing compute and memory demands, especially as models scale to longer horizons and larger backbones. To address these bottlenecks, we introduce QuantVLA, a training-free post-training quantization (PTQ) framework that, to our knowledge, is the first PTQ approach for VLA systems and the first to successfully quantize a diffusion transformer (DiT) action head. QuantVLA incorporates three scale-calibrated components: (1) a selective quantization layout that integerizes all linear layers in both the language backbone and the DiT while keeping attention projections in floating point to preserve the original operator schedule; (2) attention temperature matching, a lightweight per-head scaling mechanism that stabilizes attention logits and is folded into the dequantization scales at inference; and (3) output head balancing, a per-layer residual interface calibration that mitigates post-projection energy drift. The framework requires no additional training, uses only a small unlabeled calibration buffer, and supports integer kernels for low-bit weights and activations while leaving the architecture unchanged. Across representative VLA models on LIBERO, QuantVLA exceeds the task success rates of full-precision baselines, achieves about 70% relative memory savings on the quantized components, and delivers a 1.22x speedup in end-to-end inference latency, providing a practical pathway toward scalable low-bit embodied intelligence under strict compute, memory, and power constraints.
CVJan 27
ProMist-5K: A Comprehensive Dataset for Digital Emulation of Cinematic Pro-Mist Filter EffectsYingtie Lei, Zimeng Li, Chi-Man Pun et al.
Pro-Mist filters are widely used in cinematography for their ability to create soft halation, lower contrast, and produce a distinctive, atmospheric style. These effects are difficult to reproduce digitally due to the complex behavior of light diffusion. We present ProMist-5K, a dataset designed to support cinematic style emulation. It is built using a physically inspired pipeline in a scene-referred linear space and includes 20,000 high-resolution image pairs across four configurations, covering two filter densities (1/2 and 1/8) and two focal lengths (20mm and 50mm). Unlike general style datasets, ProMist-5K focuses on realistic glow and highlight diffusion effects. Multiple blur layers and carefully tuned weighting are used to model the varying intensity and spread of optical diffusion. The dataset provides a consistent and controllable target domain that supports various image translation models and learning paradigms. Experiments show that the dataset works well across different training settings and helps capture both subtle and strong cinematic appearances. ProMist-5K offers a practical and physically grounded resource for film-inspired image transformation, bridging the gap between digital flexibility and traditional lens aesthetics. The dataset is available at https://www.kaggle.com/datasets/yingtielei/promist5k.
CVAug 10, 2025Code
CMAMRNet: A Contextual Mask-Aware Network Enhancing Mural Restoration Through Comprehensive Mask GuidanceYingtie Lei, Fanghai Yi, Yihang Dong et al.
Murals, as invaluable cultural artifacts, face continuous deterioration from environmental factors and human activities. Digital restoration of murals faces unique challenges due to their complex degradation patterns and the critical need to preserve artistic authenticity. Existing learning-based methods struggle with maintaining consistent mask guidance throughout their networks, leading to insufficient focus on damaged regions and compromised restoration quality. We propose CMAMRNet, a Contextual Mask-Aware Mural Restoration Network that addresses these limitations through comprehensive mask guidance and multi-scale feature extraction. Our framework introduces two key components: (1) the Mask-Aware Up/Down-Sampler (MAUDS), which ensures consistent mask sensitivity across resolution scales through dedicated channel-wise feature selection and mask-guided feature fusion; and (2) the Co-Feature Aggregator (CFA), operating at both the highest and lowest resolutions to extract complementary features for capturing fine textures and global structures in degraded regions. Experimental results on benchmark datasets demonstrate that CMAMRNet outperforms state-of-the-art methods, effectively preserving both structural integrity and artistic details in restored murals. The code is available at~\href{https://github.com/CXH-Research/CMAMRNet}{https://github.com/CXH-Research/CMAMRNet}.
CVOct 30, 2024
High-Fidelity Document Stain Removal via A Large-Scale Real-World Dataset and A Memory-Augmented TransformerMingxian Li, Hao Sun, Yingtie Lei et al.
Document images are often degraded by various stains, significantly impacting their readability and hindering downstream applications such as document digitization and analysis. The absence of a comprehensive stained document dataset has limited the effectiveness of existing document enhancement methods in removing stains while preserving fine-grained details. To address this challenge, we construct StainDoc, the first large-scale, high-resolution ($2145\times2245$) dataset specifically designed for document stain removal. StainDoc comprises over 5,000 pairs of stained and clean document images across multiple scenes. This dataset encompasses a diverse range of stain types, severities, and document backgrounds, facilitating robust training and evaluation of document stain removal algorithms. Furthermore, we propose StainRestorer, a Transformer-based document stain removal approach. StainRestorer employs a memory-augmented Transformer architecture that captures hierarchical stain representations at part, instance, and semantic levels via the DocMemory module. The Stain Removal Transformer (SRTransformer) leverages these feature representations through a dual attention mechanism: an enhanced spatial attention with an expanded receptive field, and a channel attention captures channel-wise feature importance. This combination enables precise stain removal while preserving document content integrity. Extensive experiments demonstrate StainRestorer's superior performance over state-of-the-art methods on the StainDoc dataset and its variants StainDoc\_Mark and StainDoc\_Seal, establishing a new benchmark for document stain removal. Our work highlights the potential of memory-augmented Transformers for this task and contributes a valuable dataset to advance future research.
CVApr 10, 2024
Implicit Multi-Spectral Transformer: An Lightweight and Effective Visible to Infrared Image Translation ModelYijia Chen, Pinghua Chen, Xiangxin Zhou et al.
In the field of computer vision, visible light images often exhibit low contrast in low-light conditions, presenting a significant challenge. While infrared imagery provides a potential solution, its utilization entails high costs and practical limitations. Recent advancements in deep learning, particularly the deployment of Generative Adversarial Networks (GANs), have facilitated the transformation of visible light images to infrared images. However, these methods often experience unstable training phases and may produce suboptimal outputs. To address these issues, we propose a novel end-to-end Transformer-based model that efficiently converts visible light images into high-fidelity infrared images. Initially, the Texture Mapping Module and Color Perception Adapter collaborate to extract texture and color features from the visible light image. The Dynamic Fusion Aggregation Module subsequently integrates these features. Finally, the transformation into an infrared image is refined through the synergistic action of the Color Perception Adapter and the Enhanced Perception Attention mechanism. Comprehensive benchmarking experiments confirm that our model outperforms existing methods, producing infrared images of markedly superior quality, both qualitatively and quantitatively. Furthermore, the proposed model enables more effective downstream applications for infrared images than other methods.
IVOct 10, 2025
FS-RWKV: Leveraging Frequency Spatial-Aware RWKV for 3T-to-7T MRI TranslationYingtie Lei, Zimeng Li, Chi-Man Pun et al.
Ultra-high-field 7T MRI offers enhanced spatial resolution and tissue contrast that enables the detection of subtle pathological changes in neurological disorders. However, the limited availability of 7T scanners restricts widespread clinical adoption due to substantial infrastructure costs and technical demands. Computational approaches for synthesizing 7T-quality images from accessible 3T acquisitions present a viable solution to this accessibility challenge. Existing CNN approaches suffer from limited spatial coverage, while Transformer models demand excessive computational overhead. RWKV architectures offer an efficient alternative for global feature modeling in medical image synthesis, combining linear computational complexity with strong long-range dependency capture. Building on this foundation, we propose Frequency Spatial-RWKV (FS-RWKV), an RWKV-based framework for 3T-to-7T MRI translation. To better address the challenges of anatomical detail preservation and global tissue contrast recovery, FS-RWKV incorporates two key modules: (1) Frequency-Spatial Omnidirectional Shift (FSO-Shift), which performs discrete wavelet decomposition followed by omnidirectional spatial shifting on the low-frequency branch to enhance global contextual representation while preserving high-frequency anatomical details; and (2) Structural Fidelity Enhancement Block (SFEB), a module that adaptively reinforces anatomical structure through frequency-aware feature fusion. Comprehensive experiments on UNC and BNU datasets demonstrate that FS-RWKV consistently outperforms existing CNN-, Transformer-, GAN-, and RWKV-based baselines across both T1w and T2w modalities, achieving superior anatomical fidelity and perceptual quality.
CVAug 26, 2025
SFormer: SNR-guided Transformer for Underwater Image Enhancement from the Frequency DomainXin Tian, Yingtie Lei, Xiujun Zhang et al.
Recent learning-based underwater image enhancement (UIE) methods have advanced by incorporating physical priors into deep neural networks, particularly using the signal-to-noise ratio (SNR) prior to reduce wavelength-dependent attenuation. However, spatial domain SNR priors have two limitations: (i) they cannot effectively separate cross-channel interference, and (ii) they provide limited help in amplifying informative structures while suppressing noise. To overcome these, we propose using the SNR prior in the frequency domain, decomposing features into amplitude and phase spectra for better channel modulation. We introduce the Fourier Attention SNR-prior Transformer (FAST), combining spectral interactions with SNR cues to highlight key spectral components. Additionally, the Frequency Adaptive Transformer (FAT) bottleneck merges low- and high-frequency branches using a gated attention mechanism to enhance perceptual quality. Embedded in a unified U-shaped architecture, these modules integrate a conventional RGB stream with an SNR-guided branch, forming SFormer. Trained on 4,800 paired images from UIEB, EUVP, and LSUI, SFormer surpasses recent methods with a 3.1 dB gain in PSNR and 0.08 in SSIM, successfully restoring colors, textures, and contrast in underwater scenes.