CVJul 15, 2023
Spatial-Spectral Hyperspectral Classification based on Learnable 3D Group ConvolutionGuandong Li, Mengxia Ye
Deep neural networks have faced many problems in hyperspectral image classification, including the ineffective utilization of spectral-spatial joint information and the problems of gradient vanishing and overfitting that arise with increasing depth. In order to accelerate the deployment of models on edge devices with strict latency requirements and limited computing power, this paper proposes a learnable group convolution network (LGCNet) based on an improved 3D-DenseNet model and a lightweight model design. The LGCNet module improves the shortcomings of group convolution by introducing a dynamic learning method for the input channels and convolution kernel grouping, enabling flexible grouping structures and generating better representation ability. Through the overall loss and gradient of the backpropagation network, the 3D group convolution is dynamically determined and updated in an end-to-end manner. The learnable number of channels and corresponding grouping can capture different complementary visual features of input images, allowing the CNN to learn richer feature representations. When extracting high-dimensional and redundant hyperspectral data, the 3D convolution kernels also contain a large amount of redundant information. The LGC module allows the 3D-DenseNet to choose channel information with more semantic features, and is very efficient, making it suitable for embedding in any deep neural network for acceleration and efficiency improvements. LGC enables the 3D-CNN to achieve sufficient feature extraction while also meeting speed and computing requirements. Furthermore, LGCNet has achieved progress in inference speed and accuracy, and outperforms mainstream hyperspectral image classification methods on the Indian Pines, Pavia University, and KSC datasets.
45.4CVMar 23Code
AdaEdit: Adaptive Temporal and Channel Modulation for Flow-Based Image EditingGuandong Li, Zhaobin Chu
Inversion-based image editing in flow matching models has emerged as a powerful paradigm for training-free, text-guided image manipulation. A central challenge in this paradigm is the injection dilemma: injecting source features during denoising preserves the background of the original image but simultaneously suppresses the model's ability to synthesize edited content. Existing methods address this with fixed injection strategies -- binary on/off temporal schedules, uniform spatial mixing ratios, and channel-agnostic latent perturbation -- that ignore the inherently heterogeneous nature of injection demand across both the temporal and channel dimensions. In this paper, we present AdaEdit, a training-free adaptive editing framework that resolves this dilemma through two complementary innovations. First, we propose a Progressive Injection Schedule that replaces hard binary cutoffs with continuous decay functions (sigmoid, cosine, or linear), enabling a smooth transition from source-feature preservation to target-feature generation and eliminating feature discontinuity artifacts. Second, we introduce Channel-Selective Latent Perturbation, which estimates per-channel importance based on the distributional gap between the inverted and random latents and applies differentiated perturbation strengths accordingly -- strongly perturbing edit-relevant channels while preserving structure-encoding channels. Extensive experiments on the PIE-Bench benchmark (700 images, 10 editing types) demonstrate that AdaEdit achieves an 8.7% reduction in LPIPS, a 2.6% improvement in SSIM, and a 2.3% improvement in PSNR over strong baselines, while maintaining competitive CLIP similarity. AdaEdit is fully plug-and-play and compatible with multiple ODE solvers including Euler, RF-Solver, and FireFlow. Code is available at https://github.com/leeguandong/AdaEdit
CVJul 13, 2023
DGCNet: An Efficient 3D-Densenet based on Dynamic Group Convolution for Hyperspectral Remote Sensing Image ClassificationGuandong Li
Deep neural networks face many problems in the field of hyperspectral image classification, lack of effective utilization of spatial spectral information, gradient disappearance and overfitting as the model depth increases. In order to accelerate the deployment of the model on edge devices with strict latency requirements and limited computing power, we introduce a lightweight model based on the improved 3D-Densenet model and designs DGCNet. It improves the disadvantage of group convolution. Referring to the idea of dynamic network, dynamic group convolution(DGC) is designed on 3d convolution kernel. DGC introduces small feature selectors for each grouping to dynamically decide which part of the input channel to connect based on the activations of all input channels. Multiple groups can capture different and complementary visual and semantic features of input images, allowing convolution neural network(CNN) to learn rich features. 3D convolution extracts high-dimensional and redundant hyperspectral data, and there is also a lot of redundant information between convolution kernels. DGC module allows 3D-Densenet to select channel information with richer semantic features and discard inactive regions. The 3D-CNN passing through the DGC module can be regarded as a pruned network. DGC not only allows 3D-CNN to complete sufficient feature extraction, but also takes into account the requirements of speed and calculation amount. The inference speed and accuracy have been improved, with outstanding performance on the IN, Pavia and KSC datasets, ahead of the mainstream hyperspectral image classification methods.
30.3CVMay 2
AttnRouter: Per-Category Attention Routing for Training-Free Image Editing on MMDiTGuandong Li, Mengxia Ye
We study training-free image editing on Qwen-Image-Edit-2511, a 60-block multi-modal diffusion transformer (MMDiT) that concatenates noise and source-image tokens within a single attention stream. We make three contributions. (i) We introduce KVInject, a single-forward attention manipulation that alpha-blends source-half key/value projections into the noise-half within a localized layer/step band. KVInject is simpler than the classical two-pass MasaCtrl recipe and avoids the prompt-mismatch failure mode that disables MasaCtrl on MMDiT (composite score drops 31% versus baseline). (ii) We show that no single attention operation dominates across edit types, motivating AttnRouter, a per-category routing table that dispatches edits to the operation that best preserves source structure for that type. With ground-truth categories the router improves the CLIP-T+DINO-I composite by 6.4% over the editing baseline; an automatic CLIP zero-shot classifier closes 98% of this gap despite only 55% category accuracy. (iii) Through layer-, step-, and alpha-band ablations we localize the editing-effective attention sub-circuit: K/V injection in early denoising steps (S0-7) recovers nearly all of the gain of full-step injection, while injection in early (L0-15) or late (L45-60) layer bands fails to drive editing entirely; alpha in [0.3, 0.5] is a stable sweet spot. We also report negative results that highlight what does not transfer from the UNet folklore: simple K/V rescaling never beats baseline and aggressive variants collapse generation entirely (composite 0.084). We release code, pre-computed routing tables, and a 100-sample stratified subset of ImgEdit-Bench used in all ablations.
49.4CVApr 19
Edit Fidelity Field: Semantics-Aware Region Isolation for Training-Free Scene Text EditingGuandong Li, Mengxia Ye
Scene text editing (STE) has achieved remarkable progress in accurately rendering target text through diffusion-based methods. However, we identify a critical yet overlooked problem: edit spillover -- when editing a target text region, existing methods inadvertently modify non-target regions, particularly neighboring text. Through systematic evaluation on 50 real-world scenes across four categories, we reveal that state-of-the-art diffusion editing models exhibit a spillover rate of 94%, meaning nearly all non-target text regions are altered during editing. To address this, we propose the Edit Fidelity Field (EFF), a semantics-aware continuous field that controls per-pixel editing fidelity. Unlike binary masks, EFF leverages OCR-detected text regions to construct a four-zone field: Edit Core (fully editable), Transition Zone (smooth decay), Protected Zone (non-target text, explicitly locked), and Background (strictly preserved). EFF operates as a training-free, model-agnostic post-processing module applicable to any diffusion-based STE method. We further propose per-region spillover quantification, a novel evaluation protocol that measures edit leakage at each non-target text region individually. Experiments demonstrate that EFF reduces spillover rate from 94% to 25% while improving non-target region preservation by +91.4 dB PSNR.
CVSep 15, 2024
E-Commerce Inpainting with Mask Guidance in Controlnet for Reducing OvercompletionGuandong Li
E-commerce image generation has always been one of the core demands in the e-commerce field. The goal is to restore the missing background that matches the main product given. In the post-AIGC era, diffusion models are primarily used to generate product images, achieving impressive results. This paper systematically analyzes and addresses a core pain point in diffusion model generation: overcompletion, which refers to the difficulty in maintaining product features. We propose two solutions: 1. Using an instance mask fine-tuned inpainting model to mitigate this phenomenon; 2. Adopting a train-free mask guidance approach, which incorporates refined product masks as constraints when combining ControlNet and UNet to generate the main product, thereby avoiding overcompletion of the product. Our method has achieved promising results in practical applications and we hope it can serve as an inspiring technical report in this field.
40.3CVApr 13
LayerCache: Exploiting Layer-wise Velocity Heterogeneity for Efficient Flow Matching InferenceGuandong Li
Flow Matching models achieve state-of-the-art image generation quality but incur substantial inference cost due to iterative denoising through large Transformer networks. We observe that different layer groups within a Transformer exhibit markedly heterogeneous velocity dynamics: shallow layers are highly stable and amenable to aggressive caching, while deep layers undergo large velocity changes that demand full computation. Existing caching methods, however, treat the entire Transformer as a monolithic unit, applying a single caching decision per timestep and thus failing to exploit this heterogeneity. Based on this finding, we propose LayerCache, a layer-aware caching framework that partitions the Transformer into layer groups and makes independent, per-group caching decisions at each denoising step. LayerCache introduces an adaptive JVP span K selection mechanism that leverages per-group stability measurements to balance estimation accuracy and computational savings. We formulate a three-dimensional scheduling problem over timesteps, layer groups, and JVP span, and solve it with a greedy budget allocation algorithm. On Qwen-Image (1024x1024, 50 steps), LayerCache achieves PSNR 37.46 dB (+5.38 dB over MeanCache), SSIM 0.9834, and LPIPS 0.0178 (a 70% reduction over MeanCache) at 1.37x speedup, dominating all prior caching methods on the quality-speed Pareto frontier.
CVSep 7, 2024
Training-Free Style Consistent Image Synthesis with Condition and Mask Guidance in E-CommerceGuandong Li
Generating style-consistent images is a common task in the e-commerce field, and current methods are largely based on diffusion models, which have achieved excellent results. This paper introduces the concept of the QKV (query/key/value) level, referring to modifications in the attention maps (self-attention and cross-attention) when integrating UNet with image conditions. Without disrupting the product's main composition in e-commerce images, we aim to use a train-free method guided by pre-set conditions. This involves using shared KV to enhance similarity in cross-attention and generating mask guidance from the attention map to cleverly direct the generation of style-consistent images. Our method has shown promising results in practical applications.
12.0CVMar 18
Edit Spillover as a Probe: Do Image Editing Models Implicitly Understand World Relations?Guandong Li, Zhaobin Chu
Instruction-following image editing models are expected to modify only the specified region while keeping the rest of the image unchanged. However, in practice, we observe a pervasive phenomenon -- edit spillover: models alter semantically related but unspecified content outside the edit region. This raises a fundamental question -- does spillover reflect genuine implicit world understanding, or is it merely attention leakage? We propose EditSpilloverProbe, a systematic framework that repurposes edit spillover as a natural probe for world knowledge in image editing models. We introduce a spillover taxonomy (spatial, semantic, mixed, random), an automated detection-and-classification pipeline, and a benchmark dataset constructed from real-world Chinese text editing tasks, EditSpilloverBench. Systematic evaluation of 5 representative editing models reveals three core findings: (1) spillover rates vary dramatically across architectures, from 3.49% to 11.46%, with a 3.3x ratio; (2) absolute semantic spillover quantity reveals models' world understanding capability -- nano_banana produces the most semantic spillover (27.8 per image), while qwen_2511 has the most precise editing control but lower semantic spillover (16.3 per image), revealing a trade-off between editing control and world understanding; (3) spatial decay analysis shows spillover area density decays exponentially with distance, but the proportion of semantically relevant spillover remains constant (40%-58%), providing direct evidence that semantic spillover reflects genuine world understanding rather than spatial diffusion.
39.3CVMay 1
PhysEdit: Physically-Consistent Region-Aware Image Editing via Adaptive Spatio-Temporal ReasoningGuandong Li, Mengxia Ye
Image editing instructions are heterogeneous: a color swap, an object insertion, and a physical-action edit all demand different spatial coverage and different reasoning depth, yet existing reasoning-based editors apply a single fixed inference recipe to every instruction. We argue that adaptivity along both the spatial and temporal axes is the missing degree of freedom, and we present PhysEdit, an editing framework built around this principle. PhysEdit introduces two inference-time modules that compose without retraining the backbone. At its core, (1) Complexity-Adaptive Reasoning Depth (CARD) predicts edit complexity directly from the instruction and reference image and allocates the reasoning step count N_r and reasoning-token length r per sample -- turning a previously fixed inference schedule into a conditional-computation problem. CARD is supported by (2) a Spatial Reasoning Mask (SRM) that extracts an instruction-conditioned spatial prior from cross-attention to confine reasoning to regions that semantically require it. On the full 737-case ImgEdit Basic-Edit Suite, PhysEdit delivers a 1.18x wall-clock speedup (64.3s vs. 76.1s per sample) over a strong reasoning baseline while slightly improving instruction adherence (CLIP-T 0.2283 vs. 0.2266, +0.7%) and matching identity preservation within noise (CLIP-I 0.8246 vs. 0.8280). The speedup is category-dependent and reaches 1.52x on appearance-level edits, validating CARD's adaptive allocation as the principal source of efficiency gain. A 30-sample pilot with full ablations isolates the contribution of each module.
CVDec 22, 2025
DVI: Disentangling Semantic and Visual Identity for Training-Free Personalized GenerationGuandong Li, Yijun Ding
Recent tuning-free identity customization methods achieve high facial fidelity but often overlook visual context, such as lighting, skin texture, and environmental tone. This limitation leads to ``Semantic-Visual Dissonance,'' where accurate facial geometry clashes with the input's unique atmosphere, causing an unnatural ``sticker-like'' effect. We propose **DVI (Disentangled Visual-Identity)**, a zero-shot framework that orthogonally disentangles identity into fine-grained semantic and coarse-grained visual streams. Unlike methods relying solely on semantic vectors, DVI exploits the inherent statistical properties of the VAE latent space, utilizing mean and variance as lightweight descriptors for global visual atmosphere. We introduce a **Parameter-Free Feature Modulation** mechanism that adaptively modulates semantic embeddings with these visual statistics, effectively injecting the reference's ``visual soul'' without training. Furthermore, a **Dynamic Temporal Granularity Scheduler** aligns with the diffusion process, prioritizing visual atmosphere in early denoising stages while refining semantic details later. Extensive experiments demonstrate that DVI significantly enhances visual consistency and atmospheric fidelity without parameter fine-tuning, maintaining robust identity preservation and outperforming state-of-the-art methods in IBench evaluations.
CVMar 30, 2025
Efficient Dynamic Attention 3D Convolution for Hyperspectral Image ClassificationGuandong Li, Mengxia Ye
Deep neural networks face several challenges in hyperspectral image classification, including insufficient utilization of joint spatial-spectral information, gradient vanishing with increasing depth, and overfitting. To enhance feature extraction efficiency while skipping redundant information, this paper proposes a dynamic attention convolution design based on an improved 3D-DenseNet model. The design employs multiple parallel convolutional kernels instead of a single kernel and assigns dynamic attention weights to these parallel convolutions. This dynamic attention mechanism achieves adaptive feature response based on spatial characteristics in the spatial dimension of hyperspectral images, focusing more on key spatial structures. In the spectral dimension, it enables dynamic discrimination of different bands, alleviating information redundancy and computational complexity caused by high spectral dimensionality. The DAC module enhances model representation capability by attention-based aggregation of multiple convolutional kernels without increasing network depth or width. The proposed method demonstrates superior performance in both inference speed and accuracy, outperforming mainstream hyperspectral image classification methods on the IN, UP, and KSC datasets.
CVApr 15, 2025
3D Wavelet Convolutions with Extended Receptive Fields for Hyperspectral Image ClassificationGuandong Li, Mengxia Ye
Deep neural networks face numerous challenges in hyperspectral image classification, including high-dimensional data, sparse ground object distributions, and spectral redundancy, which often lead to classification overfitting and limited generalization capability. To better adapt to ground object distributions while expanding receptive fields without introducing excessive parameters and skipping redundant information, this paper proposes WCNet, an improved 3D-DenseNet model integrated with wavelet transforms. We introduce wavelet transforms to effectively extend convolutional receptive fields and guide CNNs to better respond to low frequencies through cascading, termed wavelet convolution. Each convolution focuses on different frequency bands of the input signal with gradually increasing effective ranges. This process enables greater emphasis on low-frequency components while adding only a small number of trainable parameters. This dynamic approach allows the model to flexibly focus on critical spatial structures when processing different regions, rather than relying on fixed receptive fields of single static kernels. The Wavelet Conv module enhances model representation capability by expanding receptive fields through 3D wavelet transforms without increasing network depth or width. Experimental results demonstrate superior performance on the IN, UP, and KSC datasets, outperforming mainstream hyperspectral image classification methods.
CVApr 6, 2025
Spatial-Geometry Enhanced 3D Dynamic Snake Convolutional Neural Network for Hyperspectral Image ClassificationGuandong Li, Mengxia Ye
Deep neural networks face several challenges in hyperspectral image classification, including complex and sparse ground object distributions, small clustered structures, and elongated multi-branch features that often lead to missing detections. To better adapt to ground object distributions and achieve adaptive dynamic feature responses while skipping redundant information, this paper proposes a Spatial-Geometry Enhanced 3D Dynamic Snake Network (SG-DSCNet) based on an improved 3D-DenseNet model. The network employs Dynamic Snake Convolution (DSCConv), which introduces deformable offsets to enhance kernel flexibility through constrained self-learning, thereby improving regional perception of ground objects. Additionally, we propose a multi-view feature fusion strategy that generates multiple morphological kernel templates from DSCConv to observe target structures from different perspectives and achieve efficient feature fusion through summarizing key characteristics. This dynamic approach enables the model to focus more flexibly on critical spatial structures when processing different regions, rather than relying on fixed receptive fields of single static kernels. The DSC module enhances model representation capability through dynamic kernel aggregation without increasing network depth or width. Experimental results demonstrate superior performance on the IN, UP, and KSC datasets, outperforming mainstream hyperspectral classification methods.
CVApr 17, 2025
Expert Kernel Generation Network Driven by Contextual Mapping for Hyperspectral Image ClassificationGuandong Li, Mengxia Ye
Deep neural networks face several challenges in hyperspectral image classification, including high-dimensional data, sparse distribution of ground objects, and spectral redundancy, which often lead to classification overfitting and limited generalization capability. To more efficiently adapt to ground object distributions while extracting image features without introducing excessive parameters and skipping redundant information, this paper proposes EKGNet based on an improved 3D-DenseNet model, consisting of a context-aware mapping network and a dynamic kernel generation module. The context-aware mapping module translates global contextual information of hyperspectral inputs into instructions for combining base convolutional kernels, while the dynamic kernels are composed of K groups of base convolutions, analogous to K different types of experts specializing in fundamental patterns across various dimensions. The mapping module and dynamic kernel generation mechanism form a tightly coupled system - the former generates meaningful combination weights based on inputs, while the latter constructs an adaptive expert convolution system using these weights. This dynamic approach enables the model to focus more flexibly on key spatial structures when processing different regions, rather than relying on the fixed receptive field of a single static convolutional kernel. EKGNet enhances model representation capability through a 3D dynamic expert convolution system without increasing network depth or width. The proposed method demonstrates superior performance on IN, UP, and KSC datasets, outperforming mainstream hyperspectral image classification approaches.
CVApr 21, 2025
Dynamic 3D KAN Convolution with Adaptive Grid Optimization for Hyperspectral Image ClassificationGuandong Li, Mengxia Ye
Deep neural networks face several challenges in hyperspectral image classification, including high-dimensional data, sparse distribution of ground objects, and spectral redundancy, which often lead to classification overfitting and limited generalization capability. To more efficiently adapt to ground object distributions while extracting image features without introducing excessive parameters and skipping redundant information, this paper proposes KANet based on an improved 3D-DenseNet model, consisting of 3D KAN Conv and an adaptive grid update mechanism. By introducing learnable univariate B-spline functions on network edges, specifically by flattening three-dimensional neighborhoods into vectors and applying B-spline-parameterized nonlinear activation functions to replace the fixed linear weights of traditional 3D convolutional kernels, we precisely capture complex spectral-spatial nonlinear relationships in hyperspectral data. Simultaneously, through a dynamic grid adjustment mechanism, we adaptively update the grid point positions of B-splines based on the statistical characteristics of input data, optimizing the resolution of spline functions to match the non-uniform distribution of spectral features, significantly improving the model's accuracy in high-dimensional data modeling and parameter efficiency, effectively alleviating the curse of dimensionality. This characteristic demonstrates superior neural scaling laws compared to traditional convolutional neural networks and reduces overfitting risks in small-sample and high-noise scenarios. KANet enhances model representation capability through a 3D dynamic expert convolution system without increasing network depth or width. The proposed method demonstrates superior performance on IN, UP, and KSC datasets, outperforming mainstream hyperspectral image classification approaches.
CVDec 28, 2023
Intelligent Parsing: An Automated Parsing Framework for Extracting Design Semantics from E-commerce CreativesGuandong Li, Xian Yang
In the industrial e-commerce landscape, creative designs such as banners and posters are ubiquitous. Extracting structured semantic information from creative e-commerce design materials (manuscripts crafted by designers) to obtain design semantics represents a core challenge in the realm of intelligent design. In this paper, we propose a comprehensive automated framework for intelligently parsing creative materials. This framework comprises material recognition, preprocess, smartname, and label layers. The material recognition layer consolidates various detection and recognition interfaces, covering business aspects including detection of auxiliary areas within creative materials and layer-level detection, alongside label identification. Algorithmically, it encompasses a variety of coarse-to-fine methods such as Cascade RCNN, GFL, and other models. The preprocess layer involves filtering creative layers and grading creative materials. The smartname layer achieves intelligent naming for creative materials, while the label layer covers multi-level tagging for creative materials, enabling tagging at different hierarchical levels. Intelligent parsing constitutes a complete parsing framework that significantly aids downstream processes such as intelligent creation, creative optimization, and material library construction. Within the practical business applications at Suning, it markedly enhances the exposure, circulation, and click-through rates of creative materials, expediting the closed-loop production of creative materials and yielding substantial benefits.
CVMar 5
Frequency-Aware Error-Bounded Caching for Accelerating Diffusion TransformersGuandong Li
Diffusion Transformers (DiTs) have emerged as the dominant architecture for high-quality image and video generation, yet their iterative denoising process incurs substantial computational cost during inference. Existing caching methods accelerate DiTs by reusing intermediate computations across timesteps, but they share a common limitation: treating the denoising process as uniform across time,depth, and feature dimensions. In this work, we identify three orthogonal axes of non-uniformity in DiT denoising: (1) temporal -- sensitivity to caching errors varies dramatically across the denoising trajectory; (2) depth -- consecutive caching decisions lead to cascading approximation errors; and (3) feature -- different components of the hidden state exhibit heterogeneous temporal dynamics. Based on these observations, we propose SpectralCache, a unified caching framework comprising Timestep-Aware Dynamic Scheduling (TADS), Cumulative Error Budgets (CEB), and Frequency-Decomposed Caching (FDC). On FLUX.1-schnell at 512x512 resolution, SpectralCache achieves 2.46x speedup with LPIPS 0.217 and SSIM 0.727, outperforming TeaCache (2.12x, LPIPS 0.215, SSIM 0.734) by 16% in speed while maintaining comparable quality (LPIPS difference < 1%). Our approach is training-free, plug-and-play, and compatible with existing DiT architectures.
CVFeb 20
Dual-Channel Attention Guidance for Training-Free Image Editing Control in Diffusion TransformersGuandong Li, Mengxia Ye
Training-free control over editing intensity is a critical requirement for diffusion-based image editing models built on the Diffusion Transformer (DiT) architecture. Existing attention manipulation methods focus exclusively on the Key space to modulate attention routing, leaving the Value space -- which governs feature aggregation -- entirely unexploited. In this paper, we first reveal that both Key and Value projections in DiT's multi-modal attention layers exhibit a pronounced bias-delta structure, where token embeddings cluster tightly around a layer-specific bias vector. Building on this observation, we propose Dual-Channel Attention Guidance (DCAG), a training-free framework that simultaneously manipulates both the Key channel (controlling where to attend) and the Value channel (controlling what to aggregate). We provide a theoretical analysis showing that the Key channel operates through the nonlinear softmax function, acting as a coarse control knob, while the Value channel operates through linear weighted summation, serving as a fine-grained complement. Together, the two-dimensional parameter space $(δ_k, δ_v)$ enables more precise editing-fidelity trade-offs than any single-channel method. Extensive experiments on the PIE-Bench benchmark (700 images, 10 editing categories) demonstrate that DCAG consistently outperforms Key-only guidance across all fidelity metrics, with the most significant improvements observed in localized editing tasks such as object deletion (4.9% LPIPS reduction) and object addition (3.2% LPIPS reduction).
CVJul 6, 2025
MVNet: Hyperspectral Remote Sensing Image Classification Based on Hybrid Mamba-Transformer Vision Backbone ArchitectureGuandong Li, Mengxia Ye
Hyperspectral image (HSI) classification faces challenges such as high-dimensional data, limited training samples, and spectral redundancy, which often lead to overfitting and insufficient generalization capability. This paper proposes a novel MVNet network architecture that integrates 3D-CNN's local feature extraction, Transformer's global modeling, and Mamba's linear complexity sequence modeling capabilities, achieving efficient spatial-spectral feature extraction and fusion. MVNet features a redesigned dual-branch Mamba module, including a State Space Model (SSM) branch and a non-SSM branch employing 1D convolution with SiLU activation, enhancing modeling of both short-range and long-range dependencies while reducing computational latency in traditional Mamba. The optimized HSI-MambaVision Mixer module overcomes the unidirectional limitation of causal convolution, capturing bidirectional spatial-spectral dependencies in a single forward pass through decoupled attention that focuses on high-value features, alleviating parameter redundancy and the curse of dimensionality. On IN, UP, and KSC datasets, MVNet outperforms mainstream hyperspectral image classification methods in both classification accuracy and computational efficiency, demonstrating robust capability in processing complex HSI data.
CVJun 10, 2025
Hyperspectral Image Classification via Transformer-based Spectral-Spatial Attention Decoupling and Adaptive GatingGuandong Li, Mengxia Ye
Deep neural networks face several challenges in hyperspectral image classification, including high-dimensional data, sparse distribution of ground objects, and spectral redundancy, which often lead to classification overfitting and limited generalization capability. To more effectively extract and fuse spatial context with fine spectral information in hyperspectral image (HSI) classification, this paper proposes a novel network architecture called STNet. The core advantage of STNet stems from the dual innovative design of its Spatial-Spectral Transformer module: first, the fundamental explicit decoupling of spatial and spectral attention ensures targeted capture of key information in HSI; second, two functionally distinct gating mechanisms perform intelligent regulation at both the fusion level of attention flows (adaptive attention fusion gating) and the internal level of feature transformation (GFFN). This characteristic demonstrates superior feature extraction and fusion capabilities compared to traditional convolutional neural networks, while reducing overfitting risks in small-sample and high-noise scenarios. STNet enhances model representation capability without increasing network depth or width. The proposed method demonstrates superior performance on IN, UP, and KSC datasets, outperforming mainstream hyperspectral image classification approaches.
CVMar 16, 2025
EditID: Training-Free Editable ID Customization for Text-to-Image GenerationGuandong Li, Zhaobin Chu
We propose EditID, a training-free approach based on the DiT architecture, which achieves highly editable customized IDs for text to image generation. Existing text-to-image models for customized IDs typically focus more on ID consistency while neglecting editability. It is challenging to alter facial orientation, character attributes, and other features through prompts. EditID addresses this by deconstructing the text-to-image model for customized IDs into an image generation branch and a character feature branch. The character feature branch is further decoupled into three modules: feature extraction, feature fusion, and feature integration. By introducing a combination of mapping features and shift features, along with controlling the intensity of ID feature integration, EditID achieves semantic compression of local features across network depths, forming an editable feature space. This enables the successful generation of high-quality images with editable IDs while maintaining ID consistency, achieving excellent results in the IBench evaluation, which is an editability evaluation framework for the field of customized ID text-to-image generation that quantitatively demonstrates the superior performance of EditID. EditID is the first text-to-image solution to propose customizable ID editability on the DiT architecture, meeting the demands of long prompts and high quality image generation.
CVFeb 21, 2024
A Two-Stage Dual-Path Framework for Text Tampering Detection and RecognitionGuandong Li, Xian Yang, Wenpin Ma
Document tamper detection has always been an important aspect of tamper detection. Before the advent of deep learning, document tamper detection was difficult. We have made some explorations in the field of text tamper detection based on deep learning. Our Ps tamper detection method includes three steps: feature assistance, audit point positioning, and tamper recognition. It involves hierarchical filtering and graded output (tampered/suspected tampered/untampered). By combining artificial tamper data features, we simulate and augment data samples in various scenarios (cropping with noise addition/replacement, single character/space replacement, smearing/splicing, brightness/contrast adjustment, etc.). The auxiliary features include exif/binary stream keyword retrieval/noise, which are used for branch detection based on the results. Audit point positioning uses detection frameworks and controls thresholds for high and low density detection. Tamper recognition employs a dual-path dual-stream recognition network, with RGB and ELA stream feature extraction. After dimensionality reduction through self-correlation percentile pooling, the fused output is processed through vlad, yielding an accuracy of 0.804, recall of 0.659, and precision of 0.913.
CVFeb 15
Inject Where It Matters: Training-Free Spatially-Adaptive Identity Preservation for Text-to-Image PersonalizationGuandong Li, Mengxia Ye
Personalized text-to-image generation aims to integrate specific identities into arbitrary contexts. However, existing tuning-free methods typically employ Spatially Uniform Visual Injection, causing identity features to contaminate non-facial regions (e.g., backgrounds and lighting) and degrading text adherence. To address this without expensive fine-tuning, we propose SpatialID, a training-free spatially-adaptive identity modulation framework. SpatialID fundamentally decouples identity injection into face-relevant and context-free regions using a Spatial Mask Extractor derived from cross-attention responses. Furthermore, we introduce a Temporal-Spatial Scheduling strategy that dynamically adjusts spatial constraints - transitioning from Gaussian priors to attention-based masks and adaptive relaxation - to align with the diffusion generation dynamics. Extensive experiments on IBench demonstrate that SpatialID achieves state-of-the-art performance in text adherence (CLIP-T: 0.281), visual consistency (CLIP-I: 0.827), and image quality (IQ: 0.523), significantly eliminating background contamination while maintaining robust identity preservation.
CVFeb 11
FastUSP: A Multi-Level Collaborative Acceleration Framework for Distributed Diffusion Model InferenceGuandong Li
Large-scale diffusion models such as FLUX (12B parameters) and Stable Diffusion 3 (8B parameters) require multi-GPU parallelism for efficient inference. Unified Sequence Parallelism (USP), which combines Ulysses and Ring attention mechanisms, has emerged as the state-of-the-art approach for distributed attention computation. However, existing USP implementations suffer from significant inefficiencies including excessive kernel launch overhead and suboptimal computation-communication scheduling. In this paper, we propose \textbf{FastUSP}, a multi-level optimization framework that integrates compile-level optimization (graph compilation with CUDA Graphs and computation-communication reordering), communication-level optimization (FP8 quantized collective communication), and operator-level optimization (pipelined Ring attention with double buffering). We evaluate FastUSP on FLUX (12B) and Qwen-Image models across 2, 4, and 8 NVIDIA RTX 5090 GPUs. On FLUX, FastUSP achieves consistent \textbf{1.12$\times$--1.16$\times$} end-to-end speedup over baseline USP, with compile-level optimization contributing the dominant improvement. On Qwen-Image, FastUSP achieves \textbf{1.09$\times$} speedup on 2 GPUs; on 4--8 GPUs, we identify a PyTorch Inductor compatibility limitation with Ring attention that prevents compile optimization, while baseline USP scales to 1.30$\times$--1.46$\times$ of 2-GPU performance. We further provide a detailed analysis of the performance characteristics of distributed diffusion inference, revealing that kernel launch overhead -- rather than communication latency -- is the primary bottleneck on modern high-bandwidth GPU interconnects.
CVSep 6, 2025
EditIDv2: Editable ID Customization with Data-Lubricated ID Feature Integration for Text-to-Image GenerationGuandong Li, Zhaobin Chu
We propose EditIDv2, a tuning-free solution specifically designed for high-complexity narrative scenes and long text inputs. Existing character editing methods perform well under simple prompts, but often suffer from degraded editing capabilities, semantic understanding biases, and identity consistency breakdowns when faced with long text narratives containing multiple semantic layers, temporal logic, and complex contextual relationships. In EditID, we analyzed the impact of the ID integration module on editability. In EditIDv2, we further explore and address the influence of the ID feature integration module. The core of EditIDv2 is to discuss the issue of editability injection under minimal data lubrication. Through a sophisticated decomposition of PerceiverAttention, the introduction of ID loss and joint dynamic training with the diffusion model, as well as an offline fusion strategy for the integration module, we achieve deep, multi-level semantic editing while maintaining identity consistency in complex narrative environments using only a small amount of data lubrication. This meets the demands of long prompts and high-quality image generation, and achieves excellent results in the IBench evaluation.
CVNov 11, 2024
Layout Control and Semantic Guidance with Attention Loss Backward for T2I Diffusion ModelGuandong Li
Controllable image generation has always been one of the core demands in image generation, aiming to create images that are both creative and logical while satisfying additional specified conditions. In the post-AIGC era, controllable generation relies on diffusion models and is accomplished by maintaining certain components or introducing inference interferences. This paper addresses key challenges in controllable generation: 1. mismatched object attributes during generation and poor prompt-following effects; 2. inadequate completion of controllable layouts. We propose a train-free method based on attention loss backward, cleverly controlling the cross attention map. By utilizing external conditions such as prompts that can reasonably map onto the attention map, we can control image generation without any training or fine-tuning. This method addresses issues like attribute mismatch and poor prompt-following while introducing explicit layout constraints for controllable image generation. Our approach has achieved excellent practical applications in production, and we hope it can serve as an inspiring technical report in this field.
IVFeb 14, 2022
Faster hyperspectral image classification based on selective kernel mechanism using deep convolutional networksGuandong Li, Chunju Zhang
Hyperspectral imagery is rich in spatial and spectral information. Using 3D-CNN can simultaneously acquire features of spatial and spectral dimensions to facilitate classification of features, but hyperspectral image information spectral dimensional information redundancy. The use of continuous 3D-CNN will result in a high amount of parameters, and the computational power requirements of the device are high, and the training takes too long. This letter designed the Faster selective kernel mechanism network (FSKNet), FSKNet can balance this problem. It designs 3D-CNN and 2D-CNN conversion modules, using 3D-CNN to complete feature extraction while reducing the dimensionality of spatial and spectrum. However, such a model is not lightweight enough. In the converted 2D-CNN, a selective kernel mechanism is proposed, which allows each neuron to adjust the receptive field size based on the two-way input information scale. Under the Selective kernel mechanism, it mainly includes two components, se module and variable convolution. Se acquires channel dimensional attention and variable convolution to obtain spatial dimension deformation information of ground objects. The model is more accurate, faster, and less computationally intensive. FSKNet achieves high accuracy on the IN, UP, Salinas, and Botswana data sets with very small parameters.