Liting Gao

h-index5

4papers

75citations

4 Papers

5.7CVJul 16

Cotton-SF YOLO: Learning Structural and Frequency Cues for Early Cotton Square Detection in Complex Field Environments

Chengjia Zhang, Yu Li, Feiri Ali et al.

Cotton squares are important phenotypic indicators of the early reproductive growth of cotton, and automatic field detection of cotton squares provides an important basis for cotton growth monitoring and precision cultivation management. However, early cotton square detection in complex field environments remains insufficiently explored, as cotton squares are small, frequently occluded, easily blurred, subject to illumination variations, and exhibit low contrast against surrounding cotton leaves. To address these challenges, we propose a task-oriented framework based on YOLO26m, named Cotton-SF YOLO, for cotton square detection under natural field conditions. To improve the perception of small and irregular cotton square boundaries, we introduce Dynamic Snake Convolution into the detector, enabling adaptive extraction of deformable edge features. Furthermore, a frequency-domain feature modulation module is designed by incorporating spectral enhancement into the C2f structure, which recalibrate frequency-domain representations and strengthen discriminative edge and texture cues while reducing interference from complex cotton leaf backgrounds. Trained and evaluated on our newly constructed and annotated field dataset with manually annotated cotton squares, the proposed model achieves mAP$_{50}$, mAP$_{50:95}$, and recall values of 0.8196, 0.4942, and 0.7939, improving over the baseline YOLO26m by 1.25%, 3.45%, and 2.96%, respectively. Ablation experiments and visualization demonstrate that the best performance is achieved with the complementary effects of structural and frequency cues.

7.3SDMay 28

COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings

Yonggang Zhu, Liting Gao, Aidong Men et al.

Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swapping in many zero-shot applications. However, their performance is heavily affected by the modality gap between audio and text embeddings. Existing explanations mainly attribute this gap to the cone effect, treating it as a shift between mean embeddings, yet correcting the mean alone yields only limited improvements. Alternative hypotheses, such as information imbalance and dimensionality collapse, have also been proposed, but they remain insufficiently verified and have not been thoroughly studied in the audio domain. Meanwhile, several works attempt to decompose multimodal contrastive embeddings into interpretable concepts, but none explicitly analyze the modality gap from the perspective of concept decomposition. In this work, we introduce COMET (Concept space Organization and Modality gap Explanation with PLS-SVD Transformation), a novel partial least squares singular value decomposition (PLS-SVD) framework for CLAP that unveils a broader perspective of the modality gap. Our framework reveals that only a small, interpretable subset of axes, which captures shared concepts, contributes substantially to similarity computation, and that the mean component represents only partially the modality gap. Building on this insight, we propose a simple spectral truncation method that mitigates the modality gap in a training-free manner. The method enables zero-shot audio captioning with condition swapping to approach fully supervised performance, without requiring large auxiliary memory banks or expensive computation. At the same time, it achieves substantial embedding dimensionality reduction while preserving strong performance on retrieval and audio captioning tasks.

17.5SDJun 18

Hybrid Diffusion Transformer for Instruction-Guided Audio Editing via Rectified Flow

Liting Gao, Yonggang Zhu, Yaru Chen et al.

Audio editing aims to modify specific content in an existing audio clip according to a natural language instruction while preserving the remaining acoustic content. Despite the remarkable progress of diffusion models, existing training-based editing methods mainly rely on the local inductive biases and cross-attention interaction in convolutional U-Net backbones, which often hinder long-range semantic alignment and precise understanding and localization of instructions. In contrast, diffusion transformers provide stronger global modeling and multimodal fusion, but existing editing architectures usually adopt a simple stack of MMDiT and DiT blocks. Applying joint attention over concatenated audio and text tokens in all blocks results in quadratic complexity with respect to token length. To balance editing performance and efficiency, we propose a hybrid two-stage diffusion transformer architecture for instruction-guided audio editing based on rectified flow matching. It performs joint attention over audio and text tokens to establish coarse semantic alignment at low-resolution stage, then switches to alternating joint-attention and cross-attention blocks to refine editing details at high-resolution stage. This coarse-to-fine strategy enables efficient and accurate instruction-guided audio editing. Experiments show that the proposed framework achieves notable performance gains on challenging editing tasks involving overlapping audio events and complex instructions, while substantially improving editing efficiency with a compact model.

6.2CVSep 17, 2025

Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing

Yaru Chen, Ruohao Guo, Liting Gao et al.

Weakly-supervised audio-visual video parsing (AVVP) seeks to detect audible, visible, and audio-visual events without temporal annotations. Previous work has emphasized refining global predictions through contrastive or collaborative learning, but neglected stable segment-level supervision and class-aware cross-modal alignment. To address this, we propose two strategies: (1) an exponential moving average (EMA)-guided pseudo supervision framework that generates reliable segment-level masks via adaptive thresholds or top-k selection, offering stable temporal guidance beyond video-level labels; and (2) a class-aware cross-modal agreement (CMA) loss that aligns audio and visual embeddings at reliable segment-class pairs, ensuring consistency across modalities while preserving temporal structure. Evaluations on LLP and UnAV-100 datasets shows that our method achieves state-of-the-art (SOTA) performance across multiple metrics.