Yanlong Chen

CV
h-index26
5papers
26citations
Novelty56%
AI Score52

5 Papers

CVJan 30
Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs

Yanlong Chen, Amirhossein Habibian, Luca Benini et al.

Vision-Language Models (VLMs) achieve strong multimodal performance but are costly to deploy, and post-training quantization often causes significant accuracy loss. Despite its potential, quantization-aware training for VLMs remains underexplored. We propose GRACE, a framework unifying knowledge distillation and QAT under the Information Bottleneck principle: quantization constrains information capacity while distillation guides what to preserve within this budget. Treating the teacher as a proxy for task-relevant information, we introduce confidence-gated decoupled distillation to filter unreliable supervision, relational centered kernel alignment to transfer visual token structures, and an adaptive controller via Lagrangian relaxation to balance fidelity against capacity constraints. Across extensive benchmarks on LLaVA and Qwen families, our INT4 models consistently outperform FP16 baselines (e.g., LLaVA-1.5-7B: 70.1 vs. 66.8 on SQA; Qwen2-VL-2B: 76.9 vs. 72.6 on MMBench), nearly matching teacher performance. Using real INT4 kernel, we achieve 3$\times$ throughput with 54% memory reduction. This principled framework significantly outperforms existing quantization methods, making GRACE a compelling solution for resource-constrained deployment.

LGJun 12, 2025Code
PhysioWave: A Multi-Scale Wavelet-Transformer for Physiological Signal Representation

Yanlong Chen, Mattia Orlandi, Pierangelo Maria Rapa et al.

Physiological signals are often corrupted by motion artifacts, baseline drift, and other low-SNR disturbances, which pose significant challenges for analysis. Additionally, these signals exhibit strong non-stationarity, with sharp peaks and abrupt changes that evolve continuously, making them difficult to represent using traditional time-domain or filtering methods. To address these issues, a novel wavelet-based approach for physiological signal analysis is presented, aiming to capture multi-scale time-frequency features in various physiological signals. Leveraging this technique, two large-scale pretrained models specific to EMG and ECG are introduced for the first time, achieving superior performance and setting new baselines in downstream tasks. Additionally, a unified multi-modal framework is constructed by integrating pretrained EEG model, where each modality is guided through its dedicated branch and fused via learnable weighted fusion. This design effectively addresses challenges such as low signal-to-noise ratio, high inter-subject variability, and device mismatch, outperforming existing methods on multi-modal tasks. The proposed wavelet-based architecture lays a solid foundation for analysis of diverse physiological signals, while the multi-modal design points to next-generation physiological signal processing with potential impact on wearable health monitoring, clinical diagnostics, and broader biomedical applications. Code and data are available at: github.com/ForeverBlue816/PhysioWave

CVMay 8
Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs

Song Zhang, Yanlong Chen, Yilin Li et al.

Remote sensing vision-language models (RS-VLMs) face a fundamental mismatch with natural-image counterparts: the same geographic object exhibits radically different visual evidence across ground sampling distances (GSDs) spanning multiple orders of magnitude. Yet existing RS-VLMs often discard GSD or inject it as a discrete text token, forcing a single static parameter set to absorb the entire scale spectrum. We introduce ScaleEarth, a parameter-efficient fine-tuning framework built on Qwen3-VL that treats GSD as a continuous conditioning variable governing the model's computation path. At its core, CS-HLoRA (Continuous Scale-Conditioned Hyper-LoRA) modulates the LoRA low-rank subspace through a GSD-driven gate, enabling the model to dynamically route computation by physical scale. To remove reliance on sensor metadata at deployment, we pair CS-HLoRA with SSE-U, a lightweight heteroscedastic sub-head that predicts GSD and its uncertainty from visual features. To provide matching supervision, we construct GeoScale-VQA, a 1.5M-sample scale-layered RS-VQA corpus whose question-answer generation is conditioned on the same physical scalar that drives CS-HLoRA, forming a closed method-data loop. Trained with QLoRA on an 8B backbone, ScaleEarth achieves state-of-the-art results on remote-sensing benchmarks covering diverse Earth-system tasks, including XLRS-Bench and OmniEarth-Bench.

CVJun 12, 2025
WaveFormer: A Lightweight Transformer Model for sEMG-based Gesture Recognition

Yanlong Chen, Mattia Orlandi, Pierangelo Maria Rapa et al.

Human-machine interaction, particularly in prosthetic and robotic control, has seen progress with gesture recognition via surface electromyographic (sEMG) signals.However, classifying similar gestures that produce nearly identical muscle signals remains a challenge, often reducing classification accuracy. Traditional deep learning models for sEMG gesture recognition are large and computationally expensive, limiting their deployment on resource-constrained embedded systems. In this work, we propose WaveFormer, a lightweight transformer-based architecture tailored for sEMG gesture recognition. Our model integrates time-domain and frequency-domain features through a novel learnable wavelet transform, enhancing feature extraction. In particular, the WaveletConv module, a multi-level wavelet decomposition layer with depthwise separable convolution, ensures both efficiency and compactness. With just 3.1 million parameters, WaveFormer achieves 95% classification accuracy on the EPN612 dataset, outperforming larger models. Furthermore, when profiled on a laptop equipped with an Intel CPU, INT8 quantization achieves real-time deployment with a 6.75 ms inference latency.

ROAug 2, 2020
Edge Computing for Real-Time Near-Crash Detection for Smart Transportation Applications

Ruimin Ke, Zhiyong Cui, Yanlong Chen et al.

Traffic near-crash events serve as critical data sources for various smart transportation applications, such as being surrogate safety measures for traffic safety research and corner case data for automated vehicle testing. However, there are several key challenges for near-crash detection. First, extracting near-crashes from original data sources requires significant computing, communication, and storage resources. Also, existing methods lack efficiency and transferability, which bottlenecks prospective large-scale applications. To this end, this paper leverages the power of edge computing to address these challenges by processing the video streams from existing dashcams onboard in a real-time manner. We design a multi-thread system architecture that operates on edge devices and model the bounding boxes generated by object detection and tracking in linear complexity. The method is insensitive to camera parameters and backward compatible with different vehicles. The edge computing system has been evaluated with recorded videos and real-world tests on two cars and four buses for over ten thousand hours. It filters out irrelevant videos in real-time thereby saving labor cost, processing time, network bandwidth, and data storage. It collects not only event videos but also other valuable data such as road user type, event location, time to collision, vehicle trajectory, vehicle speed, brake switch, and throttle. The experiments demonstrate the promising performance of the system regarding efficiency, accuracy, reliability, and transferability. It is among the first efforts in applying edge computing for real-time traffic video analytics and is expected to benefit multiple sub-fields in smart transportation research and applications.