32.0SDJun 1
C2GA: A Class-Controllable Generative Augmentation Framework for Respiratory Sound ClassificationZiqi Ma, Mengyu Han, Anteng Cai et al.
Background: Respiratory sound classification plays a critical role in the clinical identification of pulmonary pathologies. However, its performance is often hindered by the limited size, severe noise, and class imbalance of real-world auscultation datasets. Although conventional audio augmentation techniques are easy to implement, they may inadvertently distort subtle pathological characteristics. Meanwhile, existing Variational Autoencoder (VAE)- or Generative Adversarial Network (GAN)-based generative approaches often suffer from limited sample fidelity and insufficient controllability over class semantics, particularly under conditions of scarce supervision. Methods: To overcome these limitations, we propose C2GA, a class-controllable generative augmentation framework. C2GA first constructs a semantically rich discrete latent space using a conditional Vector-Quantized Variational Autoencoder (VQ-VAE), in which local acoustic tokens are explicitly decoupled from global class prototypes. Subsequently, a Transformer-based autoregressive prior is trained to generate label-consistent token sequences. These generated tokens are then fused with the corresponding class prototypes and decoded into high-fidelity Mel-spectrograms for data augmentation. Conclusion: These results indicate that C2GA provides an effective and semantically reliable augmentation strategy for respiratory sound analysis. By enabling controllable and high-quality data generation, the proposed framework offers a promising solution for improving the robustness and generalization of respiratory sound classification in realistic clinical scenarios.
ROMay 21, 2025
VERDI: VLM-Embedded Reasoning for Autonomous DrivingBowen Feng, Zhiting Mei, Baiang Li et al.
While autonomous driving (AD) stacks struggle with decision making under partial observability and real-world complexity, human drivers are capable of commonsense reasoning to make near-optimal decisions with limited information. Recent work has attempted to leverage finetuned Vision-Language Models (VLMs) for trajectory planning at inference time to emulate human behavior. Despite their success in benchmark evaluations, these methods are often impractical to deploy (a 70B parameter VLM inference at merely 8 tokens per second requires more than 160G of memory), and their monolithic network structure prohibits safety decomposition. To bridge this gap, we propose VLM-Embedded Reasoning for autonomous Driving (VERDI), a training-time framework that distills the reasoning process and commonsense knowledge of VLMs into the AD stack. VERDI augments modular differentiable end-to-end (e2e) AD models by aligning intermediate module outputs at the perception, prediction, and planning stages with text features explaining the driving reasoning process produced by VLMs. By encouraging alignment in latent space, VERDI enables the modular AD stack to internalize structured reasoning, without incurring the inference-time costs of large VLMs. We demonstrate the effectiveness of our method on the NuScenes dataset and find that VERDI outperforms existing e2e methods that do not embed reasoning by 10% in $\ell_{2}$ distance, while maintaining high inference speed.
LGJun 20, 2024
Behavior-Inspired Neural Networks for Relational InferenceYulong Yang, Bowen Feng, Keqin Wang et al.
From pedestrians to Kuramoto oscillators, interactions between agents govern how dynamical systems evolve in space and time. Discovering how these agents relate to each other has the potential to improve our understanding of the often complex dynamics that underlie these systems. Recent works learn to categorize relationships between agents based on observations of their physical behavior. These approaches model relationship categories as outcomes of a categorical distribution which is limiting and contrary to real-world systems, where relationship categories often intermingle and interact. In this work, we introduce a level of abstraction between the observable behavior of agents and the latent categories that determine their behavior. To do this, we learn a mapping from agent observations to agent preferences for a set of latent categories. The learned preferences and inter-agent proximity are integrated in a nonlinear opinion dynamics model, which allows us to naturally identify mutually exclusive categories, predict an agent's evolution in time, and control an agent's behavior. Through extensive experiments, we demonstrate the utility of our model for learning interpretable categories, and the efficacy of our model for long-horizon trajectory prediction.