ASMay 15, 2022
Learning Representations for New Sound Classes With Continual Self-Supervised LearningZhepei Wang, Cem Subakan, Xilin Jiang et al. · uw
In this paper, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framework where the model can be updated without relying on labeled data. For this purpose, we propose adopting representation learning, where an encoder is trained using unlabeled data. This learning framework enables the study and implementation of a practically relevant use case where only a small amount of the labels is available in a continual learning context. We also make the empirical observation that a similarity-based representation learning method within this framework is robust to forgetting even if no explicit mechanism against forgetting is employed. We show that this approach obtains similar performance compared to several distillation-based continual learning methods when employed on self-supervised representation learning methods.
SDOct 19, 2023
Audio Editing with Non-Rigid Text PromptsFrancesco Paissan, Luca Della Libera, Zhepei Wang et al.
In this paper, we explore audio-editing with non-rigid text edits. We show that the proposed editing pipeline is able to create audio edits that remain faithful to the input audio. We explore text prompts that perform addition, style transfer, and in-painting. We quantitatively and qualitatively show that the edits are able to obtain results which outperform Audio-LDM, a recently released text-prompted audio generation model. Qualitative inspection of the results points out that the edits given by our approach remain more faithful to the input audio in terms of keeping the original onsets and offsets of the audio events.
ASMar 28, 2022
Improved singing voice separation with chromagram-based pitch-aware remixingSiyuan Yuan, Zhepei Wang, Umut Isik et al.
Singing voice separation aims to separate music into vocals and accompaniment components. One of the major constraints for the task is the limited amount of training data with separated vocals. Data augmentation techniques such as random source mixing have been shown to make better use of existing data and mildly improve model performance. We propose a novel data augmentation technique, chromagram-based pitch-aware remixing, where music segments with high pitch alignment are mixed. By performing controlled experiments in both supervised and semi-supervised settings, we demonstrate that training models with pitch-aware remixing significantly improves the test signal-to-distortion ratio (SDR)
SDAug 23, 2024
On Class Separability Pitfalls In Audio-Text Contrastive Zero-Shot LearningTiago Tavares, Fabio Ayres, Zhepei Wang et al.
Recent advances in audio-text cross-modal contrastive learning have shown its potential towards zero-shot learning. One possibility for this is by projecting item embeddings from pre-trained backbone neural networks into a cross-modal space in which item similarity can be calculated in either domain. This process relies on a strong unimodal pre-training of the backbone networks, and on a data-intensive training task for the projectors. These two processes can be biased by unintentional data leakage, which can arise from using supervised learning in pre-training or from inadvertently training the cross-modal projection using labels from the zero-shot learning evaluation. In this study, we show that a significant part of the measured zero-shot learning accuracy is due to strengths inherited from the audio and text backbones, that is, they are not learned in the cross-modal domain and are not transferred from one modality to another.
SDFeb 3
Rethinking Music Captioning with Music Metadata LLMsIrmak Bukey, Zhepei Wang, Chris Donahue et al.
Music captioning, or the task of generating a natural language description of music, is useful for both music understanding and controllable music generation. Training captioning models, however, typically requires high-quality music caption data which is scarce compared to metadata (e.g., genre, mood, etc.). As a result, it is common to use large language models (LLMs) to synthesize captions from metadata to generate training data for captioning models, though this process imposes a fixed stylization and entangles factual information with natural language style. As a more direct approach, we propose metadata-based captioning. We train a metadata prediction model to infer detailed music metadata from audio and then convert it into expressive captions via pre-trained LLMs at inference time. Compared to a strong end-to-end baseline trained on LLM-generated captions derived from metadata, our method: (1) achieves comparable performance in less training time over end-to-end captioners, (2) offers flexibility to easily change stylization post-training, enabling output captions to be tailored to specific stylistic and quality requirements, and (3) can be prompted with audio and partial metadata to enable powerful metadata imputation or in-filling--a common task for organizing music data.
ROMay 21, 2021Code
Fast-Racing: An Open-source Strong Baseline for SE(3) Planning in Autonomous Drone RacingZhichao Han, Zhepei Wang, Neng Pan et al.
With the autonomy of aerial robots advances in recent years, autonomous drone racing has drawn increasing attention. In a professional pilot competition, a skilled operator always controls the drone to agilely avoid obstacles in aggressive attitudes, for reaching the destination as fast as possible. Autonomous flight like elite pilots requires planning in SE(3), whose non-triviality and complexity hindering a convincing solution in our community by now. To bridge this gap, this paper proposes an open-source baseline, which includes a high-performance SE(3) planner and a challenging simulation platform tailored for drone racing. We specify the SE(3) trajectory generation as a soft-penalty optimization problem, and speed up the solving process utilizing its underlying parallel structure. Moreover, to provide a testbed for challenging the planner, we develop delicate drone racing tracks which mimic real-world set-up and necessities planning in SE(3). Besides, we provide necessary system components such as common map interfaces and a baseline controller, to make our work plug-in-and-use. With our baseline, we hope to future foster the research of SE(3) planning and the competition of autonomous drone racing.
ROFeb 27, 2021Code
Geometrically Constrained Trajectory Optimization for MulticoptersZhepei Wang, Xin Zhou, Chao Xu et al.
We present an optimization-based framework for multicopter trajectory planning subject to geometrical configuration constraints and user-defined dynamic constraints. The basis of the framework is a novel trajectory representation built upon our novel optimality conditions for unconstrained control effort minimization. We design linear-complexity operations on this representation to conduct spatial-temporal deformation under various planning requirements. Smooth maps are utilized to exactly eliminate geometrical constraints in a lightweight fashion. A variety of state-input constraints are supported by the decoupling of dense constraint evaluation from sparse parameterization, and backward differentiation of flatness map. As a result, this framework transforms a generally constrained multicopter planning problem into an unconstrained optimization that can be solved reliably and efficiently. Our framework bridges the gaps among solution quality, planning efficiency, and constraint fidelity for a multicopter with limited resources and maneuvering capability. Its generality and robustness are both demonstrated by applications to different flight tasks. Extensive simulations and benchmarks are also conducted to show its capability of generating high-quality solutions while retaining the computation speed against other specialized methods by orders of magnitude. The source code of our framework is available at: https://github.com/ZJU-FAST-Lab/GCOPTER
ROAug 8, 2020Code
TGK-Planner: An Efficient Topology Guided Kinodynamic Planner for Autonomous QuadrotorsHongkai Ye, Xin Zhou, Zhepei Wang et al.
In this paper, we propose a lightweight yet effective Topology Guided Kinodynamic planner (TGK-Planner) for quadrotor aggressive flights with limited onboard computing resources. The proposed system follows the traditional hierarchical planning workflow, with novel designs to improve the robustness and efficiency in both the pathfinding and trajectory optimization sub-modules. Firstly, we propose the topology guided graph, which roughly captures the topological structure of the environment and guides the state sampling of a sampling-based kinodynamic planner. In this way, we significantly improve the efficiency of finding a safe and dynamically feasible trajectory. Then, we refine the smoothness and continuity of the trajectory in an optimization framework, which incorporates the homotopy constraint to guarantee the safety of the trajectory. The optimization program is formulated as a sequence of quadratic programmings (QPs) and can be iteratively solved in a few milliseconds. Finally, the proposed system is integrated into a fully autonomous quadrotor and validated in various simulated and real-world scenarios. Benchmark comparisons show that our method outperforms state-of-the-art methods with regard to efficiency and trajectory quality. Moreover, we will release our code as an open-source package.
ROFeb 25, 2020Code
Alternating Minimization Based Trajectory Generation for Quadrotor Aggressive FlightZhepei Wang, Xin Zhou, Chao Xu et al.
With much research has been conducted into trajectory planning for quadrotors, planning with spatial and temporal optimal trajectories in real-time is still challenging. In this paper, we propose a framework for generating large-scale piecewise polynomial trajectories for aggressive autonomous flights, with highlights on its superior computational efficiency and simultaneous spatial-temporal optimality. Exploiting the implicitly decoupled structure of the planning problem, we conduct alternating minimization between boundary conditions and time durations of trajectory pieces. In each minimization phase, we leverage the algebraic convenience of the sub-problem to escape poor local minima and achieve the lowest time consumption. Theoretical analysis for the global/local convergence rate of our proposed method is provided. Moreover, based on polynomial theory, an extremely fast feasibility check method is designed for various kinds of constraints. By incorporating the method into our alternating structure, a constrained minimization algorithm is constructed to optimize trajectories on the premise of feasibility. Benchmark evaluation shows that our algorithm outperforms state-of-the-art methods regarding efficiency, optimality, and scalability. Aggressive flight experiments in a limited space with dense obstacles are presented to demonstrate the performance of the proposed algorithm. We release our implementation as an open-source ros-package.
SDMay 3, 2023
Unsupervised Improvement of Audio-Text Cross-Modal RepresentationsZhepei Wang, Cem Subakan, Krishna Subramani et al.
Recent advances in using language models to obtain cross-modal audio-text representations have overcome the limitations of conventional training approaches that use predefined labels. This has allowed the community to make progress in tasks like zero-shot classification, which would otherwise not be possible. However, learning such representations requires a large amount of human-annotated audio-text pairs. In this paper, we study unsupervised approaches to improve the learning framework of such representations with unpaired text and audio. We explore domain-unspecific and domain-specific curation methods to create audio-text pairs that we use to further improve the model. We also show that when domain-specific curation is used in conjunction with a soft-labeled contrastive loss, we are able to obtain significant improvement in terms of zero-shot classification performance on downstream sound event classification or acoustic scene classification tasks.
ROFeb 24, 2022
Bubble Planner: Planning High-speed Smooth Quadrotor Trajectories using Receding CorridorsYunfan Ren, Fangcheng Zhu, Wenyi Liu et al.
Quadrotors are agile platforms. With human experts, they can perform extremely high-speed flights in cluttered environments. However, fully autonomous flight at high speed remains a significant challenge. In this work, we propose a motion planning algorithm based on the corridor-constrained minimum control effort trajectory optimization (MINCO) framework. Specifically, we use a series of overlapping spheres to represent the free space of the environment and propose two novel designs that enable the algorithm to plan high-speed quadrotor trajectories in real-time. One is a sampling-based corridor generation method that generates spheres with large overlapped areas (hence overall corridor size) between two neighboring spheres. The second is a Receding Horizon Corridors (RHC) strategy, where part of the previously generated corridor is reused in each replan. Together, these two designs enlarge the corridor spaces in accordance with the quadrotor's current state and hence allow the quadrotor to maneuver at high speeds. We benchmark our algorithm against other state-of-the-art planning methods to show its superiority in simulation. Comprehensive ablation studies are also conducted to show the necessity of the two designs. The proposed method is finally evaluated on an autonomous LiDAR-navigated quadrotor UAV in woods environments, achieving flight speeds over 13.7 m/s without any prior map of the environment or external localization facility.
ROSep 17, 2021
Robust Trajectory Planning for Spatial-Temporal Multi-Drone Coordination in Large ScenesZhepei Wang, Chao Xu, Fei Gao
In this paper, we describe a robust multi-drone planning framework for high-speed trajectories in large scenes. It uses a free-space-oriented map to free the optimization from cumbersome environment data. A capsule-like safety constraint is designed to avoid reciprocal collisions when vehicles deviate from their nominal flight progress under disturbance. We further show the minimum-singularity differential flatness of our drone dynamics with nonlinear drag effects involved. Leveraging the flatness map, trajectory optimization is efficiently conducted on the flat outputs while still subject to physical limits considering drag forces at high speeds. The robustness and effectiveness of our framework are both validated in large-scale simulations. It can compute collision-free trajectories satisfying high-fidelity vehicle constraints for hundreds of drones in a few minutes.
ROJun 23, 2021
Decentralized Spatial-Temporal Trajectory Planning for Multicopter SwarmsXin Zhou, Zhepei Wang, Xiangyong Wen et al.
Multicopter swarms with decentralized structure possess the nature of flexibility and robustness, while efficient spatial-temporal trajectory planning still remains a challenge. This report introduces decentralized spatial-temporal trajectory planning, which puts a well-formed trajectory representation named MINCO into multi-agent scenarios. Our method ensures high-quality local planning for each agent subject to any constraint from either the coordination of the swarm or safety requirements in cluttered environments. Then, the local trajectory generation is formulated as an unconstrained optimization problem that is efficiently solved in milliseconds. Moreover, a decentralized asynchronous mechanism is designed to trigger the local planning for each agent. A systematic solution is presented with detailed descriptions of careful engineering considerations. Extensive benchmarks and indoor/outdoor experiments validate its wide applicability and high quality. Our software will be released for the reference of the community.
SDMay 17, 2021
Sound Event Detection with Adaptive Frequency SelectionZhepei Wang, Jonah Casebeer, Adam Clemmitt et al.
In this work, we present HIDACT, a novel network architecture for adaptive computation for efficiently recognizing acoustic events. We evaluate the model on a sound event detection task where we train it to adaptively process frequency bands. The model learns to adapt to the input without requesting all frequency sub-bands provided. It can make confident predictions within fewer processing steps, hence reducing the amount of computation. Experimental results show that HIDACT has comparable performance to baseline models with more parameters and higher computational complexity. Furthermore, the model can adjust the amount of computation based on the data and computational budget.
SDMay 11, 2021
Separate but Together: Unsupervised Federated Learning for Speech Enhancement from Non-IID DataEfthymios Tzinis, Jonah Casebeer, Zhepei Wang et al.
We propose FEDENHANCE, an unsupervised federated learning (FL) approach for speech enhancement and separation with non-IID distributed data across multiple clients. We simulate a real-world scenario where each client only has access to a few noisy recordings from a limited and disjoint number of speakers (hence non-IID). Each client trains their model in isolation using mixture invariant training while periodically providing updates to a central server. Our experiments show that our approach achieves competitive enhancement performance compared to IID training on a single device and that we can further facilitate the convergence speed and the overall performance using transfer learning on the server-side. Moreover, we show that we can effectively combine updates from clients trained locally with supervised and unsupervised losses. We also release a new dataset LibriFSD50K and its creation recipe in order to facilitate FL research for source separation problems.
SDMar 3, 2021
Compute and memory efficient universal sound source separationEfthymios Tzinis, Zhepei Wang, Xilin Jiang et al.
Recent progress in audio source separation lead by deep learning has enabled many neural network models to provide robust solutions to this fundamental estimation problem. In this study, we provide a family of efficient neural network architectures for general purpose audio source separation while focusing on multiple computational aspects that hinder the application of neural networks in real-world scenarios. The backbone structure of this convolutional network is the SUccessive DOwnsampling and Resampling of Multi-Resolution Features (SuDoRM-RF) as well as their aggregation which is performed through simple one-dimensional convolutions. This mechanism enables our models to obtain high fidelity signal separation in a wide variety of settings where variable number of sources are present and with limited computational resources (e.g. floating point operations, memory footprint, number of parameters and latency). Our experiments show that SuDoRM-RF models perform comparably and even surpass several state-of-the-art benchmarks with significantly higher computational resource requirements. The causal variation of SuDoRM-RF is able to obtain competitive performance in real-time speech separation of around 10dB scale-invariant signal-to-distortion ratio improvement (SI-SDRi) while remaining up to 20 times faster than real-time on a laptop device.
RONov 8, 2020
Mapless-Planner: A Robust and Fast Planning Framework for Aggressive Autonomous Flight without Map FusionJialin Ji, Zhepei Wang, Yingjian Wang et al.
Maintaining a map online is resource-consuming while a robust navigation system usually needs environment abstraction via a well-fused map. In this paper, we propose a mapless planner which directly conducts such abstraction on the unfused sensor data. A limited-memory data structure with a reliable proximity query algorithm is proposed for maintaining raw historical information. A sampling-based scheme is designed to extract the free-space skeleton. A smart waypoint selection strategy enables to generate high-quality trajectories within the resultant flight corridors. Our planner differs from other mapless ones in that it can abstract and exploit the environment information efficiently. The online replan consistency and success rate are both significantly improved against conventional mapless methods.
RONov 5, 2020
Generating Large-Scale Trajectories Efficiently using Double Descriptions of PolynomialsZhepei Wang, Hongkai Ye, Chao Xu et al.
For quadrotor trajectory planning, describing a polynomial trajectory through coefficients and end-derivatives both enjoy their own convenience in energy minimization. We name them double descriptions of polynomial trajectories. The transformation between them, causing most of the inefficiency and instability, is formally analyzed in this paper. Leveraging its analytic structure, we design a linear-complexity scheme for both jerk/snap minimization and parameter gradient evaluation, which possesses efficiency, stability, flexibility, and scalability. With the help of our scheme, generating an energy optimal (minimum snap) trajectory only costs 1 $μs$ per piece at the scale up to 1,000,000 pieces. Moreover, generating large-scale energy-time optimal trajectories is also accelerated by an order of magnitude against conventional methods.
ROAug 20, 2020
EGO-Planner: An ESDF-free Gradient-based Local Planner for QuadrotorsXin Zhou, Zhepei Wang, Hongkai Ye et al.
Gradient-based planners are widely used for quadrotor local planning, in which a Euclidean Signed Distance Field (ESDF) is crucial for evaluating gradient magnitude and direction. Nevertheless, computing such a field has much redundancy since the trajectory optimization procedure only covers a very limited subspace of the ESDF updating range. In this paper, an ESDF-free gradient-based planning framework is proposed, which significantly reduces computation time. The main improvement is that the collision term in the penalty function is formulated by comparing the colliding trajectory with a collision-free guiding path. The resulting obstacle information will be stored only if the trajectory hits new obstacles, making the planner only extract necessary obstacle information. Then, we lengthen the time allocation if dynamical feasibility is violated. An anisotropic curve fitting algorithm is introduced to adjust higher-order derivatives of the trajectory while maintaining the original shape. Benchmark comparisons and real-world experiments verify its robustness and high-performance. The source code is released as ROS packages.
ASJul 14, 2020
Sudo rm -rf: Efficient Networks for Universal Audio Source SeparationEfthymios Tzinis, Zhepei Wang, Paris Smaragdis
In this paper, we present an efficient neural network for end-to-end general purpose audio source separation. Specifically, the backbone structure of this convolutional network is the SUccessive DOwnsampling and Resampling of Multi-Resolution Features (SuDoRMRF) as well as their aggregation which is performed through simple one-dimensional convolutions. In this way, we are able to obtain high quality audio source separation with limited number of floating point operations, memory requirements, number of parameters and latency. Our experiments on both speech and environmental sound separation datasets show that SuDoRMRF performs comparably and even surpasses various state-of-the-art approaches with significantly higher computational resource requirements.
ROFeb 21, 2020
Detailed Proofs of Alternating Minimization Based Trajectory Generation for Quadrotor Aggressive FlightZhepei Wang, Xin Zhou, Chao Xu et al.
This technical report provides detailed theoretical analysis of the algorithm used in \textit{Alternating Minimization Based Trajectory Generation for Quadrotor Aggressive Flight}. An assumption is provided to ensure that settings for the objective function are meaningful. What's more, we explore the structure of the optimization problem and analyze the global/local convergence rate of the employed algorithm.
LGOct 22, 2019
Two-Step Sound Source Separation: Training on Learned Latent TargetsEfthymios Tzinis, Shrikant Venkataramani, Zhepei Wang et al.
In this paper, we propose a two-step training procedure for source separation via a deep neural network. In the first step we learn a transform (and it's inverse) to a latent space where masking-based separation performance using oracles is optimal. For the second step, we train a separation module that operates on the previously learned space. In order to do so, we also make use of a scale-invariant signal to distortion ratio (SI-SDR) loss function that works in the latent space, and we prove that it lower-bounds the SI-SDR in the time domain. We run various sound separation experiments that show how this approach can obtain better performance as compared to systems that learn the transform and the separation module jointly. The proposed methodology is general enough to be applicable to a large class of neural network end-to-end separation systems.
LGJun 3, 2019
Continual Learning of New Sound Classes using Generative ReplayZhepei Wang, Cem Subakan, Efthymios Tzinis et al.
Continual learning consists in incrementally training a model on a sequence of datasets and testing on the union of all datasets. In this paper, we examine continual learning for the problem of sound classification, in which we wish to refine already trained models to learn new sound classes. In practice one does not want to maintain all past training data and retrain from scratch, but naively updating a model with new data(sets) results in a degradation of already learned tasks, which is referred to as "catastrophic forgetting." We develop a generative replay procedure for generating training audio spectrogram data, in place of keeping older training datasets. We show that by incrementally refining a classifier with generative replay a generator that is 4% of the size of all previous training data matches the performance of refining the classifier keeping 20% of all previous training data. We thus conclude that we can extend a trained sound classifier to learn new classes without having to keep previously used datasets.
SDNov 3, 2018
Multi-View Networks For Multi-Channel Audio ClassificationJonah Casebeer, Zhepei Wang, Paris Smaragdis
In this paper we introduce the idea of multi-view networks for sound classification with multiple sensors. We show how one can build a multi-channel sound recognition model trained on a fixed number of channels, and deploy it to scenarios with arbitrary (and potentially dynamically changing) number of input channels and not observe degradation in performance. We demonstrate that at inference time you can safely provide this model all available channels as it can ignore noisy information and leverage new information better than standard baseline approaches. The model is evaluated in both an anechoic environment and in rooms generated by a room acoustics simulator. We demonstrate that this model can generalize to unseen numbers of channels as well as unseen room geometries.