Jen-Yu Liu

SD
h-index42
13papers
524citations
Novelty53%
AI Score39

13 Papers

SDFeb 5, 2024
ISPA: Inter-Species Phonetic Alphabet for Transcribing Animal Sounds

Masato Hagiwara, Marius Miron, Jen-Yu Liu

Traditionally, bioacoustics has relied on spectrograms and continuous, per-frame audio representations for the analysis of animal sounds, also serving as input to machine learning models. Meanwhile, the International Phonetic Alphabet (IPA) system has provided an interpretable, language-independent method for transcribing human speech sounds. In this paper, we introduce ISPA (Inter-Species Phonetic Alphabet), a precise, concise, and interpretable system designed for transcribing animal sounds into text. We compare acoustics-based and feature-based methods for transcribing and classifying animal sounds, demonstrating their comparable performance with baseline methods utilizing continuous, dense audio representations. By representing animal sounds with text, we effectively treat them as a "foreign language," and we show that established human language ML paradigms and models, such as language models, can be successfully applied to improve performance.

SDMar 4, 2025
Robust detection of overlapping bioacoustic sound events

Louis Mahon, Benjamin Hoffman, Logan James et al.

We propose a method for accurately detecting bioacoustic sound events that is robust to overlapping events, a common issue in domains such as ethology, ecology and conservation. While standard methods employ a frame-based, multi-label approach, we introduce an onset-based detection method which we name Voxaboxen. It takes inspiration from object detection methods in computer vision, but simultaneously takes advantage of recent advances in self-supervised audio encoders. For each time window, Voxaboxen predicts whether it contains the start of a vocalization and how long the vocalization is. It also does the same in reverse, predicting whether each window contains the end of a vocalization, and how long ago it started. The two resulting sets of bounding boxes are then fused using a graph-matching algorithm. We also release a new dataset designed to measure performance on detecting overlapping vocalizations. This consists of recordings of zebra finches annotated with temporally-strong labels and showing frequent overlaps. We test Voxaboxen on seven existing data sets and on our new data set. We compare Voxaboxen to natural baselines and existing sound event detection methods and demonstrate SotA results. Further experiments show that improvements are robust to frequent vocalization overlap.

SDAug 15, 2025
What Matters for Bioacoustic Encoding

Marius Miron, David Robinson, Milad Alizadeh et al.

Bioacoustics, the study of sounds produced by living organisms, plays a vital role in conservation, biodiversity monitoring, and behavioral studies. Many tasks in this field, such as species, individual, and behavior classification and detection, are well-suited to machine learning. However, they often suffer from limited annotated data, highlighting the need for a general-purpose bioacoustic encoder capable of extracting useful representations for diverse downstream tasks. Such encoders have been proposed before, but are often limited in scope due to a focus on a narrow range of species (typically birds), and a reliance on a single model architecture or training paradigm. Moreover, they are usually evaluated on a small set of tasks and datasets. In this work, we present a large-scale empirical study that covers aspects of bioacoustics that are relevant to research but have previously been scarcely considered: training data diversity and scale, model architectures and training recipes, and the breadth of evaluation tasks and datasets. We obtain encoders that are state-of-the-art on the existing and proposed benchmarks. We also identify what matters for training these encoders, such that this work can be extended when more data are available or better architectures are proposed. Specifically, across 26 datasets with tasks including species classification, detection, individual ID, and vocal repertoire discovery, we find self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus yields the strongest in- and out-of-distribution performance. We show the importance of data diversity in both stages. To support ongoing research and application, we will release the model checkpoints.

ASOct 8, 2021
KaraSinger: Score-Free Singing Voice Synthesis with VQ-VAE using Mel-spectrograms

Chien-Feng Liao, Jen-Yu Liu, Yi-Hsuan Yang

In this paper, we propose a novel neural network model called KaraSinger for a less-studied singing voice synthesis (SVS) task named score-free SVS, in which the prosody and melody are spontaneously decided by machine. KaraSinger comprises a vector-quantized variational autoencoder (VQ-VAE) that compresses the Mel-spectrograms of singing audio to sequences of discrete codes, and a language model (LM) that learns to predict the discrete codes given the corresponding lyrics. For the VQ-VAE part, we employ a Connectionist Temporal Classification (CTC) loss to encourage the discrete codes to carry phoneme-related information. For the LM part, we use location-sensitive attention for learning a robust alignment between the input phoneme sequence and the output discrete code. We keep the architecture of both the VQ-VAE and LM light-weight for fast training and inference speed. We validate the effectiveness of the proposed design choices using a proprietary collection of 550 English pop songs sung by multiple amateur singers. The result of a listening test shows that KaraSinger achieves high scores in intelligibility, musicality, and the overall quality.

SDJan 7, 2021
Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs

Wen-Yi Hsiao, Jen-Yu Liu, Yin-Cheng Yeh et al.

To apply neural sequence models such as the Transformers to music generation tasks, one has to represent a piece of music by a sequence of tokens drawn from a finite set of pre-defined vocabulary. Such a vocabulary usually involves tokens of various types. For example, to describe a musical note, one needs separate tokens to indicate the note's pitch, duration, velocity (dynamics), and placement (onset time) along the time grid. While different types of tokens may possess different properties, existing models usually treat them equally, in the same way as modeling words in natural languages. In this paper, we present a conceptually different approach that explicitly takes into account the type of the tokens, such as note types and metric types. And, we propose a new Transformer decoder architecture that uses different feed-forward heads to model tokens of different types. With an expansion-compression trick, we convert a piece of music to a sequence of compound words by grouping neighboring tokens, greatly reducing the length of the token sequences. We show that the resulting model can be viewed as a learner over dynamic directed hypergraphs. And, we employ it to learn to compose expressive Pop piano music of full-song length (involving up to 10K individual tokens per song), both conditionally and unconditionally. Our experiment shows that, compared to state-of-the-art models, the proposed model converges 5--10 times faster at training (i.e., within a day on a single GPU with 11 GB memory), and with comparable quality in the generated music.

ASMay 18, 2020
Unconditional Audio Generation with Generative Adversarial Networks and Cycle Regularization

Jen-Yu Liu, Yu-Hua Chen, Yin-Cheng Yeh et al.

In a recent paper, we have presented a generative adversarial network (GAN)-based model for unconditional generation of the mel-spectrograms of singing voices. As the generator of the model is designed to take a variable-length sequence of noise vectors as input, it can generate mel-spectrograms of variable length. However, our previous listening test shows that the quality of the generated audio leaves room for improvement. The present paper extends and expands that previous work in the following aspects. First, we employ a hierarchical architecture in the generator to induce some structure in the temporal dimension. Second, we introduce a cycle regularization mechanism to the generator to avoid mode collapse. Third, we evaluate the performance of the new model not only for generating singing voices, but also for generating speech voices. Evaluation result shows that new model outperforms the prior one both objectively and subjectively. We also employ the model to unconditionally generate sequences of piano and violin music and find the result promising. Audio examples, as well as the code for implementing our model, will be publicly available online upon paper publication.

SDDec 26, 2019
Score and Lyrics-Free Singing Voice Generation

Jen-Yu Liu, Yu-Hua Chen, Yin-Cheng Yeh et al.

Generative models for singing voice have been mostly concerned with the task of ``singing voice synthesis,'' i.e., to produce singing voice waveforms given musical scores and text lyrics. In this work, we explore a novel yet challenging alternative: singing voice generation without pre-assigned scores and lyrics, in both training and inference time. In particular, we outline three such generation schemes, and propose a pipeline to tackle these new tasks. Moreover, we implement such models using generative adversarial networks and evaluate them both objectively and subjectively.

SDJun 4, 2019
Dilated Convolution with Dilated GRU for Music Source Separation

Jen-Yu Liu, Yi-Hsuan Yang

Stacked dilated convolutions used in Wavenet have been shown effective for generating high-quality audios. By replacing pooling/striding with dilation in convolution layers, they can preserve high-resolution information and still reach distant locations. Producing high-resolution predictions is also crucial in music source separation, whose goal is to separate different sound sources while maintaining the quality of the separated sounds. Therefore, this paper investigates using stacked dilated convolutions as the backbone for music source separation. However, while stacked dilated convolutions can reach wider context than standard convolutions, their effective receptive fields are still fixed and may not be wide enough for complex music audio signals. To reach information at remote locations, we propose to combine dilated convolution with a modified version of gated recurrent units (GRU) called the `Dilated GRU' to form a block. A Dilated GRU unit receives information from k steps before instead of the previous step for a fixed k. This modification allows a GRU unit to reach a location with fewer recurrent steps and run faster because it can execute partially in parallel. We show that the proposed model with a stack of such blocks performs equally well or better than the state-of-the-art models for separating vocals and accompaniments.

SDJul 6, 2018
Singing Style Transfer Using Cycle-Consistent Boundary Equilibrium Generative Adversarial Networks

Cheng-Wei Wu, Jen-Yu Liu, Yi-Hsuan Yang et al.

Can we make a famous rap singer like Eminem sing whatever our favorite song? Singing style transfer attempts to make this possible, by replacing the vocal of a song from the source singer to the target singer. This paper presents a method that learns from unpaired data for singing style transfer using generative adversarial networks.

SDJul 5, 2018
Denoising Auto-encoder with Recurrent Skip Connections and Residual Regression for Music Source Separation

Jen-Yu Liu, Yi-Hsuan Yang

Convolutional neural networks with skip connections have shown good performance in music source separation. In this work, we propose a denoising Auto-encoder with Recurrent skip Connections (ARC). We use 1D convolution along the temporal axis of the time-frequency feature map in all layers of the fully-convolutional network. The use of 1D convolution makes it possible to apply recurrent layers to the intermediate outputs of the convolution layers. In addition, we also propose an enhancement network and a residual regression method to further improve the separation result. The recurrent skip connections, the enhancement module, and the residual regression all improve the separation quality. The ARC model with residual regression achieves 5.74 siganl-to-distoration ratio (SDR) in vocals with MUSDB in SiSEC 2018. We also evaluate the ARC model alone on the older dataset DSD100 (used in SiSEC 2016) and it achieves 5.91 SDR in vocals.

MMMay 5, 2018
Weakly-supervised Visual Instrument-playing Action Detection in Videos

Jen-Yu Liu, Yi-Hsuan Yang, Shyh-Kang Jeng

Instrument playing is among the most common scenes in music-related videos, which represent nowadays one of the largest sources of online videos. In order to understand the instrument-playing scenes in the videos, it is important to know what instruments are played, when they are played, and where the playing actions occur in the scene. While audio-based recognition of instruments has been widely studied, the visual aspect of the music instrument playing remains largely unaddressed in the literature. One of the main obstacles is the difficulty in collecting annotated data of the action locations for training-based methods. To address this issue, we propose a weakly-supervised framework to find when and where the instruments are played in the videos. We propose to use two auxiliary models, a sound model and an object model, to provide supervisions for training the instrument-playing action model. The sound model provides temporal supervisions, while the object model provides spatial supervisions. They together can simultaneously provide temporal and spatial supervisions. The resulted model only needs to analyze the visual part of a music video to deduce which, when and where instruments are played. We found that the proposed method significantly improves the localization accuracy. We evaluate the result of the proposed method temporally and spatially on a small dataset (totally 5,400 frames) that we manually annotated.

SDApr 5, 2017
Revisiting the problem of audio-based hit song prediction using convolutional neural networks

Li-Chia Yang, Szu-Yu Chou, Jen-Yu Liu et al.

Being able to predict whether a song can be a hit has impor- tant applications in the music industry. Although it is true that the popularity of a song can be greatly affected by exter- nal factors such as social and commercial influences, to which degree audio features computed from musical signals (whom we regard as internal factors) can predict song popularity is an interesting research question on its own. Motivated by the recent success of deep learning techniques, we attempt to ex- tend previous work on hit song prediction by jointly learning the audio features and prediction models using deep learning. Specifically, we experiment with a convolutional neural net- work model that takes the primitive mel-spectrogram as the input for feature learning, a more advanced JYnet model that uses an external song dataset for supervised pre-training and auto-tagging, and the combination of these two models. We also consider the inception model to characterize audio infor- mation in different scales. Our experiments suggest that deep structures are indeed more accurate than shallow structures in predicting the popularity of either Chinese or Western Pop songs in Taiwan. We also use the tags predicted by JYnet to gain insights into the result of different models.

NEAug 26, 2016
Applying Topological Persistence in Convolutional Neural Network for Music Audio Signals

Jen-Yu Liu, Shyh-Kang Jeng, Yi-Hsuan Yang

Recent years have witnessed an increased interest in the application of persistent homology, a topological tool for data analysis, to machine learning problems. Persistent homology is known for its ability to numerically characterize the shapes of spaces induced by features or functions. On the other hand, deep neural networks have been shown effective in various tasks. To our best knowledge, however, existing neural network models seldom exploit shape information. In this paper, we investigate a way to use persistent homology in the framework of deep neural networks. Specifically, we propose to embed the so-called "persistence landscape," a rather new topological summary for data, into a convolutional neural network (CNN) for dealing with audio signals. Our evaluation on automatic music tagging, a multi-label classification task, shows that the resulting persistent convolutional neural network (PCNN) model can perform significantly better than state-of-the-art models in prediction accuracy. We also discuss the intuition behind the design of the proposed model, and offer insights into the features that it learns.