A Contextual Latent Space Model: Subsequence Modulation in Melodic Sequence
This work addresses the need for more interactive and controllable generation in domains like music and text, though it is incremental as it builds on existing latent space methods.
The paper tackles the problem of limited control in subsequence editing for generative sequence models by proposing a contextual latent space model (CLSM), which enables directed exploration like interpolation and variation, resulting in smoother interpolation and superior sample quality compared to baselines on a monophonic symbolic music dataset.
Some generative models for sequences such as music and text allow us to edit only subsequences, given surrounding context sequences, which plays an important part in steering generation interactively. However, editing subsequences mainly involves randomly resampling subsequences from a possible generation space. We propose a contextual latent space model (CLSM) in order for users to be able to explore subsequence generation with a sense of direction in the generation space, e.g., interpolation, as well as exploring variations -- semantically similar possible subsequences. A context-informed prior and decoder constitute the generative model of CLSM, and a context position-informed encoder is the inference model. In experiments, we use a monophonic symbolic music dataset, demonstrating that our contextual latent space is smoother in interpolation than baselines, and the quality of generated samples is superior to baseline models. The generation examples are available online.