ASSep 4, 2019
Exploiting Parallel Audio Recordings to Enforce Device Invariance in CNN-based Acoustic Scene ClassificationPaul Primus, Hamid Eghbal-zadeh, David Eitelsebner et al.
Distribution mismatches between the data seen at training and at application time remain a major challenge in all application areas of machine learning. We study this problem in the context of machine listening (Task 1b of the DCASE 2019 Challenge). We propose a novel approach to learn domain-invariant classifiers in an end-to-end fashion by enforcing equal hidden layer representations for domain-parallel samples, i.e. time-aligned recordings from different recording devices. No classification labels are needed for our domain adaptation (DA) method, which makes the data collection process cheaper.
SDJul 13, 2019
Learning Complex Basis Functions for Invariant Representations of AudioStefan Lattner, Monika Dörfler, Andreas Arzt
Learning features from data has shown to be more successful than using hand-crafted features for many machine learning tasks. In music information retrieval (MIR), features learned from windowed spectrograms are highly variant to transformations like transposition or time-shift. Such variances are undesirable when they are irrelevant for the respective MIR task. We propose an architecture called Complex Autoencoder (CAE) which learns features invariant to orthogonal transformations. Mapping signals onto complex basis functions learned by the CAE results in a transformation-invariant "magnitude space" and a transformation-variant "phase space". The phase space is useful to infer transformations between data pairs. When exploiting the invariance-property of the magnitude space, we achieve state-of-the-art results in audio-to-score alignment and repeated section discovery for audio. A PyTorch implementation of the CAE, including the repeated section discovery method, is available online.
IRJun 26, 2019
Learning Soft-Attention Models for Tempo-invariant Audio-Sheet Music RetrievalStefan Balke, Matthias Dorfer, Luis Carvalho et al.
Connecting large libraries of digitized audio recordings to their corresponding sheet music images has long been a motivation for researchers to develop new cross-modal retrieval systems. In recent years, retrieval systems based on embedding space learning with deep neural networks got a step closer to fulfilling this vision. However, global and local tempo deviations in the music recordings still require careful tuning of the amount of temporal context given to the system. In this paper, we address this problem by introducing an additional soft-attention mechanism on the audio input. Quantitative and qualitative results on synthesized piano data indicate that this attention increases the robustness of the retrieval system by focusing on different parts of the input representation based on the tempo of the audio. Encouraged by these results, we argue for the potential of attention models as a very general tool for many MIR tasks.
IRFeb 12, 2019
Cross-Modal Music Retrieval and Applications: An Overview of Key MethodologiesMeinard Müller, Andreas Arzt, Stefan Balke et al.
There has been a rapid growth of digitally available music data, including audio recordings, digitized images of sheet music, album covers and liner notes, and video clips. This huge amount of data calls for retrieval strategies that allow users to explore large music collections in a convenient way. More precisely, there is a need for cross-modal retrieval algorithms that, given a query in one modality (e.g., a short audio excerpt), find corresponding information and entities in other modalities (e.g., the name of the piece and the sheet music). This goes beyond exact audio identification and subsequent retrieval of metainformation as performed by commercial applications like Shazam [1].
SDJul 19, 2018
Audio-to-Score Alignment using Transposition-invariant FeaturesAndreas Arzt, Stefan Lattner
Audio-to-score alignment is an important pre-processing step for in-depth analysis of classical music. In this paper, we apply novel transposition-invariant audio features to this task. These low-dimensional features represent local pitch intervals and are learned in an unsupervised fashion by a gated autoencoder. Our results show that the proposed features are indeed fully transposition-invariant and enable accurate alignments between transposed scores and performances. Furthermore, they can even outperform widely used features for audio-to-score alignment on `untransposed data', and thus are a viable and more flexible alternative to well-established features for music alignment and matching.
SDNov 7, 2017
The ACCompanion v0.1: An Expressive Accompaniment SystemCarlos Cancino-Chacón, Martin Bonev, Amaury Durand et al.
In this paper we present a preliminary version of the ACCompanion, an expressive accompaniment system for MIDI input. The system uses a probabilistic monophonic score follower to track the position of the soloist in the score, and a linear Gaussian model to compute tempo updates. The expressiveness of the system is powered by the Basis-Mixer, a state-of-the-art computational model of expressive music performance. The system allows for expressive dynamics, timing and articulation.
MMAug 7, 2017
Aktuelle Entwicklungen in der Automatischen MusikverfolgungAndreas Arzt, Matthias Dorfer
In this paper we present current trends in real-time music tracking (a.k.a. score following). Casually speaking, these algorithms "listen" to a live performance of music, compare the audio signal to an abstract representation of the score, and "read" along in the sheet music. In this way at any given time the exact position of the musician(s) in the sheet music is computed. Here, we focus on the aspects of flexibility and usability of these algorithms. This comprises work on automatic identification and flexible tracking of the piece being played as well as current approaches based on Deep Learning. The latter enables direct learning of correspondences between complex audio data and images of the sheet music, avoiding the complicated and time-consuming definition of a mid-level representation. ----- Diese Arbeit befasst sich mit aktuellen Entwicklungen in der automatischen Musikverfolgung durch den Computer. Es handelt sich dabei um Algorithmen, die einer musikalischen Aufführung "zuhören", das aufgenommene Audiosignal mit einer (abstrakten) Repräsentation des Notentextes vergleichen und sozusagen in diesem mitlesen. Der Algorithmus kennt also zu jedem Zeitpunkt die Position der Musiker im Notentext. Neben der Vermittlung eines generellen Überblicks, liegt der Schwerpunkt dieser Arbeit auf der Beleuchtung des Aspekts der Flexibilität und der einfacheren Nutzbarkeit dieser Algorithmen. Es wird dargelegt, welche Schritte getätigt wurden (und aktuell getätigt werden) um den Prozess der automatischen Musikverfolgung einfacher zugänglich zu machen. Dies umfasst Arbeiten zur automatischen Identifikation von gespielten Stücken und deren flexible Verfolgung ebenso wie aktuelle Ansätze mithilfe von Deep Learning, die es erlauben Bild und Ton direkt zu verbinden, ohne Umwege über abstrakte und nur unter großem Zeitaufwand zu erstellende Zwischenrepräsentationen.
IRAug 2, 2017
Piece Identification in Classical Piano Music Without Reference ScoresAndreas Arzt, Gerhard Widmer
In this paper we describe an approach to identify the name of a piece of piano music, based on a short audio excerpt of a performance. Given only a description of the pieces in text format (i.e. no score information is provided), a reference database is automatically compiled by acquiring a number of audio representations (performances of the pieces) from internet sources. These are transcribed, preprocessed, and used to build a reference database via a robust symbolic fingerprinting algorithm, which in turn is used to identify new, incoming queries. The main challenge is the amount of noise that is introduced into the identification process by the music transcription algorithm and the automatic (but possibly suboptimal) choice of performances to represent a piece in the reference database. In a number of experiments we show how to improve the identification performance by increasing redundancy in the reference database and by using a preprocessing step to rate the reference performances regarding their suitability as a representation of the pieces in question. As the results show this approach leads to a robust system that is able to identify piano music with high accuracy -- without any need for data annotation or manual data preparation.
IRJul 31, 2017
Learning Audio - Sheet Music Correspondences for Score Identification and Offline AlignmentMatthias Dorfer, Andreas Arzt, Gerhard Widmer
This work addresses the problem of matching short excerpts of audio with their respective counterparts in sheet music images. We show how to employ neural network-based cross-modality embedding spaces for solving the following two sheet music-related tasks: retrieving the correct piece of sheet music from a database when given a music audio as a search query; and aligning an audio recording of a piece with the corresponding images of sheet music. We demonstrate the feasibility of this in experiments on classical piano music by five different composers (Bach, Haydn, Mozart, Beethoven and Chopin), and additionally provide a discussion on why we expect multi-modal neural networks to be a fruitful paradigm for dealing with sheet music and audio at the same time.
IRJul 14, 2017
Modeling Harmony with Skip-GramsDavid R. W. Sears, Andreas Arzt, Harald Frostel et al.
String-based (or viewpoint) models of tonal harmony often struggle with data sparsity in pattern discovery and prediction tasks, particularly when modeling composite events like triads and seventh chords, since the number of distinct n-note combinations in polyphonic textures is potentially enormous. To address this problem, this study examines the efficacy of skip-grams in music research, an alternative viewpoint method developed in corpus linguistics and natural language processing that includes sub-sequences of n events (or n-grams) in a frequency distribution if their constituent members occur within a certain number of skips. Using a corpus consisting of four datasets of Western classical music in symbolic form, we found that including skip-grams reduces data sparsity in n-gram distributions by (1) minimizing the proportion of n-grams with negligible counts, and (2) increasing the coverage of contiguous n-grams in a test corpus. What is more, skip-grams significantly outperformed contiguous n-grams in discovering conventional closing progressions (called cadences).
SDDec 15, 2016
On the Potential of Simple Framewise Approaches to Piano TranscriptionRainer Kelz, Matthias Dorfer, Filip Korzeniowski et al.
In an attempt at exploring the limitations of simple approaches to the task of piano transcription (as usually defined in MIR), we conduct an in-depth analysis of neural network-based framewise transcription. We systematically compare different popular input representations for transcription systems to determine the ones most suitable for use with neural networks. Exploiting recent advances in training techniques and new regularizers, and taking into account hyper-parameter tuning, we show that it is possible, by simple bottom-up frame-wise processing, to obtain a piano transcriber that outperforms the current published state of the art on the publicly available MAPS dataset -- without any complex post-processing steps. Thus, we propose this simple approach as a new baseline for this dataset, for future transcription research to build on and improve.
SDDec 15, 2016
Live Score Following on Sheet Music ImagesMatthias Dorfer, Andreas Arzt, Sebastian Böck et al.
In this demo we show a novel approach to score following. Instead of relying on some symbolic representation, we are using a multi-modal convolutional neural network to match the incoming audio stream directly to sheet music images. This approach is in an early stage and should be seen as proof of concept. Nonetheless, the audience will have the opportunity to test our implementation themselves via 3 simple piano pieces.
SDDec 15, 2016
Towards End-to-End Audio-Sheet-Music RetrievalMatthias Dorfer, Andreas Arzt, Gerhard Widmer
This paper demonstrates the feasibility of learning to retrieve short snippets of sheet music (images) when given a short query excerpt of music (audio) -- and vice versa --, without any symbolic representation of music or scores. This would be highly useful in many content-based musical retrieval scenarios. Our approach is based on Deep Canonical Correlation Analysis (DCCA) and learns correlated latent spaces allowing for cross-modality retrieval in both directions. Initial experiments with relatively simple monophonic music show promising results.
LGDec 15, 2016
Towards Score Following in Sheet Music ImagesMatthias Dorfer, Andreas Arzt, Gerhard Widmer
This paper addresses the matching of short music audio snippets to the corresponding pixel location in images of sheet music. A system is presented that simultaneously learns to read notes, listens to music and matches the currently played music to its corresponding notes in the sheet. It consists of an end-to-end multi-modal convolutional neural network that takes as input images of sheet music and spectrograms of the respective audio snippets. It learns to predict, for a given unseen audio snippet (covering approximately one bar of music), the corresponding position in the respective score line. Our results suggest that with the use of (deep) neural networks -- which have proven to be powerful image processing models -- working with sheet music becomes feasible and a promising future research direction.