David N. Levin

SDMay 8, 2019

On the representation of speech and music

David N. Levin

In most automatic speech recognition (ASR) systems, the audio signal is processed to produce a time series of sensor measurements (e.g., filterbank outputs). This time series encodes semantic information in a speaker-dependent way. An earlier paper showed how to use the sequence of sensor measurements to derive an "inner" time series that is unaffected by any previous invertible transformation of the sensor measurements. The current paper considers two or more speakers, who mimic one another in the following sense: when they say the same words, they produce sensor states that are invertibly mapped onto one another. It follows that the inner time series of their utterances must be the same when they say the same words. In other words, the inner time series encodes their speech in a manner that is speaker-independent. Consequently, the ASR training process can be simplified by collecting and labelling the inner time series of the utterances of just one speaker, instead of training on the sensor time series of the utterances of a large variety of speakers. A similar argument suggests that the inner time series of music is instrument-independent. This is demonstrated in experiments on monophonic electronic music.

MEMar 24, 2017

The Inner Structure of Time-Dependent Signals

David N. Levin

This paper shows how a time series of measurements of an evolving system can be processed to create an inner time series that is unaffected by any instantaneous invertible, possibly nonlinear transformation of the measurements. An inner time series contains information that does not depend on the nature of the sensors, which the observer chose to monitor the system. Instead, it encodes information that is intrinsic to the evolution of the observed system. Because of its sensor-independence, an inner time series may produce fewer false negatives when it is used to detect events in the presence of sensor drift. Furthermore, if the observed physical system is comprised of non-interacting subsystems, its inner time series is separable; i.e., it consists of a collection of time series, each one being the inner time series of an isolated subsystem. Because of this property, an inner time series can be used to detect a specific behavior of one of the independent subsystems without using blind source separation to disentangle that subsystem from the others. The method is illustrated by applying it to: 1) an analytic example; 2) the audio waveform of one speaker; 3) video images from a moving camera; 4) mixtures of audio waveforms of two speakers.

David N. Levin

2 Papers