ASDec 8, 2022
Framewise WaveGAN: High Speed Adversarial Vocoder in Time Domain with Very Low Computational ComplexityAhmed Mustafa, Jean-Marc Valin, Jan Büthe et al.
GAN vocoders are currently one of the state-of-the-art methods for building high-quality neural waveform generative models. However, most of their architectures require dozens of billion floating-point operations per second (GFLOPS) to generate speech waveforms in samplewise manner. This makes GAN vocoders still challenging to run on normal CPUs without accelerators or parallel computers. In this work, we propose a new architecture for GAN vocoders that mainly depends on recurrent and fully-connected networks to directly generate the time domain signal in framewise manner. This results in considerable reduction of the computational cost and enables very fast generation on both GPUs and low-complexity CPUs. Experimental results show that our Framewise WaveGAN vocoder achieves significantly higher quality than auto-regressive maximum-likelihood vocoders such as LPCNet at a very low complexity of 1.2 GFLOPS. This makes GAN vocoders more practical on edge and low-power devices.
ASMar 28, 2022
Improved singing voice separation with chromagram-based pitch-aware remixingSiyuan Yuan, Zhepei Wang, Umut Isik et al.
Singing voice separation aims to separate music into vocals and accompaniment components. One of the major constraints for the task is the limited amount of training data with separated vocals. Data augmentation techniques such as random source mixing have been shown to make better use of existing data and mildly improve model performance. We propose a novel data augmentation technique, chromagram-based pitch-aware remixing, where music segments with high pitch alignment are mixed. By performing controlled experiments in both supervised and semi-supervised settings, we demonstrate that training models with pitch-aware remixing significantly improves the test signal-to-distortion ratio (SDR)
ASFeb 23, 2022Code
End-to-end LPCNet: A Neural Vocoder With Fully-Differentiable LPC EstimationKrishna Subramani, Jean-Marc Valin, Umut Isik et al.
Neural vocoders have recently demonstrated high quality speech synthesis, but typically require a high computational complexity. LPCNet was proposed as a way to reduce the complexity of neural synthesis by using linear prediction (LP) to assist an autoregressive model. At inference time, LPCNet relies on the LP coefficients being explicitly computed from the input acoustic features. That makes the design of LPCNet-based systems more complicated, while adding the constraint that the input features must represent a clean speech spectrum. We propose an end-to-end version of LPCNet that lifts these limitations by learning to infer the LP coefficients from the input features in the frame rate network. Results show that the proposed end-to-end approach equals or exceeds the quality of the original LPCNet model, but without explicit LP analysis. Our open-source end-to-end model still benefits from LPCNet's low complexity, while allowing for any type of conditioning features.
ASFeb 22, 2022Code
Neural Speech Synthesis on a Shoestring: Improving the Efficiency of LPCNetJean-Marc Valin, Umut Isik, Paris Smaragdis et al.
Neural speech synthesis models can synthesize high quality speech but typically require a high computational complexity to do so. In previous work, we introduced LPCNet, which uses linear prediction to significantly reduce the complexity of neural synthesis. In this work, we further improve the efficiency of LPCNet -- targeting both algorithmic and computational improvements -- to make it usable on a wide variety of devices. We demonstrate an improvement in synthesis quality while operating 2.5x faster. The resulting open-source LPCNet algorithm can perform real-time neural synthesis on most existing phones and is even usable in some embedded devices.
SDFeb 28, 2016Code
Speex: A Free Codec For Free SpeechJean-Marc Valin
The Speex project has been started in 2002 to address the need for a free, open-source speech codec. Speex is based on the Code Excited Linear Prediction (CELP) algorithm and, unlike the previously existing Vorbis codec, is optimised for transmitting speech for low latency communication over an unreliable packet network. This paper presents an overview of Speex, the technology involved in it and how it can be used in applications. The most recent developments in Speex, such as the fixed-point port, acoustic echo cancellation and noise suppression are also addressed.
ASJun 15, 2021
Multi-channel Opus compression for far-field automatic speech recognition with a fixed bitrate budgetLukas Drude, Jahn Heymann, Andreas Schwarz et al.
Automatic speech recognition (ASR) in the cloud allows the use of larger models and more powerful multi-channel signal processing front-ends compared to on-device processing. However, it also adds an inherent latency due to the transmission of the audio signal, especially when transmitting multiple channels of a microphone array. One way to reduce the network bandwidth requirements is client-side compression with a lossy codec such as Opus. However, this compression can have a detrimental effect especially on multi-channel ASR front-ends, due to the distortion and loss of spatial information introduced by the codec. In this publication, we propose an improved approach for the compression of microphone array signals based on Opus, using a modified joint channel coding approach and additionally introducing a multi-channel spatial decorrelating transform to reduce redundancy in the transmission. We illustrate the effect of the proposed approach on the spatial information retained in multi-channel signals after compression, and evaluate the performance on far-field ASR with a multi-channel beamforming front-end. We demonstrate that our approach can lead to a 37.5 % bitrate reduction or a 5.1 % relative word error rate reduction for a fixed bitrate budget in a seven channel setup.
ASFeb 12, 2021
Enhancing into the codec: Noise Robust Speech Coding with Vector-Quantized AutoencodersJonah Casebeer, Vinjai Vale, Umut Isik et al.
Audio codecs based on discretized neural autoencoders have recently been developed and shown to provide significantly higher compression levels for comparable quality speech output. However, these models are tightly coupled with speech content, and produce unintended outputs in noisy conditions. Based on VQ-VAE autoencoders with WaveRNN decoders, we develop compressor-enhancer encoders and accompanying decoders, and show that they operate well in noisy conditions. We also observe that a compressor-enhancer model performs better on clean speech inputs than a compressor model trained only on clean speech.
ASAug 11, 2020
PoCoNet: Better Speech Enhancement with Frequency-Positional Embeddings, Semi-Supervised Conversational Data, and Biased LossUmut Isik, Ritwik Giri, Neerad Phansalkar et al.
Neural network applications generally benefit from larger-sized models, but for current speech enhancement models, larger scale networks often suffer from decreased robustness to the variety of real-world use cases beyond what is encountered in training data. We introduce several innovations that lead to better large neural networks for speech enhancement. The novel PoCoNet architecture is a convolutional neural network that, with the use of frequency-positional embeddings, is able to more efficiently build frequency-dependent features in the early layers. A semi-supervised method helps increase the amount of conversational training data by pre-enhancing noisy datasets, improving performance on real recordings. A new loss function biased towards preserving speech quality helps the optimization better match human perceptual opinions on speech quality. Ablation experiments and objective and human opinion metrics show the benefits of the proposed improvements.
ASMay 12, 2019
Improving Opus Low Bit Rate Quality with Neural Speech SynthesisJan Skoglund, Jean-Marc Valin
The voice mode of the Opus audio coder can compress wideband speech at bit rates ranging from 6 kb/s to 40 kb/s. However, Opus is at its core a waveform matching coder, and as the rate drops below 10 kb/s, quality degrades quickly. As the rate reduces even further, parametric coders tend to perform better than waveform coders. In this paper we propose a backward-compatible way of improving low bit rate Opus quality by re-synthesizing speech from the decoded parameters. We compare two different neural generative models, WaveNet and LPCNet. WaveNet is a powerful, high-complexity, and high-latency architecture that is not feasible for a practical system, yet provides a best known achievable quality with generative models. LPCNet is a low-complexity, low-latency RNN-based generative model, and practically implementable on mobile phones. We apply these systems with parameters from Opus coded at 6 kb/s as conditioning features for the generative models. A listening test shows that for the same 6 kb/s Opus bit stream, synthesized speech using LPCNet clearly outperforms the output of the standard Opus decoder. This opens up ways to improve the decoding quality of existing speech and audio waveform coders without breaking compatibility.
ASMar 28, 2019
A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNetJean-Marc Valin, Jan Skoglund
Neural speech synthesis algorithms are a promising new approach for coding speech at very low bitrate. They have so far demonstrated quality that far exceeds traditional vocoders, at the cost of very high complexity. In this work, we present a low-bitrate neural vocoder based on the LPCNet model. The use of linear prediction and sparse recurrent networks makes it possible to achieve real-time operation on general-purpose hardware. We demonstrate that LPCNet operating at 1.6 kb/s achieves significantly higher quality than MELP and that uncompressed LPCNet can exceed the quality of a waveform codec operating at low bitrate. This opens the way for new codec designs based on neural synthesis models.
ASOct 28, 2018
LPCNet: Improving Neural Speech Synthesis Through Linear PredictionJean-Marc Valin, Jan Skoglund
Neural speech synthesis models have recently demonstrated the ability to synthesize high quality speech for text-to-speech and compression applications. These new models often require powerful GPUs to achieve real-time operation, so being able to reduce their complexity would open the way for many new applications. We propose LPCNet, a WaveRNN variant that combines linear prediction with recurrent neural networks to significantly improve the efficiency of speech synthesis. We demonstrate that LPCNet can achieve significantly higher quality than WaveRNN for the same network size and that high quality LPCNet speech synthesis is achievable with a complexity under 3 GFLOPS. This makes it easier to deploy neural synthesis applications on lower-power devices, such as embedded systems and mobile phones.
SDSep 24, 2017
A Hybrid DSP/Deep Learning Approach to Real-Time Full-Band Speech EnhancementJean-Marc Valin
Despite noise suppression being a mature area in signal processing, it remains highly dependent on fine tuning of estimator algorithms and parameters. In this paper, we demonstrate a hybrid DSP/deep learning approach to noise suppression. A deep neural network with four hidden layers is used to estimate ideal critical band gains, while a more traditional pitch filter attenuates noise between pitch harmonics. The approach achieves significantly higher quality than a traditional minimum mean squared error spectral estimator, while keeping the complexity low enough for real-time operation at 48 kHz on a low-power processor.
MMOct 8, 2016
Perceptually-Driven Video Coding with the Daala Video CodecYushin Cho, Thomas J. Daede, Nathan E. Egge et al.
The Daala project is a royalty-free video codec that attempts to compete with the best patent-encumbered codecs. Part of our strategy is to replace core tools of traditional video codecs with alternative approaches, many of them designed to take perceptual aspects into account, rather than optimizing for simple metrics like PSNR. This paper documents some of our experiences with these tools, which ones worked and which did not. We evaluate which tools are easy to integrate into a more traditional codec design, and show results in the context of the codec being developed by the Alliance for Open Media.
MMAug 5, 2016
Daala: Building A Next-Generation Video Codec From Unconventional TechnologyJean-Marc Valin, Timothy B. Terriberry, Nathan E. Egge et al.
Daala is a new royalty-free video codec that attempts to compete with state-of-the-art royalty-bearing codecs. To do so, it must achieve good compression while avoiding all of their patented techniques. We use technology that is as different as possible from traditional approaches to achieve this. This paper describes the technology behind Daala and discusses where it fits in the newly created AV1 codec from the Alliance for Open Media. We show that Daala is approaching the performance level of more mature, state-of-the art video codecs and can contribute to improving AV1.
MMMay 16, 2016
Daala: A Perceptually-Driven Still Picture CodecJean-Marc Valin, Nathan E. Egge, Thomas Daede et al.
Daala is a new royalty-free video codec based on perceptually-driven coding techniques. We explore using its keyframe format for still picture coding and show how it has improved over the past year. We believe the technology used in Daala could be the basis of an excellent, royalty-free image format.
MMMar 10, 2016
Predicting Chroma from Luma with Frequency Domain Intra PredictionNathan E. Egge, Jean-Marc Valin
This paper describes a technique for performing intra prediction of the chroma planes based on the reconstructed luma plane in the frequency domain. This prediction exploits the fact that while RGB to YUV color conversion has the property that it decorrelates the color planes globally across an image, there is still some correlation locally at the block level. Previous proposals compute a linear model of the spatial relationship between the luma plane (Y) and the two chroma planes (U and V). In codecs that use lapped transforms this is not possible since transform support extends across the block boundaries and thus neighboring blocks are unavailable during intra-prediction. We design a frequency domain intra predictor for chroma that exploits the same local correlation with lower complexity than the spatial predictor and which works with lapped transforms. We then describe a low-complexity algorithm that directly uses luma coefficients as a chroma predictor based on gain-shape quantization and band partitioning. An experiment is performed that compares these two techniques inside the experimental Daala video codec and shows the lower complexity algorithm to be a better chroma predictor.
SDMar 10, 2016
Channel Decorrelation For Stereo Acoustic Echo Cancellation In High-Quality Audio CommunicationJean-Marc Valin
In this paper, we address an important problem in high-quality audio communication systems. Acoustic echo cancellation with stereo signals is generally an under-determined problem because of the generally important correlation that exists between the left and right channels. In this paper, we present a novel method of significantly reducing that correlation without affecting the audio quality. This method is perceptually motivated and combines a shaped comb-allpass (SCAL) filter with the injection of psychoacoustically masked noise. We show that the proposed method performs significantly better than other known methods for channel decorrelation.
SDMar 10, 2016
Microphone array post-filter for separation of simultaneous non-stationary sourcesJean-Marc Valin, Jean Rouat, François Michaud
Microphone array post-filters have demonstrated their ability to greatly reduce noise at the output of a beamformer. However, current techniques only consider a single source of interest, most of the time assuming stationary background noise. We propose a microphone array post-filter that enhances the signals produced by the separation of simultaneous sources using common source separation algorithms. Our method is based on a loudness-domain optimal spectral estimator and on the assumption that the noise can be described as the sum of a stationary component and of a transient component that is due to leakage between the channels of the initial source separation algorithm. The system is evaluated in the context of mobile robotics and is shown to produce better results than current post-filtering techniques, greatly reducing interference while causing little distortion to the signal of interest, even at very low SNR.
MMMar 10, 2016
Daala: A Perceptually-Driven Next Generation Video CodecThomas J. Daede, Nathan E. Egge, Jean-Marc Valin et al.
The Daala project is a royalty-free video codec that attempts to compete with the best patent-encumbered codecs. Part of our strategy is to replace core tools of traditional video codecs with alternative approaches, many of them designed to take perceptual aspects into account, rather than optimizing for simple metrics like PSNR. This paper documents some of our experiences with these tools, which ones worked and which did not, and what we've learned from them. The result is a codec which compares favorably with HEVC on still images, and is on a path to do so for video as well.
ROMar 7, 2016
Enhanced Robot Audition Based on Microphone Array Source Separation with Post-FilterJean-Marc Valin, Jean Rouat, François Michaud
We propose a system that gives a mobile robot the ability to separate simultaneous sound sources. A microphone array is used along with a real-time dedicated implementation of Geometric Source Separation and a post-filter that gives us a further reduction of interferences from other sources. We present results and comparisons for separation of multiple non-stationary speech sources combined with noise sources. The main advantage of our approach for mobile robots resides in the fact that both the frequency-domain Geometric Source Separation algorithm and the post-filter are able to adapt rapidly to new sources and non-stationarity. Separation results are presented for three simultaneous interfering speakers in the presence of noise. A reduction of log spectral distortion (LSD) and increase of signal-to-noise ratio (SNR) of approximately 10 dB and 14 dB are observed.
SDMar 6, 2016
Improved Noise Weighting in CELP Coding of Speech - Applying the Vorbis Psychoacoustic Model To SpeexJean-Marc Valin, Christopher Montgomery
One key aspect of the CELP algorithm is that it shapes the coding noise using a simple, yet effective, weighting filter. In this paper, we improve the noise shaping of CELP using a more modern psychoacoustic model. This has the significant advantage of improving the quality of an existing codec without the need to change the bit-stream. More specifically, we improve the Speex CELP codec by using the psychoacoustic model used in the Vorbis audio codec. The results show a significant increase in quality, especially at high bit-rates, where the improvement is equivalent to a 20% reduction in bit-rate. The technique itself is not specific to Speex and could be applied to other CELP codecs.
SDMar 6, 2016
Low-Complexity Iterative Sinusoidal Parameter EstimationJean-Marc Valin, Daniel V. Smith, Christopher Montgomery et al.
Sinusoidal parameter estimation is a computationally-intensive task, which can pose problems for real-time implementations. In this paper, we propose a low-complexity iterative method for estimating sinusoidal parameters that is based on the linearisation of the model around an initial frequency estimate. We show that for N sinusoids in a frame of length L, the proposed method has a complexity of O(LN), which is significantly less than the matching pursuits method. Furthermore, the proposed method is shown to be more accurate than the matching pursuits and time frequency reassignment methods in our experiments.
SDFeb 27, 2016
Perceptually-Motivated Nonlinear Channel Decorrelation For Stereo Acoustic Echo CancellationJean-Marc Valin
Acoustic echo cancellation with stereo signals is generally an under-determined problem because of the high coherence between the left and right channels. In this paper, we present a novel method of significantly reducing inter-channel coherence without affecting the audio quality. Our work takes into account psychoacoustic masking and binaural auditory cues. The proposed non-linear processing combines a shaped comb-allpass (SCAL) filter with the injection of psychoacoustically masked noise. We show that the proposed method performs significantly better than other known methods for reducing inter-channel coherence.
ROFeb 27, 2016
Localization of Simultaneous Moving Sound Sources for Mobile Robot Using a Frequency-Domain Steered Beamformer ApproachJean-Marc Valin, François Michaud, Brahim Hadjou et al.
Mobile robots in real-life settings would benefit from being able to localize sound sources. Such a capability can nicely complement vision to help localize a person or an interesting event in the environment, and also to provide enhanced processing for other capabilities such as speech recognition. In this paper we present a robust sound source localization method in three-dimensional space using an array of 8 microphones. The method is based on a frequency-domain implementation of a steered beamformer along with a probabilistic post-processor. Results show that a mobile robot can localize in real time multiple moving sources of different types over a range of 5 meters with a response time of 200 ms.
SDFeb 27, 2016
A New Robust Frequency Domain Echo Canceller With Closed-Loop Learning Rate AdaptationJean-Marc Valin, Iain B. Collings
One of the main difficulties in echo cancellation is the fact that the learning rate needs to vary according to conditions such as double-talk and echo path change. Several methods have been proposed to vary the learning. In this paper we propose a new closed-loop method where the learning rate is proportional to a misalignment parameter, which is in turn estimated based on a gradient adaptive approach. The method is presented in the context of a multidelay block frequency domain (MDF) echo canceller. We demonstrate that the proposed algorithm outperforms current popular double-talk detection techniques by up to 6 dB.
ROFeb 27, 2016
Robust 3D Localization and Tracking of Sound Sources Using Beamforming and Particle FilteringJean-Marc Valin, François Michaud, Jean Rouat
In this paper we present a new robust sound source localization and tracking method using an array of eight microphones (US patent pending) . The method uses a steered beamformer based on the reliability-weighted phase transform (RWPHAT) along with a particle filter-based tracking algorithm. The proposed system is able to estimate both the direction and the distance of the sources. In a videoconferencing context, the direction was estimated with an accuracy better than one degree while the distance was accurate within 10% RMS. Tracking of up to three simultaneous moving speakers is demonstrated in a noisy environment.
SDFeb 26, 2016
Bandwidth Extension of Narrowband Speech for Low Bit-Rate Wideband CodingJean-Marc Valin, Roch Lefebvre
Wireless telephone speech is usually limited to the 300-3400 Hz band, which reduces its quality. There is thus a growing demand for wideband speech systems that transmit from 50 Hz to 8000 Hz. This paper presents an algorithm to generate wideband speech from narrowband speech using as low as 500 bits/s of side information. The 50-300 Hz band is predicted from the narrowband signal. A source-excitation model is used for the 3400-8000 Hz band, where the excitation is extrapolated at the receiver, and the spectral envelope is transmitted. Though some artifacts are present, the resulting wideband speech has enhanced quality compared to narrowband speech.
ROFeb 26, 2016
Robust Sound Source Localization Using a Microphone Array on a Mobile RobotJean-Marc Valin, François Michaud, Jean Rouat et al.
The hearing sense on a mobile robot is important because it is omnidirectional and it does not require direct line-of-sight with the sound source. Such capabilities can nicely complement vision to help localize a person or an interesting event in the environment. To do so the robot auditory system must be able to work in noisy, unknown and diverse environmental conditions. In this paper we present a robust sound source localization method in three-dimensional space using an array of 8 microphones. The method is based on time delay of arrival estimation. Results show that a mobile robot can localize in real time different types of sound sources over a range of 3 meters and with a precision of 3 degrees.
SDFeb 26, 2016
Extension spectrale d'un signal de parole de la bande téléphonique à la bande AMJean-Marc Valin
This document proposes a bandwidth extension system producing a wideband signal from a narrowband speech signal. The extension is performed independently for high and low frequencies. High-frequency extension uses the excitation-filter model. Extension of the excitation is performed in the time domain using a non-linear function, while the spectral envelope is extended in the cepstral domain using a multi-layer perceptron. Low-band extension is based on the sinusoidal model. The amplitude of sinusoids is also estimated using a multi-layer perceptron. The results show that the sound quality after extension is higher than that of narrowband speech, with a significant variation across listeners. Some of the techniques, including excitation extension, are of interest in the field of speech coding. ----- Le présent mémoire propose un système d'extension de la bande permettant de produire un signal en bande AM à partir d'un signal de parole en bande téléphonique. L'extension est effectuée de façon indépendante pour les hautes fréquences et les basses fréquences. L'extension des hautes fréquences utilise le modèle filtre-excitation. L'extension de l'excitation est réalisée dans le domaine temporel par une fonction non linéaire, alors que l'extension de l'enveloppe spectrale s'effectue dans le domaine cepstral par un perceptron multi-couches. L'extension de la bande basse utilise le modèle sinusoïdal. L'amplitude des sinusoïdes est aussi estimée par un perceptron multi-couches. Les résultats obtenus montrent que la qualité sonore après extension est supérieure à celle de la bande téléphonique, avec une importante différence entre les auditeurs. Certaines techniques développées, dont l'extension de l'excitation, présentent un certain intérêt pour le domaine du codage de la parole.
ROFeb 25, 2016
Robust Localization and Tracking of Simultaneous Moving Sound Sources Using Beamforming and Particle FilteringJean-Marc Valin, François Michaud, Jean Rouat
Mobile robots in real-life settings would benefit from being able to localize and track sound sources. Such a capability can help localizing a person or an interesting event in the environment, and also provides enhanced processing for other capabilities such as speech recognition. To give this capability to a robot, the challenge is not only to localize simultaneous sound sources, but to track them over time. In this paper we propose a robust sound source localization and tracking method using an array of eight microphones. The method is based on a frequency-domain implementation of a steered beamformer along with a particle filter-based tracking algorithm. Results show that a mobile robot can localize and track in real-time multiple moving sources of different types over a range of 7 meters. These new capabilities allow a mobile robot to interact using more natural means with people in real life settings.
SYFeb 25, 2016
Interference-Normalised Least Mean Square AlgorithmJean-Marc Valin, Iain B. Collings
An interference-normalised least mean square (INLMS) algorithm for robust adaptive filtering is proposed. The INLMS algorithm extends the gradient-adaptive learning rate approach to the case where the signals are non-stationary. In particular, we show that the INLMS algorithm can work even for highly non-stationary interference signals, where previous gradient-adaptive learning rate algorithms fail.
SDFeb 25, 2016
On Adjusting the Learning Rate in Frequency Domain Echo Cancellation With Double-TalkJean-Marc Valin
One of the main difficulties in echo cancellation is the fact that the learning rate needs to vary according to conditions such as double-talk and echo path change. In this paper we propose a new method of varying the learning rate of a frequency-domain echo canceller. This method is based on the derivation of the optimal learning rate of the NLMS algorithm in the presence of noise. The method is evaluated in conjunction with the multidelay block frequency domain (MDF) adaptive filter. We demonstrate that it performs better than current double-talk detection techniques and is simple to implement.
ROFeb 22, 2016
Auditory System for a Mobile RobotJean-Marc Valin
In this thesis, we propose an artificial auditory system that gives a robot the ability to locate and track sounds, as well as to separate simultaneous sound sources and recognising simultaneous speech. We demonstrate that it is possible to implement these capabilities using an array of microphones, without trying to imitate the human auditory system. The sound source localisation and tracking algorithm uses a steered beamformer to locate sources, which are then tracked using a multi-source particle filter. Separation of simultaneous sound sources is achieved using a variant of the Geometric Source Separation (GSS) algorithm, combined with a multi-source post-filter that further reduces noise, interference and reverberation. Speech recognition is performed on separated sources, either directly or by using Missing Feature Theory (MFT) to estimate the reliability of the speech features. The results obtained show that it is possible to track up to four simultaneous sound sources, even in noisy and reverberant environments. Real-time control of the robot following a sound source is also demonstrated. The sound source separation approach we propose is able to achieve a 13.7 dB improvement in signal-to-noise ratio compared to a single microphone when three speakers are present. In these conditions, the system demonstrates more than 80% accuracy on digit recognition, higher than most human listeners could obtain in our small case study when recognising only one of these sources. All these new capabilities will allow humans to interact more naturally with a mobile robot in real life settings.
ROFeb 20, 2016
Robust Recognition of Simultaneous Speech By a Mobile RobotJean-Marc Valin, Shun'ichi Yamamoto, Jean Rouat et al.
This paper describes a system that gives a mobile robot the ability to perform automatic speech recognition with simultaneous speakers. A microphone array is used along with a real-time implementation of Geometric Source Separation and a post-filter that gives a further reduction of interference from other sources. The post-filter is also used to estimate the reliability of spectral features and compute a missing feature mask. The mask is used in a missing feature theory-based speech recognition system to recognize the speech from simultaneous Japanese speakers in the context of a humanoid robot. Recognition rates are presented for three simultaneous speakers located at 2 meters from the robot. The system was evaluated on a 200 word vocabulary at different azimuths between sources, ranging from 10 to 90 degrees. Compared to the use of the microphone array source separation alone, we demonstrate an average reduction in relative recognition error rate of 24% with the post-filter and of 42% when the missing features approach is combined with the post-filter. We demonstrate the effectiveness of our multi-source microphone array post-filter and the improvement it provides when used in conjunction with the missing features theory.
MMFeb 18, 2016
The AV1 Constrained Directional Enhancement Filter (CDEF)Steinar Midtskogen, Jean-Marc Valin
This paper presents the constrained directional enhancement filter designed for the AV1 royalty-free video codec. The in-loop filter is based on a non-linear low-pass filter and is designed for vectorization efficiency. It takes into account the direction of edges and patterns being filtered. The filter works by identifying the direction of each block and then adaptively filtering with a high degree of control over the filter strength along the direction and across it. The proposed enhancement filter is shown to improve the quality of the Alliance for Open Media (AOM) AV1 and Thor video codecs in particular in low complexity configurations.
SDFeb 17, 2016
A High-Quality Speech and Audio Codec With Less Than 10 ms DelayJean-Marc Valin, Timothy B. Terriberry, Christopher Montgomery et al.
With increasing quality requirements for multimedia communications, audio codecs must maintain both high quality and low delay. Typically, audio codecs offer either low delay or high quality, but rarely both. We propose a codec that simultaneously addresses both these requirements, with a delay of only 8.7 ms at 44.1 kHz. It uses gain-shape algebraic vector quantisation in the frequency domain with time-domain pitch prediction. We demonstrate that the proposed codec operating at 48 kbit/s and 64 kbit/s out-performs both G.722.1C and MP3 and has quality comparable to AAC-LD, despite having less than one fourth of the algorithmic delay of these codecs.
SDFeb 17, 2016
An Iterative Linearised Solution to the Sinusoidal Parameter Estimation ProblemJean-Marc Valin, Daniel V. Smith, Christopher Montgomery et al.
Signal processing applications use sinusoidal modelling for speech synthesis, speech coding, and audio coding. Estimation of the model parameters involves non-linear optimisation methods, which can be very costly for real-time applications. We propose a low-complexity iterative method that starts from initial frequency estimates and converges rapidly. We show that for N sinusoids in a frame of length L, the proposed method has a complexity of O(LN), which is significantly less than the matching pursuits method. Furthermore, the proposed method is shown to be more accurate than the matching pursuits and time-frequency reassignment methods in our experiments.
MMFeb 17, 2016
A Full-Bandwidth Audio Codec With Low Complexity And Very Low DelayJean-Marc Valin, Timothy B. Terriberry, Gregory Maxwell
We propose an audio codec that addresses the low-delay requirements of some applications such as network music performance. The codec is based on the modified discrete cosine transform (MDCT) with very short frames and uses gain-shape quantization to preserve the spectral envelope. The short frame sizes required for low delay typically hinder the performance of transform codecs. However, at 96 kbit/s and with only 4 ms algorithmic delay, the proposed codec out-performs the ULD codec operating at the same rate. The total complexity of the codec is small, at only 17 WMOPS for real-time operation at 48 kHz.
MMFeb 16, 2016
Perceptual Vector Quantization For Video CodingJean-Marc Valin, Timothy B. Terriberry
This paper applies energy conservation principles to the Daala video codec using gain-shape vector quantization to encode a vector of AC coefficients as a length (gain) and direction (shape). The technique originates from the CELT mode of the Opus audio codec, where it is used to conserve the spectral envelope of an audio signal. Conserving energy in video has the potential to preserve textures rather than low-passing them. Explicitly quantizing a gain allows a simple contrast masking model with no signaling cost. Vector quantizing the shape keeps the number of degrees of freedom the same as scalar quantization, avoiding redundancy in the representation. We demonstrate how to predict the vector by transforming the space it is encoded in, rather than subtracting off the predictor, which would make energy conservation impossible. We also derive an encoding of the vector-quantized codewords that takes advantage of their non-uniform distribution. We show that the resulting technique outperforms scalar quantization by an average of 0.90 dB on still images, equivalent to a 24.8% reduction in bitrate at equal quality, while for videos, the improvement averages 0.83 dB, equivalent to a 13.7% reduction in bitrate.
MMFeb 15, 2016
High-Quality, Low-Delay Music Coding in the Opus CodecJean-Marc Valin, Gregory Maxwell, Timothy B. Terriberry et al.
The IETF recently standardized the Opus codec as RFC6716. Opus targets a wide range of real-time Internet applications by combining a linear prediction coder with a transform coder. We describe the transform coder, with particular attention to the psychoacoustic knowledge built into the format. The result out-performs existing audio codecs that do not operate under real-time constraints.