CVJul 3, 2022
Supervised learning for improving the accuracy of robot-mounted 3D camera applied to human gait analysisDiego Guffanti, Alberto Brunete, Miguel Hernando et al.
The use of 3D cameras for gait analysis has been highly questioned due to the low accuracy they have demonstrated in the past. The objective of the study presented in this paper is to improve the accuracy of the estimations made by robot-mounted 3D cameras in human gait analysis by applying a supervised learning stage. The 3D camera was mounted in a mobile robot to obtain a longer walking distance. This study shows an improvement in detection of kinematic gait signals and gait descriptors by post-processing the raw estimations of the camera using artificial neural networks trained with the data obtained from a certified Vicon system. To achieve this, 37 healthy participants were recruited and data of 207 gait sequences were collected using an Orbbec Astra 3D camera. There are two basic possible approaches for training: using kinematic gait signals and using gait descriptors. The former seeks to improve the waveforms of kinematic gait signals by reducing the error and increasing the correlation with respect to the Vicon system. The second is a more direct approach, focusing on training the artificial neural networks using gait descriptors directly. The accuracy of the 3D camera was measured before and after training. In both training approaches, an improvement was observed. Kinematic gait signals showed lower errors and higher correlations with respect to the ground truth. The accuracy of the system to detect gait descriptors also showed a substantial improvement, mostly for kinematic descriptors rather than spatio-temporal. When comparing both training approaches, it was not possible to define which was the absolute best. Therefore, we believe that the selection of the training approach will depend on the purpose of the study to be conducted. This study reveals the great potential of 3D cameras and encourages the research community to continue exploring their use in gait analysis.
LGSep 25, 2019
Input complexity and out-of-distribution detection with likelihood-based generative modelsJoan Serrà, David Álvarez, Vicenç Gómez et al.
Likelihood-based generative models are a promising resource to detect out-of-distribution (OOD) inputs which could compromise the robustness or reliability of a machine learning system. However, likelihoods derived from such models have been shown to be problematic for detecting certain types of inputs that significantly differ from training data. In this paper, we pose that this problem is due to the excessive influence that input complexity has in generative models' likelihoods. We report a set of experiments supporting this hypothesis, and use an estimate of input complexity to derive an efficient and parameter-free OOD score, which can be seen as a likelihood-ratio, akin to Bayesian model comparison. We find such score to perform comparably to, or even better than, existing OOD detection approaches under a wide range of data sets, models, model sizes, and complexity estimates.
SDJun 3, 2019
Problem-Agnostic Speech Embeddings for Multi-Speaker Text-to-Speech with SampleRNNDavid Álvarez, Santiago Pascual, Antonio Bonafonte
Text-to-speech (TTS) acoustic models map linguistic features into an acoustic representation out of which an audible waveform is generated. The latest and most natural TTS systems build a direct mapping between linguistic and waveform domains, like SampleRNN. This way, possible signal naturalness losses are avoided as intermediate acoustic representations are discarded. Another important dimension of study apart from naturalness is their adaptability to generate voice from new speakers that were unseen during training. In this paper we first propose the use of problem-agnostic speech embeddings in a multi-speaker acoustic model for TTS based on SampleRNN. This way we feed the acoustic model with speaker acoustically dependent representations that enrich the waveform generation more than discrete embeddings unrelated to these factors. Our first results suggest that the proposed embeddings lead to better quality voices than those obtained with discrete embeddings. Furthermore, as we can use any speech segment as an encoded representation during inference, the model is capable to generalize to new speaker identities without retraining the network. We finally show that, with a small increase of speech duration in the embedding extractor, we dramatically reduce the spectral distortion to close the gap towards the target identities.