SDApr 18, 2022
AB/BA analysis: A framework for estimating keyword spotting recall improvement while maintaining audio privacyRaphael Petegrosso, Vasistakrishna Baderdinni, Thibaud Senechal et al.
Evaluation of keyword spotting (KWS) systems that detect keywords in speech is a challenging task under realistic privacy constraints. The KWS is designed to only collect data when the keyword is present, limiting the availability of hard samples that may contain false negatives, and preventing direct estimation of model recall from production data. Alternatively, complementary data collected from other sources may not be fully representative of the real application. In this work, we propose an evaluation technique which we call AB/BA analysis. Our framework evaluates a candidate KWS model B against a baseline model A, using cross-dataset offline decoding for relative recall estimation, without requiring negative examples. Moreover, we propose a formulation with assumptions that allow estimation of relative false positive rate between models with low variance even when the number of false positives is small. Finally, we propose to leverage machine-generated soft labels, in a technique we call Semi-Supervised AB/BA analysis, that improves the analysis time, privacy, and cost. Experiments with both simulation and real data show that AB/BA analysis is successful at measuring recall improvement in conjunction with the trade-off in relative false positive rate.
LGFeb 20, 2018
Scalable Label Propagation for Multi-relational Learning on the Tensor Product of GraphsZhuliu Li, Raphael Petegrosso, Shaden Smith et al.
Multi-relational learning on knowledge graphs infers high-order relations among the entities across the graphs. This learning task can be solved by label propagation on the tensor product of the knowledge graphs to learn the high-order relations as a tensor. In this paper, we generalize a widely used label propagation model to the normalized tensor product graph, and propose an optimization formulation and a scalable Low-rank Tensor-based Label Propagation algorithm (LowrankTLP) to infer multi-relations for two learning tasks, hyperlink prediction and multiple graph alignment. The optimization formulation minimizes the upper bound of the noisy tensor estimation error for multiple graph alignment, by learning with a subset of the eigen-pairs in the spectrum of the normalized tensor product graph. We also provide a data-dependent transductive Rademacher bound for binary hyperlink prediction. We accelerate LowrankTLP with parallel tensor computation which enables label propagation on a tensor product of 100 graphs each of size 1000 in less than half hour in the simulation. LowrankTLP was also applied to predicting the author-paper-venue hyperlinks in publication records, alignment of segmented regions across up to 26 CT-scan images and alignment of protein-protein interaction networks across multiple species. The experiments demonstrate that LowrankTLP indeed well approximates the original label propagation with better scalability and accuracy.
LGFeb 28, 2017
Low-rank Label Propagation for Semi-supervised Learning with 100 Millions SamplesRaphael Petegrosso, Wei Zhang, Zhuliu Li et al.
The success of semi-supervised learning crucially relies on the scalability to a huge amount of unlabelled data that are needed to capture the underlying manifold structure for better classification. Since computing the pairwise similarity between the training data is prohibitively expensive in most kinds of input data, currently, there is no general ready-to-use semi-supervised learning method/tool available for learning with tens of millions or more data points. In this paper, we adopted the idea of two low-rank label propagation algorithms, GLNP (Global Linear Neighborhood Propagation) and Kernel Nyström Approximation, and implemented the parallelized version of the two algorithms accelerated with Nesterov's accelerated projected gradient descent for Big-data Label Propagation (BigLP). The parallel algorithms are tested on five real datasets ranging from 7000 to 10,000,000 in size and a simulation dataset of 100,000,000 samples. In the experiments, the implementation can scale up to datasets with 100,000,000 samples and hundreds of features and the algorithms also significantly improved the prediction accuracy when only a very small percentage of the data is labeled. The results demonstrate that the BigLP implementation is highly scalable to big data and effective in utilizing the unlabeled data for semi-supervised learning.