CL CR LG SD ASApr 20, 2022

Detecting Unintended Memorization in Language-Model-Fused ASR

W. Ronny Huang, Steve Chien, Om Thakkar, Rajiv Mathews

arXiv:2204.09606v23.012 citationsh-index: 17

Originality Incremental advance

AI Analysis

This addresses privacy risks in LM-fused ASR systems, offering a detection and mitigation method, though it is incremental as it builds on prior memorization research.

The authors tackled the problem of detecting unintended memorization of rare sequences in language models used with speech recognizers, showing they could detect memorization of singly-occurring canaries in a model trained on 300M examples and reduce it via per-example gradient-clipped training without quality loss.

End-to-end (E2E) models are often being accompanied by language models (LMs) via shallow fusion for boosting their overall quality as well as recognition of rare words. At the same time, several prior works show that LMs are susceptible to unintentionally memorizing rare or unique sequences in the training data. In this work, we design a framework for detecting memorization of random textual sequences (which we call canaries) in the LM training data when one has only black-box (query) access to LM-fused speech recognizer, as opposed to direct access to the LM. On a production-grade Conformer RNN-T E2E model fused with a Transformer LM, we show that detecting memorization of singly-occurring canaries from the LM training data of 300M examples is possible. Motivated to protect privacy, we also show that such memorization gets significantly reduced by per-example gradient-clipped LM training without compromising overall quality.

View on arXiv PDF

Similar