CLASMar 15, 2023

Cascading and Direct Approaches to Unsupervised Constituency Parsing on Spoken Sentences

MIT
arXiv:2303.08809v25 citationsh-index: 52
AI Analysis

This addresses the problem of parsing spoken language without labeled data for researchers in NLP and speech processing, but it is incremental as it extends existing unsupervised parsing methods to a new modality.

The paper tackles unsupervised constituency parsing for spoken sentences by comparing cascading ASR with parsing and direct parsing on speech representations, finding that using ASR transcripts yields better results and accurate segmentation may suffice for parsing.

Past work on unsupervised parsing is constrained to written form. In this paper, we present the first study on unsupervised spoken constituency parsing given unlabeled spoken sentences and unpaired textual data. The goal is to determine the spoken sentences' hierarchical syntactic structure in the form of constituency parse trees, such that each node is a span of audio that corresponds to a constituent. We compare two approaches: (1) cascading an unsupervised automatic speech recognition (ASR) model and an unsupervised parser to obtain parse trees on ASR transcripts, and (2) direct training an unsupervised parser on continuous word-level speech representations. This is done by first splitting utterances into sequences of word-level segments, and aggregating self-supervised speech representations within segments to obtain segment embeddings. We find that separately training a parser on the unpaired text and directly applying it on ASR transcripts for inference produces better results for unsupervised parsing. Additionally, our results suggest that accurate segmentation alone may be sufficient to parse spoken sentences accurately. Finally, we show the direct approach may learn head-directionality correctly for both head-initial and head-final languages without any explicit inductive bias.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes