CLAIOct 5, 2020

A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese

arXiv:2010.01891v11005 citations
Originality Synthesis-oriented
AI Analysis

This work addresses a gap for Vietnamese NLP researchers by providing a new dataset and incremental improvements in semantic parsing for a low-resource language.

The authors tackled the lack of a large-scale Text-to-SQL dataset for Vietnamese by creating the first public one, and found that using Vietnamese-specific tools like word segmentation, PhoBERT, and syntactic features improved parsing results over baseline methods.

Semantic parsing is an important NLP task. However, Vietnamese is a low-resource language in this research area. In this paper, we present the first public large-scale Text-to-SQL semantic parsing dataset for Vietnamese. We extend and evaluate two strong semantic parsing baselines EditSQL (Zhang et al., 2019) and IRNet (Guo et al., 2019) on our dataset. We compare the two baselines with key configurations and find that: automatic Vietnamese word segmentation improves the parsing results of both baselines; the normalized pointwise mutual information (NPMI) score (Bouma, 2009) is useful for schema linking; latent syntactic features extracted from a neural dependency parser for Vietnamese also improve the results; and the monolingual language model PhoBERT for Vietnamese (Nguyen and Nguyen, 2020) helps produce higher performances than the recent best multilingual language model XLM-R (Conneau et al., 2020).

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes