Creating a Universal Dependencies Treebank of Spoken Frisian-Dutch Code-switched Data
This work addresses annotation difficulties for low-resource Frisian and code-switched data, but it is incremental as it focuses on improving annotation processes rather than novel methods.
The paper tackles the challenge of annotating transcribed spoken Dutch-Frisian code-switched utterances into Universal Dependencies, achieving an increase of 7.8 UAS and 10.5 LAS points between annotation rounds.
This paper explores the difficulties of annotating transcribed spoken Dutch-Frisian code-switch utterances into Universal Dependencies. We make use of data from the FAME! corpus, which consists of transcriptions and audio data. Besides the usual annotation difficulties, this dataset is extra challenging because of Frisian being low-resource, the informal nature of the data, code-switching and non-standard sentence segmentation. As a starting point, two annotators annotated 150 random utterances in three stages of 50 utterances. After each stage, disagreements where discussed and resolved. An increase of 7.8 UAS and 10.5 LAS points was achieved between the first and third round. This paper will focus on the issues that arise when annotating a transcribed speech corpus. To resolve these issues several solutions are proposed.