CLSDASDec 19, 2023

SpokesBiz -- an Open Corpus of Conversational Polish

arXiv:2312.12364v15 citationsh-index: 6
Originality Synthesis-oriented
AI Analysis

This provides a new dataset for researchers and developers working on Polish language processing, but it is incremental as it adds to existing resources.

The paper introduces SpokesBiz, a freely available corpus of over 650 hours of conversational Polish recordings, which has been transcribed, diarized, and manually annotated for punctuation and casing, and outlines its applications in linguistic research and ASR system evaluation.

This paper announces the early release of SpokesBiz, a freely available corpus of conversational Polish developed within the CLARIN-BIZ project and comprising over 650 hours of recordings. The transcribed recordings have been diarized and manually annotated for punctuation and casing. We outline the general structure and content of the corpus, showcasing selected applications in linguistic research, evaluation and improvement of automatic speech recognition (ASR) systems

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes