Assaf Urieli

h-index4
2papers

2 Papers

CLApr 3, 2022
A Part-of-Speech Tagger for Yiddish

Seth Kulick, Neville Ryant, Beatrice Santorini et al.

We describe the construction and evaluation of a part-of-speech tagger for Yiddish. This is the first step in a larger project of automatically assigning part-of-speech tags and syntactic structure to Yiddish text for purposes of linguistic research. We combine two resources for the current work - an 80K-word subset of the Penn Parsed Corpus of Historical Yiddish (PPCHY) and 650 million words of OCR'd Yiddish text from the Yiddish Book Center (YBC). Yiddish orthography in the YBC corpus has many spelling inconsistencies, and we present some evidence that even simple non-contextualized embeddings trained on YBC are able to capture the relationships among spelling variants without the need to first "standardize" the corpus. We also use YBC for continued pretraining of contexualized embeddings, which are then integrated into a tagger model trained and evaluated on the PPCHY. We evaluate the tagger performance on a 10-fold cross-validation split, showing that the use of the YBC text for the contextualized embeddings improves tagger performance. We conclude by discussing some next steps, including the need for additional annotated training and test data.

CLJan 14, 2025Code
Jochre 3 and the Yiddish OCR corpus

Assaf Urieli, Amber Clooney, Michelle Sigiel et al.

We describe the construction of a publicly available Yiddish OCR Corpus, and describe and evaluate the open source OCR tool suite Jochre 3, including an Alto editor for corpus annotation, OCR software for Alto OCR layer generation, and a customizable OCR search engine. The current version of the Yiddish OCR corpus contains 658 pages, 186K tokens and 840K glyphs. The Jochre 3 OCR tool uses various fine-tuned YOLOv8 models for top-down page layout analysis, and a custom CNN network for glyph recognition. It attains a CER of 1.5% on our test corpus, far out-performing all other existing public models for Yiddish. We analyzed the full 660M word Yiddish Book Center with Jochre 3 OCR, and the new OCR is searchable through the Yiddish Book Center OCR search engine.