CLAug 5, 2023

Multi-Source (Pre-)Training for Cross-Domain Measurement, Unit and Context Extraction

arXiv:2308.02951v1223 citationsh-index: 12Has Code
Originality Incremental advance
AI Analysis

This work addresses cross-domain generalization for automated extraction tasks, which is incremental as it builds on existing pre-trained language models and multi-source training methods.

The paper tackled automated measurement and context extraction across domains using a multi-source pre-training approach, finding that multi-source training yields the best overall results while single-source training performs best for individual domains, with successful extraction of quantity values and units but needing improvement for contextual entities.

We present a cross-domain approach for automated measurement and context extraction based on pre-trained language models. We construct a multi-source, multi-domain corpus and train an end-to-end extraction pipeline. We then apply multi-source task-adaptive pre-training and fine-tuning to benchmark the cross-domain generalization capability of our model. Further, we conceptualize and apply a task-specific error analysis and derive insights for future work. Our results suggest that multi-source training leads to the best overall results, while single-source training yields the best results for the respective individual domain. While our setup is successful at extracting quantity values and units, more research is needed to improve the extraction of contextual entities. We make the cross-domain corpus used in this work available online.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes