CLAIDLMar 25, 2024

SPACE-IDEAS: A Dataset for Salient Information Detection in Space Innovation

arXiv:2403.16941v181 citationsh-index: 2LREC
Originality Synthesis-oriented
AI Analysis

This provides a domain-specific dataset for researchers in NLP and space innovation, but it is incremental as it adapts existing methods to a new domain.

The authors tackled the lack of diverse datasets for salient information detection by introducing SPACE-IDEAS, a dataset from space innovation ideas with varied writing styles, and showed that using automatically annotated data via multitask learning improves classifier performance.

Detecting salient parts in text using natural language processing has been widely used to mitigate the effects of information overflow. Nevertheless, most of the datasets available for this task are derived mainly from academic publications. We introduce SPACE-IDEAS, a dataset for salient information detection from innovation ideas related to the Space domain. The text in SPACE-IDEAS varies greatly and includes informal, technical, academic and business-oriented writing styles. In addition to a manually annotated dataset we release an extended version that is annotated using a large generative language model. We train different sentence and sequential sentence classifiers, and show that the automatically annotated dataset can be leveraged using multitask learning to train better classifiers.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes