DL CLDec 20, 2024

Enriching Social Science Research via Survey Item Linking

Tornike Tsereteli, Daniel Ruffinelli, Simone Paolo Ponzetto

arXiv:2412.15831v11.2h-index: 12Has Code

Originality Synthesis-oriented

AI Analysis

This addresses a problem for social science researchers by improving the traceability and comparison of survey items, though it is incremental as it builds on existing linking tasks with a new dataset and analysis.

The paper tackles the challenge of automatically linking implicit survey item mentions in social science research to a knowledge base, creating a high-quality dataset of 20,454 sentences and demonstrating feasibility with deep learning systems, though error propagation limits overall performance.

Questions within surveys, called survey items, are used in the social sciences to study latent concepts, such as the factors influencing life satisfaction. Instead of using explicit citations, researchers paraphrase the content of the survey items they use in-text. However, this makes it challenging to find survey items of interest when comparing related work. Automatically parsing and linking these implicit mentions to survey items in a knowledge base can provide more fine-grained references. We model this task, called Survey Item Linking (SIL), in two stages: mention detection and entity disambiguation. Due to an imprecise definition of the task, existing datasets used for evaluating the performance for SIL are too small and of low-quality. We argue that latent concepts and survey item mentions should be differentiated. To this end, we create a high-quality and richly annotated dataset consisting of 20,454 English and German sentences. By benchmarking deep learning systems for each of the two stages independently and sequentially, we demonstrate that the task is feasible, but observe that errors propagate from the first stage, leading to a lower overall task performance. Moreover, mentions that require the context of multiple sentences are more challenging to identify for models in the first stage. Modeling the entire context of a document and combining the two stages into an end-to-end system could mitigate these problems in future work, and errors could additionally be reduced by collecting more diverse data and by improving the quality of the knowledge base. The data and code are available at https://github.com/e-tornike/SIL .

View on arXiv PDF Code

Similar