Quantifying and extending the coverage of spatial categorization data sets
This work addresses the challenge of scaling spatial datasets for linguistic research, though it is incremental as it builds on existing methods and datasets.
The authors tackled the problem of limited coverage in spatial categorization datasets by using large language models (LLMs) to generate labels that align with human ones, enabling the addition of 42 new scenes to the Topological Relations Picture Series (TRPS) for better coverage than previous extensions.
Variation in spatial categorization across languages is often studied by eliciting human labels for the relations depicted in a set of scenes known as the Topological Relations Picture Series (TRPS). We demonstrate that labels generated by large language models (LLMs) align relatively well with human labels, and show how LLM-generated labels can help to decide which scenes and languages to add to existing spatial data sets. To illustrate our approach we extend the TRPS by adding 42 new scenes, and show that this extension achieves better coverage of the space of possible scenes than two previous extensions of the TRPS. Our results provide a foundation for scaling towards spatial data sets with dozens of languages and hundreds of scenes.