Building a Custom Taxonomy of AI Skills and Tasks from the Ground Up with Job Postings
For researchers and practitioners building domain-specific taxonomies from large corpora, this provides a systematic study of data inclusion decisions, though the results are incremental.
The paper investigates how to best leverage large-scale job postings for automated taxonomy construction of AI skills, finding that filtering inputs to their TaxonomyBuilder framework yields better domain-specific coverage than using unfiltered data with clustering and LLM-based tools.
Utilizing LLMs for automated taxonomy construction presents a clear opportunity for the comprehensive, yet efficient mapping of potentially complex domains. When contending with high volumes of rapidly growing corpora, however, it becomes unclear how to best leverage such data for optimal taxonomy construction. Taking the case of systematizing AI skills in the workplace, we use two large-scale job postings corpora to investigate key design decisions for the inclusion (or exclusion) of data points for taxonomy construction. We propose TaxonomyBuilder as a blueprint for our systematic study, with which we evaluate various configurations of custom, data-informed, and hierarchical taxonomies. We demonstrate that less data can provide more clarity: filtering inputs to TaxonomyBuilder provides better domain-specific coverage than offering unfiltered inputs to clustering and LLM-enhanced hierarchical taxonomy labeling tools.