CY CLOct 1, 2025

Extracting O*NET Features from the NLx Corpus to Build Public Use Aggregate Labor Market Data

Stephen Meisenbacher, Svetlozar Nestorov, Peter Norlander

arXiv:2510.01470v11.21 citationsh-index: 15Has Code

Originality Synthesis-oriented

AI Analysis

This provides standardized, public labor market data for researchers and policymakers, though it is incremental as it applies existing NLP methods to a new dataset.

The authors tackled the problem of inaccessible and non-standardized online job posting data by developing the Job Ad Analysis Toolkit (JAAT) to extract structured O*NET features, resulting in over 10 billion data points from 155 million job ads with demonstrated reliability and accuracy.

Data from online job postings are difficult to access and are not built in a standard or transparent manner. Data included in the standard taxonomy and occupational information database (O*NET) are updated infrequently and based on small survey samples. We adopt O*NET as a framework for building natural language processing tools that extract structured information from job postings. We publish the Job Ad Analysis Toolkit (JAAT), a collection of open-source tools built for this purpose, and demonstrate its reliability and accuracy in out-of-sample and LLM-as-a-Judge testing. We extract more than 10 billion data points from more than 155 million online job ads provided by the National Labor Exchange (NLx) Research Hub, including O*NET tasks, occupation codes, tools, and technologies, as well as wages, skills, industry, and more features. We describe the construction of a dataset of occupation, state, and industry level features aggregated by monthly active jobs from 2015 - 2025. We illustrate the potential for research and future uses in education and workforce development.

View on arXiv PDF

Similar