CLAINov 23, 2025

OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph

arXiv:2511.18622v1
Originality Incremental advance
AI Analysis

This addresses gaps in pedagogical applications and NLP tasks by providing integrated content like definitions and examples, though it is incremental as it builds on existing lexical resources and generation methods.

The authors tackled the problem of creating comprehensive lexical resources by introducing OpenGloss, a synthetic encyclopedic dictionary and semantic knowledge graph for English, which contains 537K senses across 150K lexemes with 9.1M semantic edges and was generated in under one week for under $1,000, demonstrating that structured generation can produce such resources at scales impractical for manual curation.

We present OpenGloss, a synthetic encyclopedic dictionary and semantic knowledge graph for English that integrates lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships in a unified resource. OpenGloss contains 537K senses across 150K lexemes, on par with WordNet 3.1 and Open English WordNet, while providing more than four times as many sense definitions. These lexemes include 9.1M semantic edges, 1M usage examples, 3M collocations, and 60M words of encyclopedic content. Generated through a multi-agent procedural generation pipeline with schema-validated LLM outputs and automated quality assurance, the entire resource was produced in under one week for under $1,000. This demonstrates that structured generation can create comprehensive lexical resources at cost and time scales impractical for manual curation, enabling rapid iteration as foundation models improve. The resource addresses gaps in pedagogical applications by providing integrated content -- definitions, examples, collocations, encyclopedias, etymology -- that supports both vocabulary learning and natural language processing tasks. As a synthetically generated resource, OpenGloss reflects both the capabilities and limitations of current foundation models. The dataset is publicly available on Hugging Face under CC-BY 4.0, enabling researchers and educators to build upon and adapt this resource.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes