AI CLSep 30, 2025

DeepJSONEval: Benchmarking Complex Nested JSON Data Mining for Large Language Models

Zhicheng Zhou, Jing Li, Suming Qiu, Junjie Huang, Linyuan Qiu, Zhijie Sun

arXiv:2509.25922v17.81 citationsh-index: 7Has Code

Originality Incremental advance

AI Analysis

This addresses a practical limitation in web data mining by providing a more relevant benchmark for researchers and practitioners, though it is incremental as it builds on existing JSON generation benchmarks.

The authors tackled the problem of evaluating large language models' ability to comprehend and extract data into complex nested JSON structures for web data mining, introducing DeepJSONEval, a benchmark with 2100 multi-domain instances that revealed significant performance gaps among LLMs.

The internet is saturated with low-density, high-redundancy information, such as social media comments, repetitive news, and lengthy discussions, making it difficult to extract valuable insights efficiently. Multi-layer nested JSON structures provide an effective solution by compressing such information into semantically rich, hierarchical representations, which organize data into key-value pairs, arrays, and nested objects, preserving contextual relationships and enabling efficient storage, retrieval, and semantic querying. For instance, in news aggregation, a JSON object can nest an article's metadata (title, author, date), content (text, multimedia), and multimedia information (multimedia type, caption) hierarchically. Large Language Models (LLMs) play a transformative role in web data mining by parsing unstructured text and outputting structured results directly into complex JSON schemas. However, current benchmarks for evaluating LLMs' JSON output capabilities overemphasize pure JSON generation rather than assessing data comprehension and extraction abilities, a limitation that lacks relevance to practical web data mining tasks. To address this, we introduce DeepJSONEval, a novel benchmark featuring 2100 multi-domain instances with deep nested structures, categorized by difficulty. Experiments show significant performance gaps among LLMs in handling such complexity. Our benchmark and datasets are open-sourced to advance research in structured JSON generation.(https://github.com/GTS-AI-Infra-Lab-SotaS/DeepJSONEval).

View on arXiv PDF Code

Similar