Generating Hierarchical JSON Representations of Scientific Sentences Using LLMs
This addresses the problem of structured data extraction for scientific text processing, though it appears incremental as it builds on existing LLM and reconstruction methods.
The paper tackled the problem of preserving scientific sentence meaning through structured representations by fine-tuning a lightweight LLM with a novel structural loss to generate hierarchical JSON from sentences, then reconstructing text from these JSONs, showing hierarchical formats effectively retain information as measured by semantic and lexical similarity.
This paper investigates whether structured representations can preserve the meaning of scientific sentences. To test this, a lightweight LLM is fine-tuned using a novel structural loss function to generate hierarchical JSON structures from sentences collected from scientific articles. These JSONs are then used by a generative model to reconstruct the original text. Comparing the original and reconstructed sentences using semantic and lexical similarity we show that hierarchical formats are capable of retaining information of scientific texts effectively.