SE AINov 28, 2024

Structured Object Language Modeling (SoLM): Native Structured Objects Generation Conforming to Complex Schemas with Self-Supervised Denoising

Amir Tavanaei, Kee Kiat Koo, Hayreddin Ceker, Shaobai Jiang, Qi Li, Julien Han, Karim Bouyarmane

arXiv:2411.19301v15.93 citationsh-index: 8EMNLP

Originality Incremental advance

AI Analysis

This addresses the challenge of generating self-consistent and grounded structured objects for applications like data normalization and completion, though it appears incremental as it builds on existing LLM techniques.

The paper tackles the problem of generating structured objects that conform to complex schemas with intricate dependencies, using a self-supervised denoising method to train a language model natively without instructions. The method matches or outperforms state-of-the-art LLMs like Claude 3 and Mixtral-8x7B while being order-of-magnitude more cost-efficient.

In this paper, we study the problem of generating structured objects that conform to a complex schema, with intricate dependencies between the different components (facets) of the object. The facets of the object (attributes, fields, columns, properties) can be a mix of short, structured, type-constrained facts, or long natural-language descriptions. The object has to be self-consistent between the different facets in the redundant information it carries (relative consistency), while being grounded with respect to world knowledge (absolute consistency). We frame the problem as a Language Modeling problem (Structured Object Language Modeling) and train an LLM to perform the task natively, without requiring instructions or prompt-engineering. We propose a self-supervised denoising method to train the model from an existing dataset of such objects. The input query can be the existing object itself, in which case the model acts as a regenerator, completing, correcting, normalizing the input, or any unstructured blurb to be structured. We show that the self-supervised denoising training provides a strong baseline, and that additional supervised fine-tuning with small amount of human demonstrations leads to further improvement. Experimental results show that the proposed method matches or outperforms prompt-engineered general-purpose state-of-the-art LLMs (Claude 3, Mixtral-8x7B), while being order-of-magnitude more cost-efficient.

View on arXiv PDF

Similar