CLAIJan 7

Towards Compositional Generalization of LLMs via Skill Taxonomy Guided Data Synthesis

arXiv:2601.03676v1h-index: 8
Originality Incremental advance
AI Analysis

This addresses a data bottleneck for LLMs and agents in handling complex skill combinations, though it is incremental as it builds on existing data synthesis methods.

The paper tackles the problem of compositional generalization in LLMs and agent-based systems by proposing STEPS, a framework for generating compositionally challenging data using a skill taxonomy, which outperforms existing baselines on instruction-following benchmarks and improves generalization in agent tasks.

Large Language Models (LLMs) and agent-based systems often struggle with compositional generalization due to a data bottleneck in which complex skill combinations follow a long-tailed, power-law distribution, limiting both instruction-following performance and generalization in agent-centric tasks. To address this challenge, we propose STEPS, a Skill Taxonomy guided Entropy-based Post-training data Synthesis framework for generating compositionally challenging data. STEPS explicitly targets compositional generalization by uncovering latent relationships among skills and organizing them into an interpretable, hierarchical skill taxonomy using structural information theory. Building on this taxonomy, we formulate data synthesis as a constrained information maximization problem, selecting skill combinations that maximize marginal structural information within the hierarchy while preserving semantic coherence. Experiments on challenging instruction-following benchmarks show that STEPS outperforms existing data synthesis baselines, while also yielding improved compositional generalization in downstream agent-based evaluations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes