Fine-Tuning LLMs with Fine-Grained Human Feedback on Text Spans
This addresses the problem of improving language model alignment with human preferences through more structured feedback, though it appears incremental as it builds on existing preference tuning methods.
The paper tackles the problem of fine-tuning language models with human feedback by introducing a method where annotators provide fine-grained feedback on specific text spans, and the model incrementally rewrites disliked spans to create improvement chains. The result shows this approach outperforms standard A/B preference ranking or full contrastive rewrites, leading to more efficient and effective preference tuning.
We present a method and dataset for fine-tuning language models with preference supervision using feedback-driven improvement chains. Given a model response, an annotator provides fine-grained feedback by marking ``liked'' and ``disliked'' spans and specifying what they liked or disliked about them. The base model then rewrites the disliked spans accordingly, proceeding from left to right, forming a sequence of incremental improvements. We construct preference pairs for direct alignment from each adjacent step in the chain, enabling the model to learn from localized, targeted edits. We find that our approach outperforms direct alignment methods based on standard A/B preference ranking or full contrastive rewrites, demonstrating that structured, revision-based supervision leads to more efficient and effective preference tuning.