Mix and Match: Context Pairing for Scalable Topic-Controlled Educational Summarisation

Nathikan Yodthapa, Thanapong Intharah, Sahan Bulathwela

arXiv:2604.1808774.9h-index: 7

Predicted impact top 83% in CL · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the challenge of training small language models for topic-controlled summarization with limited data, offering a scalable data augmentation strategy that improves performance without requiring larger models or more real data.

The paper proposes a pairwise data augmentation method for topic-controlled summarization that combines contexts from different documents to create contrastive training examples. Using the SciTLDR dataset, a T5-base model trained with this approach achieves competitive performance relative to larger models while using significantly fewer parameters and real training examples.

Topic-controlled summarisation enables users to generate summaries focused on specific aspects of source documents. This paper investigates a data augmentation strategy for training small language models (sLMs) to perform topic-controlled summarisation. We propose a pairwise data augmentation method that combines contexts from different documents to create contrastive training examples, enabling models to learn the relationship between topics and summaries more effectively. Using the SciTLDR dataset enriched with Wikipedia-derived topics, we systematically evaluate how augmentation scale affects model performance. Results show consistent improvements in win rate and semantic alignment as the augmentation scale increases, while the amount of real training data remains fixed. Consequently, a T5-base model trained with our augmentation approach achieves competitive performance relative to larger models, despite using significantly fewer parameters and substantially fewer real training examples.

View on arXiv PDF

Similar