CLJan 29

inversedMixup: Data Augmentation via Inverting Mixed Embeddings

arXiv:2601.21543v2h-index: 6
Originality Incremental advance
AI Analysis

This work addresses the challenge of controllable and interpretable data augmentation for text data, which is incremental by building on existing Mixup and LLM inversion methods.

The paper tackles the problem of generating human-interpretable augmented text data by proposing inversedMixup, which combines Mixup's controllability with LLM-based interpretability, resulting in improved augmentation performance as demonstrated in experiments across few-shot and fully supervised scenarios.

Mixup generates augmented samples by linearly interpolating inputs and labels with a controllable ratio. However, since it operates in the latent embedding level, the resulting samples are not human-interpretable. In contrast, LLM-based augmentation methods produce sentences via prompts at the token level, yielding readable outputs but offering limited control over the generation process. Inspired by recent advances in LLM inversion, which reconstructs natural language from embeddings and helps bridge the gap between latent embedding space and discrete token space, we propose inversedMixup, a unified framework that combines the controllability of Mixup with the interpretability of LLM-based generation. Specifically, inversedMixup adopts a three-stage training procedure to align the output embedding space of a task-specific model with the input embedding space of an LLM. Upon successful alignment, inversedMixup can reconstruct mixed embeddings with a controllable mixing ratio into human-interpretable augmented sentences, thereby improving the augmentation performance. Additionally, inversedMixup provides the first empirical evidence of the manifold intrusion phenomenon in text Mixup and introduces a simple yet effective strategy to mitigate it. Extensive experiments demonstrate the effectiveness and generalizability of our approach in both few-shot and fully supervised scenarios.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes