LGCVMLNov 30, 2024

TAROT: Targeted Data Selection via Optimal Transport

arXiv:2412.00420v27 citationsh-index: 13Has CodeICML
Originality Incremental advance
AI Analysis

This addresses a domain-specific problem for machine learning practitioners by providing a more effective data selection method, though it is incremental as it builds on existing targeted data selection approaches.

The paper tackles the problem of suboptimal data selection in multimodal distributions by proposing TAROT, a framework that uses optimal transport theory to improve selection, and results show it outperforms state-of-the-art methods across tasks like semantic segmentation and motion prediction.

We propose TAROT, a targeted data selection framework grounded in optimal transport theory. Previous targeted data selection methods primarily rely on influence-based greedy heuristics to enhance domain-specific performance. While effective on limited, unimodal data (i.e., data following a single pattern), these methods struggle as target data complexity increases. Specifically, in multimodal distributions, these heuristics fail to account for multiple inherent patterns, leading to suboptimal data selection. This work identifies two primary factors contributing to this limitation: (i) the disproportionate impact of dominant feature components in high-dimensional influence estimation, and (ii) the restrictive linear additive assumptions inherent in greedy selection strategies. To address these challenges, TAROT incorporates whitened feature distance to mitigate dominant feature bias, providing a more reliable measure of data influence. Building on this, TAROT uses whitened feature distance to quantify and minimize the optimal transport distance between the selected data and target domains. Notably, this minimization also facilitates the estimation of optimal selection ratios. We evaluate TAROT across multiple tasks, including semantic segmentation, motion prediction, and instruction tuning. Results consistently show that TAROT outperforms state-of-the-art methods, highlighting its versatility across various deep learning tasks. Code is available at https://github.com/vita-epfl/TAROT.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes