PLAISEFeb 18, 2024

Solving Data-centric Tasks using Large Language Models

arXiv:2402.11734v231 citationsh-index: 17NAACL-HLT
AI Analysis

This work addresses a practical issue for non-professional programmers and end users using LLMs for tasks like spreadsheet manipulation, though it is incremental in improving prompt selection methods.

The paper tackled the problem of deciding how much and which data to include in prompts for LLMs performing data-centric tasks, by introducing a cluster-then-select prompting technique that adds representative rows from input data, which outperformed a random baseline for tasks with high syntactic variation.

Large language models (LLMs) are rapidly replacing help forums like StackOverflow, and are especially helpful for non-professional programmers and end users. These users are often interested in data-centric tasks, such as spreadsheet manipulation and data wrangling, which are hard to solve if the intent is only communicated using a natural-language description, without including the data. But how do we decide how much data and which data to include in the prompt? This paper makes two contributions towards answering this question. First, we create a dataset of real-world NL-to-code tasks manipulating tabular data, mined from StackOverflow posts. Second, we introduce a cluster-then-select prompting technique, which adds the most representative rows from the input data to the LLM prompt. Our experiments show that LLM performance is indeed sensitive to the amount of data passed in the prompt, and that for tasks with a lot of syntactic variation in the input table, our cluster-then-select technique outperforms a random selection baseline.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes