Timur Galimzyanov

LG
h-index18
4papers
88citations
Novelty34%
AI Score36

4 Papers

SEDec 3, 2024Code
Drawing Pandas: A Benchmark for LLMs in Generating Plotting Code

Timur Galimzyanov, Sergey Titov, Yaroslav Golubev et al.

This paper introduces the human-curated PandasPlotBench dataset, designed to evaluate language models' effectiveness as assistants in visual data exploration. Our benchmark focuses on generating code for visualizing tabular data - such as a Pandas DataFrame - based on natural language instructions, complementing current evaluation tools and expanding their scope. The dataset includes 175 unique tasks. Our experiments assess several leading Large Language Models (LLMs) across three visualization libraries: Matplotlib, Seaborn, and Plotly. We show that the shortening of tasks has a minimal effect on plotting capabilities, allowing for the user interface that accommodates concise user input without sacrificing functionality or accuracy. Another of our findings reveals that while LLMs perform well with popular libraries like Matplotlib and Seaborn, challenges persist with Plotly, highlighting areas for improvement. We hope that the modular design of our benchmark will broaden the current studies on generating visualizations. Our dataset and benchmark code are available online: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench; https://github.com/JetBrains-Research/PandasPlotBench.

LGJun 17, 2024Code
Long Code Arena: a Set of Benchmarks for Long-Context Code Models

Egor Bogomolov, Aleksandra Eliseeva, Timur Galimzyanov et al.

Nowadays, the fields of code and natural language processing are evolving rapidly. In particular, models become better at processing long context windows - supported context sizes have increased by orders of magnitude over the last few years. However, there is a shortage of benchmarks for code processing that go beyond a single file of context, while the most popular ones are limited to a single method. With this work, we aim to close this gap by introducing Long Code Arena, a suite of six benchmarks for code processing tasks that require project-wide context. These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization. For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions based on popular LLMs to showcase the usage of the dataset and to simplify adoption by other researchers. We publish the benchmark page on HuggingFace Spaces with the leaderboard, links to HuggingFace Hub for all the datasets, and link to the GitHub repository with baselines: https://huggingface.co/spaces/JetBrains-Research/long-code-arena.

SEDec 14, 2023
Dynamic Retrieval-Augmented Generation

Anton Shapkin, Denis Litvinov, Yaroslav Zharov et al.

Current state-of-the-art large language models are effective in generating high-quality text and encapsulating a broad spectrum of world knowledge. These models, however, often hallucinate and lack locally relevant factual data. Retrieval-augmented approaches were introduced to overcome these problems and provide more accurate responses. Typically, the retrieved information is simply appended to the main request, restricting the context window size of the model. We propose a novel approach for the Dynamic Retrieval-Augmented Generation (DRAG), based on the entity-augmented generation, which injects compressed embeddings of the retrieved entities into the generative model. The proposed pipeline was developed for code-generation tasks, yet can be transferred to some domains of natural language processing. To train the model, we collect and publish a new project-level code generation dataset. We use it for the evaluation along with publicly available datasets. Our approach achieves several targets: (1) lifting the length limitations of the context window, saving on the prompt size; (2) allowing huge expansion of the number of retrieval entities available for the context; (3) alleviating the problem of misspelling or failing to find relevant entity names. This allows the model to beat all baselines (except GPT-3.5) with a strong margin.

LGOct 23, 2025
Practical Code RAG at Scale: Task-Aware Retrieval Design Choices under Compute Budgets

Timur Galimzyanov, Olga Kolomyttseva, Egor Bogomolov

We study retrieval design for code-focused generation tasks under realistic compute budgets. Using two complementary tasks from Long Code Arena -- code completion and bug localization -- we systematically compare retrieval configurations across various context window sizes along three axes: (i) chunking strategy, (ii) similarity scoring, and (iii) splitting granularity. (1) For PL-PL, sparse BM25 with word-level splitting is the most effective and practical, significantly outperforming dense alternatives while being an order of magnitude faster. (2) For NL-PL, proprietary dense encoders (Voyager-3 family) consistently beat sparse retrievers, however requiring 100x larger latency. (3) Optimal chunk size scales with available context: 32-64 line chunks work best at small budgets, and whole-file retrieval becomes competitive at 16000 tokens. (4) Simple line-based chunking matches syntax-aware splitting across budgets. (5) Retrieval latency varies by up to 200x across configurations; BPE-based splitting is needlessly slow, and BM25 + word splitting offers the best quality-latency trade-off. Thus, we provide evidence-based recommendations for implementing effective code-oriented RAG systems based on task requirements, model constraints, and computational efficiency.