Sergey Mechtaev

h-index13

5papers

5citations

Novelty55%

AI Score44

Ranked #49,178 of 194,257 authors (top 25%)#452 in SE (top 15%)

5 Papers

7.7SEMar 30

Compressing Code Context for LLM-based Issue Resolution

Haoxiang Jia, Earl T. Barr, Sergey Mechtaev

Large Language Models (LLMs) are now capable of resolving real-world GitHub issues. However, current approaches overapproximate the code context and suffer from two compounding problems: the prohibitive cost of processing massive inputs, and low effectiveness as noise floods the context window and distracts the model from the bug-fixing signal. Existing compression techniques fail to resolve this tension: generic compressors compromise the semantic integrity of code, while code-specific tools lack awareness of code structure and task context to preserve essential patch ingredients. To address this, we propose a novel framework consisting of two components. First, Oracle-guided Code Distillation (OCD), a context distillation algorithm that combines genetic search and delta debugging to systematically reduce code contexts to their minimal sufficient subsequence - retaining only the ingredients required for a successful fix. We use this distilled data to fine-tune SWEzze, a lightweight model that learns to compress code context at inference time, filtering noise and combating distraction while preserving fix ingredients. Evaluated on SWE-bench Verified across three frontier LLMs, SWEzze maintains a stable compression rate of about 6 times across models, reduces the total token budget by 51.8%-71.3% relative to the uncompressed setting, improves issue resolution rates by 5.0%-9.2%, and delivers the best overall balance among effectiveness, compression ratio, and latency compared with state-of-the-art context compression baselines.

12.3SEMar 30

Reducing Hallucinations in LLM-Generated Code via Semantic Triangulation

Yihan Dai, Sijie Liang, Haotian Xu et al.

Large language models (LLMs) can generate executable code from natural language descriptions, but the resulting programs frequently contain bugs due to hallucinations. In the absence of formal specifications, existing approaches attempt to assess correctness using LLM-generated proxies such as tests or auto-formalized specifications. However, these proxies are produced by the same imperfect models and thus often corroborate rather than catch errors, especially when the model exhibits correlated errors. We introduce semantic triangulation, a theory-grounded framework that decorrelates model errors by transforming the original problem into a dissociative variant - one likely requiring a fundamentally different algorithm - and checks consistency between independently sampled solutions to both problems. We identify theoretical requirements for this framework, and we prove that under a formal model of LLM hallucinations, these properties confer higher confidence in program correctness. We instantiate the framework through four concrete triangulation methods based on problem inversion, decomposition, and solution enumeration. Evaluated on LiveCodeBench and CodeElo across GPT-4o, DeepSeek-V3, and Gemini 2.5 Flash, our tool increases the probability of selecting a correct program by 24% over baselines (test generation, metamorphic testing, and auto-formalized specifications) and achieves 26% higher F1 score in selection-or-abstention scenarios, while being the only method that consistently handles inexact problems admitting multiple valid solutions.

1.2PLJan 31

Defusing Logic Bombs in Symbolic Execution with LLM-Generated Ghost Code

Dimitrios Stamatios Bouras, Sergey Mechtaev

Symbolic execution is a powerful program analysis technique, but its effectiveness is fundamentally limited by solver-hostile program fragments, complex numerical reasoning, and unbounded heap structures. Recent work proposed replacing constraint solvers with large language models (LLMs) to bypass these limitations, but such approaches struggle to analyze real-world codebases, where deep execution paths require globally consistent reasoning across many interacting constraints. We present Gordian, a hybrid symbolic execution framework that uses LLMs selectively to generate lightweight ghost code that aids an SMT solver in handling solver-hostile code fragments, while preserving its precise, global reasoning capability. In particular, we propose three types of ghost code: (1) inversion of difficult code fragments with iterative bidirectional constraint propagation, (2) modeling via solver-friendly surrogates while preserving relevant behavior, and (3) semantic partitioning of unbounded heap spaces. We implemented Gordian on top of the KLEE symbolic execution engine and evaluated it on synthetic "logic bombs" capturing distinct symbolic reasoning challenges, a popular mathematical library FDLibM, and three structured-input programs (libexpat, jq, and bc). Across all benchmarks, Gordian improves coverage on average by 52-84% over traditional symbolic execution baselines, and by 86-419% over LLM-based techniques, while reducing LLM token usage by an average of 90-96%. This highlights the practicality and effectiveness of this approach in real-world settings.

2.3LGNov 22, 2020

Fairness-guided SMT-based Rectification of Decision Trees and Random Forests

Jiang Zhang, Ivan Beschastnikh, Sergey Mechtaev et al.

Data-driven decision making is gaining prominence with the popularity of various machine learning models. Unfortunately, real-life data used in machine learning training may capture human biases, and as a result the learned models may lead to unfair decision making. In this paper, we provide a solution to this problem for decision trees and random forests. Our approach converts any decision tree or random forest into a fair one with respect to a specific data set, fairness criteria, and sensitive attributes. The \emph{FairRepair} tool, built based on our approach, is inspired by automated program repair techniques for traditional programs. It uses an SMT solver to decide which paths in the decision tree could have their outcomes flipped to improve the fairness of the model. Our experiments on the well-known adult dataset from UC Irvine demonstrate that FairRepair scales to realistic decision trees and random forests. Furthermore, FairRepair provides formal guarantees about soundness and completeness of finding a repair. Since our fairness-guided repair technique repairs decision trees and random forests obtained from a given (unfair) data-set, it can help to identify and rectify biases in decision-making in an organisation.

2.9SEJul 11, 2017

Partitioning Patches into Test-equivalence Classes for Scaling Program Repair

Sergey Mechtaev, Xiang Gao, Shin Hwei Tan et al.

Automated program repair is a problem of finding a transformation (called a patch) of a given incorrect program that eliminates the observable failures. It has important applications such as providing debugging aids, automatically grading assignments and patching security vulnerabilities. A common challenge faced by all existing repair techniques is scalability to large patch spaces, since there are many candidate patches that these techniques explicitly or implicitly consider. The correctness criterion for program repair is often given as a suite of tests, since a formal specification of the intended program behavior may not be available. Current repair techniques do not scale due to the large number of test executions performed by the underlying search algorithms. We address this problem by introducing a methodology of patch generation based on a test-equivalence relation (if two programs are "test-equivalent" for a given test, they produce indistinguishable results on this test). We propose two test-equivalence relations based on runtime values and dependencies respectively and present an algorithm that performs on-the-fly partitioning of patches into test-equivalence classes. Our experiments on real-world programs reveal that the proposed methodology drastically reduces the number of test executions and therefore provides an order of magnitude efficiency improvement over existing repair techniques, without sacrificing patch quality.