Elena Ericheva

IV
h-index7
5papers
113citations
Novelty47%
AI Score41

5 Papers

LGNov 22, 2024Code
RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts

Hjalmar Wijk, Tao Lin, Joel Becker et al.

Frontier AI safety policies highlight automation of AI research and development (R&D) by AI agents as an important capability to anticipate. However, there exist few evaluations for AI R&D capabilities, and none that are highly realistic and have a direct comparison to human performance. We introduce RE-Bench (Research Engineering Benchmark, v1), which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 distinct human experts. We confirm that our experts make progress in the environments given 8 hours, with 82% of expert attempts achieving a non-zero score and 24% matching or exceeding our strong reference solutions. We compare humans to several public frontier models through best-of-k with varying time budgets and agent designs, and find that the best AI agents achieve a score 4x higher than human experts when both are given a total time budget of 2 hours per environment. However, humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2x the score of the top AI agent when both are given 32 total hours (across different attempts). Qualitatively, we find that modern AI agents possess significant expertise in many ML topics -- e.g. an agent wrote a faster custom Triton kernel than any of our human experts' -- and can generate and test solutions over ten times faster than humans, at much lower cost. We open-source the evaluation environments, human expert data, analysis code and agent trajectories to facilitate future research.

CLNov 21, 2025Code
Polarity-Aware Probing for Quantifying Latent Alignment in Language Models

Sabrina Sadiekh, Elena Ericheva, Chirag Agarwal

Advances in unsupervised probes such as Contrast-Consistent Search (CCS), which reveal latent beliefs without relying on token outputs, raise the question of whether these methods can reliably assess model alignment. We investigate this by examining the sensitivity of CCS to harmful vs. safe statements and by introducing Polarity-Aware CCS (PA-CCS), a method for evaluating whether a model's internal representations remain consistent under polarity inversion. We propose two alignment-oriented metrics, Polar-Consistency and the Contradiction Index, to quantify the semantic robustness of a model's latent knowledge. To validate PA-CCS, we curate two main datasets and one control dataset containing matched harmful-safe sentence pairs constructed using different methodologies (concurrent and antagonistic statements). We apply PA-CCS to 16 language models. Our results show that PA-CCS identifies both architectural and layer-specific differences in the encoding of latent harmful knowledge. Notably, replacing the negation token with a meaningless marker degrades PA-CCS scores for models with well-aligned internal representations, while models lacking robust internal calibration do not exhibit this degradation. Our findings highlight the potential of unsupervised probing for alignment evaluation and emphasize the need to incorporate structural robustness checks into interpretability benchmarks. Code and datasets are available at: https://github.com/SadSabrina/polarity-probing. WARNING: This paper contains potentially sensitive, harmful, and offensive content.

IVJun 10, 2021
End-to-end lung nodule detection framework with model-based feature projection block

Ivan Drokin, Elena Ericheva

This paper proposes novel end-to-end framework for detecting suspicious pulmonary nodules in chest CT scans. The method core idea is a new nodule segmentation architecture with a model-based feature projection block on three-dimensional convolutions. This block acts as a preliminary feature extractor for a two-dimensional U-Net-like convolutional network. Using the proposed approach along with an axial, coronal, and sagittal projection analysis makes it possible to abandon the widely used false positives reduction step. The proposed method achieves SOTA on LUNA2016 with 0.959 average sensitivity, and 0.936 sensitivity if the false-positive level per scan is 0.25. The paper describes the proposed approach and represents the experimental results on LUNA2016 as well as ablation studies.

IVMay 7, 2020
Deep Learning on Point Clouds for False Positive Reduction at Nodule Detection in Chest CT Scans

Ivan Drokin, Elena Ericheva

This paper focuses on a novel approach for false-positive reduction (FPR) of nodule candidates in Computer-aided detection (CADe) systems following the suspicious lesions detection stage. Contrary to typical decisions in medical image analysis, the proposed approach considers input data not as a 2D or 3D image, but rather as a point cloud, and uses deep learning models for point clouds. We discovered that point cloud models require less memory and are faster both in training and inference compared to traditional CNN 3D, they achieve better performance and do not impose restrictions on the size of the input image, i.e. no restrictions on the size of the nodule candidate. We propose an algorithm for transforming 3D CT scan data to point cloud. In some cases, the volume of the nodule candidate can be much smaller than the surrounding context, for example, in the case of subpleural localization of the nodule. Therefore, we developed an algorithm for sampling points from a point cloud constructed from a 3D image of the candidate region. The algorithm is able to guarantee the capture of both context and candidate information as part of the point cloud of the nodule candidate. We designed and set up an experiment in creating a dataset from an open LIDC-IDRI database for a feature of the FPR task, and is herein described in detail. Data augmentation was applied both to avoid overfitting and as an upsampling method. Experiments were conducted with PointNet, PointNet++, and DGCNN. We show that the proposed approach outperforms baseline CNN 3D models and resulted in 85.98 FROC versus 77.26 FROC for baseline models. We compare our algorithm with published SOTA and demonstrate that even without significant modifications it works at the appropriate performance level on LUNA2016 and shows SOTA on LIDC-IDRI.

IVAug 1, 2019
GANs 'N Lungs: improving pneumonia prediction

Tatiana Malygina, Elena Ericheva, Ivan Drokin

We propose a novel method to improve deep learning model performance on highly-imbalanced tasks. The proposed method is based on CycleGAN to achieve balanced dataset. We show that data augmentation with GAN helps to improve accuracy of pneumonia binary classification task even if the generative network was trained on the same training dataset.