CLNov 7, 2023
Unveiling Safety Vulnerabilities of Large Language ModelsGeorge Kour, Marcel Zalmanovici, Naama Zwerdling et al. · ibm-research
As large language models become more prevalent, their possible harmful or inappropriate responses are a cause for concern. This paper introduces a unique dataset containing adversarial examples in the form of questions, which we call AttaQ, designed to provoke such harmful or inappropriate responses. We assess the efficacy of our dataset by analyzing the vulnerabilities of various models when subjected to it. Additionally, we introduce a novel automatic approach for identifying and naming vulnerable semantic regions - input semantic areas for which the model is likely to produce harmful outputs. This is achieved through the application of specialized clustering techniques that consider both the semantic similarity of the input attacks and the harmfulness of the model's responses. Automatically identifying vulnerable semantic regions enhances the evaluation of model weaknesses, facilitating targeted improvements to its safety mechanisms and overall reliability.
CLNov 29, 2022
Measuring the Measuring Tools: An Automatic Evaluation of Semantic Metrics for Text CorporaGeorge Kour, Samuel Ackerman, Orna Raz et al. · ibm-research
The ability to compare the semantic similarity between text corpora is important in a variety of natural language processing applications. However, standard methods for evaluating these metrics have yet to be established. We propose a set of automatic and interpretable measures for assessing the characteristics of corpus-level semantic similarity metrics, allowing sensible comparison of their behavior. We demonstrate the effectiveness of our evaluation measures in capturing fundamental characteristics by evaluating them on a collection of classical and state-of-the-art metrics. Our measures revealed that recently-developed metrics are becoming better in identifying semantic distributional mismatch while classical metrics are more sensitive to perturbations in the surface text levels.
CLNov 2, 2023
Predicting Question-Answering Performance of Large Language Models through Semantic ConsistencyElla Rabinovich, Samuel Ackerman, Orna Raz et al.
Semantic consistency of a language model is broadly defined as the model's ability to produce semantically-equivalent outputs, given semantically-equivalent inputs. We address the task of assessing question-answering (QA) semantic consistency of contemporary large language models (LLMs) by manually creating a benchmark dataset with high-quality paraphrases for factual questions, and release the dataset to the community. We further combine the semantic consistency metric with additional measurements suggested in prior work as correlating with LLM QA accuracy, for building and evaluating a framework for factual QA reference-less performance prediction -- predicting the likelihood of a language model to accurately answer a question. Evaluating the framework on five contemporary LLMs, we demonstrate encouraging, significantly outperforming baselines, results.
AIAug 22, 2024
How Safe is Your Safety Metric? Automatic Concatenation Tests for Metric ReliabilityOra Nova Fandina, Leshem Choshen, Eitan Farchi et al. · ibm-research
Consider a scenario where a harmfulness evaluation metric intended to filter unsafe responses from a Large Language Model. When applied to individual harmful prompt-response pairs, it correctly flags them as unsafe by assigning a high-risk score. Yet, if those same pairs are concatenated, the metrics decision unexpectedly reverses - labelling the combined content as safe with a low score, allowing the harmful text to bypass the filter. We found that multiple safety metrics, including advanced metrics such as GPT-based judges, exhibit this non-safe behaviour. Moreover, they show a strong sensitivity to input order: responses are often classified as safe if safe content appears first, regardless of any harmful content that follows, and vice versa. These findings underscore the importance of evaluating the safety of safety metrics, that is, the reliability of their output scores. To address this, we developed general, automatic, concatenation-based tests to assess key properties of these metrics. When applied in a model safety scenario, the tests revealed significant inconsistencies in harmfulness evaluations.
SEOct 31, 2025
Vintage Code, Modern Judges: Meta-Validation in Low Data RegimesOra Nova Fandina, Gal Amram, Eitan Farchi et al.
Application modernization in legacy languages such as COBOL, PL/I, and REXX faces an acute shortage of resources, both in expert availability and in high-quality human evaluation data. While Large Language Models as a Judge (LaaJ) offer a scalable alternative to expert review, their reliability must be validated before being trusted in high-stakes workflows. Without principled validation, organizations risk a circular evaluation loop, where unverified LaaJs are used to assess model outputs, potentially reinforcing unreliable judgments and compromising downstream deployment decisions. Although various automated approaches to validating LaaJs have been proposed, alignment with human judgment remains a widely used and conceptually grounded validation strategy. In many real-world domains, the availability of human-labeled evaluation data is severely limited, making it difficult to assess how well a LaaJ aligns with human judgment. We introduce SparseAlign, a formal framework for assessing LaaJ alignment with sparse human-labeled data. SparseAlign combines a novel pairwise-confidence concept with a score-sensitive alignment metric that jointly capture ranking consistency and score proximity, enabling reliable evaluator selection even when traditional statistical methods are ineffective due to limited annotated examples. SparseAlign was applied internally to select LaaJs for COBOL code explanation. The top-aligned evaluators were integrated into assessment workflows, guiding model release decisions. We present a case study of four LaaJs to demonstrate SparseAlign's utility in real-world evaluation scenarios.
AIJul 29, 2024
Generating Unseen Code Tests In InfinitumMarcel Zalmanovici, Orna Raz, Eitan Farchi et al.
Large Language Models (LLMs) are used for many tasks, including those related to coding. An important aspect of being able to utilize LLMs is the ability to assess their fitness for specific usages. The common practice is to evaluate LLMs against a set of benchmarks. While benchmarks provide a sound foundation for evaluation and comparison of alternatives, they suffer from the well-known weakness of leaking into the training data \cite{Xu2024Benchmarking}. We present a method for creating benchmark variations that generalize across coding tasks and programming languages, and may also be applied to in-house code bases. Our approach enables ongoing generation of test-data thus mitigating the leaking into the training data issue. We implement one benchmark, called \textit{auto-regression}, for the task of text-to-code generation in Python. Auto-regression is specifically created to aid in debugging and in tracking model generation changes as part of the LLM regression testing process.
SENov 5, 2025
Uncovering Code Insights: Leveraging GitHub Artifacts for Deeper Code UnderstandingZiv Nevo, Orna Raz, Karen Yorav
Understanding the purpose of source code is a critical task in software maintenance, onboarding, and modernization. While large language models (LLMs) have shown promise in generating code explanations, they often lack grounding in the broader software engineering context. We propose a novel approach that leverages natural language artifacts from GitHub -- such as pull request descriptions, issue descriptions and discussions, and commit messages -- to enhance LLM-based code understanding. Our system consists of three components: one that extracts and structures relevant GitHub context, another that uses this context to generate high-level explanations of the code's purpose, and a third that validates the explanation. We implemented this as a standalone tool, as well as a server within the Model Context Protocol (MCP), enabling integration with other AI-assisted development tools. Our main use case is that of enhancing a standard LLM-based code explanation with code insights that our system generates. To evaluate explanations' quality, we conducted a small scale user study, with developers of several open projects, as well as developers of proprietary projects. Our user study indicates that when insights are generated they often are helpful and non trivial, and are free from hallucinations.
LGMar 9, 2024
Detectors for Safe and Reliable LLMs: Implementations, Uses, and LimitationsSwapnaja Achintalwar, Adriana Alvarado Garcia, Ateret Anaby-Tavor et al. · ibm-research
Large language models (LLMs) are susceptible to a variety of risks, from non-faithful output to biased and toxic generations. Due to several limiting factors surrounding LLMs (training cost, API access, data availability, etc.), it may not always be feasible to impose direct safety constraints on a deployed model. Therefore, an efficient and reliable alternative is required. To this end, we present our ongoing efforts to create and deploy a library of detectors: compact and easy-to-build classification models that provide labels for various harms. In addition to the detectors themselves, we discuss a wide range of uses for these detector models - from acting as guardrails to enabling effective AI governance. We also deep dive into inherent challenges in their development and discuss future work aimed at making the detectors more reliable and broadening their scope.
CLMar 8, 2024
Alignment Studio: Aligning Large Language Models to Particular Contextual RegulationsSwapnaja Achintalwar, Ioana Baldini, Djallel Bouneffouf et al. · ibm-research
The alignment of large language models is usually done by model providers to add or control behaviors that are common or universally understood across use cases and contexts. In contrast, in this article, we present an approach and architecture that empowers application developers to tune a model to their particular values, social norms, laws and other regulations, and orchestrate between potentially conflicting requirements in context. We lay out three main components of such an Alignment Studio architecture: Framers, Instructors, and Auditors that work in concert to control the behavior of a language model. We illustrate this approach with a running example of aligning a company's internal-facing enterprise chatbot to its business conduct guidelines.
SEAug 4, 2025
Automated Validation of LLM-based Evaluators for Software Engineering ArtifactsOra Nova Fandina, Eitan Farchi, Shmulik Froimovich et al.
Automation in software engineering increasingly relies on large language models (LLMs) to generate, review, and assess code artifacts. However, establishing LLMs as reliable evaluators remains an open challenge: human evaluations are costly, subjective and non scalable, while existing automated methods fail to discern fine grained variations in artifact quality. We introduce REFINE (Ranking Evaluators for FIne grained Nuanced Evaluation), an automated framework for benchmarking LLM based evaluators across software engineering tasks. REFINE comprises of two modules: Hierarchy Dataset Builder applies novel generation techniques to automatically synthesize artifacts with progressively reduced quality, and Evaluator Tester quantifies each candidate evaluator configuration by measuring how closely its rankings align with expected ordering. A key feature of REFINE is controllability: users can tune the granularity of degradation to progressively refine evaluator configurations, from coarse filtering to stress testing on subtle quality gaps. While the methodology is general, we focus on coding tasks reflecting the practical demands in our production setting. REFINE was integrated into IBM's internal development workflows and applied to code generation, translation, and summarization for COBOL, an enterprise critical programming language, using industrial data. It was used to identify LLM as a Judge configurations that lifted alignment scores from below $0.7$ to above $0.9$ in some coding tasks. These nuance sensitive evaluators are now actively used by model training teams to support model release decisions.
CLMay 15, 2024
Using Combinatorial Optimization to Design a High quality LLM SolutionSamuel Ackerman, Eitan Farchi, Rami Katan et al.
We introduce a novel LLM based solution design approach that utilizes combinatorial optimization and sampling. Specifically, a set of factors that influence the quality of the solution are identified. They typically include factors that represent prompt types, LLM inputs alternatives, and parameters governing the generation and design alternatives. Identifying the factors that govern the LLM solution quality enables the infusion of subject matter expert knowledge. Next, a set of interactions between the factors are defined and combinatorial optimization is used to create a small subset $P$ that ensures all desired interactions occur in $P$. Each element $p \in P$ is then developed into an appropriate benchmark. Applying the alternative solutions on each combination, $p \in P$ and evaluating the results facilitate the design of a high quality LLM solution pipeline. The approach is especially applicable when the design and evaluation of each benchmark in $P$ is time-consuming and involves manual steps and human evaluation. Given its efficiency the approach can also be used as a baseline to compare and validate an autoML approach that searches over the factors governing the solution.
SEDec 11, 2025
PACIFIC: a framework for generating benchmarks to check Precise Automatically Checked Instruction Following In CodeItay Dreyfuss, Antonio Abu Nassar, Samuel Ackerman et al.
Large Language Model (LLM)-based code assistants have emerged as a powerful application of generative AI, demonstrating impressive capabilities in code generation and comprehension. A key requirement for these systems is their ability to accurately follow user instructions. We present Precise Automatically Checked Instruction Following In Code (PACIFIC), a novel framework designed to automatically generate benchmarks that rigorously assess sequential instruction-following and code dry-running capabilities in LLMs, while allowing control over benchmark difficulty. PACIFIC produces benchmark variants with clearly defined expected outputs, enabling straightforward and reliable evaluation through simple output comparisons. In contrast to existing approaches that often rely on tool usage or agentic behavior, our work isolates and evaluates the LLM's intrinsic ability to reason through code behavior step-by-step without execution (dry running) and to follow instructions. Furthermore, our framework mitigates training data contamination by facilitating effortless generation of novel benchmark variations. We validate our framework by generating a suite of benchmarks spanning a range of difficulty levels and evaluating multiple state-of-the-art LLMs. Our results demonstrate that PACIFIC can produce increasingly challenging benchmarks that effectively differentiate instruction-following and dry running capabilities, even among advanced models. Overall, our framework offers a scalable, contamination-resilient methodology for assessing core competencies of LLMs in code-related tasks.
SENov 23, 2025
Evaluating perturbation robustness of generative systems that use COBOL code inputsSamuel Ackerman, Wesam Ibraheem, Orna Raz et al.
Systems incorporating large language models (LLMs) as a component are known to be sensitive (i.e., non-robust) to minor input variations that do not change the meaning of the input; such sensitivity may reduce the system's usefulness. Here, we present a framework to evaluate robustness of systems using COBOL code as input; our application is translation between COBOL and Java programming languages, but the approach extends to other tasks such as code generation or explanation. Targeting robustness of systems with COBOL as input is essential yet challenging. Many business-critical applications are written in COBOL, yet these are typically proprietary legacy applications and their code is unavailable to LLMs for training. We develop a library of COBOL paragraph and full-program perturbation methods, and create variant-expanded versions of a benchmark dataset of examples for a specific task. The robustness of the LLM-based system is evaluated by measuring changes in values of individual and aggregate metrics calculated on the system's outputs. Finally, we present a series of dynamic table and chart visualization dashboards that assist in debugging the system's outputs, and monitoring and understanding root causes of the system's sensitivity to input variation. These tools can be further used to improve the system by, for instance, indicating variations that should be handled by pre-processing steps.
APJan 30, 2025
Statistical multi-metric evaluation and visualization of LLM system predictive performanceSamuel Ackerman, Eitan Farchi, Orna Raz et al.
The evaluation of generative or discriminative large language model (LLM)-based systems is often a complex multi-dimensional problem. Typically, a set of system configuration alternatives are evaluated on one or more benchmark datasets, each with one or more evaluation metrics, which may differ between datasets. We often want to evaluate -- with a statistical measure of significance -- whether systems perform differently either on a given dataset according to a single metric, on aggregate across metrics on a dataset, or across datasets. Such evaluations can be done to support decision-making, such as deciding whether a particular system component change (e.g., choice of LLM or hyperparameter values) significantly improves performance over the current system configuration, or, more generally, whether a fixed set of system configurations (e.g., a leaderboard list) have significantly different performances according to metrics of interest. We present a framework implementation that automatically performs the correct statistical tests, properly aggregates the statistical results across metrics and datasets (a nontrivial task), and can visualize the results. The framework is demonstrated on the multi-lingual code generation benchmark CrossCodeEval, for several state-of-the-art LLMs.
LGMay 14, 2023
Automatic Generation of Attention Rules For Containment of Machine Learning Model ErrorsSamuel Ackerman, Axel Bendavid, Eitan Farchi et al.
Machine learning (ML) solutions are prevalent in many applications. However, many challenges exist in making these solutions business-grade. For instance, maintaining the error rate of the underlying ML models at an acceptably low level. Typically, the true relationship between feature inputs and the target feature to be predicted is uncertain, and hence statistical in nature. The approach we propose is to separate the observations that are the most likely to be predicted incorrectly into 'attention sets'. These can directly aid model diagnosis and improvement, and be used to decide on alternative courses of action for these problematic observations. We present several algorithms (`strategies') for determining optimal rules to separate these observations. In particular, we prefer strategies that use feature-based slicing because they are human-interpretable, model-agnostic, and require minimal supplementary inputs or knowledge. In addition, we show that these strategies outperform several common baselines, such as selecting observations with prediction confidence below a threshold. To evaluate strategies, we introduce metrics to measure various desired qualities, such as their performance, stability, and generalizability to unseen data; the strategies are evaluated on several publicly-available datasets. We use TOPSIS, a Multiple Criteria Decision Making method, to aggregate these metrics into a single quality score for each strategy, to allow comparison.
LGJan 2, 2022
Theory and Practice of Quality Assurance for Machine Learning Systems An Experiment Driven ApproachSamuel Ackerman, Guy Barash, Eitan Farchi et al.
The crafting of machine learning (ML) based systems requires statistical control throughout its life cycle. Careful quantification of business requirements and identification of key factors that impact the business requirements reduces the risk of a project failure. The quantification of business requirements results in the definition of random variables representing the system key performance indicators that need to be analyzed through statistical experiments. In addition, available data for training and experiment results impact the design of the system. Once the system is developed, it is tested and continually monitored to ensure it meets its business requirements. This is done through the continued application of statistical experiments to analyze and control the key performance indicators. This book teaches the art of crafting and developing ML based systems. It advocates an "experiment first" approach stressing the need to define statistical experiments from the beginning of the project life cycle. It also discusses in detail how to apply statistical control on the ML based system throughout its lifecycle.
LGDec 22, 2021
Classifier Data Quality: A Geometric Complexity Based Method for Automated Baseline And Insights GenerationGeorge Kour, Marcel Zalmanovici, Orna Raz et al.
Testing Machine Learning (ML) models and AI-Infused Applications (AIIAs), or systems that contain ML models, is highly challenging. In addition to the challenges of testing classical software, it is acceptable and expected that statistical ML models sometimes output incorrect results. A major challenge is to determine when the level of incorrectness, e.g., model accuracy or F1 score for classifiers, is acceptable and when it is not. In addition to business requirements that should provide a threshold, it is a best practice to require any proposed ML solution to out-perform simple baseline models, such as a decision tree. We have developed complexity measures, which quantify how difficult given observations are to assign to their true class label; these measures can then be used to automatically determine a baseline performance threshold. These measures are superior to the best practice baseline in that, for a linear computation cost, they also quantify each observation' classification complexity in an explainable form, regardless of the classifier model used. Our experiments with both numeric synthetic data and real natural language chatbot data demonstrate that the complexity measures effectively highlight data regions and observations that are likely to be misclassified.
LGNov 10, 2021
Automatically detecting data drift in machine learning classifiersSamuel Ackerman, Orna Raz, Marcel Zalmanovici et al.
Classifiers and other statistics-based machine learning (ML) techniques generalize, or learn, based on various statistical properties of the training data. The assumption underlying statistical ML resulting in theoretical or empirical performance guarantees is that the distribution of the training data is representative of the production data distribution. This assumption often breaks; for instance, statistical distributions of the data may change. We term changes that affect ML performance `data drift' or `drift'. Many classification techniques compute a measure of confidence in their results. This measure might not reflect the actual ML performance. A famous example is the Panda picture that is correctly classified as such with a confidence of about 60\%, but when noise is added it is incorrectly classified as a Gibbon with a confidence of above 99\%. However, the work we report on here suggests that a classifier's measure of confidence can be used for the purpose of detecting data drift. We propose an approach based solely on classifier suggested labels and its confidence in them, for alerting on data distribution or feature space changes that are likely to cause data drift. Our approach identities degradation in model performance and does not require labeling of data in production which is often lacking or delayed. Our experiments with three different data sets and classifiers demonstrate the effectiveness of this approach in detecting data drift. This is especially encouraging as the classification itself may or may not be correct and no model input data is required. We further explore the statistical approach of sequential change-point tests to automatically determine the amount of data needed in order to identify drift while controlling the false positive rate (Type-1 error).
LGOct 24, 2021
Detecting model drift using polynomial relationsEliran Roffe, Samuel Ackerman, Orna Raz et al.
Machine learning models serve critical functions, such as classifying loan applicants as good or bad risks. Each model is trained under the assumption that the data used in training and in the field come from the same underlying unknown distribution. Often, this assumption is broken in practice. It is desirable to identify when this occurs, to minimize the impact on model performance. We suggest a new approach to detecting change in the data distribution by identifying polynomial relations between the data features. We measure the strength of each identified relation using its R-square value. A strong polynomial relation captures a significant trait of the data which should remain stable if the data distribution does not change. We thus use a set of learned strong polynomial relations to identify drift. For a set of polynomial relations that are stronger than a given threshold, we calculate the amount of drift observed for that relation. The amount of drift is measured by calculating the Bayes Factor for the polynomial relation likelihood of the baseline data versus field data. We empirically validate the approach by simulating a range of changes, and identify drift using the Bayes Factor of the polynomial relation likelihood change.
LGOct 11, 2021
Density-based interpretable hypercube region partitioning for mixed numeric and categorical dataSamuel Ackerman, Eitan Farchi, Orna Raz et al.
Consider a structured dataset of features, such as $\{\textrm{SEX}, \textrm{INCOME}, \textrm{RACE}, \textrm{EXPERIENCE}\}$. A user may want to know where in the feature space observations are concentrated, and where it is sparse or empty. The existence of large sparse or empty regions can provide domain knowledge of soft or hard feature constraints (e.g., what is the typical income range, or that it may be unlikely to have a high income with few years of work experience). Also, these can suggest to the user that machine learning (ML) model predictions for data inputs in sparse or empty regions may be unreliable. An interpretable region is a hyper-rectangle, such as $\{\textrm{RACE} \in\{\textrm{Black}, \textrm{White}\}\}\:\&$ $\{10 \leq \:\textrm{EXPERIENCE} \:\leq 13\}$, containing all observations satisfying the constraints; typically, such regions are defined by a small number of features. Our method constructs an observation density-based partition of the observed feature space in the dataset into such regions. It has a number of advantages over others in that it works on features of mixed type (numeric or categorical) in the original domain, and can separate out empty regions as well. As can be seen from visualizations, the resulting partitions accord with spatial groupings that a human eye might identify; the results should thus extend to higher dimensions. We also show some applications of the partition to other data analysis tasks, such as inferring about ML model error, measuring high-dimensional density variability, and causal inference for treatment effect. Many of these applications are made possible by the hyper-rectangular form of the partition regions.
LGAug 12, 2021
FreaAI: Automated extraction of data slices to test machine learning modelsSamuel Ackerman, Orna Raz, Marcel Zalmanovici
Machine learning (ML) solutions are prevalent. However, many challenges exist in making these solutions business-grade. One major challenge is to ensure that the ML solution provides its expected business value. In order to do that, one has to bridge the gap between the way ML model performance is measured and the solution requirements. In previous work (Barash et al, "Bridging the gap...") we demonstrated the effectiveness of utilizing feature models in bridging this gap. Whereas ML performance metrics, such as the accuracy or F1-score of a classifier, typically measure the average ML performance, feature models shed light on explainable data slices that are too far from that average, and therefore might indicate unsatisfied requirements. For example, the overall accuracy of a bank text terms classifier may be very high, say $98\% \pm 2\%$, yet it might perform poorly for terms that include short descriptions and originate from commercial accounts. A business requirement, which may be implicit in the training data, may be to perform well regardless of the type of account and length of the description. Therefore, the under-performing data slice that includes short descriptions and commercial accounts suggests poorly-met requirements. In this paper we show the feasibility of automatically extracting feature models that result in explainable data slices over which the ML solution under-performs. Our novel technique, IBM FreaAI aka FreaAI, extracts such slices from structured ML test data or any other labeled data. We demonstrate that FreaAI can automatically produce explainable and statistically-significant data slices over seven open datasets.
LGAug 11, 2021
Machine Learning Model Drift Detection Via Weak Data SlicesSamuel Ackerman, Parijat Dube, Eitan Farchi et al.
Detecting drift in performance of Machine Learning (ML) models is an acknowledged challenge. For ML models to become an integral part of business applications it is essential to detect when an ML model drifts away from acceptable operation. However, it is often the case that actual labels are difficult and expensive to get, for example, because they require expert judgment. Therefore, there is a need for methods that detect likely degradation in ML operation without labels. We propose a method that utilizes feature space rules, called data slices, for drift detection. We provide experimental indications that our method is likely to identify that the ML model will likely change in performance, based on changes in the underlying data.
APDec 16, 2020
Detection of data drift and outliers affecting machine learning model performance over timeSamuel Ackerman, Eitan Farchi, Orna Raz et al.
A trained ML model is deployed on another `test' dataset where target feature values (labels) are unknown. Drift is distribution change between the training and deployment data, which is concerning if model performance changes. For a cat/dog image classifier, for instance, drift during deployment could be rabbit images (new class) or cat/dog images with changed characteristics (change in distribution). We wish to detect these changes but can't measure accuracy without deployment data labels. We instead detect drift indirectly by nonparametrically testing the distribution of model prediction confidence for changes. This generalizes our method and sidesteps domain-specific feature representation. We address important statistical issues, particularly Type-1 error control in sequential testing, using Change Point Models (CPMs; see Adams and Ross 2012). We also use nonparametric outlier methods to show the user suspicious observations for model diagnosis, since the before/after change confidence distributions overlap significantly. In experiments to demonstrate robustness, we train on a subset of MNIST digit classes, then insert drift (e.g., unseen digit class) in deployment data in various settings (gradual/sudden changes in the drift proportion). A novel loss function is introduced to compare the performance (detection delay, Type-1 and 2 errors) of a drift detector under different levels of drift class contamination.