AISep 26, 2024Code
Data-Prep-Kit: getting your data ready for LLM application developmentDavid Wood, Boris Lublinsky, Alexy Roytman et al.
Data preparation is the first and a very important step towards any Large Language Model (LLM) development. This paper introduces an easy-to-use, extensible, and scale-flexible open-source data preparation toolkit called Data Prep Kit (DPK). DPK is architected and designed to enable users to scale their data preparation to their needs. With DPK they can prepare data on a local machine or effortlessly scale to run on a cluster with thousands of CPU Cores. DPK comes with a highly scalable, yet extensible set of modules that transform natural language and code data. If the user needs additional transforms, they can be easily developed using extensive DPK support for transform creation. These modules can be used independently or pipelined to perform a series of operations. In this paper, we describe DPK architecture and show its performance from a small scale to a very large number of CPUs. The modules from DPK have been used for the preparation of Granite Models [1] [2]. We believe DPK is a valuable contribution to the AI community to easily prepare data to enhance the performance of their LLM models or to fine-tune models with Retrieval-Augmented Generation (RAG).
AIJul 8, 2023
Multi-Intent Detection in User Provided Annotations for Programming by Examples SystemsNischal Ashok Kumar, Nitin Gupta, Shanmukha Guttula et al. · cmu
In mapping enterprise applications, data mapping remains a fundamental part of integration development, but its time consuming. An increasing number of applications lack naming standards, and nested field structures further add complexity for the integration developers. Once the mapping is done, data transformation is the next challenge for the users since each application expects data to be in a certain format. Also, while building integration flow, developers need to understand the format of the source and target data field and come up with transformation program that can change data from source to target format. The problem of automatic generation of a transformation program through program synthesis paradigm from some specifications has been studied since the early days of Artificial Intelligence (AI). Programming by Example (PBE) is one such kind of technique that targets automatic inferencing of a computer program to accomplish a format or string conversion task from user-provided input and output samples. To learn the correct intent, a diverse set of samples from the user is required. However, there is a possibility that the user fails to provide a diverse set of samples. This can lead to multiple intents or ambiguity in the input and output samples. Hence, PBE systems can get confused in generating the correct intent program. In this paper, we propose a deep neural network based ambiguity prediction model, which analyzes the input-output strings and maps them to a different set of properties responsible for multiple intent. Users can analyze these properties and accordingly can provide new samples or modify existing samples which can help in building a better PBE system for mapping enterprise applications.
AIJul 18, 2024
Scaling Granite Code Models to 128K ContextMatt Stallone, Vaibhav Saxena, Leonid Karlinsky et al.
This paper introduces long-context Granite code models that support effective context windows of up to 128K tokens. Our solution for scaling context length of Granite 3B/8B code models from 2K/4K to 128K consists of a light-weight continual pretraining by gradually increasing its RoPE base frequency with repository-level file packing and length-upsampled long-context data. Additionally, we also release instruction-tuned models with long-context support which are derived by further finetuning the long context base models on a mix of permissively licensed short and long-context instruction-response pairs. While comparing to the original short-context Granite code models, our long-context models achieve significant improvements on long-context tasks without any noticeable performance degradation on regular code completion benchmarks (e.g., HumanEval). We release all our long-context Granite code models under an Apache 2.0 license for both research and commercial use.
CLApr 21
STaD: Scaffolded Task Design for Identifying Compositional Skill Gaps in LLMsSungeun An, Swanand Ravindra Kadhe, Shailja Thakur et al.
Benchmarks are often used as a standard to understand LLM capabilities in different domains. However, aggregate benchmark scores provide limited insight into compositional skill gaps of LLMs and how to improve them. To make these weaknesses visible, we propose Scaffolded Task Design (STaD) framework. STaD generates controlled variations of benchmark tasks based on the concept of scaffolding, which introduces structured, incremental support in a step-by-step manner. Rather than inspecting failures individually, this approach enables systematic and scalable probing of model behavior by identifying the specific reasoning skill compositions they lack. Treating the LLM as a black box, our experiments on six models of varying sizes reveal multiple failure points in three reasoning benchmarks and highlight each model's unique and distinct skill gaps.
AIMay 7, 2024Code
Granite Code Models: A Family of Open Foundation Models for Code IntelligenceMayank Mishra, Matt Stallone, Gaoyuan Zhang et al. · ibm-research
Large Language Models (LLMs) trained on code are revolutionizing the software development process. Increasingly, code LLMs are being integrated into software development environments to improve the productivity of human programmers, and LLM-based agents are beginning to show promise for handling complex tasks autonomously. Realizing the full potential of code LLMs requires a wide range of capabilities, including code generation, fixing bugs, explaining and documenting code, maintaining repositories, and more. In this work, we introduce the Granite series of decoder-only code models for code generative tasks, trained with code written in 116 programming languages. The Granite Code models family consists of models ranging in size from 3 to 34 billion parameters, suitable for applications ranging from complex application modernization tasks to on-device memory-constrained use cases. Evaluation on a comprehensive set of tasks demonstrates that Granite Code models consistently reaches state-of-the-art performance among available open-source code LLMs. The Granite Code model family was optimized for enterprise software development workflows and performs well across a range of coding tasks (e.g. code generation, fixing and explanation), making it a versatile all around code model. We release all our Granite Code models under an Apache 2.0 license for both research and commercial use.
CLFeb 13, 2023
Identifying Semantically Difficult Samples to Improve Text ClassificationShashank Mujumdar, Stuti Mehta, Hima Patel et al.
In this paper, we investigate the effect of addressing difficult samples from a given text dataset on the downstream text classification task. We define difficult samples as being non-obvious cases for text classification by analysing them in the semantic embedding space; specifically - (i) semantically similar samples that belong to different classes and (ii) semantically dissimilar samples that belong to the same class. We propose a penalty function to measure the overall difficulty score of every sample in the dataset. We conduct exhaustive experiments on 13 standard datasets to show a consistent improvement of up to 9% and discuss qualitative results to show effectiveness of our approach in identifying difficult samples for a text classification model.
AIDec 1, 2025
STRIDE: A Systematic Framework for Selecting AI Modalities -- Agentic AI, AI Assistants, or LLM CallsShubhi Asthana, Bing Zhang, Chad DeLuca et al.
The rapid shift from stateless large language models (LLMs) to autonomous, goal-driven agents raises a central question: When is agentic AI truly necessary? While agents enable multi-step reasoning, persistent memory, and tool orchestration, deploying them indiscriminately leads to higher cost, complexity, and risk. We present STRIDE (Systematic Task Reasoning Intelligence Deployment Evaluator), a framework that provides principled recommendations for selecting between three modalities: (i) direct LLM calls, (ii) guided AI assistants, and (iii) fully autonomous agentic AI. STRIDE integrates structured task decomposition, dynamism attribution, and self-reflection requirement analysis to produce an Agentic Suitability Score, ensuring that full agentic autonomy is reserved for tasks with inherent dynamism or evolving context. Evaluated across 30 real-world tasks spanning SRE, compliance, and enterprise automation, STRIDE achieved 92% accuracy in modality selection, reduced unnecessary agent deployments by 45%, and cut resource costs by 37%. Expert validation over six months in SRE and compliance domains confirmed its practical utility, with domain specialists agreeing that STRIDE effectively distinguishes between tasks requiring simple LLM calls, guided assistants, or full agentic autonomy. This work reframes agent adoption as a necessity-driven design decision, ensuring autonomy is applied only when its benefits justify the costs.
SEMay 14
Runtime-Structured Task Decomposition for Agentic Coding SystemsShubhi Asthana, Bing Zhang, Chad DeLuca et al.
Agentic coding systems increasingly use large language models (LLMs) for software engineering tasks such as debugging, root cause analysis, and code review. However, many existing systems encode task logic, execution flow, and output generation inside monolithic prompts. This design creates brittle behavior, limited debuggability, and high retry costs because failures often require rerunning the full workflow. We present runtime-structured task decomposition, an architectural approach in which task partitioning and execution flow are managed through executable control logic rather than prompt structure alone. LLMs are used only for focused judgment tasks, and outputs are validated against predefined schemas before downstream execution. We evaluate this approach on two software engineering workloads using three configurations: monolithic execution, static decomposition with fixed subtasks and no runtime branching, and runtime-structured decomposition. Each configuration was evaluated across 10 runs. Our results show that decomposition alone does not necessarily reduce retry cost. In the Kubernetes root cause analysis workload, the static decomposition baseline produced a retry cost of 1,632 +/- 145 tokens versus 904 +/- 17 tokens for the monolithic baseline because failures forced reruns of downstream subtasks. A similar pattern appeared in the multi-file debugging workload, where the static baseline consumed 933 tokens compared to 703 tokens for the monolithic system. The runtime-structured approach reran only failed subtasks, reducing retry costs to 436 +/- 132 tokens for root cause analysis and 460 tokens for debugging. Overall, the approach achieved up to 51.7% lower retry cost than monolithic systems and 73.2% lower retry cost than static decomposition baselines, improving efficiency, debuggability, and operational reliability in agentic coding systems.
SEDec 19, 2025
CIFE: Code Instruction-Following EvaluationSravani Gunnu, Shanmukha Guttula, Hima Patel
Large Language Models (LLMs) are increasingly applied to real-world code generation, where functional correctness alone is insufficient for reliable deployment, developers also expect adherence to explicit requirements for robustness, formatting, and security. Existing benchmarks primarily assess correctness through test-case execution, offering limited insight into how reliably models follow such constraints. We introduce a benchmark of 1,000 Python tasks, each paired with an average of 7 developer-specified constraints spanning 13 categories. Constraints are curated through a four-stage human-LLM pipeline to ensure they are atomic, relevant, and objective. We evaluate 14 open- and closed-source models using complementary adherence metrics and propose the C2A Score, a composite measure that jointly captures correctness and constraint compliance. Results reveal a substantial gap between partial and strict satisfaction, while strong models achieve over 90% partial adherence, strict adherence remains between 39-66%. These findings highlight that trustworthy code generation requires not only correctness but also consistent adherence to developer intent.
AIApr 24
A Systematic Approach for Large Language Models DebuggingBasel Shbita, Anna Lisa Gentile, Bing Zhang et al.
Large language models (LLMs) have become central to modern AI workflows, powering applications from open-ended text generation to complex agent-based reasoning. However, debugging these models remains a persistent challenge due to their opaque and probabilistic nature and the difficulty of diagnosing errors across diverse tasks and settings. This paper introduces a systematic approach for LLM debugging that treats models as observable systems, providing structured, model-agnostic methods from issue detection to model refinement. By unifying evaluation, interpretability, and error-analysis practices, our approach enables practitioners to iteratively diagnose model weaknesses, refine prompts and model parameters, and adapt data for fine-tuning or assessment, while remaining effective in contexts where standardized benchmarks and evaluation criteria are lacking. We argue that such a structured methodology not only accelerates troubleshooting but also fosters reproducibility, transparency, and scalability in the deployment of LLM-based systems.
SEMar 19, 2025
LLM-Aided Customizable Profiling of Code Data Based On Programming Language ConceptsPankaj Thorat, Adnan Qidwai, Adrija Dhar et al.
Data profiling is critical in machine learning for generating descriptive statistics, supporting both deeper understanding and downstream tasks like data valuation and curation. This work addresses profiling specifically in the context of code datasets for Large Language Models (code-LLMs), where data quality directly influences tasks such as code generation and summarization. Characterizing code datasets in terms of programming language concepts enables better insights and targeted data curation. Our proposed methodology decomposes code data profiling into two phases: (1) an offline phase where LLMs are leveraged to derive and learn rules for extracting syntactic and semantic concepts across various programming languages, including previously unseen or low-resource languages, and (2) an online deterministic phase applying these derived rules for efficient real-time analysis. This hybrid approach is customizable, extensible to new syntactic and semantic constructs, and scalable to multiple languages. Experimentally, our LLM-aided method achieves a mean accuracy of 90.33% for syntactic extraction rules and semantic classification accuracies averaging 80% and 77% across languages and semantic concepts, respectively.
LGOct 15, 2024
ILAEDA: An Imitation Learning Based Approach for Automatic Exploratory Data AnalysisAbhijit Manatkar, Devarsh Patel, Hima Patel et al.
Automating end-to-end Exploratory Data Analysis (AutoEDA) is a challenging open problem, often tackled through Reinforcement Learning (RL) by learning to predict a sequence of analysis operations (FILTER, GROUP, etc). Defining rewards for each operation is a challenging task and existing methods rely on various \emph{interestingness measures} to craft reward functions to capture the importance of each operation. In this work, we argue that not all of the essential features of what makes an operation important can be accurately captured mathematically using rewards. We propose an AutoEDA model trained through imitation learning from expert EDA sessions, bypassing the need for manually defined interestingness measures. Our method, based on generative adversarial imitation learning (GAIL), generalizes well across datasets, even with limited expert data. We also introduce a novel approach for generating synthetic EDA demonstrations for training. Our method outperforms the existing state-of-the-art end-to-end EDA approach on benchmarks by upto 3x, showing strong performance and generalization, while naturally capturing diverse interestingness measures in generated EDA sessions.
SENov 28, 2025
Generating Verifiable Chain of Thoughts from Exection-TracesShailja Thakur, Vaibhav Saxena, Rohan Kulkarni et al.
Teaching language models to reason about code execution remains a fundamental challenge. While Chain-of-Thought (CoT) prompting has shown promise, current synthetic training data suffers from a critical weakness: the reasoning steps are often plausible-sounding explanations generated by teacher models, not verifiable accounts of what the code actually does. This creates a troubling failure mode where models learn to mimic superficially convincing but logically flawed reasoning patterns. We address this by grounding CoT generation directly in program execution traces. Our pipeline instruments code to capture its dynamic behavior, then narrates these execution traces into natural language and factually-grounded rationales that are verifiable by design. This execution-grounded approach ensures every reasoning step reflects what the program computes, eliminating logical hallucinations at the source. We evaluate our method on code reasoning tasks, code generation and explanation tasks from HumanEval. Models trained on our bi-directional trace-grounded data achieve substantial improvements on reasoning tasks, with gains of up to 30 points on output prediction and 28 points on input prediction over base models, alongside competitive explanation and code generation performance. https://github.ibm.com/IBM-Research-AI/Verified-Code-CoT
LGMay 9, 2024
An Automatic Prompt Generation System for Tabular Data TasksAshlesha Akella, Abhijit Manatkar, Brij Chavda et al.
Efficient processing of tabular data is important in various industries, especially when working with datasets containing a large number of columns. Large language models (LLMs) have demonstrated their ability on several tasks through carefully crafted prompts. However, creating effective prompts for tabular datasets is challenging due to the structured nature of the data and the need to manage numerous columns. This paper presents an innovative auto-prompt generation system suitable for multiple LLMs, with minimal training. It proposes two novel methods; 1) A Reinforcement Learning-based algorithm for identifying and sequencing task-relevant columns 2) Cell-level similarity-based approach for enhancing few-shot example selection. Our approach has been extensively tested across 66 datasets, demonstrating improved performance in three downstream tasks: data imputation, error detection, and entity matching using two distinct LLMs; Google flan-t5-xxl and Mixtral 8x7B.
AIMar 14, 2024
xLP: Explainable Link Prediction for Master Data ManagementBalaji Ganesan, Matheen Ahmed Pasha, Srinivasa Parkala et al.
Explaining neural model predictions to users requires creativity. Especially in enterprise applications, where there are costs associated with users' time, and their trust in the model predictions is critical for adoption. For link prediction in master data management, we have built a number of explainability solutions drawing from research in interpretability, fact verification, path ranking, neuro-symbolic reasoning and self-explaining AI. In this demo, we present explanations for link prediction in a creative way, to allow users to choose explanations they are more comfortable with.
LGAug 12, 2021
Data Quality Toolkit: Automatic assessment of data quality and remediation for machine learning datasetsNitin Gupta, Hima Patel, Shazia Afzal et al.
The quality of training data has a huge impact on the efficiency, accuracy and complexity of machine learning tasks. Various tools and techniques are available that assess data quality with respect to general cleaning and profiling checks. However these techniques are not applicable to detect data issues in the context of machine learning tasks, like noisy labels, existence of overlapping classes etc. We attempt to re-look at the data quality issues in the context of building a machine learning pipeline and build a tool that can detect, explain and remediate issues in the data, and systematically and automatically capture all the changes applied to the data. We introduce the Data Quality Toolkit for machine learning as a library of some key quality metrics and relevant remediation techniques to analyze and enhance the readiness of structured training datasets for machine learning projects. The toolkit can reduce the turn-around times of data preparation pipelines and streamline the data quality assessment process. Our toolkit is publicly available via IBM API Hub [1] platform, any developer can assess the data quality using the IBM's Data Quality for AI apis [2]. Detailed tutorials are also available on IBM Learning Path [3].
CRDec 10, 2020
Explainable Link Prediction for Privacy-Preserving Contact TracingBalaji Ganesan, Hima Patel, Sameep Mehta
Contact Tracing has been used to identify people who were in close proximity to those infected with SARS-Cov2 coronavirus. A number of digital contract tracing applications have been introduced to facilitate or complement physical contact tracing. However, there are a number of privacy issues in the implementation of contract tracing applications, which make people reluctant to install or update their infection status on these applications. In this concept paper, we present ideas from Graph Neural Networks and explainability, that could improve trust in these applications, and encourage adoption by people.
DBOct 14, 2020
Data Readiness ReportShazia Afzal, Rajmohan C, Manish Kesarwani et al.
Data exploration and quality analysis is an important yet tedious process in the AI pipeline. Current practices of data cleaning and data readiness assessment for machine learning tasks are mostly conducted in an arbitrary manner which limits their reuse and results in loss of productivity. We introduce the concept of a Data Readiness Report as an accompanying documentation to a dataset that allows data consumers to get detailed insights into the quality of input data. Data characteristics and challenges on various quality dimensions are identified and documented keeping in mind the principles of transparency and explainability. The Data Readiness Report also serves as a record of all data assessment operations including applied transformations. This provides a detailed lineage for the purpose of data governance and management. In effect, the report captures and documents the actions taken by various personas in a data readiness and assessment workflow. Overtime this becomes a repository of best practices and can potentially drive a recommendation system for building automated data readiness workflows on the lines of AutoML [8]. We anticipate that together with the Datasheets [9], Dataset Nutrition Label [11], FactSheets [1] and Model Cards [15], the Data Readiness Report makes significant progress towards Data and AI lifecycle documentation.
CLSep 30, 2020
Towards a Multi-modal, Multi-task Learning based Pre-training Framework for Document Representation LearningSubhojeet Pramanik, Shashank Mujumdar, Hima Patel
Recent approaches in literature have exploited the multi-modal information in documents (text, layout, image) to serve specific downstream document tasks. However, they are limited by their - (i) inability to learn cross-modal representations across text, layout and image dimensions for documents and (ii) inability to process multi-page documents. Pre-training techniques have been shown in Natural Language Processing (NLP) domain to learn generic textual representations from large unlabelled datasets, applicable to various downstream NLP tasks. In this paper, we propose a multi-task learning-based framework that utilizes a combination of self-supervised and supervised pre-training tasks to learn a generic document representation applicable to various downstream document tasks. Specifically, we introduce Document Topic Modelling and Document Shuffle Prediction as novel pre-training tasks to learn rich image representations along with the text and layout representations for documents. We utilize the Longformer network architecture as the backbone to encode the multi-modal information from multi-page documents in an end-to-end fashion. We showcase the applicability of our pre-training framework on a variety of different real-world document tasks such as document classification, document information extraction, and document retrieval. We evaluate our framework on different standard document datasets and conduct exhaustive experiments to compare performance against various ablations of our framework and state-of-the-art baselines.
SIMar 7, 2020
Link Prediction using Graph Neural Networks for Master Data ManagementBalaji Ganesan, Srinivas Parkala, Neeraj R Singh et al.
Learning graph representations of n-ary relational data has a number of real world applications like anti-money laundering, fraud detection, and customer due diligence. Contact tracing of COVID19 positive persons could also be posed as a Link Prediction problem. Predicting links between people using Graph Neural Networks requires careful ethical and privacy considerations than in domains where GNNs have typically been applied so far. We introduce novel methods for anonymizing data, model training, explainability and verification for Link Prediction in Master Data Management, and discuss our results.
IRFeb 23, 2020
Data Augmentation for Personal Knowledge Base PopulationLingraj S Vannur, Balaji Ganesan, Lokesh Nagalapatti et al.
Cold start knowledge base population (KBP) is the problem of populating a knowledge base from unstructured documents. While artificial neural networks have led to significant improvements in the different tasks that are part of KBP, the overall F1 of the end-to-end system remains quite low. This problem is more acute in personal knowledge bases, which present additional challenges with regard to data protection, fairness and privacy. In this work, we present a system that uses rule based annotators and a graph neural network for missing link prediction, to populate a more complete, fair and diverse knowledge base from the TACRED dataset.
AIJan 22, 2020
A Neural Architecture for Person Ontology populationBalaji Ganesan, Riddhiman Dasgupta, Akshay Parekh et al.
A person ontology comprising concepts, attributes and relationships of people has a number of applications in data protection, didentification, population of knowledge graphs for business intelligence and fraud prevention. While artificial neural networks have led to improvements in Entity Recognition, Entity Classification, and Relation Extraction, creating an ontology largely remains a manual process, because it requires a fixed set of semantic relations between concepts. In this work, we present a system for automatically populating a person ontology graph from unstructured data using neural models for Entity Classification and Relation Extraction. We introduce a new dataset for these tasks and discuss our results.
LGJul 11, 2014
An SVM Based Approach for Cardiac View PlanningRamasubramanian Sundararajan, Hima Patel, Dattesh Shanbhag et al.
We consider the problem of automatically prescribing oblique planes (short axis, 4 chamber and 2 chamber views) in Cardiac Magnetic Resonance Imaging (MRI). A concern with technologist-driven acquisitions of these planes is the quality and time taken for the total examination. We propose an automated solution incorporating anatomical features external to the cardiac region. The solution uses support vector machine regression models wherein complexity and feature selection are optimized using multi-objective genetic algorithms. Additionally, we examine the robustness of our approach by training our models on images with additive Rician-Gaussian mixtures at varying Signal to Noise (SNR) levels. Our approach has shown promising results, with an angular deviation of less than 15 degrees on 90% cases across oblique planes, measured in terms of average 6-fold cross validation performance -- this is generally within acceptable bounds of variation as specified by clinicians.
LGJul 10, 2014
A multi-instance learning algorithm based on a stacked ensemble of lazy learnersRamasubramanian Sundararajan, Hima Patel, Manisha Srivastava
This document describes a novel learning algorithm that classifies "bags" of instances rather than individual instances. A bag is labeled positive if it contains at least one positive instance (which may or may not be specifically identified), and negative otherwise. This class of problems is known as multi-instance learning problems, and is useful in situations where the class label at an instance level may be unavailable or imprecise or difficult to obtain, or in situations where the problem is naturally posed as one of classifying instance groups. The algorithm described here is an ensemble-based method, wherein the members of the ensemble are lazy learning classifiers learnt using the Citation Nearest Neighbour method. Diversity among the ensemble members is achieved by optimizing their parameters using a multi-objective optimization method, with the objectives being to maximize Class 1 accuracy and minimize false positive rate. The method has been found to be effective on the Musk1 benchmark dataset.