CLJun 3, 2025Code
FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure ModesChristodoulos Constantinides, Dhaval Patel, Shuxin Lin et al.
We introduce FailureSensorIQ, a novel Multi-Choice Question-Answering (MCQA) benchmarking system designed to assess the ability of Large Language Models (LLMs) to reason and understand complex, domain-specific scenarios in Industry 4.0. Unlike traditional QA benchmarks, our system focuses on multiple aspects of reasoning through failure modes, sensor data, and the relationships between them across various industrial assets. Through this work, we envision a paradigm shift where modeling decisions are not only data-driven using statistical tools like correlation analysis and significance tests, but also domain-driven by specialized LLMs which can reason about the key contributors and useful patterns that can be captured with feature engineering. We evaluate the Industrial knowledge of over a dozen LLMs-including GPT-4, Llama, and Mistral-on FailureSensorIQ from different lens using Perturbation-Uncertainty-Complexity analysis, Expert Evaluation study, Asset-Specific Knowledge Gap analysis, ReAct agent using external knowledge-bases. Even though closed-source models with strong reasoning capabilities approach expert-level performance, the comprehensive benchmark reveals a significant drop in performance that is fragile to perturbations, distractions, and inherent knowledge gaps in the models. We also provide a real-world case study of how LLMs can drive the modeling decisions on 3 different failure prediction datasets related to various assets. We release: (a) expert-curated MCQA for various industrial assets, (b) FailureSensorIQ benchmark and Hugging Face leaderboard based on MCQA built from non-textual data found in ISO documents, and (c) LLMFeatureSelector, an LLM-based feature selection scikit-learn pipeline. The software is available at https://github.com/IBM/FailureSensorIQ.
CLOct 21, 2025Code
Fine-Tuned Thoughts: Leveraging Chain-of-Thought Reasoning for Industrial Asset Health MonitoringShuxin Lin, Dhaval Patel, Christodoulos Constantinides
Small Language Models (SLMs) are becoming increasingly popular in specialized fields, such as industrial applications, due to their efficiency, lower computational requirements, and ability to be fine-tuned for domain-specific tasks, enabling accurate and cost-effective solutions. However, performing complex reasoning using SLMs in specialized fields such as Industry 4.0 remains challenging. In this paper, we propose a knowledge distillation framework for industrial asset health, which transfers reasoning capabilities via Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) to smaller, more efficient models (SLMs). We discuss the advantages and the process of distilling LLMs using multi-choice question answering (MCQA) prompts to enhance reasoning and refine decision-making. We also perform in-context learning to verify the quality of the generated knowledge and benchmark the performance of fine-tuned SLMs with generated knowledge against widely used LLMs. The results show that the fine-tuned SLMs with CoT reasoning outperform the base models by a significant margin, narrowing the gap to their LLM counterparts. Our code is open-sourced at: https://github.com/IBM/FailureSensorIQ.
AIAug 21, 2025Code
Transduction is All You Need for Structured Data WorkflowsAlfio Gliozzo, Naweed Khan, Christodoulos Constantinides et al.
This paper introduces Agentics, a functional agentic AI framework for building LLM-based structured data workflow pipelines. Designed for both research and practical applications, Agentics offers a new data-centric paradigm in which agents are embedded within data types, enabling logical transduction between structured states. This design shifts the focus toward principled data modeling, providing a declarative language where data types are directly exposed to large language models and composed through transductions triggered by type connections. We present a range of structured data workflow tasks and empirical evidence demonstrating the effectiveness of this approach, including data wrangling, text-to-SQL semantic parsing, and domain-specific multiple-choice question answering. The open source Agentics is available at https://github.com/IBM/Agentics.
61.0AIMay 9
DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic RulesDevin Yasith De Silva, Dhaval Patel, Christodoulos Constantinides et al.
Monitoring complex industrial assets relies on engineer-authored symbolic rules that trigger based on sensor conditions and prompt technicians to perform corrective actions. The bottleneck is not detection but response: translating rules into maintenance steps requires asset-specific knowledge gained through years of practice. We investigate whether LLMs can serve as decision support for this rule-to-action step and introduce \ours{}, a benchmark of 6{,}690 expert-validated multiple-choice questions from 118 rule-action pairs across 16 asset types. We contribute (i) a symbolic-to-MCQA pipeline normalizing rules to Disjunctive Normal Form with embedding-based distractor sampling, (ii) five variants probing distinct failure modes (Pro, Pert, Verbose, Aug, Rationale), and (iii) a benchmark of 29 LLMs and 4 embedding baselines. A human evaluation (9 practitioners, mean 45.0\%) confirms \ours{} requires specialist knowledge beyond operational experience. Three findings stand out. The frontier has closed: the top three LLMs lie within one Macro point, with Bradley-Terry Elo placing claude-opus-4-6 30 points above the next model. Yet \ours{}\,Pro exposes brittleness, with every model losing 13--60\% relative accuracy under distractor expansion. \ours{}\,Aug exposes pattern-matching: under condition inversion, frontier models still select the original answer 49--63\% of the time. The deployment bottleneck is not capability but calibration: frontier models handle template-style fault detection but break under structural perturbation.
LGJan 28, 2025
LLM Assisted Anomaly Detection Service for Site Reliability Engineers: Enhancing Cloud Infrastructure ResilienceNimesh Jha, Shuxin Lin, Srideepika Jayaraman et al. · ibm-research
This paper introduces a scalable Anomaly Detection Service with a generalizable API tailored for industrial time-series data, designed to assist Site Reliability Engineers (SREs) in managing cloud infrastructure. The service enables efficient anomaly detection in complex data streams, supporting proactive identification and resolution of issues. Furthermore, it presents an innovative approach to anomaly modeling in cloud infrastructure by utilizing Large Language Models (LLMs) to understand key components, their failure modes, and behaviors. A suite of algorithms for detecting anomalies is offered in univariate and multivariate time series data, including regression-based, mixture-model-based, and semi-supervised approaches. We provide insights into the usage patterns of the service, with over 500 users and 200,000 API calls in a year. The service has been successfully applied in various industrial settings, including IoT-based AI applications. We have also evaluated our system on public anomaly benchmarks to show its effectiveness. By leveraging it, SREs can proactively identify potential issues before they escalate, reducing downtime and improving response times to incidents, ultimately enhancing the overall customer experience. We plan to extend the system to include time series foundation models, enabling zero-shot anomaly detection capabilities.
CLJun 14, 2025
Towards Building General Purpose Embedding Models for Industry 4.0 AgentsChristodoulos Constantinides, Shuxin Lin, Dhaval Patel
In this work we focus on improving language models' understanding for asset maintenance to guide the engineer's decisions and minimize asset downtime. Given a set of tasks expressed in natural language for Industry 4.0 domain, each associated with queries related to a specific asset, we want to recommend relevant items and generalize to queries of similar assets. A task may involve identifying relevant sensors given a query about an asset's failure mode. Our approach begins with gathering a qualitative, expert-vetted knowledge base to construct nine asset-specific task datasets. To create more contextually informed embeddings, we augment the input tasks using Large Language Models (LLMs), providing concise descriptions of the entities involved in the queries. This embedding model is then integrated with a Reasoning and Acting agent (ReAct), which serves as a powerful tool for answering complex user queries that require multi-step reasoning, planning, and knowledge inference. Through ablation studies, we demonstrate that: (a) LLM query augmentation improves the quality of embeddings, (b) Contrastive loss and other methods that avoid in-batch negatives are superior for datasets with queries related to many items, and (c) It is crucial to balance positive and negative in-batch samples. After training and testing on our dataset, we observe a substantial improvement: HIT@1 increases by +54.2%, MAP@100 by +50.1%, and NDCG@10 by +54.7%, averaged across all tasks and models. Additionally, we empirically demonstrate the model's planning and tool invocation capabilities when answering complex questions related to industrial asset maintenance, showcasing its effectiveness in supporting Subject Matter Experts (SMEs) in their day-to-day operations.
CLJun 11, 2025
Chat-of-Thought: Collaborative Multi-Agent System for Generating Domain Specific InformationChristodoulos Constantinides, Shuxin Lin, Nianjun Zhou et al.
This paper presents a novel multi-agent system called Chat-of-Thought, designed to facilitate the generation of Failure Modes and Effects Analysis (FMEA) documents for industrial assets. Chat-of-Thought employs multiple collaborative Large Language Model (LLM)-based agents with specific roles, leveraging advanced AI techniques and dynamic task routing to optimize the generation and validation of FMEA tables. A key innovation in this system is the introduction of a Chat of Thought, where dynamic, multi-persona-driven discussions enable iterative refinement of content. This research explores the application domain of industrial equipment monitoring, highlights key challenges, and demonstrates the potential of Chat-of-Thought in addressing these challenges through interactive, template-driven workflows and context-aware agent collaboration.
NEDec 29, 2021
Effects of Plasticity Functions on Neural AssembliesChristodoulos Constantinides, Kareem Nassar
We explore the effects of various plasticity functions on assemblies of neurons. To bridge the gap between experimental and computational theories we make use of a conceptual framework, the Assembly Calculus, which is a formal system for the description of brain function based on assemblies of neurons. The Assembly Calculus includes operations for projecting, associating, and merging assemblies of neurons. Our research is focused on simulating different plasticity functions with Assembly Calculus. Our main contribution is the modification and evaluation of the projection operation. We experiment with Oja's and Spike Time-Dependent Plasticity (STDP) rules and test the effect of various hyper-parameters.