AIMar 14, 2022
Conformance Checking Over Stochastically Known LogsEli Bogdanov, Izack Cohen, Avigdor Gal
With the growing number of devices, sensors and digital systems, data logs may become uncertain due to, e.g., sensor reading inaccuracies or incorrect interpretation of readings by processing programs. At times, such uncertainties can be captured stochastically, especially when using probabilistic data classification models. In this work we focus on conformance checking, which compares a process model with an event log, when event logs are stochastically known. Building on existing alignment-based conformance checking fundamentals, we mathematically define a stochastic trace model, a stochastic synchronous product, and a cost function that reflects the uncertainty of events in a log. Then, we search for an optimal alignment over the reachability graph of the stochastic synchronous product for finding an optimal alignment between a model and a stochastic process observation. Via structured experiments with two well-known process mining benchmarks, we explore the behavior of the suggested stochastic conformance checking approach and compare it to a standard alignment-based approach as well as to an approach that creates a lower bound on performance. We envision the proposed stochastic conformance checking approach as a viable process mining component for future analysis of stochastic event logs.
AIApr 24
On the Hybrid Nature of ABPMS Process Frames and its Implications on Automated Process DiscoveryAnti Alman, Izack Cohen, Avigdor Gal et al.
A core component of any AI-Augmented Business Process Management System (ABPMS) is the process frame, which gives the system process-awareness and defines the boundaries in which the system must operate. Compared to traditional process models, the process frame should, in principle, provide a somewhat more permissive representation of the managed processes, such that the (semi) autonomous behavior of an ABPMS, referred to as framed autonomy, could emerge. At the same time, it is not limited to a single linguistic or symbolic formalism and may incorporate heterogeneous knowledge ranging from predefined procedures to commonsense rules and best practices. In this paper, we conceptualize the notion of an ABPMS process frame as a hybrid business process representation, consisting of semi-concurrently executed procedural and declarative process models. We rely on our earlier works to outline the execution semantics of this type of process frame, arguing in favor of adopting the open-world assumption of the declarative paradigm also for procedural process models. The latter leads to a constraint-like interpretation, where each procedural model is considered to constrain the activities within that model, without imposing explicit execution requirements nor limitations on activities that may be present in other models. This is analogous to existing declarative languages, such as Declare, where each constraint has a direct effect only on the specific activities being constrained. Given this similarity, we propose mapping subsets of discovered declarative constraints into equivalent semi-concurrently executed procedural fragments, thus laying the foundation for a corresponding process (frame) discovery approach.
LGJun 25, 2022
SKTR: Trace Recovery from Stochastically Known LogsEli Bogdanov, Izack Cohen, Avigdor Gal
Developments in machine learning together with the increasing usage of sensor data challenge the reliance on deterministic logs, requiring new process mining solutions for uncertain, and in particular stochastically known, logs. In this work we formulate {trace recovery}, the task of generating a deterministic log from stochastically known logs that is as faithful to reality as possible. An effective trace recovery algorithm would be a powerful aid for maintaining credible process mining tools for uncertain settings. We propose an algorithmic framework for this task that recovers the best alignment between a stochastically known log and a process model, with three innovative features. Our algorithm, SKTR, 1) handles both Markovian and non-Markovian processes; 2) offers a quality-based balance between a process model and a log, depending on the available process information, sensor quality, and machine learning predictiveness power; and 3) offers a novel use of a synchronous product multigraph to create the log. An empirical analysis using five publicly available datasets, three of which use predictive models over standard video capturing benchmarks, shows an average relative accuracy improvement of more than 10 over a common baseline.
DBApr 27, 2022
Human's Role in-the-LoopAvigdor Gal, Roee Shraga
Data integration has been recently challenged by the need to handle large volumes of data, arriving at high velocity from a variety of sources, which demonstrate varying levels of veracity. This challenging setting, often referred to as big data, renders many of the existing techniques, especially those that are human-intensive, obsolete. Big data also produces technological advancements such as Internet of things, cloud computing, and deep learning, and accordingly, provides a new, exciting, and challenging research agenda. Given the availability of data and the improvement of machine learning techniques, this blog discusses the respective roles of humans and machines in achieving cognitive tasks in matching, aiming to determine whether traditional roles of humans and machines are subject to change. Such investigation, we believe, will pave a way to better utilize both human and machine resources in new and innovative manners. We shall discuss two possible modes of change, namely humans out and humans in. Humans out aim at exploring out-of-the-box latent matching reasoning using machine learning algorithms when attempting to overpower human matcher performance. Pursuing out-of-the-box thinking, machine and deep learning can be involved in matching. Humans in explores how to better involve humans in the matching loop by assigning human matchers with a symmetric role to algorithmic matcher in the matching process.
CLAug 23, 2022
FlexER: Flexible Entity Resolution for Multiple IntentsBar Genossar, Roee Shraga, Avigdor Gal
Entity resolution, a longstanding problem of data cleaning and integration, aims at identifying data records that represent the same real-world entity. Existing approaches treat entity resolution as a universal task, assuming the existence of a single interpretation of a real-world entity and focusing only on finding matched records, separating corresponding from non-corresponding ones, with respect to this single interpretation. However, in real-world scenarios, where entity resolution is part of a more general data project, downstream applications may have varying interpretations of real-world entities relating, for example, to various user needs. In what follows, we introduce the problem of multiple intents entity resolution (MIER), an extension to the universal (single intent) entity resolution task. As a solution, we propose FlexER, utilizing contemporary solutions to universal entity resolution tasks to solve multiple intents entity resolution. FlexER addresses the problem as a multi-label classification problem. It combines intent-based representations of tuple pairs using a multiplex graph representation that serves as an input to a graph neural network (GNN). FlexER learns intent representations and improves the outcome to multiple resolution problems. A large-scale empirical evaluation introduces a new benchmark and, using also two well-known benchmarks, shows that FlexER effectively solves the MIER problem and outperforms the state-of-the-art for a universal entity resolution.
DBNov 27, 2023
The Battleship Approach to the Low Resource Entity Matching ProblemBar Genossar, Avigdor Gal, Roee Shraga
Entity matching, a core data integration problem, is the task of deciding whether two data tuples refer to the same real-world entity. Recent advances in deep learning methods, using pre-trained language models, were proposed for resolving entity matching. Although demonstrating unprecedented results, these solutions suffer from a major drawback as they require large amounts of labeled data for training, and, as such, are inadequate to be applied to low resource entity matching problems. To overcome the challenge of obtaining sufficient labeled data we offer a new active learning approach, focusing on a selection mechanism that exploits unique properties of entity matching. We argue that a distributed representation of a tuple pair indicates its informativeness when considered among other pairs. This is used consequently in our approach that iteratively utilizes space-aware considerations. Bringing it all together, we treat the low resource entity matching problem as a Battleship game, hunting indicative samples, focusing on positive ones, through awareness of the latent space along with careful planning of next sampling iterations. An extensive experimental analysis shows that the proposed algorithm outperforms state-of-the-art active learning solutions to low resource entity matching, and although using less samples, can be as successful as state-of-the-art fully trained known algorithms.
LGNov 9, 2025
3dSAGER: Geospatial Entity Resolution over 3D Objects (Technical Report)Bar Genossar, Sagi Dalyot, Roee Shraga et al.
Urban environments are continuously mapped and modeled by various data collection platforms, including satellites, unmanned aerial vehicles and street cameras. The growing availability of 3D geospatial data from multiple modalities has introduced new opportunities and challenges for integrating spatial knowledge at scale, particularly in high-impact domains such as urban planning and rapid disaster management. Geospatial entity resolution is the task of identifying matching spatial objects across different datasets, often collected independently under varying conditions. Existing approaches typically rely on spatial proximity, textual metadata, or external identifiers to determine correspondence. While useful, these signals are often unavailable, unreliable, or misaligned, especially in cross-source scenarios. To address these limitations, we shift the focus to the intrinsic geometry of 3D spatial objects and present 3dSAGER (3D Spatial-Aware Geospatial Entity Resolution), an end-to-end pipeline for geospatial entity resolution over 3D objects. 3dSAGER introduces a novel, spatial-reference-independent featurization mechanism that captures intricate geometric characteristics of matching pairs, enabling robust comparison even across datasets with incompatible coordinate systems where traditional spatial methods fail. As a key component of 3dSAGER, we also propose a new lightweight and interpretable blocking method, BKAFI, that leverages a trained model to efficiently generate high-recall candidate sets. We validate 3dSAGER through extensive experiments on real-world urban datasets, demonstrating significant gains in both accuracy and efficiency over strong baselines. Our empirical study further dissects the contributions of each component, providing insights into their impact and the overall design choices.
LGApr 26, 2022
From Limited Annotated Raw Material Data to Quality Production Data: A Case Study in the Milk Industry (Technical Report)Roee Shraga, Gil Katz, Yael Badian et al.
Industry 4.0 offers opportunities to combine multiple sensor data sources using IoT technologies for better utilization of raw material in production lines. A common belief that data is readily available (the big data phenomenon), is oftentimes challenged by the need to effectively acquire quality data under severe constraints. In this paper we propose a design methodology, using active learning to enhance learning capabilities, for building a model of production outcome using a constrained amount of raw material training data. The proposed methodology extends existing active learning methods to effectively solve regression-based learning problems and may serve settings where data acquisition requires excessive resources in the physical world. We further suggest a set of qualitative measures to analyze learners performance. The proposed methodology is demonstrated using an actual application in the milk industry, where milk is gathered from multiple small milk farms and brought to a dairy production plant to be processed into cottage cheese.
DBOct 16, 2020Code
Discovering Hierarchical Processes Using Flexible Activity Trees for Event AbstractionXixi Lu, Avigdor Gal, Hajo A. Reijers
Processes, such as patient pathways, can be very complex, comprising of hundreds of activities and dozens of interleaved subprocesses. While existing process discovery algorithms have proven to construct models of high quality on clean logs of structured processes, it still remains a challenge when the algorithms are being applied to logs of complex processes. The creation of a multi-level, hierarchical representation of a process can help to manage this complexity. However, current approaches that pursue this idea suffer from a variety of weaknesses. In particular, they do not deal well with interleaving subprocesses. In this paper, we propose FlexHMiner, a three-step approach to discover processes with multi-level interleaved subprocesses. We implemented FlexHMiner in the open source Process Mining toolkit ProM. We used seven real-life logs to compare the qualities of hierarchical models discovered using domain knowledge, random clustering, and flat approaches. Our results indicate that the hierarchical process models that the FlexHMiner generates compare favorably to approaches that do not exploit hierarchy.
LGOct 26, 2025
DDTR: Diffusion Denoising Trace RecoveryMaximilian Matyash, Avigdor Gal, Arik Senderovich
With recent technological advances, process logs, which were traditionally deterministic in nature, are being captured from non-deterministic sources, such as uncertain sensors or machine learning models (that predict activities using cameras). In the presence of stochastically-known logs, logs that contain probabilistic information, the need for stochastic trace recovery increases, to offer reliable means of understanding the processes that govern such systems. We design a novel deep learning approach for stochastic trace recovery, based on Diffusion Denoising Probabilistic Models (DDPM), which makes use of process knowledge (either implicitly by discovering a model or explicitly by injecting process knowledge in the training phase) to recover traces by denoising. We conduct an empirical evaluation demonstrating state-of-the-art performance with up to a 25% improvement over existing methods, along with increased robustness under high noise levels.
DBMar 9, 2025
Conformance Checking for Less: Efficient Conformance Checking for Long Event SequencesEli Bogdanov, Izack Cohen, Avigdor Gal
Long event sequences (termed traces) and large data logs that originate from sensors and prediction models are becoming increasingly common in our data-rich world. In such scenarios, conformance checking-validating a data log against an expected system behavior (the process model) can become computationally infeasible due to the exponential complexity of finding an optimal alignment. To alleviate scalability challenges for this task, we propose ConLES, a sliding-window conformance checking approach for long event sequences that preserves the interpretability of alignment-based methods. ConLES partitions traces into manageable subtraces and iteratively aligns each against the expected behavior, leading to significant reduction of the search space while maintaining overall accuracy. We use global information that captures structural properties of both the trace and the process model, enabling informed alignment decisions and discarding unpromising alignments, even if they appear locally optimal. Performance evaluations across multiple datasets highlight that ConLES outperforms the leading optimal and heuristic algorithms for long traces, consistently achieving the optimal or near-optimal solution. Unlike other conformance methods that struggle with long event sequences, ConLES significantly reduces the search space, scales efficiently, and uniquely supports both predefined and discovered process models, making it a viable and leading option for conformance checking of long event sequences.
LGDec 9, 2024
MISFEAT: Feature Selection for Subgroups with Systematic Missing DataBar Genossar, Thinh On, Md. Mouinul Islam et al.
We investigate the problem of selecting features for datasets that can be naturally partitioned into subgroups (e.g., according to socio-demographic groups and age), each with its own dominant set of features. Within this subgroup-oriented framework, we address the challenge of systematic missing data, a scenario in which some feature values are missing for all tuples of a subgroup, due to flawed data integration, regulatory constraints, or privacy concerns. Feature selection is governed by finding mutual Information, a popular quantification of correlation, between features and a target variable. Our goal is to identify top-K feature subsets of some fixed size with the highest joint mutual information with a target variable. In the presence of systematic missing data, the closed form of mutual information could not simply be applied. We argue that in such a setting, leveraging relationships between available feature mutual information within a subgroup or across subgroups can assist inferring missing mutual information values. We propose a generalizable model based on heterogeneous graph neural network to identify interdependencies between feature-subgroup-target variable connections by modeling it as a multiplex graph, and employing information propagation between its nodes. We address two distinct scalability challenges related to training and propose principled solutions to tackle them. Through an extensive empirical evaluation, we demonstrate the efficacy of the proposed solutions both qualitatively and running time wise.
AIJun 8, 2024
A Scalable and Near-Optimal Conformance Checking Approach for Long TracesEli Bogdanov, Izack Cohen, Avigdor Gal
Long traces and large event logs that originate from sensors and prediction models are becoming more common in our data-rich world. In such circumstances, conformance checking, a key task in process mining, can become computationally infeasible due to the exponential complexity of finding an optimal alignment. This paper introduces a novel sliding window approach to address these scalability challenges while preserving the interpretability of alignment-based methods. By breaking down traces into manageable subtraces and iteratively aligning each with the process model, our method significantly reduces the search space. The approach uses global information that captures structural properties of the trace and the process model to make informed alignment decisions, discarding unpromising alignments even if they are optimal for a local subtrace. This improves the overall accuracy of the results. Experimental evaluations demonstrate that the proposed method consistently finds optimal alignments in most cases and highlight its scalability. This is further supported by a theoretical complexity analysis, which shows the reduced growth of the search space compared to other common conformance checking methods. This work provides a valuable contribution towards efficient conformance checking for large-scale process mining applications.
AIJan 30, 2022
AI-Augmented Business Process Management Systems: A Research ManifestoMarlon Dumas, Fabiana Fournier, Lior Limonad et al.
AI-Augmented Business Process Management Systems (ABPMSs) are an emerging class of process-aware information systems, empowered by trustworthy AI technology. An ABPMS enhances the execution of business processes with the aim of making these processes more adaptable, proactive, explainable, and context-sensitive. This manifesto presents a vision for ABPMSs and discusses research challenges that need to be surmounted to realize this vision. To this end, we define the concept of ABPMS, we outline the lifecycle of processes within an ABPMS, we discuss core characteristics of an ABPMS, and we derive a set of challenges to realize systems with these characteristics.
DBSep 15, 2021
PoWareMatch: a Quality-aware Deep Learning Approach to Improve Human Schema MatchingRoee Shraga, Avigdor Gal
Schema matching is a core task of any data integration process. Being investigated in the fields of databases, AI, Semantic Web and data mining for many years, the main challenge remains the ability to generate quality matches among data concepts (e.g., database attributes). In this work, we examine a novel angle on the behavior of humans as matchers, studying match creation as a process. We analyze the dynamics of common evaluation measures (precision, recall, and f-measure), with respect to this angle and highlight the need for unbiased matching to support this analysis. Unbiased matching, a newly defined concept that describes the common assumption that human decisions represent reliable assessments of schemata correspondences, is, however, not an inherent property of human matchers. In what follows, we design PoWareMatch that makes use of a deep learning mechanism to calibrate and filter human matching decisions adhering the quality of a match, which are then combined with algorithmic matching to generate better match results. We provide an empirical evidence, established based on an experiment with more than 200 human matchers over common benchmarks, that PoWareMatch predicts well the benefit of extending the match with an additional correspondence and generates high quality matches. In addition, PoWareMatch outperforms state-of-the-art matching algorithms.
AIJun 7, 2021
Uncertain Process Data with Probabilistic Knowledge: Problem Characterization and ChallengesIzack Cohen, Avigdor Gal
Motivated by the abundance of uncertain event data from multiple sources including physical devices and sensors, this paper presents the task of relating a stochastic process observation to a process model that can be rendered from a dataset. In contrast to previous research that suggested to transform a stochastically known event log into a less informative uncertain log with upper and lower bounds on activity frequencies, we consider the challenge of accommodating the probabilistic knowledge into conformance checking techniques. Based on a taxonomy that captures the spectrum of conformance checking cases under stochastic process observations, we present three types of challenging cases. The first includes conformance checking of a stochastically known log with respect to a given process model. The second case extends the first to classify a stochastically known log into one of several process models. The third case extends the two previous ones into settings in which process models are only stochastically known. The suggested problem captures the increasingly growing number of applications in which sensors provide probabilistic process information.
AIDec 3, 2020
Mapping Patterns for Virtual Knowledge GraphsDiego Calvanese, Avigdor Gal, Davide Lanti et al.
Virtual Knowledge Graphs (VKG) constitute one of the most promising paradigms for integrating and accessing legacy data sources. A critical bottleneck in the integration process involves the definition, validation, and maintenance of mappings that link data sources to a domain ontology. To support the management of mappings throughout their entire lifecycle, we propose a comprehensive catalog of sophisticated mapping patterns that emerge when linking databases to ontologies. To do so, we build on well-established methodologies and patterns studied in data management, data analysis, and conceptual modeling. These are extended and refined through the analysis of concrete VKG benchmarks and real-world use cases, and considering the inherent impedance mismatch between data sources and ontologies. We validate our catalog on the considered VKG scenarios, showing that it covers the vast majority of patterns present therein.
DBDec 2, 2020
Learning to Characterize Matching ExpertsRoee Shraga, Ofra Amir, Avigdor Gal
Matching is a task at the heart of any data integration process, aimed at identifying correspondences among data elements. Matching problems were traditionally solved in a semi-automatic manner, with correspondences being generated by matching algorithms and outcomes subsequently validated by human experts. Human-in-the-loop data integration has been recently challenged by the introduction of big data and recent studies have analyzed obstacles to effective human matching and validation. In this work we characterize human matching experts, those humans whose proposed correspondences can mostly be trusted to be valid. We provide a novel framework for characterizing matching experts that, accompanied with a novel set of features, can be used to identify reliable and valuable human experts. We demonstrate the usefulness of our approach using an extensive empirical evaluation. In particular, we show that our approach can improve matching results by filtering out inexpert matchers.
IRFeb 27, 2019
Query Scheduling in the Presence of Complex User ProfilesHaggai Roitman, Avigdor Gal, Louiqa Raschid
Advances in Web technology enable personalization proxies that assist users in satisfying their complex information monitoring and aggregation needs through the repeated querying of multiple volatile data sources. Such proxies face a scalability challenge when trying to maximize the number of clients served while at the same time fully satisfying clients' complex user profiles. In this work we use an abstraction of complex execution intervals (CEIs) constructed over simple execution intervals (EIs) represents user profiles and use existing offline approximation as a baseline for maximizing completeness of capturing CEIs. We present three heuristic solutions for the online problem of query scheduling to satisfy complex user profiles. The first only considers properties of individual EIs while the other two exploit properties of all EIs in the CEI. We use an extensive set of experiments on real traces and synthetic data to show that heuristics that exploit knowledge of the CEIs dominate across multiple parameter settings.
SEApr 12, 2017
Blockchains for Business Process Management - Challenges and OpportunitiesJan Mendling, Ingo Weber, Wil van der Aalst et al.
Blockchain technology promises a sizable potential for executing inter-organizational business processes without requiring a central party serving as a single point of trust (and failure). This paper analyzes its impact on business process management (BPM). We structure the discussion using two BPM frameworks, namely the six BPM core capabilities and the BPM lifecycle. This paper provides research directions for investigating the application of blockchain technology to BPM.
AIJul 4, 2012
A Model for Reasoning with Uncertain Rules in Event Composition SystemsSegev Wasserkrug, Avigdor Gal, Opher Etzion
In recent years, there has been an increased need for the use of active systems - systems required to act automatically based on events, or changes in the environment. Such systems span many areas, from active databases to applications that drive the core business processes of today's enterprises. However, in many cases, the events to which the system must respond are not generated by monitoring tools, but must be inferred from other events based on complex temporal predicates. In addition, in many applications, such inference is inherently uncertain. In this paper, we introduce a formal framework for knowledge representation and reasoning enabling such event inference. Based on probability theory, we define the representation of the associated uncertainty. In addition, we formally define the probability space, and show how the relevant probabilities can be calculated by dynamically constructing a Bayesian network. To the best of our knowledge, this is the first work that enables taking such uncertainty into account in the context of active systems. herefore, our contribution is twofold: We formally define the representation and semantics of event composition for probabilistic settings, and show how to apply these extensions to the quantification of the occurrence probability of events. These results enable any active system to handle such uncertainty.