Efficient Conformance Checking using Approximate Alignment Computation with Tandem Repeats
This addresses efficiency challenges in process mining for practitioners dealing with large, complex event logs exceeding one million events, though it is an incremental improvement over existing alignment methods.
The paper tackles the computational expense of optimal alignments in conformance checking for process mining by proposing an approximate technique that exploits tandem repeats in traces to compress trace length and recompute alignment cost with guaranteed cost bounds. In evaluation with 50 real-life model-log pairs against six state-of-the-art techniques, the approach outperforms baselines by up to an order of magnitude for traces with repetitions while maintaining negligible cost over-approximation.
Conformance checking encompasses a body of process mining techniques which aim to find and describe the differences between a process model capturing the expected process behavior and a corresponding event log recording the observed behavior. Alignments are an established technique to compute the distance between a trace in the event log and the closest execution trace of a corresponding process model. Given a cost function, an alignment is optimal when it contains the least number of mismatches between a log trace and a model trace. Determining optimal alignments, however, is computationally expensive, especially in light of the growing size and complexity of event logs from practice, which can easily exceed one million events with traces of several hundred activities. A common limitation of existing alignment techniques is the inability to exploit repetitions in the log. By exploiting a specific form of sequential pattern in traces, namely tandem repeats, we propose a novel approximate technique that uses pre- and post-processing steps to compress the length of a trace and recomputes the alignment cost while guaranteeing that the cost result never under-approximates the optimal cost. In an extensive empirical evaluation with 50 real-life model log pairs and against six state-of-the-art alignment techniques, we show that the proposed compression approach systematically outperforms the baselines by up to an order of magnitude in the presence of traces with repetitions, and that the cost over-approximation, when it occurs, is negligible.