Simon Poon

h-index26

8papers

78citations

Novelty48%

AI Score25

Ranked #166,768 of 194,257 authors (top 86%)#1,765 in IR (top 81%)

8 Papers

7.7IRFeb 17, 2022

A Deep Learning Approach for Repairing Missing Activity Labels in Event Logs for Process Mining

Yang Lu, Qifan Chen, Simon K. Poon

Process mining is a relatively new subject that builds a bridge between traditional process modeling and data mining. Process discovery is one of the most critical parts of process mining, which aims at discovering process models automatically from event logs. The performance of existing process discovery algorithms can be affected when there are missing activity labels in event logs. Several methods have been proposed to repair missing activity labels, but their accuracy can drop when a large number of activity labels are missing. In this paper, we propose an LSTM-based prediction model to predict the missing activity labels in event logs. The proposed model takes both the prefix and suffix sequences of the events with missing activity labels as input. Additional attributes of event logs are also utilized to improve the performance. Our evaluation of several publicly available datasets shows that the proposed method performed consistently better than existing methods in terms of repairing missing activity labels in event logs.

1.2DBMar 30, 2021

A Multi-View Framework to Detect Redundant Activity Labels for More Representative Event Logs in Process Mining

Qifan Chen, Yang Lu, Charmaine S. Tam et al.

Process mining aims to gain knowledge of business processes via the discovery of process models from event logs generated by information systems. The insights revealed from process mining heavily rely on the quality of the event logs. Activities extracted from different data sources or the free-text nature within the same system may lead to inconsistent labels. Such inconsistency would then lead to redundancy in activity labels, which refer to labels that have different syntax but share the same behaviours. Redundant activity labels could introduce unnecessary complexities to the event logs. The identifications of these labels from data-driven process discovery are difficult and rely heavily on human intervention. Neither existing process discovery algorithms nor event data preprocessing techniques can solve such redundancy efficiently. In this paper, we propose a multi-view approach to automatically detect redundant activity labels using not only context-aware features such as control--flow relations and attribute values but also semantic features from the event logs. Our evaluation of several publicly available datasets and a real-life case study demonstrate that our approach can efficiently detect redundant activity labels even with low-occurrence frequencies. The proposed approach can add value to the preprocessing step to generate more representative event logs.

8.6SEMar 19, 2021Code

A Robust and Accurate Approach to Detect Process Drifts from Event Streams

Yang Lu, Qifan Chen, Simon Poon

Business processes are bound to evolve as a form of adaption to changes, and such changes are referred as process drifts. Current process drift detection methods perform well on clean event log data, but the performance can be tremendously affected by noises. A good process drift detection method should be accurate, fast, and robust to noises. In this paper, we propose an offline process drift detection method which identifies each newly observed behaviour as a candidate drift point and checks if the new behaviour can signify significant changes to the original process behaviours. In addition, a bidirectional search method is proposed to accurately locate both the adding and removing of behaviours. The proposed method can accurately detect drift points from event logs and is robust to noises. Both artificial and real-life event logs are used to evaluate our method. Results show that our method can consistently report accurate process drift time while maintaining a reasonably fast detection speed.

3.6IRMar 19, 2021

Detecting and Understanding Branching Frequency Changes in Process Models

Yang Lu, Qifan Chen, Simon Poon

Business processes are continuously evolving in order to adapt to changes due to various factors. One type of process changes are branching frequency changes, which are related to changes in frequencies between different options when there is an exclusive choice. Existing methods either cannot detect such changes or cannot provide accurate and comprehensive results. In this paper, we propose a method which takes both event logs and process models as input and generates a choice sequence for each exclusive choice in the process model. The method then identifies change points based on the choice sequences. We evaluate our method on a real-life event log. Results show that our method can identify branching frequency changes in process models and provide comprehensive results to users.

2.0IRMar 19, 2021

Discovering Redundant Activities in Event Logs for the Simplification of Process Models

Qifan Chen, Yang Lu, Simon Poon

Process mining acts as a valuable tool to analyse the behaviour of an organisation by offering techniques to discover, monitor and enhance real processes. The key to process mining is to discovery understandable process models. However, real-life logs can be complex with redundant activities, which share similar behaviour but have different syntax. We show that the existence of such redundant activities heavily affects the quality of discovered process models. Existing approaches filter activities by frequency, which cannot solve problems caused by redundant activities. In this paper, we propose first to discover redundant activities in the log level and, then, use the discovery results to simplify event logs. Two publicly available data sets are used to evaluate the usability of our approach in real-life processes. Our approach can be adopted as a preprocessing step before applying any discovery algorithms to produce simplify models.

3.4LGJul 3, 2019

AMI-Net+: A Novel Multi-Instance Neural Network for Medical Diagnosis from Incomplete and Imbalanced Data

Zeyuan Wang, Josiah Poon, Simon Poon

In medical real-world study (RWS), how to fully utilize the fragmentary and scarce information in model training to generate the solid diagnosis results is a challenging task. In this work, we introduce a novel multi-instance neural network, AMI-Net+, to train and predict from the incomplete and extremely imbalanced data. It is more effective than the state-of-art method, AMI-Net. First, we also implement embedding, multi-head attention and gated attention-based multi-instance pooling to capture the relations of symptoms themselves and with the given disease. Besides, we propose var-ious improvements to AMI-Net, that the cross-entropy loss is replaced by focal loss and we propose a novel self-adaptive multi-instance pooling method on instance-level to obtain the bag representation. We validate the performance of AMI-Net+ on two real-world datasets, from two different medical domains. Results show that our approach outperforms other base-line models by a considerable margin.

3.4LGApr 9, 2019

Attention-based Multi-instance Neural Network for Medical Diagnosis from Incomplete and Low Quality Data

Zeyuan Wang, Josiah Poon, Shiding Sun et al.

One way to extract patterns from clinical records is to consider each patient record as a bag with various number of instances in the form of symptoms. Medical diagnosis is to discover informative ones first and then map them to one or more diseases. In many cases, patients are represented as vectors in some feature space and a classifier is applied after to generate diagnosis results. However, in many real-world cases, data is often of low-quality due to a variety of reasons, such as data consistency, integrity, completeness, accuracy, etc. In this paper, we propose a novel approach, attention based multi-instance neural network (AMI-Net), to make the single disease classification only based on the existing and valid information in the real-world outpatient records. In the context of a patient, it takes a bag of instances as input and output the bag label directly in end-to-end way. Embedding layer is adopted at the beginning, mapping instances into an embedding space which represents the individual patient condition. The correlations among instances and their importance for the final classification are captured by multi-head attention transformer, instance-level multi-instance pooling and bag-level multi-instance pooling. The proposed approach was test on two non-standardized and highly imbalanced datasets, one in the Traditional Chinese Medicine (TCM) domain and the other in the Western Medicine (WM) domain. Our preliminary results show that the proposed approach outperforms all baselines results by a significant margin.

2.2LGDec 19, 2018

CNN based Multi-Instance Multi-Task Learning for Syndrome Differentiation of Diabetic Patients

Zeyuan Wang, Josiah Poon, Shiding Sun et al.

Syndrome differentiation in Traditional Chinese Medicine (TCM) is the process of understanding and reasoning body condition, which is the essential step and premise of effective treatments. However, due to its complexity and lack of standardization, it is challenging to achieve. In this study, we consider each patient's record as a one-dimensional image and symptoms as pixels, in which missing and negative values are represented by zero pixels. The objective is to find relevant symptoms first and then map them to proper syndromes, that is similar to the object detection problem in computer vision. Inspired from it, we employ multi-instance multi-task learning combined with the convolutional neural network (MIMT-CNN) for syndrome differentiation, which takes region proposals as input and output image labels directly. The neural network consists of region proposals generation, convolutional layer, fully connected layer, and max pooling (multi-instance pooling) layer followed by the sigmoid function in each syndrome prediction task for image representation learning and final results generation. On the diabetes dataset, it performs better than all other baseline methods. Moreover, it shows stability and reliability to generate results, even on the dataset with small sample size, a large number of missing values and noises.