LGAug 19, 2022
SimLDA: A tool for topic model evaluationRebecca M. C. Taylor, Johan A. du Preez
Variational Bayes (VB) applied to latent Dirichlet allocation (LDA) has become the most popular algorithm for aspect modeling. While sufficiently successful in text topic extraction from large corpora, VB is less successful in identifying aspects in the presence of limited data. We present a novel variational message passing algorithm as applied to Latent Dirichlet Allocation (LDA) and compare it with the gold standard VB and collapsed Gibbs sampling. In situations where marginalisation leads to non-conjugate messages, we use ideas from sampling to derive approximate update equations. In cases where conjugacy holds, Loopy Belief update (LBU) (also known as Lauritzen-Spiegelhalter) is used. Our algorithm, ALBU (approximate LBU), has strong similarities with Variational Message Passing (VMP) (which is the message passing variant of VB). To compare the performance of the algorithms in the presence of limited data, we use data sets consisting of tweets and news groups. Using coherence measures we show that ALBU learns latent distributions more accurately than does VB, especially for smaller data sets.
LGAug 25, 2022
Rail break and derailment prediction using Probabilistic Graphical ModellingRebecca M. C. Taylor, Johan A. du Preez
Rail breaks are one of the most common causes of derailments internationally. This is no different for the South African Iron Ore line. Many rail breaks occur as a heavy-haul train passes over a crack, large defect or defective weld. In such cases, it is usually too late for the train to slow down in time to prevent a de-railment. Knowing the risk of a rail break occurring associated with a train passing over a section of rail allows for better implementation of maintenance initiatives and mitigating measures. In this paper the Ore Line's specific challenges are discussed and the currently available data that can be used to create a rail break risk prediction model is reviewed. The development of a basic rail break risk prediction model for the Ore Line is then presented. Finally the insight gained from the model is demonstrated by means of discussing various scenarios of various rail break risk. In future work, we are planning on extending this basic model to allow input from live monitoring systems such as the ultrasonic broken rail detection system.
LGNov 2, 2021
A derivation of variational message passing (VMP) for latent Dirichlet allocation (LDA)Rebecca M. C. Taylor, Dirko Coetsee, Johan A. du Preez
Latent Dirichlet Allocation (LDA) is a probabilistic model used to uncover latent topics in a corpus of documents. Inference is often performed using variational Bayes (VB) algorithms, which calculate a lower bound to the posterior distribution over the parameters. Deriving the variational update equations for new models requires considerable manual effort; variational message passing (VMP) has emerged as a "black-box" tool to expedite the process of variational inference. But applying VMP in practice still presents subtle challenges, and the existing literature does not contain the steps that are necessary to implement VMP for the standard smoothed LDA model, nor are available black-box probabilistic graphical modelling software able to do the word-topic updates necessary to implement LDA. In this paper, we therefore present a detailed derivation of the VMP update equations for LDA. We see this as a first step to enabling other researchers to calculate the VMP updates for similar graphical models.
CVOct 8, 2021
Automated Feature-Specific Tree Species Identification from Natural Images using Deep Semi-Supervised LearningDewald Homan, Johan A. du Preez
Prior work on plant species classification predominantly focuses on building models from isolated plant attributes. Hence, there is a need for tools that can assist in species identification in the natural world. We present a novel and robust two-fold approach capable of identifying trees in a real-world natural setting. Further, we leverage unlabelled data through deep semi-supervised learning and demonstrate superior performance to supervised learning. Our single-GPU implementation for feature recognition uses minimal annotated data and achieves accuracies of 93.96% and 93.11% for leaves and bark, respectively. Further, we extract feature-specific datasets of 50 species by employing this technique. Finally, our semi-supervised species classification method attains 94.04% top-5 accuracy for leaves and 83.04% top-5 accuracy for bark.
LGOct 1, 2021
ALBU: An approximate Loopy Belief message passing algorithm for LDA to improve performance on small data setsRebecca M. C. Taylor, Johan A. du Preez
Variational Bayes (VB) applied to latent Dirichlet allocation (LDA) has become the most popular algorithm for aspect modeling. While sufficiently successful in text topic extraction from large corpora, VB is less successful in identifying aspects in the presence of limited data. We present a novel variational message passing algorithm as applied to Latent Dirichlet Allocation (LDA) and compare it with the gold standard VB and collapsed Gibbs sampling. In situations where marginalisation leads to non-conjugate messages, we use ideas from sampling to derive approximate update equations. In cases where conjugacy holds, Loopy Belief update (LBU) (also known as Lauritzen-Spiegelhalter) is used. Our algorithm, ALBU (approximate LBU), has strong similarities with Variational Message Passing (VMP) (which is the message passing variant of VB). To compare the performance of the algorithms in the presence of limited data, we use data sets consisting of tweets and news groups. Additionally, to perform more fine grained evaluations and comparisons, we use simulations that enable comparisons with the ground truth via Kullback-Leibler divergence (KLD). Using coherence measures for the text corpora and KLD with the simulations we show that ALBU learns latent distributions more accurately than does VB, especially for smaller data sets.
MLFeb 4, 2020
Open-set learning with augmented categories by exploiting unlabelled dataEmile R. Engelbrecht, Johan A. du Preez
Novel categories are commonly defined as those unobserved during training but present during testing. However, partially labelled training datasets can contain unlabelled training samples that belong to novel categories, meaning these can be present in training and testing. This research is the first to generalise between what we call observed-novel and unobserved-novel categories within a new learning policy called open-set learning with augmented category by exploiting unlabelled data or Open-LACU. After surveying existing learning policies, we introduce Open-LACU as a unified policy of positive and unlabelled learning, semi-supervised learning and open-set recognition. Subsequently, we develop the first Open-LACU model using an algorithmic training process of the relevant research fields. The proposed Open-LACU classifier achieves state-of-the-art and first-of-its-kind results.