Toby Dylan Hocking

h-index20

23papers

226citations

Novelty47%

AI Score45

Ranked #41,303 of 194,257 authors (top 21%)#9,604 in LG (top 24%)

23 Papers

5.0LGApr 4

Understanding When Poisson Log-Normal Models Outperform Penalized Poisson Regression for Microbiome Count Data

Daniel Agyapong, Julien Chiquet, Jane Marks et al.

Multivariate count models are often justified by their ability to capture latent dependence, but researchers receive little guidance on when this added structure improves on simpler penalized marginal Poisson regression. We study this question using real microbiome data under a unified held-out evaluation framework. For count prediction, we compare PLN and GLMNet(Poisson) on 20 datasets spanning 32 to 18,270 samples and 24 to 257 taxa, using held-out Poisson deviance under leave-one-taxon-out prediction with 3-fold sample cross-validation rather than synthetic or in-sample criteria. For network inference, we compare PLNNetwork and GLMNet(Poisson) neighborhood selection on five publicly available datasets with experimentally validated microbial interaction truth. PLN outperforms GLMNet(Poisson) on most count-prediction datasets, with gains up to 38 percent. The primary predictor of the winner is the sample-to-taxon ratio, with mean absolute correlation as the strongest secondary signal and overdispersion as an additional predictor. PLNNetwork performs best on broad undirected interaction benchmarks, whereas GLMNet(Poisson) is better aligned with local or directional effects. Taken together, these results provide guidance for choosing between latent multivariate count models and penalized Poisson regression in biological count prediction and interaction recovery.

5.3LGSep 26, 2023

Cross-Validation for Training and Testing Co-occurrence Network Inference Algorithms

Daniel Agyapong, Jeffrey Ryan Propster, Jane Marks et al.

Microorganisms are found in almost every environment, including the soil, water, air, and inside other organisms, like animals and plants. While some microorganisms cause diseases, most of them help in biological processes such as decomposition, fermentation and nutrient cycling. A lot of research has gone into studying microbial communities in various environments and how their interactions and relationships can provide insights into various diseases. Co-occurrence network inference algorithms help us understand the complex associations of micro-organisms, especially bacteria. Existing network inference algorithms employ techniques such as correlation, regularized linear regression, and conditional dependence, which have different hyper-parameters that determine the sparsity of the network. Previous methods for evaluating the quality of the inferred network include using external data, and network consistency across sub-samples, both which have several drawbacks that limit their applicability in real microbiome composition data sets. We propose a novel cross-validation method to evaluate co-occurrence network inference algorithms, and new methods for applying existing algorithms to predict on test data. Our empirical study shows that the proposed method is useful for hyper-parameter selection (training) and comparing the quality of the inferred networks between different algorithms (testing).

3.8LGFeb 21, 2023

A Log-linear Gradient Descent Algorithm for Unbalanced Binary Classification using the All Pairs Squared Hinge Loss

Kyle R. Rust, Toby D. Hocking

Receiver Operating Characteristic (ROC) curves are plots of true positive rate versus false positive rate which are used to evaluate binary classification algorithms. Because the Area Under the Curve (AUC) is a constant function of the predicted values, learning algorithms instead optimize convex relaxations which involve a sum over all pairs of labeled positive and negative examples. Naive learning algorithms compute the gradient in quadratic time, which is too slow for learning using large batch sizes. We propose a new functional representation of the square loss and squared hinge loss, which results in algorithms that compute the gradient in either linear or log-linear time, and makes it possible to use gradient descent learning with large batch sizes. In our empirical study of supervised binary classification problems, we show that our new algorithm can achieve higher test AUC values on imbalanced data sets than previous algorithms, and make use of larger batch sizes than were previously feasible.

1.8LGOct 5, 2022

Functional Labeled Optimal Partitioning

Toby D. Hocking, Jacob M. Kaufman, Alyssa J. Stenberg

Peak detection is a problem in sequential data analysis that involves differentiating regions with higher counts (peaks) from regions with lower counts (background noise). It is crucial to correctly predict areas that deviate from the background noise, in both the train and test sets of labels. Dynamic programming changepoint algorithms have been proposed to solve the peak detection problem by constraining the mean to alternatively increase and then decrease. The current constrained changepoint algorithms only create predictions on the test set, while completely ignoring the train set. Changepoint algorithms that are both accurate when fitting the train set, and make predictions on the test set, have been proposed but not in the context of peak detection models. We propose to resolve these issues by creating a new dynamic programming algorithm, FLOPART, that has zero train label errors, and is able to provide highly accurate predictions on the test set. We provide an empirical analysis that shows FLOPART has a similar time complexity while being more accurate than the existing algorithms in terms of train and test label errors.

14.0LGJun 8, 2020Code

Survival regression with accelerated failure time model in XGBoost

Avinash Barnwal, Hyunsu Cho, Toby Dylan Hocking

Survival regression is used to estimate the relation between time-to-event and feature variables, and is important in application domains such as medicine, marketing, risk management and sales management. Nonlinear tree based machine learning algorithms as implemented in libraries such as XGBoost, scikit-learn, LightGBM, and CatBoost are often more accurate in practice than linear models. However, existing state-of-the-art implementations of tree-based models have offered limited support for survival regression. In this work, we implement loss functions for learning accelerated failure time (AFT) models in XGBoost, to increase the support for survival modeling for different kinds of label censoring. We demonstrate with real and simulated experiments the effectiveness of AFT in XGBoost with respect to a number of baselines, in two respects: generalization performance and training speed. Furthermore, we take advantage of the support for NVIDIA GPUs in XGBoost to achieve substantial speedup over multi-core CPUs. To our knowledge, our work is the first implementation of AFT that utilizes the processing power of NVIDIA GPUs. Starting from the 1.2.0 release, the XGBoost package natively supports the AFT model. The addition of AFT in XGBoost has had significant impact in the open source community, and a few statistics packages now utilize the XGBoost AFT model.

5.9COMar 9, 2017Code

A log-linear time algorithm for constrained changepoint detection

Toby Dylan Hocking, Guillem Rigaill, Paul Fearnhead et al.

Changepoint detection is a central problem in time series and genomic data. For some applications, it is natural to impose constraints on the directions of changes. One example is ChIP-seq data, for which adding an up-down constraint improves peak detection accuracy, but makes the optimization problem more complicated. We show how a recently proposed functional pruning technique can be adapted to solve such constrained changepoint detection problems. This leads to a new algorithm which can solve problems with arbitrary affine constraints on adjacent segment means, and which has empirical time complexity that is log-linear in the amount of data. This algorithm achieves state-of-the-art accuracy in a benchmark of several genomic data sets, and is orders of magnitude faster than existing algorithms that have similar accuracy. Our implementation is available as the PeakSegPDPA function in the coseg R package, https://github.com/tdhock/coseg

9.2MLAug 1, 2024Code

Penalty Learning for Optimal Partitioning using Multilayer Perceptron

Tung L Nguyen, Toby Dylan Hocking

Changepoint detection is a technique used to identify significant shifts in sequences and is widely used in fields such as finance, genomics, and medicine. To identify the changepoints, dynamic programming (DP) algorithms, particularly Optimal Partitioning (OP) family, are widely used. To control the changepoints count, these algorithms use a fixed penalty to penalize the changepoints presence. To predict the optimal value of that penalty, existing methods used simple models such as linear or tree-based, which may limit predictive performance. To address this issue, this study proposes using a multilayer perceptron (MLP) with a ReLU activation function to predict the penalty. The proposed model generates continuous predictions -- as opposed to the stepwise ones in tree-based models -- and handles non-linearity better than linear models. Experiments on large benchmark genomic datasets demonstrate that the proposed model improves accuracy and F1 score compared to existing models.

7.1LGMar 3, 2025

Interval Regression: A Comparative Study with Proposed Models

Tung L Nguyen, Toby Dylan Hocking

Regression models are essential for a wide range of real-world applications. However, in practice, target values are not always precisely known; instead, they may be represented as intervals of acceptable values. This challenge has led to the development of Interval Regression models. In this study, we provide a comprehensive review of existing Interval Regression models and introduce alternative models for comparative analysis. Experiments are conducted on both real-world and synthetic datasets to offer a broad perspective on model performance. The results demonstrate that no single model is universally optimal, highlighting the importance of selecting the most suitable model for each specific scenario.

4.1LGMay 12, 2025

Learning Penalty for Optimal Partitioning via Automatic Feature Extraction

Tung L Nguyen, Toby Hocking

Changepoint detection identifies significant shifts in data sequences, making it important in areas like finance, genetics, and healthcare. The Optimal Partitioning algorithms efficiently detect these changes, using a penalty parameter to limit the changepoints count. Determining the optimal value for this penalty can be challenging. Traditionally, this process involved manually extracting statistical features, such as sequence length or variance to make the prediction. This study proposes a novel approach that uses recurrent networks to learn this penalty directly from raw sequences by automatically extracting features. Experiments conducted on 20 benchmark genomic datasets show that this novel method generally outperforms traditional ones in changepoint detection accuracy.

4.6LGOct 11, 2024Code

Finite Sample Complexity Analysis of Binary Segmentation

Toby Dylan Hocking

Binary segmentation is the classic greedy algorithm which recursively splits a sequential data set by optimizing some loss or likelihood function. Binary segmentation is widely used for changepoint detection in data sets measured over space or time, and as a sub-routine for decision tree learning. In theory it should be extremely fast for $N$ data and $K$ splits, $O(N K)$ in the worst case, and $O(N \log K)$ in the best case. In this paper we describe new methods for analyzing the time and space complexity of binary segmentation for a given finite $N$, $K$, and minimum segment length parameter. First, we describe algorithms that can be used to compute the best and worst case number of splits the algorithm must consider. Second, we describe synthetic data that achieve the best and worst case and which can be used to test for correct implementation of the algorithm. Finally, we provide an empirical analysis of real data which suggests that binary segmentation is often close to optimal speed in practice.

3.1MLOct 11, 2024Code

SOAK: Same/Other/All K-fold cross-validation for estimating similarity of patterns in data subsets

Toby Dylan Hocking, Gabrielle Thibault, Cameron Scott Bodine et al.

In many real-world applications of machine learning, we are interested to know if it is possible to train on the data that we have gathered so far, and obtain accurate predictions on a new test data subset that is qualitatively different in some respect (time period, geographic region, etc). Another question is whether data subsets are similar enough so that it is beneficial to combine subsets during model training. We propose SOAK, Same/Other/All K-fold cross-validation, a new method which can be used to answer both questions. SOAK systematically compares models which are trained on different subsets of data, and then used for prediction on a fixed test subset, to estimate the similarity of learnable/predictable patterns in data subsets. We show results of using SOAK on six new real data sets (with geographic/temporal subsets, to check if predictions are accurate on new subsets), 3 image pair data sets (subsets are different image types, to check that we get smaller prediction error on similar images), and 11 benchmark data sets with predefined train/test splits (to check similarity of predefined splits).

2.6LGOct 11, 2024Code

Efficient line search for optimizing Area Under the ROC Curve in gradient descent

Jadon Fowler, Toby Dylan Hocking

Receiver Operating Characteristic (ROC) curves are useful for evaluation in binary classification and changepoint detection, but difficult to use for learning since the Area Under the Curve (AUC) is piecewise constant (gradient zero almost everywhere). Recently the Area Under Min (AUM) of false positive and false negative rates has been proposed as a differentiable surrogate for AUC. In this paper we study the piecewise linear/constant nature of the AUM/AUC, and propose new efficient path-following algorithms for choosing the learning rate which is optimal for each step of gradient descent (line search), when optimizing a linear model. Remarkably, our proposed line search algorithm has the same log-linear asymptotic time complexity as gradient descent with constant step size, but it computes a complete representation of the AUM/AUC as a function of step size. In our empirical study of binary classification problems, we verify that our proposed algorithm is fast and exact; in changepoint detection problems we show that the proposed algorithm is just as accurate as grid search, but faster.

1.9MLJul 2, 2021Code

Optimizing ROC Curves with a Sort-Based Surrogate Loss Function for Binary Classification and Changepoint Detection

Jonathan Hillman, Toby Dylan Hocking

Receiver Operating Characteristic (ROC) curves are plots of true positive rate versus false positive rate which are useful for evaluating binary classification models, but difficult to use for learning since the Area Under the Curve (AUC) is non-convex. ROC curves can also be used in other problems that have false positive and true positive rates such as changepoint detection. We show that in this more general context, the ROC curve can have loops, points with highly sub-optimal error rates, and AUC greater than one. This observation motivates a new optimization objective: rather than maximizing the AUC, we would like a monotonic ROC curve with AUC=1 that avoids points with large values for Min(FP,FN). We propose a convex relaxation of this objective that results in a new surrogate loss function called the AUM, short for Area Under Min(FP, FN). Whereas previous loss functions are based on summing over all labeled examples or pairs, the AUM requires a sort and a sum over the sequence of points on the ROC curve. We show that AUM directional derivatives can be efficiently computed and used in a gradient descent learning algorithm. In our empirical study of supervised binary classification and changepoint detection problems, we show that our new AUM minimization learning algorithm results in improved AUC and comparable speed relative to previous baselines.

1.2SPFeb 6, 2021

A Greedy Graph Search Algorithm Based on Changepoint Analysis for Automatic QRS Complex Detection

Atiyeh Fotoohinasab, Toby Hocking, Fatemeh Afghah

The electrocardiogram (ECG) signal is the most widely used non-invasive tool for the investigation of cardiovascular diseases. Automatic delineation of ECG fiducial points, in particular the R-peak, serves as the basis for ECG processing and analysis. This study proposes a new method of ECG signal analysis by introducing a new class of graphical models based on optimal changepoint detection models, named the graph-constrained changepoint detection (GCCD) model. The GCCD model treats fiducial points delineation in the non-stationary ECG signal as a changepoint detection problem. The proposed model exploits the sparsity of changepoints to detect abrupt changes within the ECG signal; thereby, the R-peak detection task can be relaxed from any preprocessing step. In this novel approach, prior biological knowledge about the expected sequence of changes is incorporated into the model using the constraint graph, which can be defined manually or automatically. First, we define the constraint graph manually; then, we present a graph learning algorithm that can search for an optimal graph in a greedy scheme. Finally, we compare the manually defined graphs and learned graphs in terms of graph structure and detection accuracy. We evaluate the performance of the algorithm using the MIT-BIH Arrhythmia Database. The proposed model achieves an overall sensitivity of 99.64%, positive predictivity of 99.71%, and detection error rate of 0.19 for the manually defined constraint graph and overall sensitivity of 99.76%, positive predictivity of 99.68%, and detection error rate of 0.55 for the automatic learning constraint graph.

2.3SPFeb 2, 2021

A Graph-Constrained Changepoint Learning Approach for Automatic QRS-Complex Detection

Atiyeh Fotoohinasab, Toby Hocking, Fatemeh Afghah

This study presents a new viewpoint on ECG signal analysis by applying a graph-based changepoint detection model to locate R-peak positions. This model is based on a new graph learning algorithm to learn the constraint graph given the labeled ECG data. The proposed learning algorithm starts with a simple initial graph and iteratively edits the graph so that the final graph has the maximum accuracy in R-peak detection. We evaluate the performance of the algorithm on the MIT-BIH Arrhythmia Database. The evaluation results demonstrate that the proposed method can obtain comparable results to other state-of-the-art approaches. The proposed method achieves the overall sensitivity of Sen = 99.64%, positive predictivity of PPR = 99.71%, and detection error rate of DER = 0.19.

8.6HCJan 26, 2021

Chatbots language design: the influence of language variation on user experience

Ana Paula Chaves, Jesse Egbert, Toby Hocking et al.

Chatbots are often designed to mimic social roles attributed to humans. However, little is known about the impact on user's perceptions of using language that fails to conform to the associated social role. Our research draws on sociolinguistic theory to investigate how a chatbot's language choices can adhere to the expected social role the agent performs within a given context. In doing so, we seek to understand whether chatbots design should account for linguistic register. This research analyzes how register differences play a role in shaping the user's perception of the human-chatbot interaction. Ultimately, we want to determine whether register-specific language influences users' perceptions and experiences with chatbots. We produced parallel corpora of conversations in the tourism domain with similar content and varying register characteristics and evaluated users' preferences of chatbot's linguistic choices in terms of appropriateness, credibility, and user experience. Our results show that register characteristics are strong predictors of user's preferences, which points to the needs of designing chatbots with register-appropriate language to improve acceptance and users' perceptions of chatbot interactions.

4.3QMDec 12, 2020Code

Increased peak detection accuracy in over-dispersed ChIP-seq data with supervised segmentation models

Arnaud Liehrmann, Guillem Rigaill, Toby Dylan Hocking

Motivation: Histone modification constitutes a basic mechanism for the genetic regulation of gene expression. In early 2000s, a powerful technique has emerged that couples chromatin immunoprecipitation with high-throughput sequencing (ChIP-seq). This technique provides a direct survey of the DNA regions associated to these modifications. In order to realize the full potential of this technique, increasingly sophisticated statistical algorithms have been developed or adapted to analyze the massive amount of data it generates. Many of these algorithms were built around natural assumptions such as the Poisson one to model the noise in the count data. In this work we start from these natural assumptions and show that it is possible to improve upon them. Results: The results of our comparisons on seven reference datasets of histone modifications (H3K36me3 and H3K4me3) suggest that natural assumptions are not always realistic under application conditions. We show that the unconstrained multiple changepoint detection model, with alternative noise assumptions and a suitable setup, reduces the over-dispersion exhibited by count data and turns out to detect peaks more accurately than algorithms which rely on these natural assumptions.

2.7MLJun 24, 2020Code

Labeled Optimal Partitioning

Toby Dylan Hocking, Anuraag Srivastava

In data sequences measured over space or time, an important problem is accurate detection of abrupt changes. In partially labeled data, it is important to correctly predict presence/absence of changes in positive/negative labeled regions, in both the train and test sets. One existing dynamic programming algorithm is designed for prediction in unlabeled test regions (and ignores the labels in the train set); another is for accurate fitting of train labels (but does not predict changepoints in unlabeled test regions). We resolve these issues by proposing a new optimal changepoint detection model that is guaranteed to fit the labels in the train data, and can also provide predictions of unlabeled changepoints in test data. We propose a new dynamic programming algorithm, Labeled Optimal Partitioning (LOPART), and we provide a formal proof that it solves the resulting non-convex optimization problem. We provide theoretical and empirical analysis of the time complexity of our algorithm, in terms of the number of labels and the size of the data sequence to segment. Finally, we provide empirical evidence that our algorithm is more accurate than the existing baselines, in terms of train and test label error.

5.1SPApr 24, 2020

A Graph-constrained Changepoint Detection Approach for ECG Segmentation

Atiyeh Fotoohinasab, Toby Hocking, Fatemeh Afghah

Electrocardiogram (ECG) signal is the most commonly used non-invasive tool in the assessment of cardiovascular diseases. Segmentation of the ECG signal to locate its constitutive waves, in particular the R-peaks, is a key step in ECG processing and analysis. Over the years, several segmentation and QRS complex detection algorithms have been proposed with different features; however, their performance highly depends on applying preprocessing steps which makes them unreliable in real-time data analysis of ambulatory care settings and remote monitoring systems, where the collected data is highly noisy. Moreover, some issues still remain with the current algorithms in regard to the diverse morphological categories for the ECG signal and their high computation cost. In this paper, we introduce a novel graph-based optimal changepoint detection (GCCD) method for reliable detection of R-peak positions without employing any preprocessing step. The proposed model guarantees to compute the globally optimal changepoint detection solution. It is also generic in nature and can be applied to other time-series biomedical signals. Based on the MIT-BIH arrhythmia (MIT-BIH-AR) database, the proposed method achieves overall sensitivity Sen = 99.76, positive predictivity PPR = 99.68, and detection error rate DER = 0.55 which are comparable to other state-of-the-art approaches.

2.3LGMar 5, 2020

Linear time dynamic programming for the exact path of optimal models selected from a finite set

Toby Hocking, Joseph Vargovich

Many learning algorithms are formulated in terms of finding model parameters which minimize a data-fitting loss function plus a regularizer. When the regularizer involves the l0 pseudo-norm, the resulting regularization path consists of a finite set of models. The fastest existing algorithm for computing the breakpoints in the regularization path is quadratic in the number of models, so it scales poorly to high dimensional problems. We provide new formal proofs that a dynamic programming algorithm can be used to compute the breakpoints in linear time. Empirical results on changepoint detection problems demonstrate the improved accuracy and speed relative to grid search and the previous quadratic time algorithm.

6.6MLOct 11, 2017Code

Maximum Margin Interval Trees

Alexandre Drouin, Toby Dylan Hocking, François Laviolette

Learning a regression function using censored or interval-valued output data is an important problem in fields such as genomics and medicine. The goal is to learn a real-valued prediction function, and the training output labels indicate an interval of possible values. Whereas most existing algorithms for this task are linear models, in this paper we investigate learning nonlinear tree models. We propose to learn a tree by minimizing a margin-based discriminative objective function, and we provide a dynamic programming algorithm for computing the optimal solution in log-linear time. We show empirically that this algorithm achieves state-of-the-art speed and prediction accuracy in a benchmark of several data sets.

1.5MLJun 3, 2015Code

PeakSegJoint: fast supervised peak detection via joint segmentation of multiple count data samples

Toby Dylan Hocking, Guillaume Bourque

Joint peak detection is a central problem when comparing samples in genomic data analysis, but current algorithms for this task are unsupervised and limited to at most 2 sample types. We propose PeakSegJoint, a new constrained maximum likelihood segmentation model for any number of sample types. To select the number of peaks in the segmentation, we propose a supervised penalty learning model. To infer the parameters of these two models, we propose to use a discrete optimization heuristic for the segmentation, and convex optimization for the penalty learning. In comparisons with state-of-the-art peak detection algorithms, PeakSegJoint achieves similar accuracy, faster speeds, and a more interpretable model with overlapping peaks that occur in exactly the same positions across all samples.

1.4MLJan 30, 2014Code

Support vector comparison machines

David Venuto, Toby Dylan Hocking, Lakjaree Sphanurattana et al.

In ranking problems, the goal is to learn a ranking function from labeled pairs of input points. In this paper, we consider the related comparison problem, where the label indicates which element of the pair is better, or if there is no significant difference. We cast the learning problem as a margin maximization, and show that it can be solved by converting it to a standard SVM. We use simulated nonlinear patterns, a real learning to rank sushi data set, and a chess data set to show that our proposed SVMcompare algorithm outperforms SVMrank when there are equality pairs.