Logistic regression models for aggregated data
This work addresses computational efficiency for logistic regression in large-scale applications, such as supersymmetry and crop classification, but is incremental as it builds on existing symbolic data analysis and composite likelihood techniques.
The authors tackled the computational burden of logistic regression on large datasets by summarizing predictor variables into histograms and using composite likelihoods for inference, achieving comparable classification rates to standard methods and state-of-the-art subsampling algorithms at a substantially lower computational cost.
Logistic regression models are a popular and effective method to predict the probability of categorical response data. However inference for these models can become computationally prohibitive for large datasets. Here we adapt ideas from symbolic data analysis to summarise the collection of predictor variables into histogram form, and perform inference on this summary dataset. We develop ideas based on composite likelihoods to derive an efficient one-versus-rest approximate composite likelihood model for histogram-based random variables, constructed from low-dimensional marginal histograms obtained from the full histogram. We demonstrate that this procedure can achieve comparable classification rates compared to the standard full data multinomial analysis and against state-of-the-art subsampling algorithms for logistic regression, but at a substantially lower computational cost. Performance is explored through simulated examples, and analyses of large supersymmetry and satellite crop classification datasets.