Thalaiyasingam Ajanthan

CV
h-index26
37papers
4,148citations
Novelty57%
AI Score61

37 Papers

LGDec 22, 2022
Understanding and Improving the Role of Projection Head in Self-Supervised Learning

Kartik Gupta, Thalaiyasingam Ajanthan, Anton van den Hengel et al.

Self-supervised learning (SSL) aims to produce useful feature representations without access to any human-labeled data annotations. Due to the success of recent SSL methods based on contrastive learning, such as SimCLR, this problem has gained popularity. Most current contrastive learning approaches append a parametrized projection head to the end of some backbone network to optimize the InfoNCE objective and then discard the learned projection head after training. This raises a fundamental question: Why is a learnable projection head required if we are to discard it after training? In this work, we first perform a systematic study on the behavior of SSL training focusing on the role of the projection head layers. By formulating the projection head as a parametric component for the InfoNCE objective rather than a part of the network, we present an alternative optimization scheme for training contrastive learning based SSL frameworks. Our experimental study on multiple image classification datasets demonstrates the effectiveness of the proposed approach over alternatives in the SSL literature.

LGMar 4
NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training

Hadi Mohaghegh Dolatabadi, Thalaiyasingam Ajanthan, Sameera Ramasinghe et al. · amazon-science

The rapid progress of large language models (LLMs) is increasingly constrained by memory and deployment costs, motivating compression methods for practical deployment. Many state-of-the-art compression pipelines leverage the low-rank structure of trained weight matrices, a phenomenon often associated with the properties of popular optimizers such as Adam. In this context, Muon is a recently proposed optimizer that improves LLM pretraining via full-rank update steps, but its induced weight-space structure has not been characterized yet. In this work, we report a surprising empirical finding: despite imposing full-rank updates, Muon-trained models exhibit pronounced low-rank structure in their weight matrices and are readily compressible under standard pipelines. Motivated by this insight, we propose NuMuon, which augments Muon with a nuclear-norm constraint on the update direction, further constraining the learned weights toward low-rank structure. Across billion-parameter-scale models, we show that NuMuon increases weight compressibility and improves post-compression model quality under state-of-the-art LLM compression pipelines while retaining Muon's favorable convergence behavior.

CVMar 30, 2023
Adaptive Cross Batch Normalization for Metric Learning

Thalaiyasingam Ajanthan, Matt Ma, Anton van den Hengel et al.

Metric learning is a fundamental problem in computer vision whereby a model is trained to learn a semantically useful embedding space via ranking losses. Traditionally, the effectiveness of a ranking loss depends on the minibatch size, and is, therefore, inherently limited by the memory constraints of the underlying hardware. While simply accumulating the embeddings across minibatches has proved useful (Wang et al. [2020]), we show that it is equally important to ensure that the accumulated embeddings are up to date. In particular, it is necessary to circumvent the representational drift between the accumulated embeddings and the feature embeddings at the current training iteration as the learnable parameters are being updated. In this paper, we model representational drift as distribution misalignment and tackle it using moment matching. The result is a simple method for updating the stored embeddings to match the first and second moments of the current embeddings at each training iteration. Experiments on three popular image retrieval datasets, namely, SOP, In-Shop, and DeepFashion2, demonstrate that our approach significantly improves the performance in all scenarios.

LGJul 25, 2024
Self-Supervision Improves Diffusion Models for Tabular Data Imputation

Yixin Liu, Thalaiyasingam Ajanthan, Hisham Husain et al.

The ubiquity of missing data has sparked considerable attention and focus on tabular data imputation methods. Diffusion models, recognized as the cutting-edge technique for data generation, demonstrate significant potential in tabular data imputation tasks. However, in pursuit of diversity, vanilla diffusion models often exhibit sensitivity to initialized noises, which hinders the models from generating stable and accurate imputation results. Additionally, the sparsity inherent in tabular data poses challenges for diffusion models in accurately modeling the data manifold, impacting the robustness of these models for data imputation. To tackle these challenges, this paper introduces an advanced diffusion model named Self-supervised imputation Diffusion Model (SimpDM for brevity), specifically tailored for tabular data imputation tasks. To mitigate sensitivity to noise, we introduce a self-supervised alignment mechanism that aims to regularize the model, ensuring consistent and stable imputation predictions. Furthermore, we introduce a carefully devised state-dependent data augmentation strategy within SimpDM, enhancing the robustness of the diffusion model when dealing with limited data. Extensive experiments demonstrate that SimpDM matches or outperforms state-of-the-art imputation methods across various scenarios.

LGJan 30
AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism

Thalaiyasingam Ajanthan, Sameera Ramasinghe, Gil Avraham et al. · amazon-science

Data and pipeline parallelism are key strategies for scaling neural network training across distributed devices, but their high communication cost necessitates co-located computing clusters with fast interconnects, limiting their scalability. We address this communication bottleneck by introducing asynchronous updates across both parallelism axes, relaxing the co-location requirement at the expense of introducing staleness between pipeline stages and data parallel replicas. To mitigate staleness, for pipeline parallelism, we adopt a weight look-ahead approach, and for data parallelism, we introduce an asynchronous sparse averaging method equipped with an exponential moving average based correction mechanism. We provide convergence guarantees for both sparse averaging and asynchronous updates. Experiments on large-scale language models (up to \em 1B parameters) demonstrate that our approach matches the performance of the fully synchronous baseline, while significantly reducing communication overhead.

DCMar 3
SENTINEL: Stagewise Integrity Verification for Pipeline Parallel Decentralized Training

Hadi Mohaghegh Dolatabadi, Thalaiyasingam Ajanthan, Sameera Ramasinghe et al. · amazon-science

Decentralized training introduces critical security risks when executed across untrusted, geographically distributed nodes. While existing Byzantine-tolerant literature addresses data parallel (DP) training through robust aggregation methods, pipeline parallelism (PP) presents fundamentally distinct challenges. In PP, model layers are distributed across workers where the activations and their gradients flow between stages rather than being aggregated, making traditional DP approaches inapplicable. We propose SENTINEL, a verification mechanism for PP training without computation duplication. SENTINEL employs lightweight momentum-based monitoring using exponential moving averages (EMAs) to detect corrupted inter-stage communication. Unlike existing Byzantine-tolerant approaches for DP that aggregate parameter gradients across replicas, our approach verifies sequential activation/gradient transmission between layers. We provide theoretical convergence guarantees for this new setting that recovers classical convergence rates when relaxed to standard training. Experiments demonstrate successful training of up to 4B-parameter LLMs across untrusted distributed environments with up to 176 workers while maintaining model convergence and performance.

49.7LGMay 22
Unextractable Protocol Models: Collaborative Training and Inference without Weight Materialization

Alexander Long, Chamin Hewa Koneputugodage, Thalaiyasingam Ajanthan et al.

We consider a decentralized setup in which the participants collaboratively train and serve a large neural network, and where each participant only processes a subset of the model. In this setup, we explore the possibility of unmaterializable weights, where a full weight set is never available to any one participant. We introduce Unextractable Protocol Models (UPMs): a training and inference framework that leverages the sharded model setup to ensure model shards (i.e., subsets) held by participants are incompatible at different time steps. UPMs periodically inject time-varying, random, invertible transforms at participant boundaries; preserving the overall network function yet rendering cross-time assemblies incoherent. On Qwen-2.5-0.5B and Llama-3.2-1B, 10,000 transforms leave FP32 perplexity unchanged ($Δ$PPL $< 0.01$; Jensen-Shannon drift $< 4 \times 10^{-5}$), and we show how to control growth for lower precision datatypes. Applying a transform every 30s adds 3% latency, 0.1% bandwidth, and 10% GPU-memory overhead at inference, while training overhead falls to 1.6% time and $< 1$% memory. We consider several attacks, showing that the requirements of direct attacks are impractical and easy to defend against, and that gradient-based fine-tuning of stitched partitions consumes $\geq 60$% of the tokens required to train from scratch. By enabling models to be collaboratively trained yet not extracted, UPMs make it practical to embed programmatic incentive mechanisms in community-driven decentralized training.

LGJun 23, 2020Code
Calibration of Neural Networks using Splines

Kartik Gupta, Amir Rahimi, Thalaiyasingam Ajanthan et al.

Calibrating neural networks is of utmost importance when employing them in safety-critical applications where the downstream decision making depends on the predicted probabilities. Measuring calibration error amounts to comparing two empirical distributions. In this work, we introduce a binning-free calibration measure inspired by the classical Kolmogorov-Smirnov (KS) statistical test in which the main idea is to compare the respective cumulative probability distributions. From this, by approximating the empirical cumulative distribution using a differentiable function via splines, we obtain a recalibration function, which maps the network outputs to actual (calibrated) class assignment probabilities. The spine-fitting is performed using a held-out calibration set and the obtained recalibration function is evaluated on an unseen test set. We tested our method against existing calibration approaches on various image classification datasets and our spline-based recalibration approach consistently outperforms existing methods on KS error as well as other commonly used calibration measures. Our Code is available at https://github.com/kartikgupta-at-anu/spline-calibration.

CVMar 30, 2020Code
Improved Gradient based Adversarial Attacks for Quantized Networks

Kartik Gupta, Thalaiyasingam Ajanthan

Neural network quantization has become increasingly popular due to efficient memory consumption and faster computation resulting from bitwise operations on the quantized networks. Even though they exhibit excellent generalization capabilities, their robustness properties are not well-understood. In this work, we systematically study the robustness of quantized networks against gradient based adversarial attacks and demonstrate that these quantized models suffer from gradient vanishing issues and show a fake sense of robustness. By attributing gradient vanishing to poor forward-backward signal propagation in the trained network, we introduce a simple temperature scaling approach to mitigate this issue while preserving the decision boundary. Despite being a simple modification to existing gradient based adversarial attacks, experiments on multiple image classification datasets with multiple network architectures demonstrate that our temperature scaled attacks obtain near-perfect success rate on quantized networks while outperforming original attacks on adversarially trained models as well as floating-point networks. Code is available at https://github.com/kartikgupta-at-anu/attack-bnn.

LGOct 18, 2019Code
Mirror Descent View for Neural Network Quantization

Thalaiyasingam Ajanthan, Kartik Gupta, Philip H. S. Torr et al.

Quantizing large Neural Networks (NN) while maintaining the performance is highly desirable for resource-limited devices due to reduced memory and time complexity. It is usually formulated as a constrained optimization problem and optimized via a modified version of gradient descent. In this work, by interpreting the continuous parameters (unconstrained) as the dual of the quantized ones, we introduce a Mirror Descent (MD) framework for NN quantization. Specifically, we provide conditions on the projections (i.e., mapping from continuous to quantized ones) which would enable us to derive valid mirror maps and in turn the respective MD updates. Furthermore, we present a numerically stable implementation of MD that requires storing an additional set of auxiliary variables (unconstrained), and show that it is strikingly analogous to the Straight Through Estimator (STE) based method which is typically viewed as a "trick" to avoid vanishing gradients issue. Our experiments on CIFAR-10/100, TinyImageNet, and ImageNet classification datasets with VGG-16, ResNet-18, and MobileNetV2 architectures show that our MD variants obtain quantized networks with state-of-the-art performance. Code is available at https://github.com/kartikgupta-at-anu/md-bnn.

LGNov 2, 2024
Guiding Neural Collapse: Optimising Towards the Nearest Simplex Equiangular Tight Frame

Evan Markou, Thalaiyasingam Ajanthan, Stephen Gould

Neural Collapse (NC) is a recently observed phenomenon in neural networks that characterises the solution space of the final classifier layer when trained until zero training loss. Specifically, NC suggests that the final classifier layer converges to a Simplex Equiangular Tight Frame (ETF), which maximally separates the weights corresponding to each class. By duality, the penultimate layer feature means also converge to the same simplex ETF. Since this simple symmetric structure is optimal, our idea is to utilise this property to improve convergence speed. Specifically, we introduce the notion of nearest simplex ETF geometry for the penultimate layer features at any given training iteration, by formulating it as a Riemannian optimisation. Then, at each iteration, the classifier weights are implicitly set to the nearest simplex ETF by solving this inner-optimisation, which is encapsulated within a declarative node to allow backpropagation. Our experiments on synthetic and real-world architectures for classification tasks demonstrate that our approach accelerates convergence and enhances training stability.

OCJun 10, 2025
Sharper Convergence Rates for Nonconvex Optimisation via Reduction Mappings

Evan Markou, Thalaiyasingam Ajanthan, Stephen Gould

Many high-dimensional optimisation problems exhibit rich geometric structures in their set of minimisers, often forming smooth manifolds due to over-parametrisation or symmetries. When this structure is known, at least locally, it can be exploited through reduction mappings that reparametrise part of the parameter space to lie on the solution manifold. These reductions naturally arise from inner optimisation problems and effectively remove redundant directions, yielding a lower-dimensional objective. In this work, we introduce a general framework to understand how such reductions influence the optimisation landscape. We show that well-designed reduction mappings improve curvature properties of the objective, leading to better-conditioned problems and theoretically faster convergence for gradient-based methods. Our analysis unifies a range of scenarios where structural information at optimality is leveraged to accelerate convergence, offering a principled explanation for the empirical gains observed in such optimisation algorithms.

LGJun 2, 2025
Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism

Sameera Ramasinghe, Thalaiyasingam Ajanthan, Gil Avraham et al.

Scaling models has led to significant advancements in deep learning, but training these models in decentralized settings remains challenging due to communication bottlenecks. While existing compression techniques are effective in data-parallel, they do not extend to model parallelism. Unlike data-parallel training, where weight gradients are exchanged, model-parallel requires compressing activations and activation gradients as they propagate through layers, accumulating compression errors. We propose a novel compression algorithm that compresses both forward and backward passes, enabling up to 99% compression with no convergence degradation with negligible memory/compute overhead. By leveraging a recursive structure in transformer networks, we predefine a low-dimensional subspace to confine the activations and gradients, allowing full reconstruction in subsequent layers. Our method achieves up to 100x improvement in communication efficiency and enables training billion-parameter-scale models over low-end GPUs connected via consumer-grade internet speeds as low as 80Mbps, matching the convergence of centralized datacenter systems with 100Gbps connections with model parallel.

LGMay 2, 2025
Nesterov Method for Asynchronous Pipeline Parallel Optimization

Thalaiyasingam Ajanthan, Sameera Ramasinghe, Yan Zuo et al.

Pipeline Parallelism (PP) enables large neural network training on small, interconnected devices by splitting the model into multiple stages. To maximize pipeline utilization, asynchronous optimization is appealing as it offers 100% pipeline utilization by construction. However, it is inherently challenging as the weights and gradients are no longer synchronized, leading to stale (or delayed) gradients. To alleviate this, we introduce a variant of Nesterov Accelerated Gradient (NAG) for asynchronous optimization in PP. Specifically, we modify the look-ahead step in NAG to effectively address the staleness in gradients. We theoretically prove that our approach converges at a sublinear rate in the presence of fixed delay in gradients. Our experiments on large-scale language modelling tasks using decoder-only architectures with up to 1B parameters, demonstrate that our approach significantly outperforms existing asynchronous methods, even surpassing the synchronous baseline.

CVNov 26, 2024
Learning Visual Hierarchies in Hyperbolic Space for Image Retrieval

Ziwei Wang, Sameera Ramasinghe, Chenchen Xu et al.

Structuring latent representations in a hierarchical manner enables models to learn patterns at multiple levels of abstraction. However, most prevalent image understanding models focus on visual similarity, and learning visual hierarchies is relatively unexplored. In this work, for the first time, we introduce a learning paradigm that can encode user-defined multi-level complex visual hierarchies in hyperbolic space without requiring explicit hierarchical labels. As a concrete example, first, we define a part-based image hierarchy using object-level annotations within and across images. Then, we introduce an approach to enforce the hierarchy using contrastive loss with pairwise entailment metrics. Finally, we discuss new evaluation metrics to effectively measure hierarchical image retrieval. Encoding these complex relationships ensures that the learned representations capture semantic and structural information that transcends mere visual similarity. Experiments in part-based image retrieval show significant improvements in hierarchical retrieval tasks, demonstrating the capability of our model in capturing visual hierarchies.

CVFeb 22, 2022
Retrieval Augmented Classification for Long-Tail Visual Recognition

Alexander Long, Wei Yin, Thalaiyasingam Ajanthan et al.

We introduce Retrieval Augmented Classification (RAC), a generic approach to augmenting standard image classification pipelines with an explicit retrieval module. RAC consists of a standard base image encoder fused with a parallel retrieval branch that queries a non-parametric external memory of pre-encoded images and associated text snippets. We apply RAC to the problem of long-tail classification and demonstrate a significant improvement over previous state-of-the-art on Places365-LT and iNaturalist-2018 (14.5% and 6.7% respectively), despite using only the training datasets themselves as the external information source. We demonstrate that RAC's retrieval module, without prompting, learns a high level of accuracy on tail classes. This, in turn, frees the base encoder to focus on common classes, and improve its performance thereon. RAC represents an alternative approach to utilizing large, pretrained models without requiring fine-tuning, as well as a first step towards more effectively making use of external memory within common computer vision architectures.

CVMar 25, 2021
Few-shot Weakly-Supervised Object Detection via Directional Statistics

Amirreza Shaban, Amir Rahimi, Thalaiyasingam Ajanthan et al.

Detecting novel objects from few examples has become an emerging topic in computer vision recently. However, these methods need fully annotated training images to learn new object categories which limits their applicability in real world scenarios such as field robotics. In this work, we propose a probabilistic multiple instance learning approach for few-shot Common Object Localization (COL) and few-shot Weakly Supervised Object Detection (WSOD). In these tasks, only image-level labels, which are much cheaper to acquire, are available. We find that operating on features extracted from the last layer of a pre-trained Faster-RCNN is more effective compared to previous episodic learning based few-shot COL methods. Our model simultaneously learns the distribution of the novel objects and localizes them via expectation-maximization steps. As a probabilistic model, we employ von Mises-Fisher (vMF) distribution which captures the semantic information better than Gaussian distribution when applied to the pre-trained embedding space. When the novel objects are localized, we utilize them to learn a linear appearance model to detect novel classes in new images. Our extensive experiments show that the proposed method, despite being simple, outperforms strong baselines in few-shot COL and WSOD, as well as large-scale WSOD tasks.

CVFeb 9, 2021
RANP: Resource Aware Neuron Pruning at Initialization for 3D CNNs

Zhiwei Xu, Thalaiyasingam Ajanthan, Vibhav Vineet et al.

Although 3D Convolutional Neural Networks are essential for most learning based applications involving dense 3D data, their applicability is limited due to excessive memory and computational requirements. Compressing such networks by pruning therefore becomes highly desirable. However, pruning 3D CNNs is largely unexplored possibly because of the complex nature of typical pruning algorithms that embeds pruning into an iterative optimization paradigm. In this work, we introduce a Resource Aware Neuron Pruning (RANP) algorithm that prunes 3D CNNs at initialization to high sparsity levels. Specifically, the core idea is to obtain an importance score for each neuron based on their sensitivity to the loss function. This neuron importance is then reweighted according to the neuron resource consumption related to FLOPs or memory. We demonstrate the effectiveness of our pruning method on 3D semantic segmentation with widely used 3D-UNets on ShapeNet and BraTS'18 datasets, video classification with MobileNetV2 and I3D on UCF101 dataset, and two-view stereo matching with Pyramid Stereo Matching (PSM) network on SceneFlow dataset. In these experiments, our RANP leads to roughly 50%-95% reduction in FLOPs and 35%-80% reduction in memory with negligible loss in accuracy compared to the unpruned networks. This significantly reduces the computational resources required to train 3D CNNs. The pruned network obtained by our algorithm can also be easily scaled up and transferred to another dataset for training.

CVOct 9, 2020
Refining Semantic Segmentation with Superpixel by Transparent Initialization and Sparse Encoder

Zhiwei Xu, Thalaiyasingam Ajanthan, Richard Hartley

Although deep learning greatly improves the performance of semantic segmentation, its success mainly lies in object central areas without accurate edges. As superpixels are a popular and effective auxiliary to preserve object edges, in this paper, we jointly learn semantic segmentation with trainable superpixels. We achieve it with fully-connected layers with Transparent Initialization (TI) and efficient logit consistency using a sparse encoder. The proposed TI preserves the effects of learned parameters of pretrained networks. This avoids a significant increase of the loss of pretrained networks, which otherwise may be caused by inappropriate parameter initialization of the additional layers. Meanwhile, consistent pixel labels in each superpixel are guaranteed by logit consistency. The sparse encoder with sparse matrix operations substantially reduces both the memory requirement and the computational complexity. We demonstrated the superiority of TI over other parameter initialization methods and tested its numerical stability. The effectiveness of our proposal was validated on PASCAL VOC 2012, ADE20K, and PASCAL Context showing enhanced semantic segmentation edges. With quantitative evaluations on segmentation edges using performance ratio and F-measure, our method outperforms the state-of-the-art.

CVOct 6, 2020
RANP: Resource Aware Neuron Pruning at Initialization for 3D CNNs

Zhiwei Xu, Thalaiyasingam Ajanthan, Vibhav Vineet et al.

Although 3D Convolutional Neural Networks (CNNs) are essential for most learning based applications involving dense 3D data, their applicability is limited due to excessive memory and computational requirements. Compressing such networks by pruning therefore becomes highly desirable. However, pruning 3D CNNs is largely unexplored possibly because of the complex nature of typical pruning algorithms that embeds pruning into an iterative optimization paradigm. In this work, we introduce a Resource Aware Neuron Pruning (RANP) algorithm that prunes 3D CNNs at initialization to high sparsity levels. Specifically, the core idea is to obtain an importance score for each neuron based on their sensitivity to the loss function. This neuron importance is then reweighted according to the neuron resource consumption related to FLOPs or memory. We demonstrate the effectiveness of our pruning method on 3D semantic segmentation with widely used 3D-UNets on ShapeNet and BraTS'18 as well as on video classification with MobileNetV2 and I3D on UCF101 dataset. In these experiments, our RANP leads to roughly 50-95 reduction in FLOPs and 35-80 reduction in memory with negligible loss in accuracy compared to the unpruned networks. This significantly reduces the computational resources required to train 3D CNNs. The pruned network obtained by our algorithm can also be easily scaled up and transferred to another dataset for training.

LGJun 23, 2020
Post-hoc Calibration of Neural Networks by g-Layers

Amir Rahimi, Thomas Mensink, Kartik Gupta et al.

Calibration of neural networks is a critical aspect to consider when incorporating machine learning models in real-world decision-making systems where the confidence of decisions are equally important as the decisions themselves. In recent years, there is a surge of research on neural network calibration and the majority of the works can be categorized into post-hoc calibration methods, defined as methods that learn an additional function to calibrate an already trained base network. In this work, we intend to understand the post-hoc calibration methods from a theoretical point of view. Especially, it is known that minimizing Negative Log-Likelihood (NLL) will lead to a calibrated network on the training set if the global optimum is attained (Bishop, 1994). Nevertheless, it is not clear learning an additional function in a post-hoc manner would lead to calibration in the theoretical sense. To this end, we prove that even though the base network ($f$) does not lead to the global optimum of NLL, by adding additional layers ($g$) and minimizing NLL by optimizing the parameters of $g$ one can obtain a calibrated network $g \circ f$. This not only provides a less stringent condition to obtain a calibrated network but also provides a theoretical justification of post-hoc calibration methods. Our experiments on various image classification benchmarks confirm the theory.

LGJun 22, 2020
Bidirectionally Self-Normalizing Neural Networks

Yao Lu, Stephen Gould, Thalaiyasingam Ajanthan

The problem of vanishing and exploding gradients has been a long-standing obstacle that hinders the effective training of neural networks. Despite various tricks and techniques that have been employed to alleviate the problem in practice, there still lacks satisfactory theories or provable solutions. In this paper, we address the problem from the perspective of high-dimensional probability theory. We provide a rigorous result that shows, under mild conditions, how the vanishing/exploding gradients problem disappears with high probability if the neural networks have sufficient width. Our main idea is to constrain both forward and backward signal propagation in a nonlinear neural network through a new class of activation functions, namely Gaussian-Poincaré normalized functions, and orthogonal weight matrices. Experiments on both synthetic and real-world data validate our theory and confirm its effectiveness on very deep neural networks when applied in practice.

LGMar 25, 2020
Understanding the Effects of Data Parallelism and Sparsity on Neural Network Training

Namhoon Lee, Thalaiyasingam Ajanthan, Philip H. S. Torr et al.

We study two factors in neural network training: data parallelism and sparsity; here, data parallelism means processing training data in parallel using distributed systems (or equivalently increasing batch size), so that training can be accelerated; for sparsity, we refer to pruning parameters in a neural network model, so as to reduce computational and memory cost. Despite their promising benefits, however, understanding of their effects on neural network training remains elusive. In this work, we first measure these effects rigorously by conducting extensive experiments while tuning all metaparameters involved in the optimization. As a result, we find across various workloads of data set, network model, and optimization algorithm that there exists a general scaling trend between batch size and number of training steps to convergence for the effect of data parallelism, and further, difficulty of training under sparsity. Then, we develop a theoretical analysis based on the convergence properties of stochastic gradient methods and smoothness of the optimization landscape, which illustrates the observed phenomena precisely and generally, establishing a better account of the effects of data parallelism and sparsity on neural network training.

CVMar 18, 2020
Pairwise Similarity Knowledge Transfer for Weakly Supervised Object Localization

Amir Rahimi, Amirreza Shaban, Thalaiyasingam Ajanthan et al.

Weakly Supervised Object Localization (WSOL) methods only require image level labels as opposed to expensive bounding box annotations required by fully supervised algorithms. We study the problem of learning localization model on target classes with weakly supervised image labels, helped by a fully annotated source dataset. Typically, a WSOL model is first trained to predict class generic objectness scores on an off-the-shelf fully supervised source dataset and then it is progressively adapted to learn the objects in the weakly supervised target dataset. In this work, we argue that learning only an objectness function is a weak form of knowledge transfer and propose to learn a classwise pairwise similarity function that directly compares two input proposals as well. The combined localization model and the estimated object annotations are jointly learned in an alternating optimization paradigm as is typically done in standard WSOL methods. In contrast to the existing work that learns pairwise similarities, our approach optimizes a unified objective with convergence guarantee and it is computationally efficient for large-scale applications. Experiments on the COCO and ILSVRC 2013 detection datasets show that the performance of the localization model improves significantly with the inclusion of pairwise similarity function. For instance, in the ILSVRC dataset, the Correct Localization (CorLoc) performance improves from 72.8% to 78.2% which is a new state-of-the-art for WSOL task in the context of knowledge transfer.

CVOct 24, 2019
Fast and Differentiable Message Passing on Pairwise Markov Random Fields

Zhiwei Xu, Thalaiyasingam Ajanthan, Richard Hartley

Despite the availability of many Markov Random Field (MRF) optimization algorithms, their widespread usage is currently limited due to imperfect MRF modelling arising from hand-crafted model parameters and the selection of inferior inference algorithm. In addition to differentiability, the two main aspects that enable learning these model parameters are the forward and backward propagation time of the MRF optimization algorithm and its inference capabilities. In this work, we introduce two fast and differentiable message passing algorithms, namely, Iterative Semi-Global Matching Revised (ISGMR) and Parallel Tree-Reweighted Message Passing (TRWP) which are greatly sped up on a GPU by exploiting massive parallelism. Specifically, ISGMR is an iterative and revised version of the standard SGM for general pairwise MRFs with improved optimization effectiveness, and TRWP is a highly parallel version of Sequential TRW (TRWS) for faster optimization. Our experiments on the standard stereo and denoising benchmarks demonstrated that ISGMR and TRWP achieve much lower energies than SGM and Mean-Field (MF), and TRWP is two orders of magnitude faster than TRWS without losing effectiveness in optimization. We further demonstrated the effectiveness of our algorithms on end-to-end learning for semantic segmentation. Notably, our CUDA implementations are at least $7$ and $700$ times faster than PyTorch GPU implementations for forward and backward propagation respectively, enabling efficient end-to-end learning with message passing.

LGJun 14, 2019
A Signal Propagation Perspective for Pruning Neural Networks at Initialization

Namhoon Lee, Thalaiyasingam Ajanthan, Stephen Gould et al.

Network pruning is a promising avenue for compressing deep neural networks. A typical approach to pruning starts by training a model and then removing redundant parameters while minimizing the impact on what is learned. Alternatively, a recent approach shows that pruning can be done at initialization prior to training, based on a saliency criterion called connection sensitivity. However, it remains unclear exactly why pruning an untrained, randomly initialized neural network is effective. In this work, by noting connection sensitivity as a form of gradient, we formally characterize initialization conditions to ensure reliable connection sensitivity measurements, which in turn yields effective pruning results. Moreover, we analyze the signal propagation properties of the resulting pruned networks and introduce a simple, data-free method to improve their trainability. Our modifications to the existing pruning at initialization method lead to improved results on all tested network models for image classification tasks. Furthermore, we empirically study the effect of supervision for pruning and demonstrate that our signal propagation perspective, combined with unsupervised pruning, can be useful in various scenarios where pruning is applied to non-standard arbitrarily-designed architectures.

CVApr 5, 2019
Learning to Adapt for Stereo

Alessio Tonioni, Oscar Rahnama, Thomas Joy et al.

Real world applications of stereo depth estimation require models that are robust to dynamic variations in the environment. Even though deep learning based stereo methods are successful, they often fail to generalize to unseen variations in the environment, making them less suitable for practical applications such as autonomous driving. In this work, we introduce a "learning-to-adapt" framework that enables deep stereo methods to continuously adapt to new target domains in an unsupervised manner. Specifically, our approach incorporates the adaptation procedure into the learning objective to obtain a base set of parameters that are better suited for unsupervised online adaptation. To further improve the quality of the adaptation, we learn a confidence measure that effectively masks the errors introduced during the unsupervised adaptation. We evaluate our method on synthetic and real-world stereo datasets and our experiments evidence that learning-to-adapt is, indeed beneficial for online adaptation on vastly different domains.

LGFeb 27, 2019
On Tiny Episodic Memories in Continual Learning

Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny et al.

In continual learning (CL), an agent learns from a stream of tasks leveraging prior experience to transfer knowledge to future tasks. It is an ideal framework to decrease the amount of supervision in the existing learning algorithms. But for a successful knowledge transfer, the learner needs to remember how to perform previous tasks. One way to endow the learner the ability to perform tasks seen in the past is to store a small memory, dubbed episodic memory, that stores few examples from previous tasks and then to replay these examples when training for future tasks. In this work, we empirically analyze the effectiveness of a very small episodic memory in a CL setup where each training example is only seen once. Surprisingly, across four rather different supervised learning benchmarks adapted to CL, a very simple baseline, that jointly trains on both examples from the current task as well as examples stored in the episodic memory, significantly outperforms specifically designed CL approaches with and without episodic memory. Interestingly, we find that repetitive training on even tiny memories of past tasks does not harm generalization, on the contrary, it improves it, with gains between 7\% and 17\% when the memory is populated with a single example per class.

CVDec 11, 2018
Proximal Mean-field for Neural Network Quantization

Thalaiyasingam Ajanthan, Puneet K. Dokania, Richard Hartley et al.

Compressing large Neural Networks (NN) by quantizing the parameters, while maintaining the performance is highly desirable due to reduced memory and time complexity. In this work, we cast NN quantization as a discrete labelling problem, and by examining relaxations, we design an efficient iterative optimization procedure that involves stochastic gradient descent followed by a projection. We prove that our simple projected gradient descent approach is, in fact, equivalent to a proximal version of the well-known mean-field method. These findings would allow the decades-old and theoretically grounded research on MRF optimization to be used to design better network quantization schemes. Our experiments on standard classification datasets (MNIST, CIFAR10/100, TinyImageNet) with convolutional and residual architectures show that our algorithm obtains fully-quantized networks with accuracies very close to the floating-point reference networks.

CVNov 22, 2018
Generalized Range Moves

Richard Hartley, Thalaiyasingam Ajanthan

We consider move-making algorithms for energy minimization of multi-label Markov Random Fields (MRFs). Since this is not a tractable problem in general, a commonly used heuristic is to minimize over subsets of labels and variables in an iterative procedure. Such methods include α-expansion, αβ-swap, and range-moves. In each iteration, a small subset of variables are active in the optimization, which diminishes their effectiveness, and increases the required number of iterations. In this paper, we present a method in which optimization can be carried out over all labels, and most, or all variables at once. Experiments show substantial improvement with respect to previous move-making algorithms.

CVOct 4, 2018
SNIP: Single-shot Network Pruning based on Connection Sensitivity

Namhoon Lee, Thalaiyasingam Ajanthan, Philip H. S. Torr

Pruning large neural networks while maintaining their performance is often desirable due to the reduced space and time complexity. In existing methods, pruning is done within an iterative optimization procedure with either heuristically designed pruning schedules or additional hyperparameters, undermining their utility. In this work, we present a new approach that prunes a given network once at initialization prior to training. To achieve this, we introduce a saliency criterion based on connection sensitivity that identifies structurally important connections in the network for the given task. This eliminates the need for both pretraining and the complex pruning schedule while making it robust to architecture variations. After pruning, the sparse network is trained in the standard way. Our method obtains extremely sparse networks with virtually the same accuracy as the reference network on the MNIST, CIFAR-10, and Tiny-ImageNet classification tasks and is broadly applicable to various architectures including convolutional, residual and recurrent networks. Unlike existing methods, our approach enables us to demonstrate that the retained connections are indeed relevant to the given task.

CVMay 23, 2018
Efficient Relaxations for Dense CRFs with Sparse Higher Order Potentials

Thomas Joy, Alban Desmaison, Thalaiyasingam Ajanthan et al.

Dense conditional random fields (CRFs) have become a popular framework for modelling several problems in computer vision such as stereo correspondence and multi-class semantic segmentation. By modelling long-range interactions, dense CRFs provide a labelling that captures finer detail than their sparse counterparts. Currently, the state-of-the-art algorithm performs mean-field inference using a filter-based method but fails to provide a strong theoretical guarantee on the quality of the solution. A question naturally arises as to whether it is possible to obtain a maximum a posteriori (MAP) estimate of a dense CRF using a principled method. Within this paper, we show that this is indeed possible. We will show that, by using a filter-based method, continuous relaxations of the MAP problem can be optimised efficiently using state-of-the-art algorithms. Specifically, we will solve a quadratic programming (QP) relaxation using the Frank-Wolfe algorithm and a linear programming (LP) relaxation by developing a proximal minimisation framework. By exploiting labelling consistency in the higher-order potentials and utilising the filter-based method, we are able to formulate the above algorithms such that each iteration has a complexity linear in the number of classes and random variables. The presented algorithms can be applied to any labelling problem using a dense CRF with sparse higher-order potentials. In this paper, we use semantic segmentation as an example application as it demonstrates the ability of the algorithm to scale to dense CRFs with large dimensions. We perform experiments on the Pascal dataset to indicate that the presented algorithms are able to attain lower energies than the mean-field inference method.

CVApr 17, 2018
DGPose: Deep Generative Models for Human Body Analysis

Rodrigo de Bem, Arnab Ghosh, Thalaiyasingam Ajanthan et al.

Deep generative modelling for human body analysis is an emerging problem with many interesting applications. However, the latent space learned by such approaches is typically not interpretable, resulting in less flexibility. In this work, we present deep generative models for human body analysis in which the body pose and the visual appearance are disentangled. Such a disentanglement allows independent manipulation of pose and appearance, and hence enables applications such as pose-transfer without specific training for such a task. Our proposed models, the Conditional-DGPose and the Semi-DGPose, have different characteristics. In the first, body pose labels are taken as conditioners, from a fully-supervised training set. In the second, our structured semi-supervised approach allows for pose estimation to be performed by the model itself and relaxes the need for labelled data. Therefore, the Semi-DGPose aims for the joint understanding and generation of people in images. It is not only capable of mapping images to interpretable latent representations but also able to map these representations back to the image space. We compare our models with relevant baselines, the ClothNet-Body and the Pose Guided Person Generation networks, demonstrating their merits on the Human3.6M, ChictopiaPlus and DeepFashion benchmarks.

CVJan 30, 2018
Riemannian Walk for Incremental Learning: Understanding Forgetting and Intransigence

Arslan Chaudhry, Puneet K. Dokania, Thalaiyasingam Ajanthan et al.

Incremental learning (IL) has received a lot of attention recently, however, the literature lacks a precise problem definition, proper evaluation settings, and metrics tailored specifically for the IL problem. One of the main objectives of this work is to fill these gaps so as to provide a common ground for better understanding of IL. The main challenge for an IL algorithm is to update the classifier whilst preserving existing knowledge. We observe that, in addition to forgetting, a known issue while preserving knowledge, IL also suffers from a problem we call intransigence, inability of a model to update its knowledge. We introduce two metrics to quantify forgetting and intransigence that allow us to understand, analyse, and gain better insights into the behaviour of IL algorithms. We present RWalk, a generalization of EWC++ (our efficient version of EWC [Kirkpatrick2016EWC]) and Path Integral [Zenke2017Continual] with a theoretically grounded KL-divergence based perspective. We provide a thorough analysis of various IL algorithms on MNIST and CIFAR-100 datasets. In these experiments, RWalk obtains superior results in terms of accuracy, and also provides a better trade-off between forgetting and intransigence.

DSFeb 20, 2017
Memory Efficient Max Flow for Multi-label Submodular MRFs

Thalaiyasingam Ajanthan, Richard Hartley, Mathieu Salzmann

Multi-label submodular Markov Random Fields (MRFs) have been shown to be solvable using max-flow based on an encoding of the labels proposed by Ishikawa, in which each variable $X_i$ is represented by $\ell$ nodes (where $\ell$ is the number of labels) arranged in a column. However, this method in general requires $2\,\ell^2$ edges for each pair of neighbouring variables. This makes it inapplicable to realistic problems with many variables and labels, due to excessive memory requirement. In this paper, we introduce a variant of the max-flow algorithm that requires much less storage. Consequently, our algorithm makes it possible to optimally solve multi-label submodular problems involving large numbers of variables and labels on a standard computer.

CVNov 29, 2016
Efficient Linear Programming for Dense CRFs

Thalaiyasingam Ajanthan, Alban Desmaison, Rudy Bunel et al.

The fully connected conditional random field (CRF) with Gaussian pairwise potentials has proven popular and effective for multi-class semantic segmentation. While the energy of a dense CRF can be minimized accurately using a linear programming (LP) relaxation, the state-of-the-art algorithm is too slow to be useful in practice. To alleviate this deficiency, we introduce an efficient LP minimization algorithm for dense CRFs. To this end, we develop a proximal minimization framework, where the dual of each proximal problem is optimized via block coordinate descent. We show that each block of variables can be efficiently optimized. Specifically, for one block, the problem decomposes into significantly smaller subproblems, each of which is defined over a single pixel. For the other block, the problem is optimized via conditional gradient descent. This has two advantages: 1) the conditional gradient can be computed in a time linear in the number of pixels and labels; and 2) the optimal step size can be computed analytically. Our experiments on standard datasets provide compelling evidence that our approach outperforms all existing baselines including the previous LP based approach for dense CRFs.

CVNov 24, 2014
Iteratively Reweighted Graph Cut for Multi-label MRFs with Non-convex Priors

Thalaiyasingam Ajanthan, Richard Hartley, Mathieu Salzmann et al.

While widely acknowledged as highly effective in computer vision, multi-label MRFs with non-convex priors are difficult to optimize. To tackle this, we introduce an algorithm that iteratively approximates the original energy with an appropriately weighted surrogate energy that is easier to minimize. Our algorithm guarantees that the original energy decreases at each iteration. In particular, we consider the scenario where the global minimizer of the weighted surrogate energy can be obtained by a multi-label graph cut algorithm, and show that our algorithm then lets us handle of large variety of non-convex priors. We demonstrate the benefits of our method over state-of-the-art MRF energy minimization techniques on stereo and inpainting problems.