CVMar 9, 2023Code
Probabilistic 3d regression with projected huber distributionDavid Mohlin, Josephine Sullivan
Estimating probability distributions which describe where an object is likely to be from camera data is a task with many applications. In this work we describe properties which we argue such methods should conform to. We also design a method which conform to these properties. In our experiments we show that our method produces uncertainties which correlate well with empirical errors. We also show that the mode of the predicted distribution outperform our regression baselines. The code for our implementation is available online.
21.3CVMar 20Code
Wildfire Spread Scenarios: Increasing Sample Diversity of Segmentation Diffusion Models with Training-Free MethodsSebastian Gerard, Josephine Sullivan
Predicting future states in uncertain environments, such as wildfire spread, medical diagnosis, or autonomous driving, requires models that can consider multiple plausible outcomes. While diffusion models can effectively learn such multi-modal distributions, naively sampling from these models is computationally inefficient, potentially requiring hundreds of samples to find low-probability modes that may still be operationally relevant. In this work, we address the challenge of sample-efficient ambiguous segmentation by evaluating several training-free sampling methods that encourage diverse predictions. We adapt two techniques, particle guidance and SPELL, originally designed for the generation of diverse natural images, to discrete segmentation tasks, and additionally propose a simple clustering-based technique. We validate these approaches on the LIDC medical dataset, a modified version of the Cityscapes dataset, and MMFire, a new simulation-based wildfire spread dataset introduced in this paper. Compared to naive sampling, these approaches increase the HM IoU* metric by up to 7.5% on MMFire and 16.4% on Cityscapes, demonstrating that training-free methods can be used to efficiently increase the sample diversity of segmentation diffusion models with little cost to image quality and runtime. Code and dataset: https://github.com/SebastianGer/wildfire-spread-scenarios
CVNov 24, 2022
Contrastive pretraining for semantic segmentation is robust to noisy positive pairsSebastian Gerard, Josephine Sullivan
Domain-specific variants of contrastive learning can construct positive pairs from two distinct in-domain images, while traditional methods just augment the same image twice. For example, we can form a positive pair from two satellite images showing the same location at different times. Ideally, this teaches the model to ignore changes caused by seasons, weather conditions or image acquisition artifacts. However, unlike in traditional contrastive methods, this can result in undesired positive pairs, since we form them without human supervision. For example, a positive pair might consist of one image before a disaster and one after. This could teach the model to ignore the differences between intact and damaged buildings, which might be what we want to detect in the downstream task. Similar to false negative pairs, this could impede model performance. Crucially, in this setting only parts of the images differ in relevant ways, while other parts remain similar. Surprisingly, we find that downstream semantic segmentation is either robust to such badly matched pairs or even benefits from them. The experiments are conducted on the remote sensing dataset xBD, and a synthetic segmentation dataset for which we have full control over the pairing conditions. As a result, practitioners can use these domain-specific contrastive methods without having to filter their positive pairs beforehand, or might even be encouraged to purposefully include such pairs in their pretraining dataset.
CVJan 30, 2024Code
A simple, strong baseline for building damage detection on the xBD datasetSebastian Gerard, Paul Borne-Pons, Josephine Sullivan
We construct a strong baseline method for building damage detection by starting with the highly-engineered winning solution of the xView2 competition, and gradually stripping away components. This way, we obtain a much simpler method, while retaining adequate performance. We expect the simplified solution to be more widely and easily applicable. This expectation is based on the reduced complexity, as well as the fact that we choose hyperparameters based on simple heuristics, that transfer to other datasets. We then re-arrange the xView2 dataset splits such that the test locations are not seen during training, contrary to the competition setup. In this setting, we find that both the complex and the simplified model fail to generalize to unseen locations. Analyzing the dataset indicates that this failure to generalize is not only a model-based problem, but that the difficulty might also be influenced by the unequal class distributions between events. Code, including the baseline model, is available under https://github.com/PaulBorneP/Xview2_Strong_Baseline
CVNov 19, 2021Code
Probabilistic Regression with Huber DistributionsDavid Mohlin, Gerald Bianchi, Josephine Sullivan
In this paper we describe a probabilistic method for estimating the position of an object along with its covariance matrix using neural networks. Our method is designed to be robust to outliers, have bounded gradients with respect to the network outputs, among other desirable properties. To achieve this we introduce a novel probability distribution inspired by the Huber loss. We also introduce a new way to parameterize positive definite matrices to ensure invariance to the choice of orientation for the coordinate system we regress over. We evaluate our method on popular body pose and facial landmark datasets and get performance on par or exceeding the performance of non-heatmap methods. Our code is available at github.com/Davmo049/Public_prob_regression_with_huber_distributions
11.4CVMar 20
Deterministic Mode Proposals: An Efficient Alternative to Generative Sampling for Ambiguous SegmentationSebastian Gerard, Josephine Sullivan
Many segmentation tasks, such as medical image segmentation or future state prediction, are inherently ambiguous, meaning that multiple predictions are equally correct. Current methods typically rely on generative models to capture this uncertainty. However, identifying the underlying modes of the distribution with these methods is computationally expensive, requiring large numbers of samples and post-hoc clustering. In this paper, we shift the focus from stochastic sampling to the direct generation of likely outcomes. We introduce mode proposal models, a deterministic framework that efficiently produces a fixed-size set of proposal masks in a single forward pass. To handle superfluous proposals, we adapt a confidence mechanism, traditionally used in object detection, to the high-dimensional space of segmentation masks. Our approach significantly reduces inference time while achieving higher ground-truth coverage than existing generative models. Furthermore, we demonstrate that our model can be trained without knowing the full distribution of outcomes, making it applicable to real-world datasets. Finally, we show that by decomposing the velocity field of a pre-trained flow model, we can efficiently estimate prior mode probabilities for our proposals.
CVOct 14, 2025
Unconditional Human Motion and Shape Generation via Balanced Score-Based DiffusionDavid Björkstrand, Tiesheng Wang, Lars Bretzner et al.
Recent work has explored a range of model families for human motion generation, including Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and diffusion-based models. Despite their differences, many methods rely on over-parameterized input features and auxiliary losses to improve empirical results. These strategies should not be strictly necessary for diffusion models to match the human motion distribution. We show that on par with state-of-the-art results in unconditional human motion generation are achievable with a score-based diffusion model using only careful feature-space normalization and analytically derived weightings for the standard L2 score-matching loss, while generating both motion and shape directly, thereby avoiding slow post hoc shape recovery from joints. We build the method step by step, with a clear theoretical motivation for each component, and provide targeted ablations demonstrating the effectiveness of each proposed addition in isolation.
LGFeb 23, 2022
Are All Linear Regions Created Equal?Matteo Gamba, Adrian Chmielewski-Anders, Josephine Sullivan et al.
The number of linear regions has been studied as a proxy of complexity for ReLU networks. However, the empirical success of network compression techniques like pruning and knowledge distillation, suggest that in the overparameterized setting, linear regions density might fail to capture the effective nonlinearity. In this work, we propose an efficient algorithm for discovering linear regions and use it to investigate the effectiveness of density in capturing the nonlinearity of trained VGGs and ResNets on CIFAR-10 and CIFAR-100. We contrast the results with a more principled nonlinearity measure based on function variation, highlighting the shortcomings of linear regions density. Furthermore, interestingly, our measure of nonlinearity clearly correlates with model-wise deep double descent, connecting reduced test error with reduced nonlinearity, and increased local similarity of linear regions.
CVMar 17, 2021
Large-Scale Zero-Shot Image Classification from Rich and Diverse Textual DescriptionsSebastian Bujwid, Josephine Sullivan
We study the impact of using rich and diverse textual descriptions of classes for zero-shot learning (ZSL) on ImageNet. We create a new dataset ImageNet-Wiki that matches each ImageNet class to its corresponding Wikipedia article. We show that merely employing these Wikipedia articles as class descriptions yields much higher ZSL performance than prior works. Even a simple model using this type of auxiliary data outperforms state-of-the-art models that rely on standard features of word embedding encodings of class names. These results highlight the usefulness and importance of textual descriptions for ZSL, as well as the relative importance of auxiliary data type compared to algorithmic progress. Our experimental results also show that standard zero-shot learning approaches generalize poorly across categories of classes.
CVJun 16, 2020
Explanation-based Weakly-supervised Learning of Visual Relations with Graph NetworksFederico Baldassarre, Kevin Smith, Josephine Sullivan et al.
Visual relationship detection is fundamental for holistic image understanding. However, the localization and classification of (subject, predicate, object) triplets remain challenging tasks, due to the combinatorial explosion of possible relationships, their long-tailed distribution in natural images, and an expensive annotation process. This paper introduces a novel weakly-supervised method for visual relationship detection that relies on minimal image-level predicate labels. A graph neural network is trained to classify predicates in images from a graph representation of detected objects, implicitly encoding an inductive bias for pairwise relations. We then frame relationship detection as the explanation of such a predicate classifier, i.e. we obtain a complete relation by recovering the subject and object of a predicted predicate. We present results comparable to recent fully- and weakly-supervised methods on three diverse and challenging datasets: HICO-DET for human-object interaction, Visual Relationship Detection for generic object-to-object relations, and UnRel for unusual triplets; demonstrating robustness to non-comprehensive annotations and good few-shot generalization.
CVFeb 12, 2016
Face Attribute Prediction Using Off-the-Shelf CNN FeaturesYang Zhong, Josephine Sullivan, Haibo Li
Predicting attributes from face images in the wild is a challenging computer vision problem. To automatically describe face attributes from face containing images, traditionally one needs to cascade three technical blocks --- face localization, facial descriptor construction, and attribute classification --- in a pipeline. As a typical classification problem, face attribute prediction has been addressed using deep learning. Current state-of-the-art performance was achieved by using two cascaded Convolutional Neural Networks (CNNs), which were specifically trained to learn face localization and attribute description. In this paper, we experiment with an alternative way of employing the power of deep representations from CNNs. Combining with conventional face localization techniques, we use off-the-shelf architectures trained for face recognition to build facial descriptors. Recognizing that the describable face attributes are diverse, our face descriptors are constructed from different levels of the CNNs for different attributes to best facilitate face attribute prediction. Experiments on two large datasets, LFWA and CelebA, show that our approach is entirely comparable to the state-of-the-art. Our findings not only demonstrate an efficient face attribute prediction approach, but also raise an important question: how to leverage the power of off-the-shelf CNN representations for novel tasks.
CVFeb 4, 2016
Leveraging Mid-Level Deep Representations For Predicting Face Attributes in the WildYang Zhong, Josephine Sullivan, Haibo Li
Predicting facial attributes from faces in the wild is very challenging due to pose and lighting variations in the real world. The key to this problem is to build proper feature representations to cope with these unfavourable conditions. Given the success of Convolutional Neural Network (CNN) in image classification, the high-level CNN feature, as an intuitive and reasonable choice, has been widely utilized for this problem. In this paper, however, we consider the mid-level CNN features as an alternative to the high-level ones for attribute prediction. This is based on the observation that face attributes are different: some of them are locally oriented while others are globally defined. Our investigations reveal that the mid-level deep representations outperform the prediction accuracy achieved by the (fine-tuned) high-level abstractions. We empirically demonstrate that the midlevel representations achieve state-of-the-art prediction performance on CelebA and LFWA datasets. Our investigations also show that by utilizing the mid-level representations one can employ a single deep network to achieve both face recognition and attribute prediction.
CVDec 20, 2014
Visual Instance Retrieval with Deep Convolutional NetworksAli Sharif Razavian, Josephine Sullivan, Stefan Carlsson et al.
This paper provides an extensive study on the availability of image representations based on convolutional networks (ConvNets) for the task of visual instance retrieval. Besides the choice of convolutional layers, we present an efficient pipeline exploiting multi-scale schemes to extract local features, in particular, by taking geometric invariance into explicit account, i.e. positions, scales and spatial consistency. In our experiments using five standard image retrieval datasets, we demonstrate that generic ConvNet image representations can outperform other state-of-the-art methods if they are extracted appropriately.
CVNov 24, 2014
Persistent Evidence of Local Image Properties in Generic ConvNetsAli Sharif Razavian, Hossein Azizpour, Atsuto Maki et al.
Supervised training of a convolutional network for object classification should make explicit any information related to the class of objects and disregard any auxiliary information associated with the capture of the image or the variation within the object class. Does this happen in practice? Although this seems to pertain to the very final layers in the network, if we look at earlier layers we find that this is not the case. Surprisingly, strong spatial information is implicit. This paper addresses this, in particular, exploiting the image representation at the first fully connected layer, i.e. the global image descriptor which has been recently shown to be most effective in a range of visual recognition tasks. We empirically demonstrate evidences for the finding in the contexts of four different tasks: 2d landmark detection, 2d object keypoints prediction, estimation of the RGB values of input image, and recovery of semantic label of each pixel. We base our investigation on a simple framework with ridge rigression commonly across these tasks, and show results which all support our insight. Such spatial information can be used for computing correspondence of landmarks to a good accuracy, but should potentially be useful for improving the training of the convolutional nets for classification purposes.
CVJun 22, 2014
Factors of Transferability for a Generic ConvNet RepresentationHossein Azizpour, Ali Sharif Razavian, Josephine Sullivan et al.
Evidence is mounting that Convolutional Networks (ConvNets) are the most effective representation learning method for visual recognition tasks. In the common scenario, a ConvNet is trained on a large labeled dataset (source) and the feed-forward units activation of the trained network, at a certain layer of the network, is used as a generic representation of an input image for a task with relatively smaller training set (target). Recent studies have shown this form of representation transfer to be suitable for a wide range of target visual recognition tasks. This paper introduces and investigates several factors affecting the transferability of such representations. It includes parameters for training of the source ConvNet such as its architecture, distribution of the training data, etc. and also the parameters of feature extraction such as layer of the trained ConvNet, dimensionality reduction, etc. Then, by optimizing these factors, we show that significant improvements can be achieved on various (17) visual recognition tasks. We further show that these visual recognition tasks can be categorically ordered based on their distance from the source task such that a correlation between the performance of tasks and their distance from the source task w.r.t. the proposed factors is observed.
CVMar 23, 2014
CNN Features off-the-shelf: an Astounding Baseline for RecognitionAli Sharif Razavian, Hossein Azizpour, Josephine Sullivan et al.
Recent results indicate that the generic descriptors extracted from the convolutional neural networks are very powerful. This paper adds to the mounting evidence that this is indeed the case. We report on a series of experiments conducted for different recognition tasks using the publicly available code and model of the \overfeat network which was trained to perform object classification on ILSVRC13. We use features extracted from the \overfeat network as a generic image representation to tackle the diverse range of recognition tasks of object image classification, scene recognition, fine grained recognition, attribute detection and image retrieval applied to a diverse set of datasets. We selected these tasks and datasets as they gradually move further away from the original task and data the \overfeat network was trained to solve. Astonishingly, we report consistent superior results compared to the highly tuned state-of-the-art systems in all the visual classification tasks on various datasets. For instance retrieval it consistently outperforms low memory footprint methods except for sculptures dataset. The results are achieved using a linear SVM classifier (or $L2$ distance in case of retrieval) applied to a feature representation of size 4096 extracted from a layer in the net. The representations are further modified using simple augmentation techniques e.g. jittering. The results strongly suggest that features obtained from deep learning with convolutional nets should be the primary candidate in most visual recognition tasks.