CVJun 6, 2022
A Deeper Dive Into What Deep Spatiotemporal Networks Encode: Quantifying Static vs. Dynamic InformationMatthew Kowal, Mennatullah Siam, Md Amirul Islam et al.
Deep spatiotemporal models are used in a variety of computer vision tasks, such as action recognition and video object segmentation. Currently, there is a limited understanding of what information is captured by these models in their intermediate representations. For example, while it has been observed that action recognition algorithms are heavily influenced by visual appearance in single static frames, there is no quantitative methodology for evaluating such static bias in the latent representation compared to bias toward dynamic information (e.g. motion). We tackle this challenge by proposing a novel approach for quantifying the static and dynamic biases of any spatiotemporal model. To show the efficacy of our approach, we analyse two widely studied tasks, action recognition and video object segmentation. Our key findings are threefold: (i) Most examined spatiotemporal models are biased toward static information; although, certain two-stream architectures with cross-connections show a better balance between the static and dynamic information captured. (ii) Some datasets that are commonly assumed to be biased toward dynamics are actually biased toward static information. (iii) Individual units (channels) in an architecture can be biased toward static, dynamic or a combination of the two.
CVNov 3, 2022
Quantifying and Learning Static vs. Dynamic Information in Deep Spatiotemporal NetworksMatthew Kowal, Mennatullah Siam, Md Amirul Islam et al.
There is limited understanding of the information captured by deep spatiotemporal models in their intermediate representations. For example, while evidence suggests that action recognition algorithms are heavily influenced by visual appearance in single frames, no quantitative methodology exists for evaluating such static bias in the latent representation compared to bias toward dynamics. We tackle this challenge by proposing an approach for quantifying the static and dynamic biases of any spatiotemporal model, and apply our approach to three tasks, action recognition, automatic video object segmentation (AVOS) and video instance segmentation (VIS). Our key findings are: (i) Most examined models are biased toward static information. (ii) Some datasets that are assumed to be biased toward dynamics are actually biased toward static information. (iii) Individual channels in an architecture can be biased toward static, dynamic or a combination of the two. (iv) Most models converge to their culminating biases in the first half of training. We then explore how these biases affect performance on dynamically biased datasets. For action recognition, we propose StaticDropout, a semantically guided dropout that debiases a model from static information toward dynamics. For AVOS, we design a better combination of fusion and cross connection layers compared with previous architectures.
CVDec 20, 2018Code
SMILER: Saliency Model Implementation Library for Experimental ResearchCalden Wloka, Toni Kunić, Iuliia Kotseruba et al.
The Saliency Model Implementation Library for Experimental Research (SMILER) is a new software package which provides an open, standardized, and extensible framework for maintaining and executing computational saliency models. This work drastically reduces the human effort required to apply saliency algorithms to new tasks and datasets, while also ensuring consistency and procedural correctness for results and conclusions produced by different parties. At its launch SMILER already includes twenty three saliency models (fourteen models based in MATLAB and nine supported through containerization), and the open design of SMILER encourages this number to grow with future contributions from the community. The project may be downloaded and contributed to through its GitHub page: https://github.com/tsotsoslab/smiler
CVOct 20, 2021
Simpler Does It: Generating Semantic Labels with Objectness GuidanceMd Amirul Islam, Matthew Kowal, Sen Jia et al.
Existing weakly or semi-supervised semantic segmentation methods utilize image or box-level supervision to generate pseudo-labels for weakly labeled images. However, due to the lack of strong supervision, the generated pseudo-labels are often noisy near the object boundaries, which severely impacts the network's ability to learn strong representations. To address this problem, we present a novel framework that generates pseudo-labels for training images, which are then used to train a segmentation model. To generate pseudo-labels, we combine information from: (i) a class agnostic objectness network that learns to recognize object-like regions, and (ii) either image-level or bounding box annotations. We show the efficacy of our approach by demonstrating how the objectness network can naturally be leveraged to generate object-like regions for unseen categories. We then propose an end-to-end multi-task learning strategy, that jointly learns to segment semantics and objectness using the generated pseudo-labels. Extensive experiments demonstrate the high quality of our generated pseudo-labels and effectiveness of the proposed framework in a variety of domains. Our approach achieves better or competitive performance compared to existing weakly-supervised and semi-supervised methods.
CVAug 23, 2021
SegMix: Co-occurrence Driven Mixup for Semantic Segmentation and Adversarial RobustnessMd Amirul Islam, Matthew Kowal, Konstantinos G. Derpanis et al.
In this paper, we present a strategy for training convolutional neural networks to effectively resolve interference arising from competing hypotheses relating to inter-categorical information throughout the network. The premise is based on the notion of feature binding, which is defined as the process by which activations spread across space and layers in the network are successfully integrated to arrive at a correct inference decision. In our work, this is accomplished for the task of dense image labelling by blending images based on (i) categorical clustering or (ii) the co-occurrence likelihood of categories. We then train a feature binding network which simultaneously segments and separates the blended images. Subsequent feature denoising to suppress noisy activations reveals additional desirable properties and high degrees of successful predictions. Through this process, we reveal a general mechanism, distinct from any prior methods, for boosting the performance of the base segmentation and saliency network while simultaneously increasing robustness to adversarial attacks.
CVAug 17, 2021
Global Pooling, More than Meets the Eye: Position Information is Encoded Channel-Wise in CNNsMd Amirul Islam, Matthew Kowal, Sen Jia et al.
In this paper, we challenge the common assumption that collapsing the spatial dimensions of a 3D (spatial-channel) tensor in a convolutional neural network (CNN) into a vector via global pooling removes all spatial information. Specifically, we demonstrate that positional information is encoded based on the ordering of the channel dimensions, while semantic information is largely not. Following this demonstration, we show the real world impact of these findings by applying them to two applications. First, we propose a simple yet effective data augmentation strategy and loss function which improves the translation invariance of a CNN's output. Second, we propose a method to efficiently determine which channels in the latent representation are responsible for (i) encoding overall position information or (ii) region-specific positions. We first show that semantic segmentation has a significant reliance on the overall position channels to make predictions. We then show for the first time that it is possible to perform a `region-specific' attack, and degrade a network's performance in a particular part of the input. We believe our findings and demonstrated applications will benefit research areas concerned with understanding the characteristics of CNNs.
CVJan 28, 2021
Position, Padding and Predictions: A Deeper Look at Position Information in CNNsMd Amirul Islam, Matthew Kowal, Sen Jia et al.
In contrast to fully connected networks, Convolutional Neural Networks (CNNs) achieve efficiency by learning weights associated with local filters with a finite spatial extent. An implication of this is that a filter may know what it is looking at, but not where it is positioned in the image. In this paper, we first test this hypothesis and reveal that a surprising degree of absolute position information is encoded in commonly used CNNs. We show that zero padding drives CNNs to encode position information in their internal representations, while a lack of padding precludes position encoding. This gives rise to deeper questions about the role of position information in CNNs: (i) What boundary heuristics enable optimal position encoding for downstream tasks?; (ii) Does position encoding affect the learning of semantic representations?; (iii) Does position encoding always improve performance? To provide answers, we perform the largest case study to date on the role that padding and border heuristics play in CNNs. We design novel tasks which allow us to quantify boundary effects as a function of the distance to the border. Numerous semantic objectives reveal the effect of the border on semantic representations. Finally, we demonstrate the implications of these findings on multiple real-world tasks to show that position information can both help or hurt performance.
CVAug 13, 2020
Feature Binding with Category-Dependant MixUp for Semantic Segmentation and Adversarial RobustnessMd Amirul Islam, Matthew Kowal, Konstantinos G. Derpanis et al.
In this paper, we present a strategy for training convolutional neural networks to effectively resolve interference arising from competing hypotheses relating to inter-categorical information throughout the network. The premise is based on the notion of feature binding, which is defined as the process by which activation's spread across space and layers in the network are successfully integrated to arrive at a correct inference decision. In our work, this is accomplished for the task of dense image labelling by blending images based on their class labels, and then training a feature binding network, which simultaneously segments and separates the blended images. Subsequent feature denoising to suppress noisy activations reveals additional desirable properties and high degrees of successful predictions. Through this process, we reveal a general mechanism, distinct from any prior methods, for boosting the performance of the base segmentation network while simultaneously increasing robustness to adversarial attacks.
CVFeb 24, 2020
Revisiting Saliency Metrics: Farthest-Neighbor Area Under CurveSen Jia, Neil D. B. Bruce
Saliency detection has been widely studied because it plays an important role in various vision applications, but it is difficult to evaluate saliency systems because each measure has its own bias. In this paper, we first revisit the problem of applying the widely used saliency metrics on modern Convolutional Neural Networks(CNNs). Our investigation shows the saliency datasets have been built based on different choices of parameters and CNNs are designed to fit a dataset-specific distribution. Secondly, we show that the Shuffled Area Under Curve(S-AUC) metric still suffers from spatial biases. We propose a new saliency metric based on the AUC property, which aims at sampling a more directional negative set for evaluation, denoted as Farthest-Neighbor AUC(FN-AUC). We also propose a strategy to measure the quality of the sampled negative set. Our experiment shows FN-AUC can measure spatial biases, central and peripheral, more effectively than S-AUC without penalizing the fixation locations. Thirdly, we propose a global smoothing function to overcome the problem of few value degrees (output quantization) in computing AUC metrics. Comparing with random noise, our smooth function can create unique values without losing the relative saliency relationship.
CVJan 22, 2020
How Much Position Information Do Convolutional Neural Networks Encode?Md Amirul Islam, Sen Jia, Neil D. B. Bruce
In contrast to fully connected networks, Convolutional Neural Networks (CNNs) achieve efficiency by learning weights associated with local filters with a finite spatial extent. An implication of this is that a filter may know what it is looking at, but not where it is positioned in the image. Information concerning absolute position is inherently useful, and it is reasonable to assume that deep CNNs may implicitly learn to encode this information if there is a means to do so. In this paper, we test this hypothesis revealing the surprising degree of absolute position information that is encoded in commonly used neural networks. A comprehensive set of experiments show the validity of this hypothesis and shed light on how and where this information is represented while offering clues to where positional information is derived from in deep CNNs.
CVSep 28, 2019
Distributed Iterative Gating Networks for Semantic SegmentationRezaul Karim, Md Amirul Islam, Neil D. B. Bruce
In this paper, we present a canonical structure for controlling information flow in neural networks with an efficient feedback routing mechanism based on a strategy of Distributed Iterative Gating (DIGNet). The structure of this mechanism derives from a strong conceptual foundation and presents a light-weight mechanism for adaptive control of computation similar to recurrent convolutional neural networks by integrating feedback signals with a feed-forward architecture. In contrast to other RNN formulations, DIGNet generates feedback signals in a cascaded manner that implicitly carries information from all the layers above. This cascaded feedback propagation by means of the propagator gates is found to be more effective compared to other feedback mechanisms that use feedback from the output of either the corresponding stage or from the previous stage. Experiments reveal the high degree of capability that this recurrent approach with cascaded feedback presents over feed-forward baselines and other recurrent models for pixel-wise labeling problems on three challenging datasets, PASCAL VOC 2012, COCO-Stuff, and ADE20K.
CVJan 8, 2019
Richer and Deeper Supervision Network for Salient Object DetectionSen Jia, Neil D. B. Bruce
Recent Salient Object Detection (SOD) systems are mostly based on Convolutional Neural Networks (CNNs). Specifically, Deeply Supervised Saliency (DSS) system has shown it is very useful to add short connections to the network and supervising on the side output. In this work, we propose a new SOD system which aims at designing a more efficient and effective way to pass back global information. Richer and Deeper Supervision (RDS) is applied to better combine features from each side output without demanding much extra computational space. Meanwhile, the backbone network used for SOD is normally pre-trained on the object classification dataset, ImageNet. But the pre-trained model has been trained on cropped images in order to only focus on distinguishing features within the region of the object. But the ignored background information is also significant in the task of SOD. We try to solve this problem by introducing the training data designed for object detection. A coarse global information is learned based on an entire image with its bounding box before training on the SOD dataset. The large-scale of object images can slightly improve the performance of SOD. Our experiment shows the proposed RDS network achieves the state-of-the-art results on five public SOD datasets.
CVNov 20, 2018
Recurrent Iterative Gating Networks for Semantic SegmentationRezaul Karim, Md Amirul Islam, Neil D. B. Bruce
In this paper, we present an approach for Recurrent Iterative Gating called RIGNet. The core elements of RIGNet involve recurrent connections that control the flow of information in neural networks in a top-down manner, and different variants on the core structure are considered. The iterative nature of this mechanism allows for gating to spread in both spatial extent and feature space. This is revealed to be a powerful mechanism with broad compatibility with common existing networks. Analysis shows how gating interacts with different network characteristics, and we also show that more shallow networks with gating may be made to perform better than much deeper networks that do not include RIGNet modules.
CVOct 3, 2018
Relative Saliency and Ranking: Models, Metrics, Data, and BenchmarksMahmoud Kalash, Md Amirul Islam, Neil D. B. Bruce
Salient object detection is a problem that has been considered in detail and \textcolor{black}{many solutions have been proposed}. In this paper, we argue that work to date has addressed a problem that is relatively ill-posed. Specifically, there is not universal agreement about what constitutes a salient object when multiple observers are queried. This implies that some objects are more likely to be judged salient than others, and implies a relative rank exists on salient objects. Initially, we present a novel deep learning solution based on a hierarchical representation of relative saliency and stage-wise refinement. Further to this, we present data, analysis and baseline benchmark results towards addressing the problem of salient object ranking. Methods for deriving suitable ranked salient object instances are presented, along with metrics suitable to measuring algorithm performance. In addition, we show how a derived dataset can be successively refined to provide cleaned results that correlate well with pristine ground truth in its characteristics and value for training and testing models. Finally, we provide a comparison among prevailing algorithms that address salient object ranking or detection to establish initial baselines providing a basis for comparison with future efforts addressing this problem. \textcolor{black}{The source code and data are publicly available via our project page:} \textrm{\href{https://ryersonvisionlab.github.io/cocosalrank.html}{ryersonvisionlab.github.io/cocosalrank}}
CVJul 25, 2018
Semantics Meet Saliency: Exploring Domain Affinity and Models for Dual-Task PredictionMd Amirul Islam, Mahmoud Kalash, Neil D. B. Bruce
Much research has examined models for prediction of semantic labels or instances including dense pixel-wise prediction. The problem of predicting salient objects or regions of an image has also been examined in a similar light. With that said, there is an apparent relationship between these two problem domains in that the composition of a scene and associated semantic categories is certain to play into what is deemed salient. In this paper, we explore the relationship between these two problem domains. This is carried out in constructing deep neural networks that perform both predictions together albeit with different configurations for flow of conceptual information related to each distinct problem. This is accompanied by a detailed analysis of object co-occurrences that shed light on dataset bias and semantic precedence specific to individual categories.
CVJun 29, 2018
Gated Feedback Refinement Network for Coarse-to-Fine Dense Semantic Image LabelingMd Amirul Islam, Mrigank Rochan, Shujon Naha et al.
Effective integration of local and global contextual information is crucial for semantic segmentation and dense image labeling. We develop two encoder-decoder based deep learning architectures to address this problem. We first propose a network architecture called Label Refinement Network (LRN) that predicts segmentation labels in a coarse-to-fine fashion at several spatial resolutions. In this network, we also define loss functions at several stages to provide supervision at different stages of training. However, there are limits to the quality of refinement possible if ambiguous information is passed forward. In order to address this issue, we also propose Gated Feedback Refinement Network (G-FRNet) that addresses this limitation. Initially, G-FRNet makes a coarse-grained prediction which it progressively refines to recover details by effectively integrating local and global contextual information during the refinement stages. This is achieved by gate units proposed in this work, that control information passed forward in order to resolve the ambiguity. Experiments were conducted on four challenging dense labeling datasets (CamVid, PASCAL VOC 2012, Horse-Cow Parsing, PASCAL-Person-Part, and SUN-RGBD). G-FRNet achieves state-of-the-art semantic segmentation results on the CamVid and Horse-Cow Parsing datasets and produces results competitive with the best performing approaches that appear in the literature for the other three datasets.
CVMay 2, 2018
EML-NET:An Expandable Multi-Layer NETwork for Saliency PredictionSen Jia, Neil D. B. Bruce
Saliency prediction can benefit from training that involves scene understanding that may be tangential to the central task; this may include understanding places, spatial layout, objects or involve different datasets and their bias. One can combine models, but to do this in a sophisticated manner can be complex, and also result in unwieldy networks or produce competing objectives that are hard to balance. In this paper, we propose a scalable system to leverage multiple powerful deep CNN models to better extract visual features for saliency prediction. Our design differs from previous studies in that the whole system is trained in an almost end-to-end piece-wise fashion. The encoder and decoder components are separately trained to deal with complexity tied to the computational paradigm and required space. Furthermore, the encoder can contain more than one CNN model to extract features, and models can have different architectures or be pre-trained on different datasets. This parallel design yields a better computational paradigm overcoming limits to the variety of information or inference that can be combined at the encoder stage towards deeper networks and a more powerful encoding. Our network can be easily expanded almost without any additional cost, and other pre-trained CNN models can be incorporated availing a wider range of visual knowledge. We denote our expandable multi-layer network as EML-NET and our method achieves the state-of-the-art results on the public saliency benchmarks, SALICON, MIT300 and CAT2000.
CVMar 14, 2018
Revisiting Salient Object Detection: Simultaneous Detection, Ranking, and Subitizing of Multiple Salient ObjectsMd Amirul Islam, Mahmoud Kalash, Neil D. B. Bruce
Salient object detection is a problem that has been considered in detail and many solutions proposed. In this paper, we argue that work to date has addressed a problem that is relatively ill-posed. Specifically, there is not universal agreement about what constitutes a salient object when multiple observers are queried. This implies that some objects are more likely to be judged salient than others, and implies a relative rank exists on salient objects. The solution presented in this paper solves this more general problem that considers relative rank, and we propose data and metrics suitable to measuring success in a relative object saliency landscape. A novel deep learning solution is proposed based on a hierarchical representation of relative saliency and stage-wise refinement. We also show that the problem of salient object subitizing can be addressed with the same network, and our approach exceeds performance of any prior work across all metrics considered (both traditional and newly proposed).