CVNov 30, 2022Code
ObjCAViT: Improving Monocular Depth Estimation Using Natural Language Models And Image-Object Cross-AttentionDylan Auty, Krystian Mikolajczyk
While monocular depth estimation (MDE) is an important problem in computer vision, it is difficult due to the ambiguity that results from the compression of a 3D scene into only 2 dimensions. It is common practice in the field to treat it as simple image-to-image translation, without consideration for the semantics of the scene and the objects within it. In contrast, humans and animals have been shown to use higher-level information to solve MDE: prior knowledge of the nature of the objects in the scene, their positions and likely configurations relative to one another, and their apparent sizes have all been shown to help resolve this ambiguity. In this paper, we present a novel method to enhance MDE performance by encouraging use of known-useful information about the semantics of objects and inter-object relationships within a scene. Our novel ObjCAViT module sources world-knowledge from language models and learns inter-object relationships in the context of the MDE problem using transformer attention, incorporating apparent size information. Our method produces highly accurate depth maps, and we obtain competitive results on the NYUv2 and KITTI datasets. Our ablation experiments show that the use of language and cross-attention within the ObjCAViT module increases performance. Code is released at https://github.com/DylanAuty/ObjCAViT.
CVMar 20, 2023
Understanding the Role of the Projector in Knowledge DistillationRoy Miles, Krystian Mikolajczyk
In this paper we revisit the efficacy of knowledge distillation as a function matching and metric learning problem. In doing so we verify three important design decisions, namely the normalisation, soft maximum function, and projection layers as key ingredients. We theoretically show that the projector implicitly encodes information on past examples, enabling relational gradients for the student. We then show that the normalisation of representations is tightly coupled with the training dynamics of this projector, which can have a large impact on the students performance. Finally, we show that a simple soft maximum function can be used to address any significant capacity gap problems. Experimental results on various benchmark datasets demonstrate that using these insights can lead to superior or comparable performance to state-of-the-art knowledge distillation techniques, despite being much more computationally efficient. In particular, we obtain these results across image classification (CIFAR100 and ImageNet), object detection (COCO2017), and on more difficult distillation objectives, such as training data efficient transformers, whereby we attain a 77.2% top-1 accuracy with DeiT-Ti on ImageNet. Code and models are publicly available.
CVDec 1, 2022
Multi-Class Segmentation from Aerial Views using Recursive Noise DiffusionBenedikt Kolbeinsson, Krystian Mikolajczyk
Semantic segmentation from aerial views is a crucial task for autonomous drones, as they rely on precise and accurate segmentation to navigate safely and efficiently. However, aerial images present unique challenges such as diverse viewpoints, extreme scale variations, and high scene complexity. In this paper, we propose an end-to-end multi-class semantic segmentation diffusion model that addresses these challenges. We introduce recursive denoising to allow information to propagate through the denoising process, as well as a hierarchical multi-scale approach that complements the diffusion process. Our method achieves promising results on the UAVid dataset and state-of-the-art performance on the Vaihingen Building segmentation benchmark. Being the first iteration of this method, it shows great promise for future improvements.
CVSep 30, 2024
Match Stereo Videos via Bidirectional AlignmentJunpeng Jing, Ye Mao, Anlan Qiu et al.
Video stereo matching is the task of estimating consistent disparity maps from rectified stereo videos. There is considerable scope for improvement in both datasets and methods within this area. Recent learning-based methods often focus on optimizing performance for independent stereo pairs, leading to temporal inconsistencies in videos. Existing video methods typically employ sliding window operation over time dimension, which can result in low-frequency oscillations corresponding to the window size. To address these challenges, we propose a bidirectional alignment mechanism for adjacent frames as a fundamental operation. Building on this, we introduce a novel video processing framework, BiDAStereo, and a plugin stabilizer network, BiDAStabilizer, compatible with general image-based methods. Regarding datasets, current synthetic object-based and indoor datasets are commonly used for training and benchmarking, with a lack of outdoor nature scenarios. To bridge this gap, we present a realistic synthetic dataset and benchmark focused on natural scenes, along with a real-world dataset captured by a stereo camera in diverse urban scenes for qualitative evaluation. Extensive experiments on in-domain, out-of-domain, and robustness evaluation demonstrate the contribution of our methods and datasets, showcasing improvements in prediction quality and achieving state-of-the-art results on various commonly used benchmarks. The project page, demos, code, and datasets are available at: \url{https://tomtomtommi.github.io/BiDAVideo/}.
LGNov 29, 2023
Adaptive Early Exiting for Collaborative Inference over Noisy Wireless ChannelsMikolaj Jankowski, Deniz Gunduz, Krystian Mikolajczyk
Collaborative inference systems are one of the emerging solutions for deploying deep neural networks (DNNs) at the wireless network edge. Their main idea is to divide a DNN into two parts, where the first is shallow enough to be reliably executed at edge devices of limited computational power, while the second part is executed at an edge server with higher computational capabilities. The main advantage of such systems is that the input of the DNN gets compressed as the subsequent layers of the shallow part extract only the information necessary for the task. As a result, significant communication savings can be achieved compared to transmitting raw input samples. In this work, we study early exiting in the context of collaborative inference, which allows obtaining inference results at the edge device for certain samples, without the need to transmit the partially processed data to the edge server at all, leading to further communication savings. The central part of our system is the transmission-decision (TD) mechanism, which, given the information from the early exit, and the wireless channel conditions, decides whether to keep the early exit prediction or transmit the data to the edge server for further processing. In this paper, we evaluate various TD mechanisms and show experimentally, that for an image classification task over the wireless edge, proper utilization of early exits can provide both performance gains and significant communication savings.
CVApr 21, 2022
Monocular Depth Estimation Using Cues Inspired by Biological Vision SystemsDylan Auty, Krystian Mikolajczyk
Monocular depth estimation (MDE) aims to transform an RGB image of a scene into a pixelwise depth map from the same camera view. It is fundamentally ill-posed due to missing information: any single image can have been taken from many possible 3D scenes. Part of the MDE task is, therefore, to learn which visual cues in the image can be used for depth estimation, and how. With training data limited by cost of annotation or network capacity limited by computational power, this is challenging. In this work we demonstrate that explicitly injecting visual cue information into the model is beneficial for depth estimation. Following research into biological vision systems, we focus on semantic information and prior knowledge of object sizes and their relations, to emulate the biological cues of relative size, familiar size, and absolute size. We use state-of-the-art semantic and instance segmentation models to provide external information, and exploit language embeddings to encode relational information between classes. We also provide a prior on the average real-world size of objects. This external information overcomes the limitation in data availability, and ensures that the limited capacity of a given network is focused on known-helpful cues, therefore improving performance. We experimentally validate our hypothesis and evaluate the proposed model on the widely used NYUD2 indoor depth estimation benchmark. The results show improvements in depth prediction when the semantic information, size prior and instance size are explicitly provided along with the RGB images, and our method can be easily adapted to any depth estimation system.
CVApr 2
Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene UnderstandingYe Mao, Weixun Luo, Ranran Huang et al.
Pretraining 3D encoders by aligning with Contrastive Language Image Pretraining (CLIP) has emerged as a promising direction to learn generalizable representations for 3D scene understanding. In this paper, we propose UniScene3D, a transformer-based encoder that learns unified scene representations from multi-view colored pointmaps, jointly modeling image appearance and geometry. For robust colored pointmap representation learning, we introduce novel cross-view geometric alignment and grounded view alignment to enforce cross-view geometry and semantic consistency. Extensive low-shot and task-specific fine-tuning evaluations on viewpoint grounding, scene retrieval, scene type classification, and 3D VQA demonstrate our state-of-the-art performance. These results highlight the effectiveness of our approach for unified 3D scene understanding. https://yebulabula.github.io/UniScene3D/
CVMar 29
From None to All: Self-Supervised 3D Reconstruction via Novel View SynthesisRanran Huang, Weixun Luo, Ye Mao et al.
In this paper, we introduce NAS3R, a self-supervised feed-forward framework that jointly learns explicit 3D geometry and camera parameters with no ground-truth annotations and no pretrained priors. During training, NAS3R reconstructs 3D Gaussians from uncalibrated and unposed context views and renders target views using its self-predicted camera parameters, enabling self-supervised training from 2D photometric supervision. To ensure stable convergence, NAS3R integrates reconstruction and camera prediction within a shared transformer backbone regulated by masked attention, and adopts a depth-based Gaussian formulation that facilitates well-conditioned optimization. The framework is compatible with state-of-the-art supervised 3D reconstruction architectures and can incorporate pretrained priors or intrinsic information when available. Extensive experiments show that NAS3R achieves superior results to other self-supervised methods, establishing a scalable and geometry-aware paradigm for 3D reconstruction from unconstrained data. Code and models are publicly available at https://ranrhuang.github.io/nas3r/.
ROJun 1, 2022
SAMPLE-HD: Simultaneous Action and Motion Planning Learning EnvironmentMichal Nazarczuk, Tony Ng, Krystian Mikolajczyk
Humans exhibit incredibly high levels of multi-modal understanding - combining visual cues with read, or heard knowledge comes easy to us and allows for very accurate interaction with the surrounding environment. Various simulation environments focus on providing data for tasks related to scene understanding, question answering, space exploration, visual navigation. In this work, we are providing a solution to encompass both, visual and behavioural aspects of simulation in a new environment for learning interactive reasoning in manipulation setup. SAMPLE-HD environment allows to generate various scenes composed of small household objects, to procedurally generate language instructions for manipulation, and to generate ground truth paths serving as training data.
CVDec 9, 2021Code
ScaleNet: A Shallow Architecture for Scale EstimationAxel Barroso-Laguna, Yurun Tian, Krystian Mikolajczyk
In this paper, we address the problem of estimating scale factors between images. We formulate the scale estimation problem as a prediction of a probability distribution over scale factors. We design a new architecture, ScaleNet, that exploits dilated convolutions as well as self and cross-correlation layers to predict the scale between images. We demonstrate that rectifying images with estimated scales leads to significant performance improvements for various tasks and methods. Specifically, we show how ScaleNet can be combined with sparse local features and dense correspondence networks to improve camera pose estimation, 3D reconstruction, or dense geometric matching in different benchmarks and datasets. We provide an extensive evaluation on several tasks and analyze the computational overhead of ScaleNet. The code, evaluation protocols, and trained models are publicly available at https://github.com/axelBarroso/ScaleNet.
CVDec 1, 2021Code
Information Theoretic Representation DistillationRoy Miles, Adrian Lopez Rodriguez, Krystian Mikolajczyk
Despite the empirical success of knowledge distillation, current state-of-the-art methods are computationally expensive to train, which makes them difficult to adopt in practice. To address this problem, we introduce two distinct complementary losses inspired by a cheap entropy-like estimator. These losses aim to maximise the correlation and mutual information between the student and teacher representations. Our method incurs significantly less training overheads than other approaches and achieves competitive performance to the state-of-the-art on the knowledge distillation and cross-model transfer tasks. We further demonstrate the effectiveness of our method on a binary distillation task, whereby it leads to a new state-of-the-art for binary quantisation and approaches the performance of a full precision model. Code: www.github.com/roymiles/ITRD
CVJan 24, 2020Code
SOLAR: Second-Order Loss and Attention for Image RetrievalTony Ng, Vassileios Balntas, Yurun Tian et al.
Recent works in deep-learning have shown that second-order information is beneficial in many computer-vision tasks. Second-order information can be enforced both in the spatial context and the abstract feature dimensions. In this work, we explore two second-order components. One is focused on second-order spatial information to increase the performance of image descriptors, both local and global. It is used to re-weight feature maps, and thus emphasise salient image locations that are subsequently used for description. The second component is concerned with a second-order similarity (SOS) loss, that we extend to global descriptors for image retrieval, and is used to enhance the triplet loss with hard-negative mining. We validate our approach on two different tasks and datasets for image retrieval and image matching. The results show that our two second-order components complement each other, bringing significant performance improvements in both tasks and lead to state-of-the-art results across the public benchmarks. Code available at: http://github.com/tonyngjichun/SOLAR
CVMar 16, 2024
Match-Stereo-Videos: Bidirectional Alignment for Consistent Dynamic Stereo MatchingJunpeng Jing, Ye Mao, Krystian Mikolajczyk
Dynamic stereo matching is the task of estimating consistent disparities from stereo videos with dynamic objects. Recent learning-based methods prioritize optimal performance on a single stereo pair, resulting in temporal inconsistencies. Existing video methods apply per-frame matching and window-based cost aggregation across the time dimension, leading to low-frequency oscillations at the scale of the window size. Towards this challenge, we develop a bidirectional alignment mechanism for adjacent frames as a fundamental operation. We further propose a novel framework, BiDAStereo, that achieves consistent dynamic stereo matching. Unlike the existing methods, we model this task as local matching and global aggregation. Locally, we consider correlation in a triple-frame manner to pool information from adjacent frames and improve the temporal consistency. Globally, to exploit the entire sequence's consistency and extract dynamic scene cues for aggregation, we develop a motion-propagation recurrent unit. Extensive experiments demonstrate the performance of our method, showcasing improvements in prediction quality and achieving state-of-the-art results on various commonly used benchmarks.
CVDec 19, 2023
DDOS: The Drone Depth and Obstacle Segmentation DatasetBenedikt Kolbeinsson, Krystian Mikolajczyk
The advancement of autonomous drones, essential for sectors such as remote sensing and emergency services, is hindered by the absence of training datasets that fully capture the environmental challenges present in real-world scenarios, particularly operations in non-optimal weather conditions and the detection of thin structures like wires. We present the Drone Depth and Obstacle Segmentation (DDOS) dataset to fill this critical gap with a collection of synthetic aerial images, created to provide comprehensive training samples for semantic segmentation and depth estimation. Specifically designed to enhance the identification of thin structures, DDOS allows drones to navigate a wide range of weather conditions, significantly elevating drone training and operational safety. Additionally, this work introduces innovative drone-specific metrics aimed at refining the evaluation of algorithms in depth estimation, with a focus on thin structure detection. These contributions not only pave the way for substantial improvements in autonomous drone technology but also set a new benchmark for future research, opening avenues for further advancements in drone navigation and safety.
CVApr 25, 2024
OpenDlign: Open-World Point Cloud Understanding with Depth-Aligned ImagesYe Mao, Junpeng Jing, Krystian Mikolajczyk
Recent open-world 3D representation learning methods using Vision-Language Models (VLMs) to align 3D point cloud with image-text information have shown superior 3D zero-shot performance. However, CAD-rendered images for this alignment often lack realism and texture variation, compromising alignment robustness. Moreover, the volume discrepancy between 3D and 2D pretraining datasets highlights the need for effective strategies to transfer the representational abilities of VLMs to 3D learning. In this paper, we present OpenDlign, a novel open-world 3D model using depth-aligned images generated from a diffusion model for robust multimodal alignment. These images exhibit greater texture diversity than CAD renderings due to the stochastic nature of the diffusion model. By refining the depth map projection pipeline and designing depth-specific prompts, OpenDlign leverages rich knowledge in pre-trained VLM for 3D representation learning with streamlined fine-tuning. Our experiments show that OpenDlign achieves high zero-shot and few-shot performance on diverse 3D tasks, despite only fine-tuning 6 million parameters on a limited ShapeNet dataset. In zero-shot classification, OpenDlign surpasses previous models by 8.0% on ModelNet40 and 16.4% on OmniObject3D. Additionally, using depth-aligned images for multimodal alignment consistently enhances the performance of other state-of-the-art models.
ROApr 23, 2024
Closed Loop Interactive Embodied Reasoning for Robot ManipulationMichal Nazarczuk, Jan Kristof Behrens, Karla Stepanova et al.
Embodied reasoning systems integrate robotic hardware and cognitive processes to perform complex tasks, typically in response to a natural language query about a specific physical environment. This usually involves changing the belief about the scene or physically interacting and changing the scene (e.g. sort the objects from lightest to heaviest). In order to facilitate the development of such systems we introduce a new modular Closed Loop Interactive Embodied Reasoning (CLIER) approach that takes into account the measurements of non-visual object properties, changes in the scene caused by external disturbances as well as uncertain outcomes of robotic actions. CLIER performs multi-modal reasoning and action planning and generates a sequence of primitive actions that can be executed by a robot manipulator. Our method operates in a closed loop, responding to changes in the environment. Our approach is developed with the use of MuBle simulation environment and tested in 10 interactive benchmark scenarios. We extensively evaluate our reasoning approach in simulation and in real-world manipulation tasks with a success rate above 76% and 64%, respectively.
CVSep 18, 2025
UCorr: Wire Detection and Depth Estimation for Autonomous DronesBenedikt Kolbeinsson, Krystian Mikolajczyk
In the realm of fully autonomous drones, the accurate detection of obstacles is paramount to ensure safe navigation and prevent collisions. Among these challenges, the detection of wires stands out due to their slender profile, which poses a unique and intricate problem. To address this issue, we present an innovative solution in the form of a monocular end-to-end model for wire segmentation and depth estimation. Our approach leverages a temporal correlation layer trained on synthetic data, providing the model with the ability to effectively tackle the complex joint task of wire detection and depth estimation. We demonstrate the superiority of our proposed method over existing competitive approaches in the joint task of wire detection and depth estimation. Our results underscore the potential of our model to enhance the safety and precision of autonomous drones, shedding light on its promising applications in real-world scenarios.
CVAug 2, 2025
No Pose at All: Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse ViewsRanran Huang, Krystian Mikolajczyk
We introduce SPFSplat, an efficient framework for 3D Gaussian splatting from sparse multi-view images, requiring no ground-truth poses during training or inference. It employs a shared feature extraction backbone, enabling simultaneous prediction of 3D Gaussian primitives and camera poses in a canonical space from unposed inputs within a single feed-forward step. Alongside the rendering loss based on estimated novel-view poses, a reprojection loss is integrated to enforce the learning of pixel-aligned Gaussian primitives for enhanced geometric constraints. This pose-free training paradigm and efficient one-step feed-forward design make SPFSplat well-suited for practical applications. Remarkably, despite the absence of pose supervision, SPFSplat achieves state-of-the-art performance in novel view synthesis even under significant viewpoint changes and limited image overlap. It also surpasses recent methods trained with geometry priors in relative pose estimation. Code and trained models are available on our project page: https://ranrhuang.github.io/spfsplat/.
ROApr 10, 2024
Interactive Learning of Physical Object Properties Through Robot Manipulation and Database of Object MeasurementsAndrej Kruzliak, Jiri Hartvich, Shubhan P. Patni et al.
This work presents a framework for automatically extracting physical object properties, such as material composition, mass, volume, and stiffness, through robot manipulation and a database of object measurements. The framework involves exploratory action selection to maximize learning about objects on a table. A Bayesian network models conditional dependencies between object properties, incorporating prior probability distributions and uncertainty associated with measurement actions. The algorithm selects optimal exploratory actions based on expected information gain and updates object properties through Bayesian inference. Experimental evaluation demonstrates effective action selection compared to a baseline and correct termination of the experiments if there is nothing more to be learned. The algorithm proved to behave intelligently when presented with trick objects with material properties in conflict with their appearance. The robot pipeline integrates with a logging module and an online database of objects, containing over 24,000 measurements of 63 objects with different grippers. All code and data are publicly available, facilitating automatic digitization of objects and their physical properties through exploratory manipulations.
CVFeb 2, 2025
Hypo3D: Exploring Hypothetical Reasoning in 3DYe Mao, Weixun Luo, Junpeng Jing et al.
The rise of vision-language foundation models marks an advancement in bridging the gap between human and machine capabilities in 3D scene reasoning. Existing 3D reasoning benchmarks assume real-time scene accessibility, which is impractical due to the high cost of frequent scene updates. To this end, we introduce Hypothetical 3D Reasoning, namely Hypo3D, a benchmark designed to evaluate models' ability to reason without access to real-time scene data. Models need to imagine the scene state based on a provided change description before reasoning. Hypo3D is formulated as a 3D Visual Question Answering (VQA) benchmark, comprising 7,727 context changes across 700 indoor scenes, resulting in 14,885 question-answer pairs. An anchor-based world frame is established for all scenes, ensuring consistent reference to a global frame for directional terms in context changes and QAs. Extensive experiments show that state-of-the-art foundation models struggle to reason in hypothetically changed scenes. This reveals a substantial performance gap compared to humans, particularly in scenarios involving movement changes and directional reasoning. Even when the context change is irrelevant to the question, models often incorrectly adjust their answers. Project website: https://matchlab-imperial.github.io/Hypo3D/
CVMar 22, 2024
Language-Based Depth Hints for Monocular Depth EstimationDylan Auty, Krystian Mikolajczyk
Monocular depth estimation (MDE) is inherently ambiguous, as a given image may result from many different 3D scenes and vice versa. To resolve this ambiguity, an MDE system must make assumptions about the most likely 3D scenes for a given input. These assumptions can be either explicit or implicit. In this work, we demonstrate the use of natural language as a source of an explicit prior about the structure of the world. The assumption is made that human language encodes the likely distribution in depth-space of various objects. We first show that a language model encodes this implicit bias during training, and that it can be extracted using a very simple learned approach. We then show that this prediction can be provided as an explicit source of assumption to an MDE system, using an off-the-shelf instance segmentation model that provides the labels used as the input to the language model. We demonstrate the performance of our method on the NYUD2 dataset, showing improvement compared to the baseline and to random controls.
CVMar 7, 2025
Stereo Any Video: Temporally Consistent Stereo MatchingJunpeng Jing, Weixun Luo, Ye Mao et al.
This paper introduces Stereo Any Video, a powerful framework for video stereo matching. It can estimate spatially accurate and temporally consistent disparities without relying on auxiliary information such as camera poses or optical flow. The strong capability is driven by rich priors from monocular video depth models, which are integrated with convolutional features to produce stable representations. To further enhance performance, key architectural innovations are introduced: all-to-all-pairs correlation, which constructs smooth and robust matching cost volumes, and temporal convex upsampling, which improves temporal coherence. These components collectively ensure robustness, accuracy, and temporal consistency, setting a new standard in video stereo matching. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple datasets both qualitatively and quantitatively in zero-shot settings, as well as strong generalization to real-world indoor and outdoor scenarios.
CVNov 20, 2025
POMA-3D: The Point Map Way to 3D Scene UnderstandingYe Mao, Weixun Luo, Ranran Huang et al.
In this paper, we introduce POMA-3D, the first self-supervised 3D representation model learned from point maps. Point maps encode explicit 3D coordinates on a structured 2D grid, preserving global 3D geometry while remaining compatible with the input format of 2D foundation models. To transfer rich 2D priors into POMA-3D, a view-to-scene alignment strategy is designed. Moreover, as point maps are view-dependent with respect to a canonical space, we introduce POMA-JEPA, a joint embedding-predictive architecture that enforces geometrically consistent point map features across multiple views. Additionally, we introduce ScenePoint, a point map dataset constructed from 6.5K room-level RGB-D scenes and 1M 2D image scenes to facilitate large-scale POMA-3D pretraining. Experiments show that POMA-3D serves as a strong backbone for both specialist and generalist 3D understanding. It benefits diverse tasks, including 3D question answering, embodied navigation, scene retrieval, and embodied localization, all achieved using only geometric inputs (i.e., 3D coordinates). Overall, our POMA-3D explores a point map way to 3D scene understanding, addressing the scarcity of pretrained priors and limited data in 3D representation learning. Project Page: https://matchlab-imperial.github.io/poma3d/
CVNov 20, 2025
Lite Any Stereo: Efficient Zero-Shot Stereo MatchingJunpeng Jing, Weixun Luo, Ye Mao et al.
Recent advances in stereo matching have focused on accuracy, often at the cost of significantly increased model size. Traditionally, the community has regarded efficient models as incapable of zero-shot ability due to their limited capacity. In this paper, we introduce Lite Any Stereo, a stereo depth estimation framework that achieves strong zero-shot generalization while remaining highly efficient. To this end, we design a compact yet expressive backbone to ensure scalability, along with a carefully crafted hybrid cost aggregation module. We further propose a three-stage training strategy on million-scale data to effectively bridge the sim-to-real gap. Together, these components demonstrate that an ultra-light model can deliver strong generalization, ranking 1st across four widely used real-world benchmarks. Remarkably, our model attains accuracy comparable to or exceeding state-of-the-art non-prior-based accurate methods while requiring less than 1% computational cost, setting a new standard for efficient stereo matching.
CVSep 21, 2025
SPFSplatV2: Efficient Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse ViewsRanran Huang, Krystian Mikolajczyk
We introduce SPFSplatV2, an efficient feed-forward framework for 3D Gaussian splatting from sparse multi-view images, requiring no ground-truth poses during training and inference. It employs a shared feature extraction backbone, enabling simultaneous prediction of 3D Gaussian primitives and camera poses in a canonical space from unposed inputs. A masked attention mechanism is introduced to efficiently estimate target poses during training, while a reprojection loss enforces pixel-aligned Gaussian primitives, providing stronger geometric constraints. We further demonstrate the compatibility of our training framework with different reconstruction architectures, resulting in two model variants. Remarkably, despite the absence of pose supervision, our method achieves state-of-the-art performance in both in-domain and out-of-domain novel view synthesis, even under extreme viewpoint changes and limited image overlap, and surpasses recent methods that rely on geometric supervision for relative pose estimation. By eliminating dependence on ground-truth poses, our method offers the scalability to leverage larger and more diverse datasets. Code and pretrained models will be available on our project page: https://ranrhuang.github.io/spfsplatv2/.
CVMar 21, 2024
Learning to Project for Cross-Task Knowledge DistillationDylan Auty, Roy Miles, Benedikt Kolbeinsson et al.
Traditional knowledge distillation (KD) relies on a proficient teacher trained on the target task, which is not always available. In this setting, cross-task distillation can be used, enabling the use of any teacher model trained on a different task. However, many KD methods prove ineffective when applied to this cross-task setting. To address this limitation, we propose a simple modification: the use of an inverted projection. We show that this drop-in replacement for a standard projector is effective by learning to disregard any task-specific features which might degrade the student's performance. We find that this simple modification is sufficient for extending many KD methods to the cross-task setting, where the teacher and student tasks can be very different. In doing so, we obtain up to a 1.9% improvement in the cross-task setting compared to the traditional projection, at no additional cost. Our method can obtain significant performance improvements (up to 7%) when using even a randomly-initialised teacher on various tasks such as depth estimation, image translation, and semantic segmentation, despite the lack of any learned knowledge to transfer. To provide conceptual and analytical insights into this result, we show that using an inverted projection allows the distillation loss to be decomposed into a knowledge transfer and a spectral regularisation component. Through this analysis we are additionally able to propose a novel regularisation loss that allows teacher-free distillation, enabling performance improvements of up to 8.57% on ImageNet with no additional training costs.
CVDec 23, 2021
NinjaDesc: Content-Concealing Visual Descriptors via Adversarial LearningTony Ng, Hyo Jin Kim, Vincent Lee et al.
In the light of recent analyses on privacy-concerning scene revelation from visual descriptors, we develop descriptors that conceal the input image content. In particular, we propose an adversarial learning framework for training visual descriptors that prevent image reconstruction, while maintaining the matching accuracy. We let a feature encoding network and image reconstruction network compete with each other, such that the feature encoder tries to impede the image reconstruction with its generated descriptors, while the reconstructor tries to recover the input image from the descriptors. The experimental results demonstrate that the visual descriptors obtained with our method significantly deteriorate the image reconstruction quality with minimal impact on correspondence matching and camera localization performance.
CVOct 25, 2021
Reconstructing Pruned Filters using Cheap Spatial TransformationsRoy Miles, Krystian Mikolajczyk
We present an efficient alternative to the convolutional layer using cheap spatial transformations. This construction exploits an inherent spatial redundancy of the learned convolutional filters to enable a much greater parameter efficiency, while maintaining the top-end accuracy of their dense counter-parts. Training these networks is modelled as a generalised pruning problem, whereby the pruned filters are replaced with cheap transformations from the set of non-pruned filters. We provide an efficient implementation of the proposed layer, followed by two natural extensions to avoid excessive feature compression and to improve the expressivity of the transformed features. We show that these networks can achieve comparable or improved performance to state-of-the-art pruning models across both the CIFAR-10 and ImageNet-1K datasets.
CVOct 6, 2021
Grasp-Oriented Fine-grained Cloth Segmentation without Real SupervisionRuijie Ren, Mohit Gurnani Rajesh, Jordi Sanchez-Riera et al.
Automatically detecting graspable regions from a single depth image is a key ingredient in cloth manipulation. The large variability of cloth deformations has motivated most of the current approaches to focus on identifying specific grasping points rather than semantic parts, as the appearance and depth variations of local regions are smaller and easier to model than the larger ones. However, tasks like cloth folding or assisted dressing require recognising larger segments, such as semantic edges that carry more information than points. The first goal of this paper is therefore to tackle the problem of fine-grained region detection in deformed clothes using only a depth image. As a proof of concept, we implement an approach for T-shirts, and define up to 6 semantic regions of varying extent, including edges on the neckline, sleeve cuffs, and hem, plus top and bottom grasping points. We introduce a U-net based network to segment and label these parts. The second contribution of our work is concerned with the level of supervision that we require to train the proposed network. While most approaches learn to detect grasping points by combining real and synthetic annotations, in this work we defy the limitations of the synthetic data, and propose a multilayered domain adaptation (DA) strategy that does not use real annotations at all. We thoroughly evaluate our approach on real depth images of a T-shirt annotated with fine-grained labels. We show that training our network solely with synthetic data and the proposed DA yields results competitive with models trained on real data.
CVAug 16, 2021
Reassessing the Limitations of CNN Methods for Camera Pose RegressionTony Ng, Adrian Lopez-Rodriguez, Vassileios Balntas et al.
In this paper, we address the problem of camera pose estimation in outdoor and indoor scenarios. In comparison to the currently top-performing methods that rely on 2D to 3D matching, we propose a model that can directly regress the camera pose from images with significantly higher accuracy than existing methods of the same class. We first analyse why regression methods are still behind the state-of-the-art, and we bridge the performance gap with our new approach. Specifically, we propose a way to overcome the biased training data by a novel training technique, which generates poses guided by a probability distribution from the training set for synthesising new training views. Lastly, we evaluate our approach on two widely used benchmarks and show that it achieves significantly improved performance compared to prior regression-based methods, retrieval techniques as well as 3D pipelines with local feature matching.
NIMay 24, 2021
AirNet: Neural Network Transmission over the AirMikolaj Jankowski, Deniz Gunduz, Krystian Mikolajczyk
State-of-the-art performance for many edge applications is achieved by deep neural networks (DNNs). Often, these DNNs are location- and time-sensitive, and must be delivered over a wireless channel rapidly and efficiently. In this paper, we introduce AirNet, a family of novel training and transmission methods that allow DNNs to be efficiently delivered over wireless channels under stringent transmit power and latency constraints. This corresponds to a new class of joint source-channel coding problems, aimed at delivering DNNs with the goal of maximizing their accuracy at the receiver, rather than recovering them with high fidelity. In AirNet, we propose the direct mapping of the DNN parameters to transmitted channel symbols, while the network is trained to meet the channel constraints, and exhibit robustness against channel noise. AirNet achieves higher accuracy compared to separation-based alternatives. We further improve the performance of AirNet by pruning the network below the available bandwidth, and expanding it for improved robustness. We also benefit from unequal error protection by selectively expanding important layers of the network. Finally, we develop an approach, which simultaneously trains a spectrum of DNNs, each targeting a different channel condition, resolving the impractical memory requirements of training distinct networks for different channel conditions.
CVSep 3, 2020
DESC: Domain Adaptation for Depth Estimation via Semantic ConsistencyAdrian Lopez-Rodriguez, Krystian Mikolajczyk
Accurate real depth annotations are difficult to acquire, needing the use of special devices such as a LiDAR sensor. Self-supervised methods try to overcome this problem by processing video or stereo sequences, which may not always be available. Instead, in this paper, we propose a domain adaptation approach to train a monocular depth estimation model using a fully-annotated source dataset and a non-annotated target dataset. We bridge the domain gap by leveraging semantic predictions and low-level edge features to provide guidance for the target domain. We enforce consistency between the main model and a second model trained with semantic segmentation and edge maps, and introduce priors in the form of instance heights. Our approach is evaluated on standard domain adaptation benchmarks for monocular depth estimation and show consistent improvement upon the state-of-the-art.
CVAug 16, 2020
Cascaded channel pruning using hierarchical self-distillationRoy Miles, Krystian Mikolajczyk
In this paper, we propose an approach for filter-level pruning with hierarchical knowledge distillation based on the teacher, teaching-assistant, and student framework. Our method makes use of teaching assistants at intermediate pruning levels that share the same architecture and weights as the target student. We propose to prune each model independently using the gradient information from its corresponding teacher. By considering the relative sizes of each student-teacher pair, this formulation provides a natural trade-off between the capacity gap for knowledge distillation and the bias of the filter saliency updates. Our results show improvements in the attainable accuracy and model compression across the CIFAR10 and ImageNet classification tasks using the VGG16and ResNet50 architectures. We provide an extensive evaluation that demonstrates the benefits of using a varying number of teaching assistant models at different sizes.
CVAug 3, 2020
Project to Adapt: Domain Adaptation for Depth Completion from Noisy and Sparse Sensor DataAdrian Lopez-Rodriguez, Benjamin Busam, Krystian Mikolajczyk
Depth completion aims to predict a dense depth map from a sparse depth input. The acquisition of dense ground truth annotations for depth completion settings can be difficult and, at the same time, a significant domain gap between real LiDAR measurements and synthetic data has prevented from successful training of models in virtual settings. We propose a domain adaptation approach for sparse-to-dense depth completion that is trained from synthetic data, without annotations in the real domain or additional sensors. Our approach simulates the real sensor noise in an RGB+LiDAR set-up, and consists of three modules: simulating the real LiDAR input in the synthetic domain via projections, filtering the real noisy LiDAR for supervision and adapting the synthetic RGB image using a CycleGAN approach. We extensively evaluate these modules against the state-of-the-art in the KITTI depth completion benchmark, showing significant improvements.
ITJul 21, 2020
Wireless Image Retrieval at the EdgeMikolaj Jankowski, Deniz Gunduz, Krystian Mikolajczyk
We study the image retrieval problem at the wireless edge, where an edge device captures an image, which is then used to retrieve similar images from an edge server. These can be images of the same person or a vehicle taken from other cameras at different times and locations. Our goal is to maximize the accuracy of the retrieval task under power and bandwidth constraints over the wireless link. Due to the stringent delay constraint of the underlying application, sending the whole image at a sufficient quality is not possible. We propose two alternative schemes based on digital and analog communications, respectively. In the digital approach, we first propose a deep neural network (DNN) aided retrieval-oriented image compression scheme, whose output bit sequence is transmitted over the channel using conventional channel codes. In the analog joint source and channel coding (JSCC) approach, the feature vectors are directly mapped into channel symbols. We evaluate both schemes on image based re-identification (re-ID) tasks under different channel conditions, including both static and fading channels. We show that the JSCC scheme significantly increases the end-to-end accuracy, speeds up the encoding process, and provides graceful degradation with channel conditions. The proposed architecture is evaluated through extensive simulations on different datasets and channel conditions, as well as through ablation studies.
CVJun 17, 2020
HyNet: Learning Local Descriptor with Hybrid Similarity Measure and Triplet LossYurun Tian, Axel Barroso-Laguna, Tony Ng et al.
Recent works show that local descriptor learning benefits from the use of L2 normalisation, however, an in-depth analysis of this effect lacks in the literature. In this paper, we investigate how L2 normalisation affects the back-propagated descriptor gradients during training. Based on our observations, we propose HyNet, a new local descriptor that leads to state-of-the-art results in matching. HyNet introduces a hybrid similarity measure for triplet margin loss, a regularisation term constraining the descriptor norm, and a new network architecture that performs L2 normalisation of all intermediate feature maps and the output descriptors. HyNet surpasses previous methods by a significant margin on standard benchmarks that include patch matching, verification, and retrieval, as well as outperforming full end-to-end methods on 3D reconstruction tasks.
CVMay 27, 2020
D2D: Keypoint Extraction with Describe to Detect ApproachYurun Tian, Vassileios Balntas, Tony Ng et al.
In this paper, we present a novel approach that exploits the information within the descriptor space to propose keypoint locations. Detect then describe, or detect and describe jointly are two typical strategies for extracting local descriptors. In contrast, we propose an approach that inverts this process by first describing and then detecting the keypoint locations. % Describe-to-Detect (D2D) leverages successful descriptor models without the need for any additional training. Our method selects keypoints as salient locations with high information content which is defined by the descriptors rather than some independent operators. We perform experiments on multiple benchmarks including image matching, camera localisation, and 3D reconstruction. The results indicate that our method improves the matching performance of various descriptors and that it generalises across methods and tasks.
CVMay 12, 2020
HDD-Net: Hybrid Detector Descriptor with Mutual Interactive LearningAxel Barroso-Laguna, Yannick Verdie, Benjamin Busam et al.
Local feature extraction remains an active research area due to the advances in fields such as SLAM, 3D reconstructions, or AR applications. The success in these applications relies on the performance of the feature detector and descriptor. While the detector-descriptor interaction of most methods is based on unifying in single network detections and descriptors, we propose a method that treats both extractions independently and focuses on their interaction in the learning process rather than by parameter sharing. We formulate the classical hard-mining triplet loss as a new detector optimisation term to refine candidate positions based on the descriptor map. We propose a dense descriptor that uses a multi-scale approach and a hybrid combination of hand-crafted and learned features to obtain rotation and scale robustness by design. We evaluate our method extensively on different benchmarks and show improvements over the state of the art in terms of image matching on HPatches and 3D reconstruction quality while keeping on par on camera localisation tasks.
CVApr 6, 2020
SHOP-VRB: A Visual Reasoning Benchmark for Object PerceptionMichal Nazarczuk, Krystian Mikolajczyk
In this paper we present an approach and a benchmark for visual reasoning in robotics applications, in particular small object grasping and manipulation. The approach and benchmark are focused on inferring object properties from visual and text data. It concerns small household objects with their properties, functionality, natural language descriptions as well as question-answer pairs for visual reasoning queries along with their corresponding scene semantic representations. We also present a method for generating synthetic data which allows to extend the benchmark to other objects or scenes and propose an evaluation protocol that is more challenging than in the existing datasets. We propose a reasoning system based on symbolic program execution. A disentangled representation of the visual and textual inputs is obtained and used to execute symbolic programs that represent a 'reasoning process' of the algorithm. We perform a set of experiments on the proposed benchmark and compare to results for the state of the art methods. These results expose the shortcomings of the existing benchmarks that may lead to misleading conclusions on the actual performance of the visual reasoning systems.
CVMar 9, 2020
Domain Adversarial Training for Infrared-colour Person Re-IdentificationNima Mohammadi Meshky, Sara Iodice, Krystian Mikolajczyk
Person re-identification (re-ID) is a very active area of research in computer vision, due to the role it plays in video surveillance. Currently, most methods only address the task of matching between colour images. However, in poorly-lit environments CCTV cameras switch to infrared imaging, hence developing a system which can correctly perform matching between infrared and colour images is a necessity. In this paper, we propose a part-feature extraction network to better focus on subtle, unique signatures on the person which are visible across both infrared and colour modalities. To train the model we propose a novel variant of the domain adversarial feature-learning framework. Through extensive experimentation, we show that our approach outperforms state-of-the-art methods.
ITMar 4, 2020
Joint Device-Edge Inference over Wireless Links with PruningMikolaj Jankowski, Deniz Gunduz, Krystian Mikolajczyk
We propose a joint feature compression and transmission scheme for efficient inference at the wireless network edge. Our goal is to enable efficient and reliable inference at the edge server assuming limited computational resources at the edge device. Previous work focused mainly on feature compression, ignoring the computational cost of channel coding. We incorporate the recently proposed deep joint source-channel coding (DeepJSCC) scheme, and combine it with novel filter pruning strategies aimed at reducing the redundant complexity from neural networks. We evaluate our approach on a classification task, and show improved results in both end-to-end reliability and workload reduction at the edge device. This is the first work that combines DeepJSCC with network pruning, and applies it to image classification over the wireless edge.
CVJan 9, 2020
Compression of descriptor models for mobile applicationsRoy Miles, Krystian Mikolajczyk
Deep neural networks have demonstrated state-of-the-art performance for feature-based image matching through the advent of new large and diverse datasets. However, there has been little work on evaluating the computational cost, model size, and matching accuracy tradeoffs for these models. This paper explicitly addresses these practical metrics by considering the state-of-the-art HardNet model. We observe a significant redundancy in the learned weights, which we exploit through the use of depthwise separable layers and an efficient Tucker decomposition. We demonstrate that a combination of these methods is very effective, but still sacrifices the top-end accuracy. To resolve this, we propose the Convolution-Depthwise-Pointwise(CDP) layer, which provides a means of interpolating between the standard and depthwise separable convolutions. With this proposed layer, we can achieve an 8 times reduction in the number of parameters on the HardNet model, 13 times reduction in the computational complexity, while sacrificing less than 1% on the overall accuracy across theHPatchesbenchmarks. To further demonstrate the generalisation of this approach, we apply it to the state-of-the-art SuperPoint model, where we can significantly reduce the number of parameters and floating-point operations, with minimal degradation in the matching accuracy.
CVNov 22, 2019
Domain Adaptation for Object Detection via Style ConsistencyAdrian Lopez Rodriguez, Krystian Mikolajczyk
We propose a domain adaptation approach for object detection. We introduce a two-step method: the first step makes the detector robust to low-level differences and the second step adapts the classifiers to changes in the high-level features. For the first step, we use a style transfer method for pixel-adaptation of source images to the target domain. We find that enforcing low distance in the high-level features of the object detector between the style transferred images and the source images improves the performance in the target domain. For the second step, we propose a robust pseudo labelling approach to reduce the noise in both positive and negative sampling. Experimental evaluation is performed using the detector SSD300 on PASCAL VOC extended with the dataset proposed in arxiv:1803.11365 where the target domain images are of different styles. Our approach significantly improves the state-of-the-art performance in this benchmark.
ITOct 28, 2019
Deep Joint Source-Channel Coding for Wireless Image RetrievalMikolaj Jankowski, Deniz Gunduz, Krystian Mikolajczyk
Motivated by surveillance applications with wireless cameras or drones, we consider the problem of image retrieval over a wireless channel. Conventional systems apply lossy compression on query images to reduce the data that must be transmitted over the bandwidth and power limited wireless link. We first note that reconstructing the original image is not needed for retrieval tasks; hence, we introduce a deep neutral network (DNN) based compression scheme targeting the retrieval task. Then, we completely remove the compression step, and propose another DNN-based communication scheme that directly maps the feature vectors to channel inputs. This joint source-channel coding (JSCC) approach not only improves the end-to-end accuracy, but also simplifies and speeds up the encoding operation which is highly beneficial for power and latency constrained IoT applications.
CVApr 1, 2019
Key.Net: Keypoint Detection by Handcrafted and Learned CNN FiltersAxel Barroso-Laguna, Edgar Riba, Daniel Ponsa et al.
We introduce a novel approach for keypoint detection task that combines handcrafted and learned CNN filters within a shallow multi-scale architecture. Handcrafted filters provide anchor structures for learned filters, which localize, score and rank repeatable features. Scale-space representation is used within the network to extract keypoints at different levels. We design a loss function to detect robust features that exist across a range of scales and to maximize the repeatability score. Our Key.Net model is trained on data synthetically created from ImageNet and evaluated on HPatches benchmark. Results show that our approach outperforms state-of-the-art detectors in terms of repeatability, matching performance and complexity.
CVMar 30, 2019
Person Re-identification with Bias-controlled Adversarial TrainingSara Iodice, Krystian Mikolajczyk
Inspired by the effectiveness of adversarial training in the area of Generative Adversarial Networks we present a new approach for learning feature representations in person re-identification. We investigate different types of bias that typically occur in re-ID scenarios, i.e., pose, body part and camera view, and propose a general approach to address them. We introduce an adversarial strategy for controlling bias, named Bias-controlled Adversarial framework (BCA), with two complementary branches to reduce or to enhance bias-related features. The results and comparison to the state of the art on different benchmarks show that our framework is an effective strategy for person re-identification. The performance improvements are in both full and partial views of persons.
CVJul 24, 2018
Partial Person Re-identification with Alignment and HallucinationSara Iodice, Krystian Mikolajczyk
Partial person re-identification involves matching pedestrian frames where only a part of a body is visible in corresponding images. This reflects practical CCTV surveillance scenario, where full person views are often not available. Missing body parts make the comparison very challenging due to significant misalignment and varying scale of the views. We propose Partial Matching Net (PMN) that detects body joints, aligns partial views and hallucinates the missing parts based on the information present in the frame and a learned model of a person. The aligned and reconstructed views are then combined into a joint representation and used for matching images. We evaluate our approach and compare to other methods on three different datasets, demonstrating significant improvements.
CVMay 16, 2018
Deep Segmentation and Registration in X-Ray Angiography VideoAthanasios Vlontzos, Krystian Mikolajczyk
In interventional radiology, short video sequences of vein structure in motion are captured in order to help medical personnel identify vascular issues or plan intervention. Semantic segmentation can greatly improve the usefulness of these videos by indicating exact position of vessels and instruments, thus reducing the ambiguity. We propose a real-time segmentation method for these tasks, based on U-Net network trained in a Siamese architecture from automatically generated annotations. We make use of noisy low level binary segmentation and optical flow to generate multi class annotations that are successively improved in a multistage segmentation approach. We significantly improve the performance of a state of the art U-Net at the processing speeds of 90fps.
CVOct 3, 2017
Person Re-Identification with Vision and LanguageFei Yan, Krystian Mikolajczyk, Josef Kittler
In this paper we propose a new approach to person re-identification using images and natural language descriptions. We propose a joint vision and language model based on CCA and CNN architectures to match across the two modalities as well as to enrich visual examples for which there are no language descriptions. We also introduce new annotations in the form of natural language descriptions for two standard Re-ID benchmarks, namely CUHK03 and VIPeR. We perform experiments on these two datasets with techniques based on CNN, hand-crafted features as well as LSTM for analysing visual and natural description data. We investigate and demonstrate the advantages of using natural language descriptions compared to attributes as well as CNN compared to LSTM in the context of Re-ID. We show that the joint use of language and vision can significantly improve the state-of-the-art performance on standard Re-ID benchmarks.
CVApr 19, 2017
HPatches: A benchmark and evaluation of handcrafted and learned local descriptorsVassileios Balntas, Karel Lenc, Andrea Vedaldi et al.
In this paper, we propose a novel benchmark for evaluating local image descriptors. We demonstrate that the existing datasets and evaluation protocols do not specify unambiguously all aspects of evaluation, leading to ambiguities and inconsistencies in results reported in the literature. Furthermore, these datasets are nearly saturated due to the recent improvements in local descriptors obtained by learning them from large annotated datasets. Therefore, we introduce a new large dataset suitable for training and testing modern descriptors, together with strictly defined evaluation protocols in several tasks such as matching, retrieval and classification. This allows for more realistic, and thus more reliable comparisons in different application scenarios. We evaluate the performance of several state-of-the-art descriptors and analyse their properties. We show that a simple normalisation of traditional hand-crafted descriptors can boost their performance to the level of deep learning based descriptors within a realistic benchmarks evaluation.