Ajad Chhatkuli

CV
h-index30
28papers
499citations
Novelty57%
AI Score50

28 Papers

CVJul 21, 2022Code
Mining Relations among Cross-Frame Affinities for Video Semantic Segmentation

Guolei Sun, Yun Liu, Hao Tang et al.

The essence of video semantic segmentation (VSS) is how to leverage temporal information for prediction. Previous efforts are mainly devoted to developing new techniques to calculate the cross-frame affinities such as optical flow and attention. Instead, this paper contributes from a different angle by mining relations among cross-frame affinities, upon which better temporal information aggregation could be achieved. We explore relations among affinities in two aspects: single-scale intrinsic correlations and multi-scale relations. Inspired by traditional feature processing, we propose Single-scale Affinity Refinement (SAR) and Multi-scale Affinity Aggregation (MAA). To make it feasible to execute MAA, we propose a Selective Token Masking (STM) strategy to select a subset of consistent reference tokens for different scales when calculating affinities, which also improves the efficiency of our method. At last, the cross-frame affinities strengthened by SAR and MAA are adopted for adaptively aggregating temporal information. Our experiments demonstrate that the proposed method performs favorably against state-of-the-art VSS methods. The code is publicly available at https://github.com/GuoleiSun/VSS-MRCFA

CVMar 7, 2022Code
ZippyPoint: Fast Interest Point Detection, Description, and Matching through Mixed Precision Discretization

Menelaos Kanakis, Simon Maurer, Matteo Spallanzani et al.

Efficient detection and description of geometric regions in images is a prerequisite in visual systems for localization and mapping. Such systems still rely on traditional hand-crafted methods for efficient generation of lightweight descriptors, a common limitation of the more powerful neural network models that come with high compute and specific hardware requirements. In this paper, we focus on the adaptations required by detection and description neural networks to enable their use in computationally limited platforms such as robots, mobile, and augmented reality devices. To that end, we investigate and adapt network quantization techniques to accelerate inference and enable its use on compute limited platforms. In addition, we revisit common practices in descriptor quantization and propose the use of a binary descriptor normalization layer, enabling the generation of distinctive binary descriptors with a constant number of ones. ZippyPoint, our efficient quantized network with binary descriptors, improves the network runtime speed, the descriptor matching speed, and the 3D model size, by at least an order of magnitude when compared to full-precision counterparts. These improvements come at a minor performance degradation as evaluated on the tasks of homography estimation, visual localization, and map-free visual relocalization. Code and models are available at https://github.com/menelaoskanakis/ZippyPoint.

CVMar 16, 2022Code
Zero Pixel Directional Boundary by Vector Transform

Edoardo Mello Rella, Ajad Chhatkuli, Yun Liu et al.

Boundaries are among the primary visual cues used by human and computer vision systems. One of the key problems in boundary detection is the label representation, which typically leads to class imbalance and, as a consequence, to thick boundaries that require non-differential post-processing steps to be thinned. In this paper, we re-interpret boundaries as 1-D surfaces and formulate a one-to-one vector transform function that allows for training of boundary prediction completely avoiding the class imbalance issue. Specifically, we define the boundary representation at any point as the unit vector pointing to the closest boundary surface. Our problem formulation leads to the estimation of direction as well as richer contextual information of the boundary, and, if desired, the availability of zero-pixel thin boundaries also at training time. Our method uses no hyper-parameter in the training loss and a fixed stable hyper-parameter at inference. We provide theoretical justification/discussions of the vector transform representation. We evaluate the proposed loss method using a standard architecture and show the excellent performance over other losses and representations on several datasets. Code is available at https://github.com/edomel/BoundaryVT.

CVApr 13, 2022Code
Neural Vector Fields for Implicit Surface Representation and Inference

Edoardo Mello Rella, Ajad Chhatkuli, Ender Konukoglu et al.

Implicit fields have recently shown increasing success in representing and learning 3D shapes accurately. Signed distance fields and occupancy fields are decades old and still the preferred representations, both with well-studied properties, despite their restriction to closed surfaces. With neural networks, several other variations and training principles have been proposed with the goal to represent all classes of shapes. In this paper, we develop a novel and yet a fundamental representation considering unit vectors in 3D space and call it Vector Field (VF): at each point in $\mathbb{R}^3$, VF is directed at the closest point on the surface. We theoretically demonstrate that VF can be easily transformed to surface density by computing the flux density. Unlike other standard representations, VF directly encodes an important physical property of the surface, its normal. We further show the advantages of VF representation, in learning open, closed, or multi-layered as well as piecewise planar surfaces. We compare our method on several datasets including ShapeNet where the proposed new neural implicit field shows superior accuracy in representing any type of shape, outperforming other standard methods. Code is available at https://github.com/edomel/ImplicitVF.

CVSep 15, 2023
Deformable Neural Radiance Fields using RGB and Event Cameras

Qi Ma, Danda Pani Paudel, Ajad Chhatkuli et al.

Modeling Neural Radiance Fields for fast-moving deformable objects from visual data alone is a challenging problem. A major issue arises due to the high deformation and low acquisition rates. To address this problem, we propose to use event cameras that offer very fast acquisition of visual change in an asynchronous manner. In this work, we develop a novel method to model the deformable neural radiance fields using RGB and event cameras. The proposed method uses the asynchronous stream of events and calibrated sparse RGB frames. In our setup, the camera pose at the individual events required to integrate them into the radiance fields remains unknown. Our method jointly optimizes these poses and the radiance field. This happens efficiently by leveraging the collection of events at once and actively sampling the events during learning. Experiments conducted on both realistically rendered graphics and real-world datasets demonstrate a significant benefit of the proposed method over the state-of-the-art and the compared baseline. This shows a promising direction for modeling deformable neural radiance fields in real-world dynamic scenes.

CVJul 15, 2024
iHuman: Instant Animatable Digital Humans From Monocular Videos

Pramish Paudel, Anubhav Khanal, Ajad Chhatkuli et al.

Personalized 3D avatars require an animatable representation of digital humans. Doing so instantly from monocular videos offers scalability to broad class of users and wide-scale applications. In this paper, we present a fast, simple, yet effective method for creating animatable 3D digital humans from monocular videos. Our method utilizes the efficiency of Gaussian splatting to model both 3D geometry and appearance. However, we observed that naively optimizing Gaussian splats results in inaccurate geometry, thereby leading to poor animations. This work achieves and illustrates the need of accurate 3D mesh-type modelling of the human body for animatable digitization through Gaussian splats. This is achieved by developing a novel pipeline that benefits from three key aspects: (a) implicit modelling of surface's displacements and the color's spherical harmonics; (b) binding of 3D Gaussians to the respective triangular faces of the body template; (c) a novel technique to render normals followed by their auxiliary supervision. Our exhaustive experiments on three different benchmark datasets demonstrates the state-of-the-art results of our method, in limited time settings. In fact, our method is faster by an order of magnitude (in terms of training time) than its closest competitor. At the same time, we achieve superior rendering and 3D reconstruction performance under the change of poses.

CVNov 28, 2023
Continuous Pose for Monocular Cameras in Neural Implicit Representation

Qi Ma, Danda Pani Paudel, Ajad Chhatkuli et al.

In this paper, we showcase the effectiveness of optimizing monocular camera poses as a continuous function of time. The camera poses are represented using an implicit neural function which maps the given time to the corresponding camera pose. The mapped camera poses are then used for the downstream tasks where joint camera pose optimization is also required. While doing so, the network parameters -- that implicitly represent camera poses -- are optimized. We exploit the proposed method in four diverse experimental settings, namely, (1) NeRF from noisy poses; (2) NeRF from asynchronous Events; (3) Visual Simultaneous Localization and Mapping (vSLAM); and (4) vSLAM with IMUs. In all four settings, the proposed method performs significantly better than the compared baselines and the state-of-the-art methods. Additionally, using the assumption of continuous motion, changes in pose may actually live in a manifold that has lower than 6 degrees of freedom (DOF) is also realized. We call this low DOF motion representation as the \emph{intrinsic motion} and use the approach in vSLAM settings, showing impressive camera tracking performance.

CVSep 24, 2024
Self-supervised Shape Completion via Involution and Implicit Correspondences

Mengya Liu, Ajad Chhatkuli, Janis Postels et al.

3D shape completion is traditionally solved using supervised training or by distribution learning on complete shape examples. Recently self-supervised learning approaches that do not require any complete 3D shape examples have gained more interests. In this paper, we propose a non-adversarial self-supervised approach for the shape completion task. Our first finding is that completion problems can be formulated as an involutory function trivially, which implies a special constraint on the completion function G, such that G(G(X)) = X. Our second constraint on self-supervised shape completion relies on the fact that shape completion becomes easier to solve with correspondences and similarly, completion can simplify the correspondences problem. We formulate a consistency measure in the canonical space in order to supervise the completion function. We efficiently optimize the completion and correspondence modules using "freeze and alternate" strategy. The overall approach performs well for rigid shapes in a category as well as dynamic non-rigid shapes. We ablate our design choices and compare our solution against state-of-the-art methods, showing remarkable accuracy approaching supervised accuracy in some cases.

CVDec 4, 2025
Inferring Compositional 4D Scenes without Ever Seeing One

Ahmet Berke Gokmen, Ajad Chhatkuli, Luc Van Gool et al.

Scenes in the real world are often composed of several static and dynamic objects. Capturing their 4-dimensional structures, composition and spatio-temporal configuration in-the-wild, though extremely interesting, is equally hard. Therefore, existing works often focus on one object at a time, while relying on some category-specific parametric shape model for dynamic objects. This can lead to inconsistent scene configurations, in addition to being limited to the modeled object categories. We propose COM4D (Compositional 4D), a method that consistently and jointly predicts the structure and spatio-temporal configuration of 4D/3D objects using only static multi-object or dynamic single object supervision. We achieve this by a carefully designed training of spatial and temporal attentions on 2D video input. The training is disentangled into learning from object compositions on the one hand, and single object dynamics throughout the video on the other, thus completely avoiding reliance on 4D compositional training data. At inference time, our proposed attention mixing mechanism combines these independently learned attentions, without requiring any 4D composition examples. By alternating between spatial and temporal reasoning, COM4D reconstructs complete and persistent 4D scenes with multiple interacting objects directly from monocular videos. Furthermore, COM4D provides state-of-the-art results in existing separate problems of 4D object and composed 3D reconstruction despite being purely data-driven.

CVAug 16, 2024
VF-NeRF: Learning Neural Vector Fields for Indoor Scene Reconstruction

Albert Gassol Puigjaner, Edoardo Mello Rella, Erik Sandström et al.

Implicit surfaces via neural radiance fields (NeRF) have shown surprising accuracy in surface reconstruction. Despite their success in reconstructing richly textured surfaces, existing methods struggle with planar regions with weak textures, which account for the majority of indoor scenes. In this paper, we address indoor dense surface reconstruction by revisiting key aspects of NeRF in order to use the recently proposed Vector Field (VF) as the implicit representation. VF is defined by the unit vector directed to the nearest surface point. It therefore flips direction at the surface and equals to the explicit surface normals. Except for this flip, VF remains constant along planar surfaces and provides a strong inductive bias in representing planar surfaces. Concretely, we develop a novel density-VF relationship and a training scheme that allows us to learn VF via volume rendering By doing this, VF-NeRF can model large planar surfaces and sharp corners accurately. We show that, when depth cues are available, our method further improves and achieves state-of-the-art results in reconstructing indoor scenes and rendering novel views. We extensively evaluate VF-NeRF on indoor datasets and run ablations of its components.

CVSep 10, 2021Code
TACS: Taxonomy Adaptive Cross-Domain Semantic Segmentation

Rui Gong, Martin Danelljan, Dengxin Dai et al.

Traditional domain adaptive semantic segmentation addresses the task of adapting a model to a novel target domain under limited or no additional supervision. While tackling the input domain gap, the standard domain adaptation settings assume no domain change in the output space. In semantic prediction tasks, different datasets are often labeled according to different semantic taxonomies. In many real-world settings, the target domain task requires a different taxonomy than the one imposed by the source domain. We therefore introduce the more general taxonomy adaptive cross-domain semantic segmentation (TACS) problem, allowing for inconsistent taxonomies between the two domains. We further propose an approach that jointly addresses the image-level and label-level domain adaptation. On the label-level, we employ a bilateral mixed sampling strategy to augment the target domain, and a relabelling method to unify and align the label spaces. We address the image-level domain gap by proposing an uncertainty-rectified contrastive learning method, leading to more domain-invariant and class-discriminative features. We extensively evaluate the effectiveness of our framework under different TACS settings: open taxonomy, coarse-to-fine taxonomy, and implicitly-overlapping taxonomy. Our approach outperforms the previous state-of-the-art by a large margin, while being capable of adapting to target taxonomies. Our implementation is publicly available at https://github.com/ETHRuiGong/TADA.

CVJun 6, 2021Code
Vision Transformers with Hierarchical Attention

Yun Liu, Yu-Huan Wu, Guolei Sun et al.

This paper tackles the high computational/space complexity associated with Multi-Head Self-Attention (MHSA) in vanilla vision transformers. To this end, we propose Hierarchical MHSA (H-MHSA), a novel approach that computes self-attention in a hierarchical fashion. Specifically, we first divide the input image into patches as commonly done, and each patch is viewed as a token. Then, the proposed H-MHSA learns token relationships within local patches, serving as local relationship modeling. Then, the small patches are merged into larger ones, and H-MHSA models the global dependencies for the small number of the merged tokens. At last, the local and global attentive features are aggregated to obtain features with powerful representation capacity. Since we only calculate attention for a limited number of tokens at each step, the computational load is reduced dramatically. Hence, H-MHSA can efficiently model global relationships among tokens without sacrificing fine-grained information. With the H-MHSA module incorporated, we build a family of Hierarchical-Attention-based Transformer Networks, namely HAT-Net. To demonstrate the superiority of HAT-Net in scene understanding, we conduct extensive experiments on fundamental vision tasks, including image classification, semantic segmentation, object detection, and instance segmentation. Therefore, HAT-Net provides a new perspective for vision transformers. Code and pretrained models are available at https://github.com/yun-liu/HAT-Net.

CVFeb 12, 2021Code
Efficient Conditional GAN Transfer with Knowledge Propagation across Classes

Mohamad Shahbazi, Zhiwu Huang, Danda Pani Paudel et al.

Generative adversarial networks (GANs) have shown impressive results in both unconditional and conditional image generation. In recent literature, it is shown that pre-trained GANs, on a different dataset, can be transferred to improve the image generation from a small target data. The same, however, has not been well-studied in the case of conditional GANs (cGANs), which provides new opportunities for knowledge transfer compared to unconditional setup. In particular, the new classes may borrow knowledge from the related old classes, or share knowledge among themselves to improve the training. This motivates us to study the problem of efficient conditional GAN transfer with knowledge propagation across classes. To address this problem, we introduce a new GAN transfer method to explicitly propagate the knowledge from the old classes to the new classes. The key idea is to enforce the popularly used conditional batch normalization (BN) to learn the class-specific information of the new classes from that of the old classes, with implicit knowledge sharing among the new ones. This allows for an efficient knowledge propagation from the old classes to the new ones, with the BN parameters increasing linearly with the number of new classes. The extensive evaluation demonstrates the clear superiority of the proposed method over state-of-the-art competitors for efficient conditional GAN transfer tasks. The code is available at: https://github.com/mshahbazi72/cGANTransfer

CVMay 7, 2025
One2Any: One-Reference 6D Pose Estimation for Any Object

Mengya Liu, Siyuan Li, Ajad Chhatkuli et al.

6D object pose estimation remains challenging for many applications due to dependencies on complete 3D models, multi-view images, or training limited to specific object categories. These requirements make generalization to novel objects difficult for which neither 3D models nor multi-view images may be available. To address this, we propose a novel method One2Any that estimates the relative 6-degrees of freedom (DOF) object pose using only a single reference-single query RGB-D image, without prior knowledge of its 3D model, multi-view data, or category constraints. We treat object pose estimation as an encoding-decoding process, first, we obtain a comprehensive Reference Object Pose Embedding (ROPE) that encodes an object shape, orientation, and texture from a single reference view. Using this embedding, a U-Net-based pose decoding module produces Reference Object Coordinate (ROC) for new views, enabling fast and accurate pose estimation. This simple encoding-decoding framework allows our model to be trained on any pair-wise pose data, enabling large-scale training and demonstrating great scalability. Experiments on multiple benchmark datasets demonstrate that our model generalizes well to novel objects, achieving state-of-the-art accuracy and robustness even rivaling methods that require multi-view or CAD inputs, at a fraction of compute.

CVOct 7, 2025
EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark

Deheng Zhang, Yuqian Fu, Runyi Yang et al.

Most existing benchmarks for egocentric vision understanding focus primarily on daytime scenarios, overlooking the low-light conditions that are inevitable in real-world applications. To investigate this gap, we present EgoNight, the first comprehensive benchmark for nighttime egocentric vision, with visual question answering (VQA) as the core task. A key feature of EgoNight is the introduction of day-night aligned videos, which enhance night annotation quality using the daytime data and reveal clear performance gaps between lighting conditions. To achieve this, we collect both synthetic videos rendered by Blender and real-world recordings, ensuring that scenes and actions are visually and temporally aligned. Leveraging these paired videos, we construct EgoNight-VQA, supported by a novel day-augmented night auto-labeling engine and refinement through extensive human verification. Each QA pair is double-checked by annotators for reliability. In total, EgoNight-VQA contains 3658 QA pairs across 90 videos, spanning 12 diverse QA types, with more than 300 hours of human work. Evaluations of state-of-the-art multimodal large language models (MLLMs) reveal substantial performance drops when transferring from day to night, underscoring the challenges of reasoning under low-light conditions. Beyond VQA, EgoNight also introduces two auxiliary tasks, day-night correspondence retrieval and egocentric depth estimation at night, that further explore the boundaries of existing models. We believe EgoNight-VQA provides a strong foundation for advancing application-driven egocentric vision research and for developing models that generalize across illumination domains. All the data and code will be made available upon acceptance.

CVSep 22, 2025
Enhancing Semantic Segmentation with Continual Self-Supervised Pre-training

Brown Ebouky, Ajad Chhatkuli, Cristiano Malossi et al.

Self-supervised learning (SSL) has emerged as a central paradigm for training foundation models by leveraging large-scale unlabeled datasets, often producing representations with strong generalization capabilities. These models are typically pre-trained on general-purpose datasets such as ImageNet and subsequently adapted to various downstream tasks through finetuning. While recent advances have explored parameter-efficient strategies for adapting pre-trained models, extending SSL pre-training itself to new domains - particularly under limited data regimes and for dense prediction tasks - remains underexplored. In this work, we address the problem of adapting vision foundation models to new domains in an unsupervised and data-efficient manner, specifically targeting downstream semantic segmentation. We propose GLARE (Global Local and Regional Enforcement), a novel continual self-supervised pre-training task designed to enhance downstream segmentation performance. GLARE introduces patch-level augmentations to encourage local consistency and incorporates a regional consistency constraint that leverages spatial semantics in the data. For efficient continual pre-training, we initialize Vision Transformers (ViTs) with weights from existing SSL models and update only lightweight adapter modules - specifically UniAdapter - while keeping the rest of the backbone frozen. Experiments across multiple semantic segmentation benchmarks on different domains demonstrate that GLARE consistently improves downstream performance with minimal computational and parameter overhead.

CVDec 24, 2023
Residual Learning for Image Point Descriptors

Rashik Shrestha, Ajad Chhatkuli, Menelaos Kanakis et al.

Local image feature descriptors have had a tremendous impact on the development and application of computer vision methods. It is therefore unsurprising that significant efforts are being made for learning-based image point descriptors. However, the advantage of learned methods over handcrafted methods in real applications is subtle and more nuanced than expected. Moreover, handcrafted descriptors such as SIFT and SURF still perform better point localization in Structure-from-Motion (SfM) compared to many learned counterparts. In this paper, we propose a very simple and effective approach to learning local image descriptors by using a hand-crafted detector and descriptor. Specifically, we choose to learn only the descriptors, supported by handcrafted descriptors while discarding the point localization head. We optimize the final descriptor by leveraging the knowledge already present in the handcrafted descriptor. Such an approach of optimization allows us to discard learning knowledge already present in non-differentiable functions such as the hand-crafted descriptors and only learn the residual knowledge in the main network branch. This offers 50X convergence speed compared to the standard baseline architecture of SuperPoint while at inference the combined descriptor provides superior performance over the learned and hand-crafted descriptors. This is done with minor increase in the computations over the baseline learned descriptor. Our approach has potential applications in ensemble learning and learning with non-differentiable functions. We perform experiments in matching, camera localization and Structure-from-Motion in order to showcase the advantages of our approach.

CVDec 31, 2020
Unsupervised Monocular Depth Reconstruction of Non-Rigid Scenes

Ayça Takmaz, Danda Pani Paudel, Thomas Probst et al.

Monocular depth reconstruction of complex and dynamic scenes is a highly challenging problem. While for rigid scenes learning-based methods have been offering promising results even in unsupervised cases, there exists little to no literature addressing the same for dynamic and deformable scenes. In this work, we present an unsupervised monocular framework for dense depth estimation of dynamic scenes, which jointly reconstructs rigid and non-rigid parts without explicitly modelling the camera motion. Using dense correspondences, we derive a training objective that aims to opportunistically preserve pairwise distances between reconstructed 3D points. In this process, the dense depth map is learned implicitly using the as-rigid-as-possible hypothesis. Our method provides promising results, demonstrating its capability of reconstructing 3D from challenging videos of non-rigid scenes. Furthermore, the proposed method also provides unsupervised motion segmentation results as an auxiliary output.

CVDec 15, 2020
Cluster, Split, Fuse, and Update: Meta-Learning for Open Compound Domain Adaptive Semantic Segmentation

Rui Gong, Yuhua Chen, Danda Pani Paudel et al.

Open compound domain adaptation (OCDA) is a domain adaptation setting, where target domain is modeled as a compound of multiple unknown homogeneous domains, which brings the advantage of improved generalization to unseen domains. In this work, we propose a principled meta-learning based approach to OCDA for semantic segmentation, MOCDA, by modeling the unlabeled target domain continuously. Our approach consists of four key steps. First, we cluster target domain into multiple sub-target domains by image styles, extracted in an unsupervised manner. Then, different sub-target domains are split into independent branches, for which batch normalization parameters are learnt to treat them independently. A meta-learner is thereafter deployed to learn to fuse sub-target domain-specific predictions, conditioned upon the style code. Meanwhile, we learn to online update the model by model-agnostic meta-learning (MAML) algorithm, thus to further improve generalization. We validate the benefits of our approach by extensive experiments on synthetic-to-real knowledge transfer benchmark datasets, where we achieve the state-of-the-art performance in both compound and open domains.

CVAug 27, 2020
Learning Condition Invariant Features for Retrieval-Based Localization from 1M Images

Janine Thoma, Danda Pani Paudel, Ajad Chhatkuli et al.

Image features for retrieval-based localization must be invariant to dynamic objects (e.g. cars) as well as seasonal and daytime changes. Such invariances are, up to some extent, learnable with existing methods using triplet-like losses, given a large number of diverse training images. However, due to the high algorithmic training complexity, there exists insufficient comparison between different loss functions on large datasets. In this paper, we train and evaluate several localization methods on three different benchmark datasets, including Oxford RobotCar with over one million images. This large scale evaluation yields valuable insights into the generalizability and performance of retrieval-based localization. Based on our findings, we develop a novel method for learning more accurate and better generalizing localization features. It consists of two main contributions: (i) a feature volume-based loss function, and (ii) hard positive and pairwise negative mining. On the challenging Oxford RobotCar night condition, our method outperforms the well-known triplet loss by 24.4% in localization accuracy within 5m.

CVJul 4, 2020
Self-Calibration Supported Robust Projective Structure-from-Motion

Rui Gong, Danda Pani Paudel, Ajad Chhatkuli et al.

Typical Structure-from-Motion (SfM) pipelines rely on finding correspondences across images, recovering the projective structure of the observed scene and upgrading it to a metric frame using camera self-calibration constraints. Solving each problem is mainly carried out independently from the others. For instance, camera self-calibration generally assumes correct matches and a good projective reconstruction have been obtained. In this paper, we propose a unified SfM method, in which the matching process is supported by self-calibration constraints. We use the idea that good matches should yield a valid calibration. In this process, we make use of the Dual Image of Absolute Quadric projection equations within a multiview correspondence framework, in order to obtain robust matching from a set of putative correspondences. The matching process classifies points as inliers or outliers, which is learned in an unsupervised manner using a deep neural network. Together with theoretical reasoning why the self-calibration constraints are necessary, we show experimental results demonstrating robust multiview matching and accurate camera calibration by exploiting these constraints.

CVMar 21, 2020
Geometrically Mappable Image Features

Janine Thoma, Danda Pani Paudel, Ajad Chhatkuli et al.

Vision-based localization of an agent in a map is an important problem in robotics and computer vision. In that context, localization by learning matchable image features is gaining popularity due to recent advances in machine learning. Features that uniquely describe the visual contents of images have a wide range of applications, including image retrieval and understanding. In this work, we propose a method that learns image features targeted for image-retrieval-based localization. Retrieval-based localization has several benefits, such as easy maintenance and quick computation. However, the state-of-the-art features only provide visual similarity scores which do not explicitly reveal the geometric distance between query and retrieved images. Knowing this distance is highly desirable for accurate localization, especially when the reference images are sparsely distributed in the scene. Therefore, we propose a novel loss function for learning image features which are both visually representative and geometrically relatable. This is achieved by guiding the learning process such that the feature and geometric distances between images are directly proportional. In our experiments we show that our features not only offer significantly better localization accuracy, but also allow to estimate the trajectory of a query sequence in absence of the reference images.

CVMar 17, 2020
Unsupervised Learning of Category-Specific Symmetric 3D Keypoints from Point Sets

Clara Fernandez-Labrador, Ajad Chhatkuli, Danda Pani Paudel et al.

Automatic discovery of category-specific 3D keypoints from a collection of objects of some category is a challenging problem. One reason is that not all objects in a category necessarily have the same semantic parts. The level of difficulty adds up further when objects are represented by 3D point clouds, with variations in shape and unknown coordinate frames. We define keypoints to be category-specific, if they meaningfully represent objects' shape and their correspondences can be simply established order-wise across all objects. This paper aims at learning category-specific 3D keypoints, in an unsupervised manner, using a collection of misaligned 3D point clouds of objects from an unknown category. In order to do so, we model shapes defined by the keypoints, within a category, using the symmetric linear basis shapes without assuming the plane of symmetry to be known. The usage of symmetry prior leads us to learn stable keypoints suitable for higher misalignments. To the best of our knowledge, this is the first work on learning such keypoints directly from 3D point clouds. Using categories from four benchmark datasets, we demonstrate the quality of our learned keypoints by quantitative and qualitative evaluations. Our experiments also show that the keypoints discovered by our method are geometrically and semantically consistent.

CVSep 26, 2019
Convex Relaxations for Consensus and Non-Minimal Problems in 3D Vision

Thomas Probst, Danda Pani Paudel, Ajad Chhatkuli et al.

In this paper, we formulate a generic non-minimal solver using the existing tools of Polynomials Optimization Problems (POP) from computational algebraic geometry. The proposed method exploits the well known Shor's or Lasserre's relaxations, whose theoretical aspects are also discussed. Notably, we further exploit the POP formulation of non-minimal solver also for the generic consensus maximization problems in 3D vision. Our framework is simple and straightforward to implement, which is also supported by three diverse applications in 3D vision, namely rigid body transformation estimation, Non-Rigid Structure-from-Motion (NRSfM), and camera autocalibration. In all three cases, both non-minimal and consensus maximization are tested, which are also compared against the state-of-the-art methods. Our results are competitive to the compared methods, and are also coherent with our theoretical analysis. The main contribution of this paper is the claim that a good approximate solution for many polynomial problems involved in 3D vision can be obtained using the existing theory of numerical computational algebra. This claim leads us to reason about why many relaxed methods in 3D vision behave so well? And also allows us to offer a generic relaxed solver in a rather straightforward way. We further show that the convex relaxation of these polynomials can easily be used for maximizing consensus in a deterministic manner. We support our claim using several experiments for aforementioned three diverse problems in 3D vision.

CVDec 10, 2018
Mapping, Localization and Path Planning for Image-based Navigation using Visual Features and Map

Janine Thoma, Danda Pani Paudel, Ajad Chhatkuli et al.

Building on progress in feature representations for image retrieval, image-based localization has seen a surge of research interest. Image-based localization has the advantage of being inexpensive and efficient, often avoiding the use of 3D metric maps altogether. That said, the need to maintain a large number of reference images as an effective support of localization in a scene, nonetheless calls for them to be organized in a map structure of some kind. The problem of localization often arises as part of a navigation process. We are, therefore, interested in summarizing the reference images as a set of landmarks, which meet the requirements for image-based navigation. A contribution of this paper is to formulate such a set of requirements for the two sub-tasks involved: map construction and self-localization. These requirements are then exploited for compact map representation and accurate self-localization, using the framework of a network flow problem. During this process, we formulate the map construction and self-localization problems as convex quadratic and second-order cone programs, respectively. We evaluate our methods on publicly available indoor and outdoor datasets, where they outperform existing methods significantly.

CVAug 13, 2018
Incremental Non-Rigid Structure-from-Motion with Unknown Focal Length

Thomas Probst, Danda Pani Paudel, Ajad Chhatkuli et al.

The perspective camera and the isometric surface prior have recently gathered increased attention for Non-Rigid Structure-from-Motion (NRSfM). Despite the recent progress, several challenges remain, particularly the computational complexity and the unknown camera focal length. In this paper we present a method for incremental Non-Rigid Structure-from-Motion (NRSfM) with the perspective camera model and the isometric surface prior with unknown focal length. In the template-based case, we provide a method to estimate four parameters of the camera intrinsics. For the template-less scenario of NRSfM, we propose a method to upgrade reconstructions obtained for one focal length to another based on local rigidity and the so-called Maximum Depth Heuristics (MDH). On its basis we propose a method to simultaneously recover the focal length and the non-rigid shapes. We further solve the problem of incorporating a large number of points and adding more views in MDH-based NRSfM and efficiently solve them with Second-Order Cone Programming (SOCP). This does not require any shape initialization and produces results orders of times faster than many methods. We provide evaluations on standard sequences with ground-truth and qualitative reconstructions on challenging YouTube videos. These evaluations show that our method performs better in both speed and accuracy than the state of the art.

CVJul 5, 2018
Model-free Consensus Maximization for Non-Rigid Shapes

Thomas Probst, Ajad Chhatkuli, Danda Pani Paudel et al.

Many computer vision methods use consensus maximization to relate measurements containing outliers with the correct transformation model. In the context of rigid shapes, this is typically done using Random Sampling and Consensus (RANSAC) by estimating an analytical model that agrees with the largest number of measurements (inliers). However, small parameter models may not be always available. In this paper, we formulate the model-free consensus maximization as an Integer Program in a graph using `rules' on measurements. We then provide a method to solve it optimally using the Branch and Bound (BnB) paradigm. We focus its application on non-rigid shapes, where we apply the method to remove outlier 3D correspondences and achieve performance superior to the state of the art. Our method works with outlier ratio as high as 80\%. We further derive a similar formulation for 3D template to image matching, achieving similar or better performance compared to the state of the art.

CVSep 17, 2017
Automatic Tool Landmark Detection for Stereo Vision in Robot-Assisted Retinal Surgery

Thomas Probst, Kevis-Kokitsi Maninis, Ajad Chhatkuli et al.

Computer vision and robotics are being increasingly applied in medical interventions. Especially in interventions where extreme precision is required they could make a difference. One such application is robot-assisted retinal microsurgery. In recent works, such interventions are conducted under a stereo-microscope, and with a robot-controlled surgical tool. The complementarity of computer vision and robotics has however not yet been fully exploited. In order to improve the robot control we are interested in 3D reconstruction of the anatomy and in automatic tool localization using a stereo microscope. In this paper, we solve this problem for the first time using a single pipeline, starting from uncalibrated cameras to reach metric 3D reconstruction and registration, in retinal microsurgery. The key ingredients of our method are: (a) surgical tool landmark detection, and (b) 3D reconstruction with the stereo microscope, using the detected landmarks. To address the former, we propose a novel deep learning method that detects and recognizes keypoints in high definition images at higher than real-time speed. We use the detected 2D keypoints along with their corresponding 3D coordinates obtained from the robot sensors to calibrate the stereo microscope using an affine projection model. We design an online 3D reconstruction pipeline that makes use of smoothness constraints and performs robot-to-camera registration. The entire pipeline is extensively validated on open-sky porcine eye sequences. Quantitative and qualitative results are presented for all steps.