CVAug 3, 2022Code
SC6D: Symmetry-agnostic and Correspondence-free 6D Object Pose EstimationDingding Cai, Janne Heikkilä, Esa Rahtu
This paper presents an efficient symmetry-agnostic and correspondence-free framework, referred to as SC6D, for 6D object pose estimation from a single monocular RGB image. SC6D requires neither the 3D CAD model of the object nor any prior knowledge of the symmetries. The pose estimation is decomposed into three sub-tasks: a) object 3D rotation representation learning and matching; b) estimation of the 2D location of the object center; and c) scale-invariant distance estimation (the translation along the z-axis) via classification. SC6D is evaluated on three benchmark datasets, T-LESS, YCB-V, and ITODD, and results in state-of-the-art performance on the T-LESS dataset. Moreover, SC6D is computationally much more efficient than the previous state-of-the-art method SurfEmb. The implementation and pre-trained models are publicly available at https://github.com/dingdingcai/SC6D-pose.
CVApr 4, 2023
Toward Verifiable and Reproducible Human Evaluation for Text-to-Image GenerationMayu Otani, Riku Togashi, Yu Sawai et al.
Human evaluation is critical for validating the performance of text-to-image generative models, as this highly cognitive process requires deep comprehension of text and images. However, our survey of 37 recent papers reveals that many works rely solely on automatic measures (e.g., FID) or perform poorly described human evaluations that are not reliable or repeatable. This paper proposes a standardized and well-defined human evaluation protocol to facilitate verifiable and reproducible human evaluation in future works. In our pilot data collection, we experimentally show that the current automatic measures are incompatible with human perception in evaluating the performance of the text-to-image generation results. Furthermore, we provide insights for designing human evaluation experiments reliably and conclusively. Finally, we make several resources publicly available to the community to facilitate easy and fast implementations.
CVMar 2, 2022
OVE6D: Object Viewpoint Encoding for Depth-based 6D Object Pose EstimationDingding Cai, Janne Heikkilä, Esa Rahtu
This paper proposes a universal framework, called OVE6D, for model-based 6D object pose estimation from a single depth image and a target object mask. Our model is trained using purely synthetic data rendered from ShapeNet, and, unlike most of the existing methods, it generalizes well on new real-world objects without any fine-tuning. We achieve this by decomposing the 6D pose into viewpoint, in-plane rotation around the camera optical axis and translation, and introducing novel lightweight modules for estimating each component in a cascaded manner. The resulting network contains less than 4M parameters while demonstrating excellent performance on the challenging T-LESS and Occluded LINEMOD datasets without any dataset-specific training. We show that OVE6D outperforms some contemporary deep learning-based pose estimation methods specifically trained for individual objects or datasets with real-world training data. The implementation and the pre-trained model will be made publicly available.
CVMar 28, 2022
Optimal Correction Cost for Object Detection EvaluationMayu Otani, Riku Togashi, Yuta Nakashima et al.
Mean Average Precision (mAP) is the primary evaluation measure for object detection. Although object detection has a broad range of applications, mAP evaluates detectors in terms of the performance of ranked instance retrieval. Such the assumption for the evaluation task does not suit some downstream tasks. To alleviate the gap between downstream tasks and the evaluation scenario, we propose Optimal Correction Cost (OC-cost), which assesses detection accuracy at image level. OC-cost computes the cost of correcting detections to ground truths as a measure of accuracy. The cost is obtained by solving an optimal transportation problem between the detections and the ground truths. Unlike mAP, OC-cost is designed to penalize false positive and false negative detections properly, and every image in a dataset is treated equally. Our experimental result validates that OC-cost has better agreement with human preference than a ranking-based measure, i.e., mAP for a single image. We also show that detectors' rankings by OC-cost are more consistent on different data splits than mAP. Our goal is not to replace mAP with OC-cost but provide an additional tool to evaluate detectors from another aspect. To help future researchers and developers choose a target measure, we provide a series of experiments to clarify how mAP and OC-cost differ.
CVFeb 14, 2023
MSDA: Monocular Self-supervised Domain Adaptation for 6D Object Pose EstimationDingding Cai, Janne Heikkilä, Esa Rahtu
Acquiring labeled 6D poses from real images is an expensive and time-consuming task. Though massive amounts of synthetic RGB images are easy to obtain, the models trained on them suffer from noticeable performance degradation due to the synthetic-to-real domain gap. To mitigate this degradation, we propose a practical self-supervised domain adaptation approach that takes advantage of real RGB(-D) data without needing real pose labels. We first pre-train the model with synthetic RGB images and then utilize real RGB(-D) images to fine-tune the pre-trained model. The fine-tuning process is self-supervised by the RGB-based pose-aware consistency and the depth-guided object distance pseudo-label, which does not require the time-consuming online differentiable rendering. We build our domain adaptation method based on the recent pose estimator SC6D and evaluate it on the YCB-Video dataset. We experimentally demonstrate that our method achieves comparable performance against its fully-supervised counterpart while outperforming existing state-of-the-art approaches.
CVJan 3, 2023
BS3D: Building-scale 3D Reconstruction from RGB-D ImagesJanne Mustaniemi, Juho Kannala, Esa Rahtu et al.
Various datasets have been proposed for simultaneous localization and mapping (SLAM) and related problems. Existing datasets often include small environments, have incomplete ground truth, or lack important sensor data, such as depth and infrared images. We propose an easy-to-use framework for acquiring building-scale 3D reconstruction using a consumer depth camera. Unlike complex and expensive acquisition setups, our system enables crowd-sourcing, which can greatly benefit data-hungry algorithms. Compared to similar systems, we utilize raw depth maps for odometry computation and loop closure refinement which results in better reconstructions. We acquire a building-scale 3D dataset (BS3D) and demonstrate its value by training an improved monocular depth estimation model. As a unique experiment, we benchmark visual-inertial odometry methods using both color and active infrared images.
CVJul 11, 2023
Unbiased Scene Graph Generation via Two-stage Causal ModelingShuzhou Sun, Shuaifeng Zhi, Qing Liao et al.
Despite the impressive performance of recent unbiased Scene Graph Generation (SGG) methods, the current debiasing literature mainly focuses on the long-tailed distribution problem, whereas it overlooks another source of bias, i.e., semantic confusion, which makes the SGG model prone to yield false predictions for similar relationships. In this paper, we explore a debiasing procedure for the SGG task leveraging causal inference. Our central insight is that the Sparse Mechanism Shift (SMS) in causality allows independent intervention on multiple biases, thereby potentially preserving head category performance while pursuing the prediction of high-informative tail relationships. However, the noisy datasets lead to unobserved confounders for the SGG task, and thus the constructed causal models are always causal-insufficient to benefit from SMS. To remedy this, we propose Two-stage Causal Modeling (TsCM) for the SGG task, which takes the long-tailed distribution and semantic confusion as confounders to the Structural Causal Model (SCM) and then decouples the causal intervention into two stages. The first stage is causal representation learning, where we use a novel Population Loss (P-Loss) to intervene in the semantic confusion confounder. The second stage introduces the Adaptive Logit Adjustment (AL-Adjustment) to eliminate the long-tailed distribution confounder to complete causal calibration learning. These two stages are model agnostic and thus can be used in any SGG model that seeks unbiased predictions. Comprehensive experiments conducted on the popular SGG backbones and benchmarks show that our TsCM can achieve state-of-the-art performance in terms of mean recall rate. Furthermore, TsCM can maintain a higher recall rate than other debiasing methods, which indicates that our method can achieve a better tradeoff between head and tail relationships.
CVJan 16, 2023
Sparse resultant based minimal solvers in computer vision and their connection with the action matrixSnehal Bhayani, Janne Heikkilä, Zuzana Kukelova
Many computer vision applications require robust and efficient estimation of camera geometry from a minimal number of input data measurements, i.e., solving minimal problems in a RANSAC framework. Minimal problems are usually formulated as complex systems of sparse polynomials. The systems usually are overdetermined and consist of polynomials with algebraically constrained coefficients. Most state-of-the-art efficient polynomial solvers are based on the action matrix method that has been automated and highly optimized in recent years. On the other hand, the alternative theory of sparse resultants and Newton polytopes has been less successful for generating efficient solvers, primarily because the polytopes do not respect the constraints on the coefficients. Therefore, in this paper, we propose a simple iterative scheme to test various subsets of the Newton polytopes and search for the most efficient solver. Moreover, we propose to use an extra polynomial with a special form to further improve the solver efficiency via a Schur complement computation. We show that for some camera geometry problems our extra polynomial-based method leads to smaller and more stable solvers than the state-of-the-art Grobner basis-based solvers. The proposed method can be fully automated and incorporated into existing tools for automatic generation of efficient polynomial solvers. It provides a competitive alternative to popular Grobner basis-based methods for minimal problems in computer vision. We also study the conditions under which the minimal solvers generated by the state-of-the-art action matrix-based methods and the proposed extra polynomial resultant-based method, are equivalent. Specifically we consider a step-by-step comparison between the approaches based on the action matrix and the sparse resultant, followed by a set of substitutions, which would lead to equivalent minimal solvers.
CVMay 19
OP2GS: Object-Aware 3D Gaussian Splatting with Dual-Opacity PrimitivesGuiyu Liu, Niklas Vaara, Janne Mustaniemi et al.
3D Gaussian Splatting (3DGS) provides an explicit and efficient scene representation, but its primitives lack inherent object-level identity, hindering downstream tasks such as open-vocabulary scene understanding. Existing methods typically address this by either distilling high-dimensional feature embeddings into Gaussians or by lifting 2D mask labels into 3D via heuristic refinement. However, feature-based approaches incur heavy storage and decoding overhead, while lifting-based pipelines remain vulnerable to label contamination: Gaussians necessary for appearance reconstruction often receive incorrect object labels during 2D-to-3D projection. We propose OP2GS, an object-aware Gaussian representation that augments each primitive with an explicit instance identity and a dedicated instance opacity $σ^{*}$ for object-mask rendering. The original opacity $σ$ remains responsible for visual reconstruction, while $σ^{*}$ models whether a Gaussian should contribute to a particular object mask. This dual-opacity formulation decouples visual existence from instance occupancy: mislabeled Gaussians can remain available for image rendering while becoming transparent in the object-mask branch. To learn this representation, we introduce a random object loss that optimizes the 1D instance occupancy field using the standard transmittance-based visibility of 3DGS. Semantic descriptors are then attached at the object level through multi-view aggregation, eliminating per-Gaussian feature storage. Compared with feature-training approaches, OP2GS achieves competitive open-vocabulary performance while significantly reducing computational overhead. Compared with training-free pipelines, it leverages physically consistent occupancy learning to resolve visibility ambiguities.
CVMay 9, 2017Code
Cell Tracking via Proposal Generation and SelectionSaad Ullah Akram, Juho Kannala, Lauri Eklund et al.
Microscopy imaging plays a vital role in understanding many biological processes in development and disease. The recent advances in automation of microscopes and development of methods and markers for live cell imaging has led to rapid growth in the amount of image data being captured. To efficiently and reliably extract useful insights from these captured sequences, automated cell tracking is essential. This is a challenging problem due to large variation in the appearance and shapes of cells depending on many factors including imaging methodology, biological characteristics of cells, cell matrix composition, labeling methodology, etc. Often cell tracking methods require a sequence-specific segmentation method and manual tuning of many tracking parameters, which limits their applicability to sequences other than those they are designed for. In this paper, we propose 1) a deep learning based cell proposal method, which proposes candidates for cells along with their scores, and 2) a cell tracking method, which links proposals in adjacent frames in a graphical model using edges representing different cellular events and poses joint cell detection and tracking as the selection of a subset of cell and edge proposals. Our method is completely automated and given enough training data can be applied to a wide variety of microscopy sequences. We evaluate our method on multiple fluorescence and phase contrast microscopy sequences containing cells of various shapes and appearances from ISBI cell tracking challenge, and show that our method outperforms existing cell tracking methods. Code is available at: https://github.com/SaadUllahAkram/CellTracker
CVMay 8
Differentiable Ray Tracing with Gaussians for Unified Radio Propagation Simulation and View SynthesisNiklas Vaara, Lam Huynh, Pekka Sangi et al.
Explicit neural representations such as 3D Gaussian Splatting (3DGS) enable high-fidelity and real-time novel view synthesis, yet optimize for alpha-composited optical appearance rather than ray-intersectable geometry. In contrast, radio-frequency (RF) digital twins require deterministic multi-bounce paths, where the geometry dictates trajectories and their associated attenuation and delay. We introduce a framework enabling differentiable RF propagation simulation directly within visually reconstructed neural scenes, allowing point-to-point path computation between arbitrary 3D locations while preserving high-quality visual rendering. Unlike conventional RF simulation pipelines that rely on manually constructed meshes, we embed Gaussian primitives into a hardware-accelerated ray tracing structure as the underlying spatial representation. By extracting physically meaningful channel impulse responses from visual-only reconstructions, we provide cross-modal evidence that neural reconstructions can serve as unified spatial representations for both electromagnetic propagation simulation and photorealistic view synthesis.
CVMay 7
Solving Minimal Problems Without Matrix Inversion Using FFT-Based InterpolationHaidong Wu, Snehal Bhayani, Janne Heikkilä
Estimating camera geometry typically involves solving minimal problems formulated as systems of multivariate polynomial equations, which often pose computational challenges when using existing Gröbner-basis or resultant-based methods due to matrix inversion needed in the online solver. Here we propose a sampling-based, matrix inversion-free method that constructs the solvers using sparse hidden-variable resultants. The determinant polynomial in the hidden variable is efficiently reconstructed via inverse fast Fourier transform interpolation from sampled evaluations, avoiding symbolic expansion. Solving this polynomial yields the hidden variable, and the remaining unknowns are recovered by identifying rank-1 deficient submatrices and applying Cramer's rule. A greatest common divisor-based criterion ensures robust submatrix identification under noise. Experiments on diverse minimal problems demonstrate that the proposed solver achieves strong numerical stability and competitive runtime, particularly for small-scale problems, providing a practical alternative to traditional Gröbner-basis and resultant-based solvers.
CVMar 22, 2025
A Causal Adjustment Module for Debiasing Scene Graph GenerationLi Liu, Shuzhou Sun, Shuaifeng Zhi et al.
While recent debiasing methods for Scene Graph Generation (SGG) have shown impressive performance, these efforts often attribute model bias solely to the long-tail distribution of relationships, overlooking the more profound causes stemming from skewed object and object pair distributions. In this paper, we employ causal inference techniques to model the causality among these observed skewed distributions. Our insight lies in the ability of causal inference to capture the unobservable causal effects between complex distributions, which is crucial for tracing the roots of model bias. Specifically, we introduce the Mediator-based Causal Chain Model (MCCM), which, in addition to modeling causality among objects, object pairs, and relationships, incorporates mediator variables, i.e., cooccurrence distribution, for complementing the causality. Following this, we propose the Causal Adjustment Module (CAModule) to estimate the modeled causal structure, using variables from MCCM as inputs to produce a set of adjustment factors aimed at correcting biased model predictions. Moreover, our method enables the composition of zero-shot relationships, thereby enhancing the model's ability to recognize such relationships. Experiments conducted across various SGG backbones and popular benchmarks demonstrate that CAModule achieves state-of-the-art mean recall rates, with significant improvements also observed on the challenging zero-shot recall rate metric.
CVMar 15, 2024
GS-Pose: Generalizable Segmentation-based 6D Object Pose Estimation with 3D Gaussian SplattingDingding Cai, Janne Heikkilä, Esa Rahtu
This paper introduces GS-Pose, a unified framework for localizing and estimating the 6D pose of novel objects. GS-Pose begins with a set of posed RGB images of a previously unseen object and builds three distinct representations stored in a database. At inference, GS-Pose operates sequentially by locating the object in the input image, estimating its initial 6D pose using a retrieval approach, and refining the pose with a render-and-compare method. The key insight is the application of the appropriate object representation at each stage of the process. In particular, for the refinement step, we leverage 3D Gaussian splatting, a novel differentiable rendering technique that offers high rendering speed and relatively low optimization time. Off-the-shelf toolchains and commodity hardware, such as mobile phones, can be used to capture new objects to be added to the database. Extensive evaluations on the LINEMOD and OnePose-LowTexture datasets demonstrate excellent performance, establishing the new state-of-the-art. Project page: https://dingdingcai.github.io/gs-pose.
LGJan 14, 2025
Uncovering Bias in Foundation Models: Impact, Testing, Harm, and MitigationShuzhou Sun, Li Liu, Yongxiang Liu et al.
Bias in Foundation Models (FMs) - trained on vast datasets spanning societal and historical knowledge - poses significant challenges for fairness and equity across fields such as healthcare, education, and finance. These biases, rooted in the overrepresentation of stereotypes and societal inequalities in training data, exacerbate real-world discrimination, reinforce harmful stereotypes, and erode trust in AI systems. To address this, we introduce Trident Probe Testing (TriProTesting), a systematic testing method that detects explicit and implicit biases using semantically designed probes. Here we show that FMs, including CLIP, ALIGN, BridgeTower, and OWLv2, demonstrate pervasive biases across single and mixed social attributes (gender, race, age, and occupation). Notably, we uncover mixed biases when social attributes are combined, such as gender x race, gender x age, and gender x occupation, revealing deeper layers of discrimination. We further propose Adaptive Logit Adjustment (AdaLogAdjustment), a post-processing technique that dynamically redistributes probability power to mitigate these biases effectively, achieving significant improvements in fairness without retraining models. These findings highlight the urgent need for ethical AI practices and interdisciplinary solutions to address biases not only at the model level but also in societal structures. Our work provides a scalable and interpretable solution that advances fairness in AI systems while offering practical insights for future research on fair AI technologies.
CVMay 29, 2025
A Reverse Causal Framework to Mitigate Spurious Correlations for Debiasing Scene Graph GenerationShuzhou Sun, Li Liu, Tianpeng Liu et al.
Existing two-stage Scene Graph Generation (SGG) frameworks typically incorporate a detector to extract relationship features and a classifier to categorize these relationships; therefore, the training paradigm follows a causal chain structure, where the detector's inputs determine the classifier's inputs, which in turn influence the final predictions. However, such a causal chain structure can yield spurious correlations between the detector's inputs and the final predictions, i.e., the prediction of a certain relationship may be influenced by other relationships. This influence can induce at least two observable biases: tail relationships are predicted as head ones, and foreground relationships are predicted as background ones; notably, the latter bias is seldom discussed in the literature. To address this issue, we propose reconstructing the causal chain structure into a reverse causal structure, wherein the classifier's inputs are treated as the confounder, and both the detector's inputs and the final predictions are viewed as causal variables. Specifically, we term the reconstructed causal paradigm as the Reverse causal Framework for SGG (RcSGG). RcSGG initially employs the proposed Active Reverse Estimation (ARE) to intervene on the confounder to estimate the reverse causality, \ie the causality from final predictions to the classifier's inputs. Then, the Maximum Information Sampling (MIS) is suggested to enhance the reverse causality estimation further by considering the relationship information. Theoretically, RcSGG can mitigate the spurious correlations inherent in the SGG framework, subsequently eliminating the induced biases. Comprehensive experiments on popular benchmarks and diverse SGG frameworks show the state-of-the-art mean recall rate.
CVApr 2, 2025
A Conic Transformation Approach for Solving the Perspective-Three-Point ProblemHaidong Wu, Snehal Bhayani, Janne Heikkilä
We propose a conic transformation method to solve the Perspective-Three-Point (P3P) problem. In contrast to the current state-of-the-art solvers, which formulate the P3P problem by intersecting two conics and constructing a degenerate conic to find the intersection, our approach builds upon a new formulation based on a transformation that maps the two conics to a new coordinate system, where one of the conics becomes a standard parabola in a canonical form. This enables expressing one variable in terms of the other variable, and as a consequence, substantially simplifies the problem of finding the conic intersection. Moreover, the polynomial coefficients are fast to compute, and we only need to determine the real-valued intersection points, which avoids the requirement of using computationally expensive complex arithmetic. While the current state-of-the-art methods reduce the conic intersection problem to solving a univariate cubic equation, our approach, despite resulting in a quartic equation, is still faster thanks to this new simplified formulation. Extensive evaluations demonstrate that our method achieves higher speed while maintaining robustness and stability comparable to state-of-the-art methods.
SPJul 5, 2025
Differentiable High-Performance Ray Tracing-Based Simulation of Radio Propagation with Point CloudsNiklas Vaara, Pekka Sangi, Miguel Bordallo López et al.
Ray tracing is a widely used deterministic method for radio propagation simulations, capable of producing physically accurate multipath components. The accuracy depends on the quality of the environment model and its electromagnetic properties. Recent advances in computer vision and machine learning have made it possible to reconstruct detailed environment models augmented with semantic segmentation labels. In this letter, we propose a differentiable ray tracing-based radio propagation simulator that operates directly on point clouds. We showcase the efficiency of our method by simulating multi-bounce propagation paths with up to five interactions with specular reflections and diffuse scattering in two indoor scenarios, each completing in less than 90 ms. Lastly, we demonstrate how the differentiability of electromagnetic computations can be combined with segmentation labels to learn the electromagnetic properties of the environment.
CVJun 30, 2025
Towards an Automated Multimodal Approach for Video Summarization: Building a Bridge Between Text, Audio and Facial Cue-Based SummarizationMd Moinul Islam, Sofoklis Kakouros, Janne Heikkilä et al.
The increasing volume of video content in educational, professional, and social domains necessitates effective summarization techniques that go beyond traditional unimodal approaches. This paper proposes a behaviour-aware multimodal video summarization framework that integrates textual, audio, and visual cues to generate timestamp-aligned summaries. By extracting prosodic features, textual cues and visual indicators, the framework identifies semantically and emotionally important moments. A key contribution is the identification of bonus words, which are terms emphasized across multiple modalities and used to improve the semantic relevance and expressive clarity of the summaries. The approach is evaluated against pseudo-ground truth (pGT) summaries generated using LLM-based extractive method. Experimental results demonstrate significant improvements over traditional extractive method, such as the Edmundson method, in both text and video-based evaluation metrics. Text-based metrics show ROUGE-1 increasing from 0.4769 to 0.7929 and BERTScore from 0.9152 to 0.9536, while in video-based evaluation, our proposed framework improves F1-Score by almost 23%. The findings underscore the potential of multimodal integration in producing comprehensive and behaviourally informed video summaries.
CVDec 26, 2024
An End-to-End Depth-Based Pipeline for Selfie Image RectificationAhmed Alhawwary, Janne Mustaniemi, Phong Nguyen-Ha et al.
Portraits or selfie images taken from a close distance typically suffer from perspective distortion. In this paper, we propose an end-to-end deep learning-based rectification pipeline to mitigate the effects of perspective distortion. We learn to predict the facial depth by training a deep CNN. The estimated depth is utilized to adjust the camera-to-subject distance by moving the camera farther, increasing the camera focal length, and reprojecting the 3D image features to the new perspective. The reprojected features are then fed to an inpainting module to fill in the missing pixels. We leverage a differentiable renderer to enable end-to-end training of our depth estimation and feature extraction nets to improve the rectified outputs. To boost the results of the inpainting module, we incorporate an auxiliary module to predict the horizontal movement of the camera which decreases the area that requires hallucination of challenging face parts such as ears. Unlike previous works, we process the full-frame input image at once without cropping the subject's face and processing it separately from the rest of the body, eliminating the need for complex post-processing steps to attach the face back to the subject's body. To train our network, we utilize the popular game engine Unreal Engine to generate a large synthetic face dataset containing various subjects, head poses, expressions, eyewear, clothes, and lighting. Quantitative and qualitative results show that our rectification pipeline outperforms previous methods, and produces comparable results with a time-consuming 3D GAN-based method while being more than 260 times faster.
CVSep 1, 2020
Uncovering Hidden Challenges in Query-Based Video Moment RetrievalMayu Otani, Yuta Nakashima, Esa Rahtu et al.
The query-based moment retrieval is a problem of localising a specific clip from an untrimmed video according a query sentence. This is a challenging task that requires interpretation of both the natural language query and the video content. Like in many other areas in computer vision and machine learning, the progress in query-based moment retrieval is heavily driven by the benchmark datasets and, therefore, their quality has significant impact on the field. In this paper, we present a series of experiments assessing how well the benchmark results reflect the true progress in solving the moment retrieval task. Our results indicate substantial biases in the popular datasets and unexpected behaviour of the state-of-the-art models. Moreover, we present new sanity check experiments and approaches for visualising the results. Finally, we suggest possible directions to improve the temporal sentence grounding in the future. Our code for this paper is available at https://mayu-ot.github.io/hidden-challenges-MR .
CVJul 17, 2020
Computing stable resultant-based minimal solvers by hiding a variableSnehal Bhayani, Zuzana Kukelova, Janne Heikkilä
Many computer vision applications require robust and efficient estimation of camera geometry. The robust estimation is usually based on solving camera geometry problems from a minimal number of input data measurements, i.e., solving minimal problems, in a RANSAC-style framework. Minimal problems often result in complex systems of polynomial equations. The existing state-of-the-art methods for solving such systems are either based on Gröbner bases and the action matrix method, which have been extensively studied and optimized in the recent years or recently proposed approach based on a sparse resultant computation using an extra variable. In this paper, we study an interesting alternative sparse resultant-based method for solving sparse systems of polynomial equations by hiding one variable. This approach results in a larger eigenvalue problem than the action matrix and extra variable sparse resultant-based methods; however, it does not need to compute an inverse or elimination of large matrices that may be numerically unstable. The proposed approach includes several improvements to the standard sparse resultant algorithms, which significantly improves the efficiency and stability of the hidden variable resultant-based solvers as we demonstrate on several interesting computer vision problems. We show that for the studied problems, our sparse resultant based approach leads to more stable solvers than the state-of-the-art Gröbner bases-based solvers as well as existing sparse resultant-based solvers, especially in close to critical configurations. Our new method can be fully automated and incorporated into existing tools for the automatic generation of efficient minimal solvers.
CVJun 18, 2020
Learning non-rigid surface reconstruction from spatio-temporal image patchesMatteo Pedone, Abdelrahman Mostafa, Janne heikkilä
We present a method to reconstruct a dense spatio-temporal depth map of a non-rigidly deformable object directly from a video sequence. The estimation of depth is performed locally on spatio-temporal patches of the video, and then the full depth video of the entire shape is recovered by combining them together. Since the geometric complexity of a local spatio-temporal patch of a deforming non-rigid object is often simple enough to be faithfully represented with a parametric model, we artificially generate a database of small deforming rectangular meshes rendered with different material properties and light conditions, along with their corresponding depth videos, and use such data to train a convolutional neural network. We tested our method on both synthetic and Kinect data and experimentally observed that the reconstruction error is significantly lower than the one obtained using other approaches like conventional non-rigid structure from motion.
CVDec 21, 2019
A sparse resultant based method for efficient minimal solversSnehal Bhayani, Zuzana Kukelova, Janne Heikkilä
Many computer vision applications require robust and efficient estimation of camera geometry. The robust estimation is usually based on solving camera geometry problems from a minimal number of input data measurements, i.e. solving minimal problems in a RANSAC framework. Minimal problems often result in complex systems of polynomial equations. Many state-of-the-art efficient polynomial solvers to these problems are based on Gröbner bases and the action-matrix method that has been automatized and highly optimized in recent years. In this paper we study an alternative algebraic method for solving systems of polynomial equations, i.e., the sparse resultant-based method and propose a novel approach to convert the resultant constraint to an eigenvalue problem. This technique can significantly improve the efficiency and stability of existing resultant-based solvers. We applied our new resultant-based method to a large variety of computer vision problems and show that for most of the considered problems, the new method leads to solvers that are the same size as the the best available Gröbner basis solvers and of similar accuracy. For some problems the new sparse-resultant based method leads to even smaller and more stable solvers than the state-of-the-art Gröbner basis solvers. Our new method can be fully automatized and incorporated into existing tools for automatic generation of efficient polynomial solvers and as such it represents a competitive alternative to popular Gröbner basis methods for minimal problems in computer vision.
CVNov 25, 2019
Improving land cover segmentation across satellites using domain adaptationNadir Bengana, Janne Heikkilä
Land use and land cover mapping are essential to various fields of study, including forestry, agriculture, and urban management. Using earth observation satellites both facilitate and accelerate the task. Lately, deep learning methods have proven to be excellent at automating the mapping via semantic image segmentation. However, because deep neural networks require large amounts of labeled data, it is not easy to exploit the full potential of satellite imagery. Additionally, the land cover tends to differ in appearance from one region to another; therefore, having labeled data from one location does not necessarily help in mapping others. Furthermore, satellite images come in various multispectral bands (the bands could range from RGB to over twelve bands). In this paper, we aim at using domain adaptation to solve the aforementioned problems. We applied a well-performing domain adaptation approach on datasets we have built using RGB images from Sentinel-2, WorldView-2, and Pleiades-1 satellites with Corine Land Cover as ground-truth labels. We have also used the DeepGlobe land cover dataset. Experiments show a significant improvement over results obtained without the use of domain adaptation. In some cases, an improvement of over 20% MIoU. At times it even manages to correct errors in the ground-truth labels.
CVMar 27, 2019
Rethinking the Evaluation of Video SummariesMayu Otani, Yuta Nakashima, Esa Rahtu et al.
Video summarization is a technique to create a short skim of the original video while preserving the main stories/content. There exists a substantial interest in automatizing this process due to the rapid growth of the available material. The recent progress has been facilitated by public benchmark datasets, which enable easy and fair comparison of methods. Currently the established evaluation protocol is to compare the generated summary with respect to a set of reference summaries provided by the dataset. In this paper, we will provide in-depth assessment of this pipeline using two popular benchmark datasets. Surprisingly, we observe that randomly generated summaries achieve comparable or better performance to the state-of-the-art. In some cases, the random summaries outperform even the human generated summaries in leave-one-out experiments. Moreover, it turns out that the video segmentation, which is often considered as a fixed pre-processing method, has the most significant impact on the performance measure. Based on our observations, we propose alternative approaches for assessing the importance scores as well as an intuitive visualization of correlation between the estimated scoring and human annotations.
CVFeb 20, 2019
An efficient solution for semantic segmentation: ShuffleNet V2 with atrous separable convolutionsSercan Türkmen, Janne Heikkilä
Assigning a label to each pixel in an image, namely semantic segmentation, has been an important task in computer vision, and has applications in autonomous driving, robotic navigation, localization, and scene understanding. Fully convolutional neural networks have proved to be a successful solution for the task over the years but most of the work being done focuses primarily on accuracy. In this paper, we present a computationally efficient approach to semantic segmentation, while achieving a high mean intersection over union (mIOU), 70.33% on Cityscapes challenge. The network proposed is capable of running real-time on mobile devices. In addition, we make our code and model weights publicly available.
CVNov 23, 2018
LSD$_2$ -- Joint Denoising and Deblurring of Short and Long Exposure Images with CNNsJanne Mustaniemi, Juho Kannala, Jiri Matas et al.
The paper addresses the problem of acquiring high-quality photographs with handheld smartphone cameras in low-light imaging conditions. We propose an approach based on capturing pairs of short and long exposure images in rapid succession and fusing them into a single high-quality photograph. Unlike existing methods, we take advantage of both images simultaneously and perform a joint denoising and deblurring using a convolutional neural network. A novel approach is introduced to generate realistic short-long exposure image pairs. The method produces good images in extremely challenging conditions and outperforms existing denoising and deblurring methods. It also enables exposure fusion in the presence of motion blur.
CVOct 1, 2018
Gyroscope-Aided Motion Deblurring with Deep NetworksJanne Mustaniemi, Juho Kannala, Simo Särkkä et al.
We propose a deblurring method that incorporates gyroscope measurements into a convolutional neural network (CNN). With the help of such measurements, it can handle extremely strong and spatially-variant motion blur. At the same time, the image data is used to overcome the limitations of gyro-based blur estimation. To train our network, we also introduce a novel way of generating realistic training data using the gyroscope. The evaluation shows a clear improvement in visual quality over the state-of-the-art while achieving real-time performance. Furthermore, the method is shown to improve the performance of existing feature detectors and descriptors against the motion blur.
CVJul 31, 2018
Leveraging Unlabeled Whole-Slide-Images for Mitosis DetectionSaad Ullah Akram, Talha Qaiser, Simon Graham et al.
Mitosis count is an important biomarker for prognosis of various cancers. At present, pathologists typically perform manual counting on a few selected regions of interest in breast whole-slide-images (WSIs) of patient biopsies. This task is very time-consuming, tedious and subjective. Automated mitosis detection methods have made great advances in recent years. However, these methods require exhaustive labeling of a large number of selected regions of interest. This task is very expensive because expert pathologists are needed for reliable and accurate annotations. In this paper, we present a semi-supervised mitosis detection method which is designed to leverage a large number of unlabeled breast cancer WSIs. As a result, our method capitalizes on the growing number of digitized histology images, without relying on exhaustive annotations, subsequently improving mitosis detection. Our method first learns a mitosis detector from labeled data, uses this detector to mine additional mitosis samples from unlabeled WSIs, and then trains the final model using this larger and diverse set of mitosis samples. The use of unlabeled data improves F1-score by $\sim$5\% compared to our best performing fully-supervised model on the TUPAC validation set. Our submission (single model) to TUPAC challenge ranks highly on the leaderboard with an F1-score of 0.64.
CVMay 22, 2018
Fast Motion Deblurring for Feature Detection and Matching Using Inertial MeasurementsJanne Mustaniemi, Juho Kannala, Simo Särkkä et al.
Many computer vision and image processing applications rely on local features. It is well-known that motion blur decreases the performance of traditional feature detectors and descriptors. We propose an inertial-based deblurring method for improving the robustness of existing feature detectors and descriptors against the motion blur. Unlike most deblurring algorithms, the method can handle spatially-variant blur and rolling shutter distortion. Furthermore, it is capable of running in real-time contrary to state-of-the-art algorithms. The limitations of inertial-based blur estimation are taken into account by validating the blur estimates using image data. The evaluation shows that when the method is used with traditional feature detector and descriptor, it increases the number of detected keypoints, provides higher repeatability and improves the localization accuracy. We also demonstrate that such features will lead to more accurate and complete reconstructions when used in the application of 3D visual reconstruction.
CVApr 24, 2018
Accurate 3-D Reconstruction with RGB-D Cameras using Depth Map Fusion and Pose RefinementMarkus Ylimäki, Juho Kannala, Janne Heikkilä
Depth map fusion is an essential part in both stereo and RGB-D based 3-D reconstruction pipelines. Whether produced with a passive stereo reconstruction or using an active depth sensor, such as Microsoft Kinect, the depth maps have noise and may have poor initial registration. In this paper, we introduce a method which is capable of handling outliers, and especially, even significant registration errors. The proposed method first fuses a sequence of depth maps into a single non-redundant point cloud so that the redundant points are merged together by giving more weight to more certain measurements. Then, the original depth maps are re-registered to the fused point cloud to refine the original camera extrinsic parameters. The fusion is then performed again with the refined extrinsic parameters. This procedure is repeated until the result is satisfying or no significant changes happen between iterations. The method is robust to outliers and erroneous depth measurements as well as even significant depth map registration errors due to inaccurate initial camera poses.
CVNov 29, 2016
Inertial-Based Scale Estimation for Structure from Motion on Mobile DevicesJanne Mustaniemi, Juho Kannala, Simo Särkkä et al.
Structure from motion algorithms have an inherent limitation that the reconstruction can only be determined up to the unknown scale factor. Modern mobile devices are equipped with an inertial measurement unit (IMU), which can be used for estimating the scale of the reconstruction. We propose a method that recovers the metric scale given inertial measurements and camera poses. In the process, we also perform a temporal and spatial alignment of the camera and the IMU. Therefore, our solution can be easily combined with any existing visual reconstruction software. The method can cope with noisy camera pose estimates, typically caused by motion blur or rolling shutter artifacts, via utilizing a Rauch-Tung-Striebel (RTS) smoother. Furthermore, the scale estimation is performed in the frequency domain, which provides more robustness to inaccurate sensor time stamps and noisy IMU samples than the previously used time domain representation. In contrast to previous methods, our approach has no parameters that need to be tuned for achieving a good performance. In the experiments, we show that the algorithm outperforms the state-of-the-art in both accuracy and convergence speed of the scale estimate. The accuracy of the scale is around $1\%$ from the ground truth depending on the recording. We also demonstrate that our method can improve the scale accuracy of the Project Tango's build-in motion tracking.
CVSep 28, 2016
Video Summarization using Deep Semantic FeaturesMayu Otani, Yuta Nakashima, Esa Rahtu et al.
This paper presents a video summarization technique for an Internet video to provide a quick way to overview its content. This is a challenging problem because finding important or informative parts of the original video requires to understand its content. Furthermore the content of Internet videos is very diverse, ranging from home videos to documentaries, which makes video summarization much more tough as prior knowledge is almost not available. To tackle this problem, we propose to use deep video features that can encode various levels of content semantics, including objects, actions, and scenes, improving the efficiency of standard video summarization techniques. For this, we design a deep neural network that maps videos as well as descriptions to a common semantic space and jointly trained it with associated pairs of videos and descriptions. To generate a video summary, we extract the deep features from each segment of the original video and apply a clustering-based summarization technique to them. We evaluate our video summaries using the SumMe dataset as well as baseline approaches. The results demonstrated the advantages of incorporating our deep semantic features in a video summarization technique.
CVAug 8, 2016
Learning Joint Representations of Videos and Sentences with Web Image SearchMayu Otani, Yuta Nakashima, Esa Rahtu et al.
Our objective is video retrieval based on natural language queries. In addition, we consider the analogous problem of retrieving sentences or generating descriptions given an input video. Recent work has addressed the problem by embedding visual and textual inputs into a common space where semantic similarities correlate to distances. We also adopt the embedding approach, and make the following contributions: First, we utilize web image search in sentence embedding process to disambiguate fine-grained visual concepts. Second, we propose embedding models for sentence, image, and video inputs whose parameters are learned simultaneously. Finally, we show how the proposed model can be applied to description generation. Overall, we observe a clear improvement over the state-of-the-art methods in the video and sentence retrieval tasks. In description generation, the performance level is comparable to the current state-of-the-art, although our embeddings were trained for the retrieval tasks.