Kalle Åström

h-index75

21papers

256citations

Novelty51%

AI Score46

Ranked #62,851 of 201,326 authors (top 31%)#22,649 in CV (top 38%)

21 Papers

SPFeb 10, 2023Code

The LuViRA Dataset: Synchronized Vision, Radio, and Audio Sensors for Indoor Localization

Ilayda Yaman, Guoda Tian, Martin Larsson et al.

We present a synchronized multisensory dataset for accurate and robust indoor localization: the Lund University Vision, Radio, and Audio (LuViRA) Dataset. The dataset includes color images, corresponding depth maps, inertial measurement unit (IMU) readings, channel response between a 5G massive multiple-input and multiple-output (MIMO) testbed and user equipment, audio recorded by 12 microphones, and accurate six degrees of freedom (6DOF) pose ground truth of 0.5 mm. We synchronize these sensors to ensure that all data is recorded simultaneously. A camera, speaker, and transmit antenna are placed on top of a slowly moving service robot, and 89 trajectories are recorded. Each trajectory includes 20 to 50 seconds of recorded sensor data and ground truth labels. Data from different sensors can be used separately or jointly to perform localization tasks, and data from the motion capture (mocap) system is used to verify the results obtained by the localization algorithms. The main aim of this dataset is to enable research on sensor fusion with the most commonly used sensors for localization tasks. Moreover, the full dataset or some parts of it can also be used for other research areas such as channel estimation, image classification, etc. Our dataset is available at: https://github.com/ilaydayaman/LuViRA_Dataset

CVDec 13, 2022Code

LidarCLIP or: How I Learned to Talk to Point Clouds

Georg Hess, Adam Tonderski, Christoffer Petersson et al.

Research connecting text and images has recently seen several breakthroughs, with models like CLIP, DALL-E 2, and Stable Diffusion. However, the connection between text and other visual modalities, such as lidar data, has received less attention, prohibited by the lack of text-lidar datasets. In this work, we propose LidarCLIP, a mapping from automotive point clouds to a pre-existing CLIP embedding space. Using image-lidar pairs, we supervise a point cloud encoder with the image CLIP embeddings, effectively relating text and lidar data with the image domain as an intermediary. We show the effectiveness of LidarCLIP by demonstrating that lidar-based retrieval is generally on par with image-based retrieval, but with complementary strengths and weaknesses. By combining image and lidar features, we improve upon both single-modality methods and enable a targeted search for challenging detection scenarios under adverse sensor conditions. We also explore zero-shot classification and show that LidarCLIP outperforms existing attempts to use CLIP for point clouds by a large margin. Finally, we leverage our compatibility with CLIP to explore a range of applications, such as point cloud captioning and lidar-to-image generation, without any additional training. Code and pre-trained models are available at https://github.com/atonderski/lidarclip.

CVApr 21, 2022Code

Future Object Detection with Spatiotemporal Transformers

Adam Tonderski, Joakim Johnander, Christoffer Petersson et al.

We propose the task Future Object Detection, in which the goal is to predict the bounding boxes for all visible objects in a future video frame. While this task involves recognizing temporal and kinematic patterns, in addition to the semantic and geometric ones, it only requires annotations in the standard form for individual, single (future) frames, in contrast to expensive full sequence annotations. We propose to tackle this task with an end-to-end method, in which a detection transformer is trained to directly output the future objects. In order to make accurate predictions about the future, it is necessary to capture the dynamics in the scene, both object motion and the movement of the ego-camera. To this end, we extend existing detection transformers in two ways. First, we experiment with three different mechanisms that enable the network to spatiotemporally process multiple frames. Second, we provide ego-motion information to the model in a learnable manner. We show that both of these extensions improve the future object detection performance substantially. Our final approach learns to capture the dynamics and makes predictions on par with an oracle for prediction horizons up to 100 ms, and outperforms all baselines for longer prediction horizons. By visualizing the attention maps, we observe that a form of tracking emerges within the network. Code is available at github.com/atonderski/future-object-detection.

CVSep 8, 2022Code

Aerial View Localization with Reinforcement Learning: Towards Emulating Search-and-Rescue

Aleksis Pirinen, Anton Samuelsson, John Backsund et al.

Climate-induced disasters are and will continue to be on the rise, and thus search-and-rescue (SAR) operations, where the task is to localize and assist one or several people who are missing, become increasingly relevant. In many cases the rough location may be known and a UAV can be deployed to explore a given, confined area to precisely localize the missing people. Due to time and battery constraints it is often critical that localization is performed as efficiently as possible. In this work we approach this type of problem by abstracting it as an aerial view goal localization task in a framework that emulates a SAR-like setup without requiring access to actual UAVs. In this framework, an agent operates on top of an aerial image (proxy for a search area) and is tasked with localizing a goal that is described in terms of visual cues. To further mimic the situation on an actual UAV, the agent is not able to observe the search area in its entirety, not even at low resolution, and thus it has to operate solely based on partial glimpses when navigating towards the goal. To tackle this task, we propose AiRLoc, a reinforcement learning (RL)-based model that decouples exploration (searching for distant goals) and exploitation (localizing nearby goals). Extensive evaluations show that AiRLoc outperforms heuristic search methods as well as alternative learnable approaches, and that it generalizes across datasets, e.g. to disaster-hit areas without seeing a single disaster scenario during training. We also conduct a proof-of-concept study which indicates that the learnable methods outperform humans on average. Code and models have been made publicly available at https://github.com/aleksispi/airloc.

CVJun 1, 2022

Semantic Room Wireframe Detection from a Single View

David Gillsjö, Gabrielle Flood, Kalle Åström

Reconstruction of indoor surfaces with limited texture information or with repeated textures, a situation common in walls and ceilings, may be difficult with a monocular Structure from Motion system. We propose a Semantic Room Wireframe Detection task to predict a Semantic Wireframe from a single perspective image. Such predictions may be used with shape priors to estimate the Room Layout and aid reconstruction. To train and test the proposed algorithm we create a new set of annotations from the simulated Structured3D dataset. We show qualitatively that the SRW-Net handles complex room geometries better than previous Room Layout Estimation algorithms while quantitatively out-performing the baseline in non-semantic Wireframe Detection.

ASAug 9, 2022

Extending GCC-PHAT using Shift Equivariant Neural Networks

Axel Berg, Mark O'Connor, Kalle Åström et al.

Speaker localization using microphone arrays depends on accurate time delay estimation techniques. For decades, methods based on the generalized cross correlation with phase transform (GCC-PHAT) have been widely adopted for this purpose. Recently, the GCC-PHAT has also been used to provide input features to neural networks in order to remove the effects of noise and reverberation, but at the cost of losing theoretical guarantees in noise-free conditions. We propose a novel approach to extending the GCC-PHAT, where the received signals are filtered using a shift equivariant neural network that preserves the timing information contained in the signals. By extensive experiments we show that our model consistently reduces the error of the GCC-PHAT in adverse environments, with guarantees of exact time delay recovery in ideal conditions.

CVApr 11, 2024Code

NeuroNCAP: Photorealistic Closed-loop Safety Testing for Autonomous Driving

William Ljungbergh, Adam Tonderski, Joakim Johnander et al.

We present a versatile NeRF-based simulator for testing autonomous driving (AD) software systems, designed with a focus on sensor-realistic closed-loop evaluation and the creation of safety-critical scenarios. The simulator learns from sequences of real-world driving sensor data and enables reconfigurations and renderings of new, unseen scenarios. In this work, we use our simulator to test the responses of AD models to safety-critical scenarios inspired by the European New Car Assessment Programme (Euro NCAP). Our evaluation reveals that, while state-of-the-art end-to-end planners excel in nominal driving scenarios in an open-loop setting, they exhibit critical flaws when navigating our safety-critical scenarios in a closed-loop setting. This highlights the need for advancements in the safety and real-world usability of end-to-end planners. By publicly releasing our simulator and scenarios as an easy-to-run evaluation suite, we invite the research community to explore, refine, and validate their AD models in controlled, yet highly configurable and challenging sensor-realistic environments. Code and instructions can be found at https://github.com/atonderski/neuro-ncap

CVJun 21, 2023

Polygon Detection for Room Layout Estimation using Heterogeneous Graphs and Wireframes

David Gillsjö, Gabrielle Flood, Kalle Åström

This paper presents a neural network based semantic plane detection method utilizing polygon representations. The method can for example be used to solve room layout estimations tasks. The method is built on, combines and further develops several different modules from previous research. The network takes an RGB image and estimates a wireframe as well as a feature space using an hourglass backbone. From these, line and junction features are sampled. The lines and junctions are then represented as an undirected graph, from which polygon representations of the sought planes are obtained. Two different methods for this last step are investigated, where the most promising method is built on a heterogeneous graph transformer. The final output is in all cases a projection of the semantic planes in 2D. The methods are evaluated on the Structured 3D dataset and we investigate the performance both using sampled and estimated wireframes. The experiments show the potential of the graph-based method by outperforming state of the art methods in Room Layout estimation in the 2D metrics using synthetic wireframe detections.

CVAug 6, 2025Code

PixCuboid: Room Layout Estimation from Multi-view Featuremetric Alignment

Gustav Hanning, Kalle Åström, Viktor Larsson

Coarse room layout estimation provides important geometric cues for many downstream tasks. Current state-of-the-art methods are predominantly based on single views and often assume panoramic images. We introduce PixCuboid, an optimization-based approach for cuboid-shaped room layout estimation, which is based on multi-view alignment of dense deep features. By training with the optimization end-to-end, we learn feature maps that yield large convergence basins and smooth loss landscapes in the alignment. This allows us to initialize the room layout using simple heuristics. For the evaluation we propose two new benchmarks based on ScanNet++ and 2D-3D-Semantics, with manually verified ground truth 3D cuboids. In thorough experiments we validate our approach and significantly outperform the competition. Finally, while our network is trained with single cuboids, the flexibility of the optimization-based approach allow us to easily extend to multi-room estimation, e.g. larger apartments or offices. Code and model weights are available at https://github.com/ghanning/PixCuboid.

CVDec 28, 2023

Geometry-Biased Transformer for Robust Multi-View 3D Human Pose Reconstruction

Olivier Moliner, Sangxia Huang, Kalle Åström

We address the challenges in estimating 3D human poses from multiple views under occlusion and with limited overlapping views. We approach multi-view, single-person 3D human pose reconstruction as a regression problem and propose a novel encoder-decoder Transformer architecture to estimate 3D poses from multi-view 2D pose sequences. The encoder refines 2D skeleton joints detected across different views and times, fusing multi-view and temporal information through global self-attention. We enhance the encoder by incorporating a geometry-biased attention mechanism, effectively leveraging geometric relationships between views. Additionally, we use detection scores provided by the 2D pose detector to further guide the encoder's attention based on the reliability of the 2D detections. The decoder subsequently regresses the 3D pose sequence from these refined tokens, using pre-defined queries for each joint. To enhance the generalization of our method to unseen scenes and improve resilience to missing joints, we implement strategies including scene centering, synthetic views, and token dropout. We conduct extensive experiments on three benchmark public datasets, Human3.6M, CMU Panoptic and Occlusion-Persons. Our results demonstrate the efficacy of our approach, particularly in occluded scenes and when few views are available, which are traditionally challenging scenarios for triangulation-based methods.

SDNov 20, 2024

SONNET: Enhancing Time Delay Estimation by Leveraging Simulated Audio

Erik Tegler, Magnus Oskarsson, Kalle Åström

Time delay estimation or Time-Difference-Of-Arrival estimates is a critical component for multiple localization applications such as multilateration, direction of arrival, and self-calibration. The task is to estimate the time difference between a signal arriving at two different sensors. For the audio sensor modality, most current systems are based on classical methods such as the Generalized Cross-Correlation Phase Transform (GCC-PHAT) method. In this paper we demonstrate that learning based methods can, even based on synthetic data, significantly outperform GCC-PHAT on novel real world data. To overcome the lack of data with ground truth for the task, we train our model on a simulated dataset which is sufficiently large and varied, and that captures the relevant characteristics of the real world problem. We provide our trained model, SONNET (Simulation Optimized Neural Network Estimator of Timeshifts), which is runnable in real-time and works on novel data out of the box for many real data applications, i.e. without re-training. We further demonstrate greatly improved performance on the downstream task of self-calibration when using our model compared to classical methods.

CVSep 19, 2025

Sparse Multiview Open-Vocabulary 3D Detection

Olivier Moliner, Viktor Larsson, Kalle Åström

The ability to interpret and comprehend a 3D scene is essential for many vision and robotics systems. In numerous applications, this involves 3D object detection, i.e.~identifying the location and dimensions of objects belonging to a specific category, typically represented as bounding boxes. This has traditionally been solved by training to detect a fixed set of categories, which limits its use. In this work, we investigate open-vocabulary 3D object detection in the challenging yet practical sparse-view setting, where only a limited number of posed RGB images are available as input. Our approach is training-free, relying on pre-trained, off-the-shelf 2D foundation models instead of employing computationally expensive 3D feature fusion or requiring 3D-specific learning. By lifting 2D detections and directly optimizing 3D proposals for featuremetric consistency across views, we fully leverage the extensive training data available in 2D compared to 3D. Through standard benchmarks, we demonstrate that this simple pipeline establishes a powerful baseline, performing competitively with state-of-the-art techniques in densely sampled scenarios while significantly outperforming them in the sparse-view setting.

IVJul 18, 2025

Converting T1-weighted MRI from 3T to 7T quality using deep learning

Malo Gicquel, Ruoyi Zhao, Anika Wuestefeld et al.

Ultra-high resolution 7 tesla (7T) magnetic resonance imaging (MRI) provides detailed anatomical views, offering better signal-to-noise ratio, resolution and tissue contrast than 3T MRI, though at the cost of accessibility. We present an advanced deep learning model for synthesizing 7T brain MRI from 3T brain MRI. Paired 7T and 3T T1-weighted images were acquired from 172 participants (124 cognitively unimpaired, 48 impaired) from the Swedish BioFINDER-2 study. To synthesize 7T MRI from 3T images, we trained two models: a specialized U-Net, and a U-Net integrated with a generative adversarial network (GAN U-Net). Our models outperformed two additional state-of-the-art 3T-to-7T models in image-based evaluation metrics. Four blinded MRI professionals judged our synthetic 7T images as comparable in detail to real 7T images, and superior in subjective visual quality to 7T images, apparently due to the reduction of artifacts. Importantly, automated segmentations of the amygdalae of synthetic GAN U-Net 7T images were more similar to manually segmented amygdalae (n=20), than automated segmentations from the 3T images that were used to synthesize the 7T images. Finally, synthetic 7T images showed similar performance to real 3T images in downstream prediction of cognitive status using MRI derivatives (n=3,168). In all, we show that synthetic T1-weighted brain images approaching 7T quality can be generated from 3T images, which may improve image quality and segmentation, without compromising performance in downstream tasks. Future directions, possible clinical use cases, and limitations are discussed.

CVFeb 4, 2022

Bootstrapped Representation Learning for Skeleton-Based Action Recognition

Olivier Moliner, Sangxia Huang, Kalle Åström

In this work, we study self-supervised representation learning for 3D skeleton-based action recognition. We extend Bootstrap Your Own Latent (BYOL) for representation learning on skeleton sequence data and propose a new data augmentation strategy including two asymmetric transformation pipelines. We also introduce a multi-viewpoint sampling method that leverages multiple viewing angles of the same action captured by different cameras. In the semi-supervised setting, we show that the performance can be further improved by knowledge distillation from wider networks, leveraging once more the unlabeled samples. We conduct extensive experiments on the NTU-60 and NTU-120 datasets to demonstrate the performance of our proposed method. Our method consistently outperforms the current state of the art on both linear evaluation and semi-supervised benchmarks.

CVMar 24, 2021

Generic Merging of Structure from Motion Maps with a Low Memory Footprint

Gabrielle Flood, David Gillsjö, Patrik Persson et al.

With the development of cheap image sensors, the amount of available image data have increased enormously, and the possibility of using crowdsourced collection methods has emerged. This calls for development of ways to handle all these data. In this paper, we present new tools that will enable efficient, flexible and robust map merging. Assuming that separate optimisations have been performed for the individual maps, we show how only relevant data can be stored in a low memory footprint representation. We use these representations to perform map merging so that the algorithm is invariant to the merging order and independent of the choice of coordinate system. The result is a robust algorithm that can be applied to several maps simultaneously. The result of a merge can also be represented with the same type of low-memory footprint format, which enables further merging and updating of the map in a hierarchical way. Furthermore, the method can perform loop closing and also detect changes in the scene between the capture of the different image sequences. Using both simulated and real data - from both a hand held mobile phone and from a drone - we verify the performance of the proposed method.

CVMar 15, 2021

Trust Your IMU: Consequences of Ignoring the IMU Drift

Marcus Valtonen Örnhag, Patrik Persson, Mårten Wadenbäck et al.

In this paper, we argue that modern pre-integration methods for inertial measurement units (IMUs) are accurate enough to ignore the drift for short time intervals. This allows us to consider a simplified camera model, which in turn admits further intrinsic calibration. We develop the first-ever solver to jointly solve the relative pose problem with unknown and equal focal length and radial distortion profile while utilizing the IMU data. Furthermore, we show significant speed-up compared to state-of-the-art algorithms, with small or negligible loss in accuracy for partially calibrated setups. The proposed algorithms are tested on both synthetic and real data, where the latter is focused on navigation using unmanned aerial vehicles (UAVs). We evaluate the proposed solvers on different commercially available low-cost UAVs, and demonstrate that the novel assumption on IMU drift is feasible in real-life applications. The extended intrinsic auto-calibration enables us to use distorted input images, making tedious calibration processes obsolete, compared to current state-of-the-art methods.

CVOct 16, 2020

In Depth Bayesian Semantic Scene Completion

David Gillsjö, Kalle Åström

This work studies Semantic Scene Completion which aims to predict a 3D semantic segmentation of our surroundings, even though some areas are occluded. For this we construct a Bayesian Convolutional Neural Network (BCNN), which is not only able to perform the segmentation, but also predict model uncertainty. This is an important feature not present in standard CNNs. We show on the MNIST dataset that the Bayesian approach performs equal or better to the standard CNN when processing digits unseen in the training phase when looking at accuracy, precision and recall. With the added benefit of having better calibrated scores and the ability to express model uncertainty. We then show results for the Semantic Scene Completion task where a category is introduced at test time on the SUNCG dataset. In this more complex task the Bayesian approach outperforms the standard CNN. Showing better Intersection over Union score and excels in Average Precision and separation scores.

CVOct 8, 2020

Efficient Real-Time Radial Distortion Correction for UAVs

Marcus Valtonen Örnhag, Patrik Persson, Mårten Wadenbäck et al.

In this paper we present a novel algorithm for onboard radial distortion correction for unmanned aerial vehicles (UAVs) equipped with an inertial measurement unit (IMU), that runs in real-time. This approach makes calibration procedures redundant, thus allowing for exchange of optics extemporaneously. By utilizing the IMU data, the cameras can be aligned with the gravity direction. This allows us to work with fewer degrees of freedom, and opens up for further intrinsic calibration. We propose a fast and robust minimal solver for simultaneously estimating the focal length, radial distortion profile and motion parameters from homographies. The proposed solver is tested on both synthetic and real data, and perform better or on par with state-of-the-art methods relying on pre-calibration procedures.

CVMar 16, 2020

Minimal Solvers for Indoor UAV Positioning

Marcus Valtonen Örnhag, Patrik Persson, Mårten Wadenbäck et al.

In this paper we consider a collection of relative pose problems which arise naturally in applications for visual indoor UAV navigation. We focus on cases where additional information from an onboard IMU is available and thus provides a partial extrinsic calibration through the gravitational vector. The solvers are designed for a partially calibrated camera, for a variety of realistic indoor scenarios, which makes it possible to navigate using images of the ground floor. Current state-of-the-art solvers use more general assumptions, such as using arbitrary planar structures; however, these solvers do not yield adequate reconstructions for real scenes, nor do they perform fast enough to be incorporated in real-time systems. We show that the proposed solvers enjoy better numerical stability, are faster, and require fewer point correspondences, compared to state-of-the-art solvers. These properties are vital components for robust navigation in real-time systems, and we demonstrate on both synthetic and real data that our method outperforms other methods, and yields superior motion estimation.

CVMar 12, 2018

Beyond Gröbner Bases: Basis Selection for Minimal Solvers

Viktor Larsson, Magnus Oskarsson, Kalle Åström et al.

Many computer vision applications require robust estimation of the underlying geometry, in terms of camera motion and 3D structure of the scene. These robust methods often rely on running minimal solvers in a RANSAC framework. In this paper we show how we can make polynomial solvers based on the action matrix method faster, by careful selection of the monomial bases. These monomial bases have traditionally been based on a Gröbner basis for the polynomial ideal. Here we describe how we can enumerate all such bases in an efficient way. We also show that going beyond Gröbner bases leads to more efficient solvers in many cases. We present a novel basis sampling scheme that we evaluate on a number of problems.

SDOct 7, 2016

An Automatic System for Acoustic Microphone Geometry Calibration based on Minimal Solvers

Simayijiang Zhayida, Simon Segerblom Rex, Yubin Kuang et al.

In this paper, robust detection, tracking and geometry estimation methods are developed and combined into a system for estimating time-difference estimates, microphone localization and sound source movement. No assumptions on the 3D locations of the microphones and sound sources are made. The system is capable of tracking continuously moving sound sources in an reverberant environment. The multi-path components are explicitly tracked and used in the geometry estimation parts. The system is based on matching between pairs of channels using GCC-PHAT. Instead of taking a single maximum at each time instant from each such pair, we select the four strongest local maxima. This produce a set of hypothesis to work with in the subsequent steps, where consistency constraints between the channels and time-continuity constraints are exploited. In the paper it demonstrated how such detections can be used to estimate microphone positions, sound source movement and room geometry. The methods are tested and verified using real data from several reverberant environments. The evaluation demonstrated accuracy in the order of few millimeters.