CVOct 4, 2022Code
FreDSNet: Joint Monocular Depth and Semantic Segmentation with Fast Fourier ConvolutionsBruno Berenguel-Baeta, Jesus Bermudez-Cameo, Jose J. Guerrero
In this work we present FreDSNet, a deep learning solution which obtains semantic 3D understanding of indoor environments from single panoramas. Omnidirectional images reveal task-specific advantages when addressing scene understanding problems due to the 360-degree contextual information about the entire environment they provide. However, the inherent characteristics of the omnidirectional images add additional problems to obtain an accurate detection and segmentation of objects or a good depth estimation. To overcome these problems, we exploit convolutions in the frequential domain obtaining a wider receptive field in each convolutional layer. These convolutions allow to leverage the whole context information from omnidirectional images. FreDSNet is the first network that jointly provides monocular depth estimation and semantic segmentation from a single panoramic image exploiting fast Fourier convolutions. Our experiments show that FreDSNet has similar performance as specific state of the art methods for semantic segmentation and depth estimation. FreDSNet code is publicly available in https://github.com/Sbrunoberenguel/FreDSNet
CVMar 2, 2023
Bayesian Deep Learning for Affordance Segmentation in imagesLorenzo Mur-Labadia, Ruben Martinez-Cantin, Jose J. Guerrero
Affordances are a fundamental concept in robotics since they relate available actions for an agent depending on its sensory-motor capabilities and the environment. We present a novel Bayesian deep network to detect affordances in images, at the same time that we quantify the distribution of the aleatoric and epistemic variance at the spatial level. We adapt the Mask-RCNN architecture to learn a probabilistic representation using Monte Carlo dropout. Our results outperform the state-of-the-art of deterministic networks. We attribute this improvement to a better probabilistic feature space representation on the encoder and the Bayesian variability induced at the mask generation, which adapts better to the object contours. We also introduce the new Probability-based Mask Quality measure that reveals the semantic and spatial differences on a probabilistic instance segmentation model. We modify the existing Probabilistic Detection Quality metric by comparing the binary masks rather than the predicted bounding boxes, achieving a finer-grained evaluation of the probabilistic segmentation. We find aleatoric variance in the contours of the objects due to the camera noise, while epistemic variance appears in visual challenging pixels.
CVSep 5, 2023
Multi-label affordance mapping from egocentric visionLorenzo Mur-Labadia, Jose J. Guerrero, Ruben Martinez-Cantin
Accurate affordance detection and segmentation with pixel precision is an important piece in many complex systems based on interactions, such as robots and assitive devices. We present a new approach to affordance perception which enables accurate multi-label segmentation. Our approach can be used to automatically extract grounded affordances from first person videos of interactions using a 3D map of the environment providing pixel level precision for the affordance location. We use this method to build the largest and most complete dataset on affordances based on the EPIC-Kitchen dataset, EPIC-Aff, which provides interaction-grounded, multi-label, metric and spatial affordance annotations. Then, we propose a new approach to affordance segmentation based on multi-label detection which enables multiple affordances to co-exists in the same space, for example if they are associated with the same object. We present several strategies of multi-label detection using several segmentation architectures. The experimental results highlight the importance of the multi-label detection. Finally, we show how our metric representation can be exploited for build a map of interaction hotspots in spatial action-centric zones and use that representation to perform a task-oriented navigation.
CVMay 20, 2022
Assessing visual acuity in visual prostheses through a virtual-reality systemMelani Sanchez-Garcia, Roberto Morollon-Ruiz, Ruben Martinez-Cantin et al.
Current visual implants still provide very low resolution and limited field of view, thus limiting visual acuity in implanted patients. Developments of new strategies of artificial vision simulation systems by harnessing new advancements in technologies are of upmost priorities for the development of new visual devices. In this work, we take advantage of virtual-reality software paired with a portable head-mounted display and evaluated the performance of normally sighted participants under simulated prosthetic vision with variable field of view and number of pixels. Our simulated prosthetic vision system allows simple experimentation in order to study the design parameters of future visual prostheses. Ten normally sighted participants volunteered for a visual acuity study. Subjects were required to identify computer-generated Landolt-C gap orientation and different stimulus based on light perception, time-resolution, light location and motion perception commonly used for visual acuity examination in the sighted. Visual acuity scores were recorded across different conditions of number of electrodes and size of field of view. Our results showed that of all conditions tested, a field of view of 20° and 1000 phosphenes of resolution proved the best, with a visual acuity of 1.3 logMAR. Furthermore, performance appears to be correlated with phosphene density, but showing a diminishing return when field of view is less than 20°. The development of new artificial vision simulation systems can be useful to guide the development of new visual devices and the optimization of field of view and resolution to provide a helpful and valuable visual aid to profoundly or totally blind patients.
CVFeb 16
Integrating Affordances and Attention models for Short-Term Object Interaction AnticipationLorenzo Mur Labadia, Ruben Martinez-Cantin, Jose J. Guerrero et al.
Short Term object-interaction Anticipation consists in detecting the location of the next active objects, the noun and verb categories of the interaction, as well as the time to contact from the observation of egocentric video. This ability is fundamental for wearable assistants to understand user goals and provide timely assistance, or to enable human-robot interaction. In this work, we present a method to improve the performance of STA predictions. Our contributions are two-fold: 1 We propose STAformer and STAformer plus plus, two novel attention-based architectures integrating frame-guided temporal pooling, dual image-video attention, and multiscale feature fusion to support STA predictions from an image-input video pair; 2 We introduce two novel modules to ground STA predictions on human behavior by modeling affordances. First, we integrate an environment affordance model which acts as a persistent memory of interactions that can take place in a given physical scene. We explore how to integrate environment affordances via simple late fusion and with an approach which adaptively learns how to best fuse affordances with end-to-end predictions. Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot. Our results show significant improvements on Overall Top-5 mAP, with gain up to +23p.p on Ego4D and +31p.p on a novel set of curated EPIC-Kitchens STA labels. We released the code, annotations, and pre-extracted affordances on Ego4D and EPIC-Kitchens to encourage future research in this area.
CVJul 22, 2018Code
RGBiD-SLAM for Accurate Real-time Localisation and 3D MappingDaniel Gutierrez-Gomez, Jose J. Guerrero
In this paper we present a complete SLAM system for RGB-D cameras, namely RGB-iD SLAM. The presented approach is a dense direct SLAM method with the main characteristic of working with the depth maps in inverse depth parametrisation for the routines of dense alignment or keyframe fusion. The system consists in 2 CPU threads working in parallel, which share the use of the GPU for dense alignment and keyframe fusion routines. The first thread is a front-end operating at frame rate, which processes every incoming frame from the RGB-D sensor to compute the incremental odometry and integrate it in a keyframe which is changed periodically following a covisibility-based strategy. The second thread is a back-end which receives keyframes from the front-end. This thread is in charge of segmenting the keyframes based on their structure, describing them using Bags of Words, trying to find potential loop closures with previous keyframes, and in such case perform pose-graph optimisation for trajectory correction. In addition, our system allows is able to compute the odometry both with unregistered and registered depth maps, allowing to use customised calibrations of the RGB-D sensor. As a consequence in the paper we also propose a detailed calibration pipeline to compute customised calibrations for particular RGB-D cameras. The experiments with our approach in the TUM RGB-D benchmark datasets show results superior in accuracy to the state-of-the-art in many of the sequences. The code has been made available on-line for research purposes https://github.com/dangut/RGBiD-SLAM.
CVMay 5
Uncertainty Estimation in Instance Segmentation of Affordances via Bayesian Visual TransformersLorenzo Mur-Labadia, Ruben Martinez-Cantina, Jose J. Guerrero
Visual affordances identify regions in an image with potential interactions, offering a novel paradigm for scene understanding. Recognizing affordances allows autonomous robots to act more naturally, could enhance human-robot interactions, enrich augmented reality systems, and benefit prosthetic vision devices. Accurate and localized prediction of affordance regions, rather than general saliency maps is crucial for these applications. We present a model for instance segmentation of affordances by adopting sample-based and ensembles approaches for uncertainty estimation. We extend an attention-based architecture for our novel task, showing with detailed ablation experiments the effects of each component. By comparing the distribution of these different detections, we extract pixel-wise epistemic and aleatoric variances at both the semantic and spatial levels. In addition, we propose a novel measure called Probability-based Mask Quality, which enables a comprehensive analysis of semantic and spatial variations in a probabilistic instance segmentation model. Our results show that the global consensus of multiple sub-networks of Bayesian models improve deterministic networks due to a better mask refinement and generalization. This fact, joined with the more powerful features extracted by attention-based mechanisms, represent an improvement of +7.4 p.p on the $F_β^w$ score in the challenging IIT-Aff dataset. Bayesian models are also better calibrated, producing less overconfident probabilities and with a better uncertainty estimation. Qualitative results show that aleatoric variance appears in the contour of the objects, while the epistemic variance is observed in visual challenging pixels, adding interpretability to the neural network.
CVJan 30, 2024
OmniSCV: An Omnidirectional Synthetic Image Generator for Computer VisionBruno Berenguel-Baeta, Jesus Bermudez-Cameo, Jose J. Guerrero
Omnidirectional and 360° images are becoming widespread in industry and in consumer society, causing omnidirectional computer vision to gain attention. Their wide field of view allows the gathering of a great amount of information about the environment from only an image. However, the distortion of these images requires the development of specific algorithms for their treatment and interpretation. Moreover, a high number of images is essential for the correct training of computer vision algorithms based on learning. In this paper, we present a tool for generating datasets of omnidirectional images with semantic and depth information. These images are synthesized from a set of captures that are acquired in a realistic virtual environment for Unreal Engine 4 through an interface plugin. We gather a variety of well-known projection models such as equirectangular and cylindrical panoramas, different fish-eye lenses, catadioptric systems, and empiric models. Furthermore, we include in our tool photorealistic non-central-projection systems as non-central panoramas and non-central catadioptric systems. As far as we know, this is the first reported tool for generating photorealistic non-central images in the literature. Moreover, since the omnidirectional images are made virtually, we provide pixel-wise information about semantics and depth as well as perfect knowledge of the calibration parameters of the cameras. This allows the creation of ground-truth information with pixel precision for training learning algorithms and testing 3D vision approaches. To validate the proposed tool, different computer vision algorithms are tested as line extractions from dioptric and catadioptric central images, 3D Layout recovery and SLAM using equirectangular panoramas, and 3D reconstruction from non-central panoramas.
HCJan 28, 2025
Influence of field of view in visual prostheses design: Analysis with a VR systemMelani Sanchez-Garcia, Ruben Martinez-Cantin, Jesus Bermudez-Cameo et al.
Visual prostheses are designed to restore partial functional vision in patients with total vision loss. Retinal visual prostheses provide limited capabilities as a result of low resolution, limited field of view and poor dynamic range. Understanding the influence of these parameters in the perception results can guide prostheses research and design. In this work, we evaluate the influence of field of view with respect to spatial resolution in visual prostheses, measuring the accuracy and response time in a search and recognition task. Twenty-four normally sighted participants were asked to find and recognize usual objects, such as furniture and home appliance in indoor room scenes. For the experiment, we use a new simulated prosthetic vision system that allows simple and effective experimentation. Our system uses a virtual-reality environment based on panoramic scenes. The simulator employs a head-mounted display which allows users to feel immersed in the scene by perceiving the entire scene all around. Our experiments use public image datasets and a commercial head-mounted display. We have also released the virtual-reality software for replicating and extending the experimentation. Results show that the accuracy and response time decrease when the field of view is increased. Furthermore, performance appears to be correlated with the angular resolution, but showing a diminishing return even with a resolution of less than 2.3 phosphenes per degree. Our results seem to indicate that, for the design of retinal prostheses, it is better to concentrate the phosphenes in a small area, to maximize the angular resolution, even if that implies sacrificing field of view.
ROJan 30, 2024
Floor extraction and door detection for visually impaired guidanceBruno Berenguel-Baeta, Manuel Guerrero-Viu, Alejandro de Nova et al.
Finding obstacle-free paths in unknown environments is a big navigation issue for visually impaired people and autonomous robots. Previous works focus on obstacle avoidance, however they do not have a general view of the environment they are moving in. New devices based on computer vision systems can help impaired people to overcome the difficulties of navigating in unknown environments in safe conditions. In this work it is proposed a combination of sensors and algorithms that can lead to the building of a navigation system for visually impaired people. Based on traditional systems that use RGB-D cameras for obstacle avoidance, it is included and combined the information of a fish-eye camera, which will give a better understanding of the user's surroundings. The combination gives robustness and reliability to the system as well as a wide field of view that allows to obtain many information from the environment. This combination of sensors is inspired by human vision where the center of the retina (fovea) provides more accurate information than the periphery, where humans have a wider field of view. The proposed system is mounted on a wearable device that provides the obstacle-free zones of the scene, allowing the planning of trajectories for people guidance.
CVJan 30, 2024
Atlanta Scaled layouts from non-central panoramasBruno Berenguel-Baeta, Jesus Bermudez-Cameo, Jose J. Guerrero
In this work we present a novel approach for 3D layout recovery of indoor environments using a non-central acquisition system. From a non-central panorama, full and scaled 3D lines can be independently recovered by geometry reasoning without geometric nor scale assumptions. However, their sensitivity to noise and complex geometric modeling has led these panoramas being little investigated. Our new pipeline aims to extract the boundaries of the structural lines of an indoor environment with a neural network and exploit the properties of non-central projection systems in a new geometrical processing to recover an scaled 3D layout. The results of our experiments show that we improve state-of-the-art methods for layout reconstruction and line extraction in non-central projection systems. We completely solve the problem in Manhattan and Atlanta environments, handling occlusions and retrieving the metric scale of the room without extra measurements. As far as the authors knowledge goes, our approach is the first work using deep learning on non-central panoramas and recovering scaled layouts from single panoramas.
CVFeb 2, 2024
Scaled 360 layouts: Revisiting non-central panoramasBruno Berenguel-Baeta, Jesus Bermudez-Cameo, Jose J. Guerrero
From a non-central panorama, 3D lines can be recovered by geometric reasoning. However, their sensitivity to noise and the complex geometric modeling required has led these panoramas being very little investigated. In this work we present a novel approach for 3D layout recovery of indoor environments using single non-central panoramas. We obtain the boundaries of the structural lines of the room from a non-central panorama using deep learning and exploit the properties of non-central projection systems in a new geometrical processing to recover the scaled layout. We solve the problem for Manhattan environments, handling occlusions, and also for Atlanta environments in an unified method. The experiments performed improve the state-of-the-art methods for 3D layout recovery from a single panorama. Our approach is the first work using deep learning with non-central panoramas and recovering the scale of single panorama layouts.
CVFeb 2, 2024
Convolution kernel adaptation to calibrated fisheyeBruno Berenguel-Baeta, Maria Santos-Villafranca, Jesus Bermudez-Cameo et al.
Convolution kernels are the basic structural component of convolutional neural networks (CNNs). In the last years there has been a growing interest in fisheye cameras for many applications. However, the radially symmetric projection model of these cameras produces high distortions that affect the performance of CNNs, especially when the field of view is very large. In this work, we tackle this problem by proposing a method that leverages the calibration of cameras to deform the convolution kernel accordingly and adapt to the distortion. That way, the receptive field of the convolution is similar to standard convolutions in perspective images, allowing us to take advantage of pre-trained networks in large perspective datasets. We show how, with just a brief fine-tuning stage in a small dataset, we improve the performance of the network for the calibrated fisheye with respect to standard convolutions in depth estimation and semantic segmentation.
CVApr 11, 2025
Multimodal Knowledge Distillation for Egocentric Action Recognition Robust to Missing ModalitiesMaria Santos-Villafranca, Dustin Carrión-Ojeda, Alejandro Perez-Yus et al.
Existing methods for egocentric action recognition often rely solely on RGB videos, while additional modalities, e.g., audio, can improve accuracy in challenging scenarios. However, most prior multimodal approaches assume all modalities are available at inference, leading to significant accuracy drops, or even failure, when inputs are missing. To address this, we introduce KARMMA, a multimodal Knowledge distillation approach for egocentric Action Recognition robust to Missing ModAlities that requires no modality alignment across all samples during training or inference. KARMMA distills knowledge from a multimodal teacher into a multimodal student that benefits from all available modalities while remaining robust to missing ones, making it suitable for diverse multimodal scenarios without retraining. Our student uses approximately 50% fewer computational resources than our teacher, resulting in a lightweight and fast model. Experiments on Epic-Kitchens and Something-Something show that our student achieves competitive accuracy while significantly reducing accuracy drops under missing modality conditions.
CVFeb 2, 2024
Visual Gyroscope: Combination of Deep Learning Features and Direct Alignment for Panoramic StabilizationBruno Berenguel-Baeta, Antoine N. Andre, Guillaume Caron et al.
In this article we present a visual gyroscope based on equirectangular panoramas. We propose a new pipeline where we take advantage of combining three different methods to obtain a robust and accurate estimation of the attitude of the camera. We quantitatively and qualitatively validate our method on two image sequences taken with a $360^\circ$ dual-fisheye camera mounted on different aerial vehicles.
DBJan 30, 2024
Non-central panorama indoor datasetBruno Berenguel-Baeta, Jesus Bermudez-Cameo, Jose J. Guerrero
Omnidirectional images are one of the main sources of information for learning based scene understanding algorithms. However, annotated datasets of omnidirectional images cannot keep the pace of these learning based algorithms development. Among the different panoramas and in contrast to standard central ones, non-central panoramas provide geometrical information in the distortion of the image from which we can retrieve 3D information of the environment [2]. However, due to the lack of commercial non-central devices, up until now there was no dataset of these kinds of panoramas. In this data paper, we present the first dataset of non-central panoramas for indoor scene understanding. The dataset is composed by {\bf 2574} RGB non-central panoramas taken in around 650 different rooms. Each panorama has associated a depth map and annotations to obtain the layout of the room from the image as a structural edge map, list of corners in the image, the 3D corners of the room and the camera pose. The images are taken from photorealistic virtual environments and pixel-wise automatically annotated.
CVMar 31, 2022
Semantic Pose Verification for Outdoor Visual Localization with Self-supervised Contrastive LearningSemih Orhan, Jose J. Guerrero, Yalin Bastanlar
Any city-scale visual localization system has to overcome long-term appearance changes, such as varying illumination conditions or seasonal changes between query and database images. Since semantic content is more robust to such changes, we exploit semantic information to improve visual localization. In our scenario, the database consists of gnomonic views generated from panoramic images (e.g. Google Street View) and query images are collected with a standard field-of-view camera at a different time. To improve localization, we check the semantic similarity between query and database images, which is not trivial since the position and viewpoint of the cameras do not exactly match. To learn similarity, we propose training a CNN in a self-supervised fashion with contrastive learning on a dataset of semantically segmented images. With experiments we showed that this semantic similarity estimation approach works better than measuring the similarity at pixel-level. Finally, we used the semantic similarity scores to verify the retrievals obtained by a state-of-the-art visual localization method and observed that contrastive learning-based pose verification increases top-1 recall value to 0.90 which corresponds to a 2% improvement.
CVSep 30, 2021
Augmented reality navigation system for visual prosthesisMelani Sanchez-Garcia, Alejandro Perez-Yus, Ruben Martinez-Cantin et al.
The visual functions of visual prostheses such as field of view, resolution and dynamic range, seriously restrict the person's ability to navigate in unknown environments. Implanted patients still require constant assistance for navigating from one location to another. Hence, there is a need for a system that is able to assist them safely during their journey. In this work, we propose an augmented reality navigation system for visual prosthesis that incorporates a software of reactive navigation and path planning which guides the subject through convenient, obstacle-free route. It consists on four steps: locating the subject on a map, planning the subject trajectory, showing it to the subject and re-planning without obstacles. We have also designed a simulated prosthetic vision environment which allows us to systematically study navigation performance. Twelve subjects participated in the experiment. Subjects were guided by the augmented reality navigation system and their instruction was to navigate through different environments until they reached two goals, cross the door and find an object (bin), as fast and accurately as possible. Results show how our augmented navigation system help navigation performance by reducing the time and distance to reach the goals, even significantly reducing the number of obstacles collisions, compared to other baseline methods.
CVMar 17, 2020
Unsupervised Learning of Category-Specific Symmetric 3D Keypoints from Point SetsClara Fernandez-Labrador, Ajad Chhatkuli, Danda Pani Paudel et al.
Automatic discovery of category-specific 3D keypoints from a collection of objects of some category is a challenging problem. One reason is that not all objects in a category necessarily have the same semantic parts. The level of difficulty adds up further when objects are represented by 3D point clouds, with variations in shape and unknown coordinate frames. We define keypoints to be category-specific, if they meaningfully represent objects' shape and their correspondences can be simply established order-wise across all objects. This paper aims at learning category-specific 3D keypoints, in an unsupervised manner, using a collection of misaligned 3D point clouds of objects from an unknown category. In order to do so, we model shapes defined by the keypoints, within a category, using the symmetric linear basis shapes without assuming the plane of symmetry to be known. The usage of symmetry prior leads us to learn stable keypoints suitable for higher misalignments. To the best of our knowledge, this is the first work on learning such keypoints directly from 3D point clouds. Using categories from four benchmark datasets, we demonstrate the quality of our learned keypoints by quantitative and qualitative evaluations. Our experiments also show that the keypoints discovered by our method are geometrically and semantically consistent.
CVOct 14, 2019
What's in my Room? Object Recognition on Indoor Panoramic ImagesJulia Guerrero-Viu, Clara Fernandez-Labrador, Cédric Demonceaux et al.
In the last few years, there has been a growing interest in taking advantage of the 360 panoramic images potential, while managing the new challenges they imply. While several tasks have been improved thanks to the contextual information these images offer, object recognition in indoor scenes still remains a challenging problem that has not been deeply investigated. This paper provides an object recognition system that performs object detection and semantic segmentation tasks by using a deep learning model adapted to match the nature of equirectangular images. From these results, instance segmentation masks are recovered, refined and transformed into 3D bounding boxes that are placed into the 3D model of the room. Quantitative and qualitative results support that our method outperforms the state of the art by a large margin and show a complete understanding of the main objects in indoor scenes.
CVMar 19, 2019
Corners for Layout: End-to-End Layout Recovery from 360 ImagesClara Fernandez-Labrador, Jose M. Facil, Alejandro Perez-Yus et al.
The problem of 3D layout recovery in indoor scenes has been a core research topic for over a decade. However, there are still several major challenges that remain unsolved. Among the most relevant ones, a major part of the state-of-the-art methods make implicit or explicit assumptions on the scenes -- e.g. box-shaped or Manhattan layouts. Also, current methods are computationally expensive and not suitable for real-time applications like robot navigation and AR/VR. In this work we present CFL (Corners for Layout), the first end-to-end model for 3D layout recovery on 360 images. Our experimental results show that we outperform the state of the art relaxing assumptions about the scene and at a lower cost. We also show that our model generalizes better to camera position variations than conventional approaches by using EquiConvs, a type of convolution applied directly on the sphere projection and hence invariant to the equirectangular distortions. CFL Webpage: https://cfernandezlab.github.io/CFL/
CVSep 25, 2018
Semantic and structural image segmentation for prosthetic visionMelani Sanchez-Garcia, Ruben Martinez-Cantin, Jose J. Guerrero
Prosthetic vision is being applied to partially recover the retinal stimulation of visually impaired people. However, the phosphenic images produced by the implants have very limited information bandwidth due to the poor resolution and lack of color or contrast. The ability of object recognition and scene understanding in real environments is severely restricted for prosthetic users. Computer vision can play a key role to overcome the limitations and to optimize the visual information in the simulated prosthetic vision, improving the amount of information that is presented. We present a new approach to build a schematic representation of indoor environments for phosphene images. The proposed method combines a variety of convolutional neural networks for extracting and conveying relevant information about the scene such as structural informative edges of the environment and silhouettes of segmented objects. Experiments were conducted with normal sighted subjects with a Simulated Prosthetic Vision system. The results show good accuracy for object recognition and room identification tasks for indoor scenes using the proposed approach, compared to other image processing methods.
CVAug 29, 2018
PanoRoom: From the Sphere to the 3D LayoutClara Fernandez-Labrador, Jose M. Facil, Alejandro Perez-Yus et al.
We propose a novel FCN able to work with omnidirectional images that outputs accurate probability maps representing the main structure of indoor scenes, which is able to generalize on different data. Our approach handles occlusions and recovers complex shaped rooms more faithful to the actual shape of the real scenes. We outperform the state of the art not only in accuracy of the 3D models but also in speed.
CVJun 21, 2018
Layouts from Panoramic Images with Geometry and Deep LearningClara Fernandez-Labrador, Alejandro Perez-Yus, Gonzalo Lopez-Nicolas et al.
In this paper, we propose a novel procedure for 3D layout recovery of indoor scenes from single 360 degrees panoramic images. With such images, all scene is seen at once, allowing to recover closed geometries. Our method combines strategically the accuracy provided by geometric reasoning (lines and vanishing points) with the higher level of data abstraction and pattern recognition achieved by deep learning techniques (edge and normal maps). Thus, we extract structural corners from which we generate layout hypotheses of the room assuming Manhattan world. The best layout model is selected, achieving good performance on both simple rooms (box-type) and complex shaped rooms (with more than four walls). Experiments of the proposed approach are conducted within two public datasets, SUN360 and Stanford (2D-3D-S) demonstrating the advantages of estimating layouts by combining geometry and deep learning and the effectiveness of our proposal with respect to the state of the art.