CVAug 27, 2022
Weakly and Semi-Supervised Detection, Segmentation and Tracking of Table Grapes with Limited and Noisy DataThomas A. Ciarfuglia, Ionut M. Motoi, Leonardo Saraceni et al.
Detection, segmentation and tracking of fruits and vegetables are three fundamental tasks for precision agriculture, enabling robotic harvesting and yield estimation applications. However, modern algorithms are data hungry and it is not always possible to gather enough data to apply the best performing supervised approaches. Since data collection is an expensive and cumbersome task, the enabling technologies for using computer vision in agriculture are often out of reach for small businesses. Following previous work in this context, where we proposed an initial weakly supervised solution to reduce the data needed to get state-of-the-art detection and segmentation in precision agriculture applications, here we improve that system and explore the problem of tracking fruits in orchards. We present the case of vineyards of table grapes in southern Lazio (Italy) since grapes are a difficult fruit to segment due to occlusion, color and general illumination conditions. We consider the case in which there is some initial labelled data that could work as source data (\eg wine grape data), but it is considerably different from the target data (e.g. table grape data). To improve detection and segmentation on the target data, we propose to train the segmentation algorithm with a weak bounding box label, while for tracking we leverage 3D Structure from Motion algorithms to generate new labels from already labelled samples. Finally, the two systems are combined in a full semi-supervised approach. Comparisons with state-of-the-art supervised solutions show how our methods are able to train new models that achieve high performances with few labelled images and with very simple labelling.
AIDec 14, 2022
A Hierarchical Framework for Collaborative Artificial IntelligenceJames L. Crowley, Joëlle L Coutaz, Jasmin Grosinger et al.
We propose a hierarchical framework for collaborative intelligent systems. This framework organizes research challenges based on the nature of the collaborative activity and the information that must be shared, with each level building on capabilities provided by lower levels. We review research paradigms at each level, with a description of classical engineering-based approaches and modern alternatives based on machine learning, illustrated with a running example using a hypothetical personal service robot. We discuss cross-cutting issues that occur at all levels, focusing on the problem of communicating and sharing comprehension, the role of explanation and the social nature of collaboration. We conclude with a summary of research challenges and a discussion of the potential for economic and societal impact provided by technologies that enhance human abilities and empower people and society through collaboration with Intelligent Systems.
CVApr 11, 2022
Permutation-Invariant Relational Network for Multi-person 3D Pose EstimationNicolas Ugrinovic, Adria Ruiz, Antonio Agudo et al.
The recovery of multi-person 3D poses from a single RGB image is a severely ill-conditioned problem due to the inherent 2D-3D depth ambiguity, inter-person occlusions, and body truncations. To tackle these issues, recent works have shown promising results by simultaneously reasoning for different people. However, in most cases this is done by only considering pairwise person interactions, hindering thus a holistic scene representation able to capture long-range interactions. This is addressed by approaches that jointly process all people in the scene, although they require defining one of the individuals as a reference and a pre-defined person ordering, being sensitive to this choice. In this paper, we overcome both these limitations, and we propose an approach for multi-person 3D pose estimation that captures long-range interactions independently of the input order. For this purpose, we build a residual-like permutation-invariant network that successfully refines potentially corrupted initial 3D poses estimated by an off-the-shelf detector. The residual function is learned via Set Transformer blocks, that model the interactions among all initial poses, no matter their ordering or number. A thorough evaluation demonstrates that our approach is able to boost the performance of the initially estimated 3D poses by large margins, achieving state-of-the-art results on standardized benchmarks. Additionally, the proposed module works in a computationally efficient manner and can be potentially used as a drop-in complement for any 3D pose detector in multi-people scenes.
ROJun 15, 2022
Body Gesture Recognition to Control a Social RobotJavier Laplaza, Joan Jaume Oliver, Ramón Romero et al.
In this work, we propose a gesture based language to allow humans to interact with robots using their body in a natural way. We have created a new gesture detection model using neural networks and a custom dataset of humans performing a set of body gestures to train our network. Furthermore, we compare body gesture communication with other communication channels to acknowledge the importance of adding this knowledge to robots. The presented approach is extensively validated in diverse simulations and real-life experiments with non-trained volunteers. This attains remarkable results and shows that it is a valuable framework for social robotics applications, such as human robot collaboration or human-robot interaction.
ROJun 1, 2022
Perception-Intention-Action Cycle in Human-Robot Collaborative TasksJ. E. Dominguez-Vidal, Nicolas Rodriguez, Rene Alquezar et al.
In this work we argue that in Human-Robot Collaboration (HRC) tasks, the Perception-Action cycle in HRC tasks can not fully explain the collaborative behaviour of the human and robot and it has to be extended to Perception-Intention-Action cycle, where Intention is a key topic. In some cases, agent Intention can be perceived or inferred by the other agent, but in others, it has to be explicitly informed to the other agent to succeed the goal of the HRC task. The Perception-Intention-Action cycle includes three basic functional procedures: Perception-Intention, Situation Awareness and Action. The Perception and the Intention are the input of the Situation Awareness, which evaluates the current situation and projects it, into the future situation. The agents receive this information, plans and agree with the actions to be executed and modify their action roles while perform the HRC task. In this work, we validate the Perception-Intention-Action cycle in a joint object transportation task, modeling the Perception-Intention-Action cycle through a force model which uses real life and social forces. The perceived world is projected into a force world and the human intention (perceived or informed) is also modelled as a force that acts in the HRC task. Finally, we show that the action roles (master-slave, collaborative, neutral or adversary) are intrinsic to any HRC task and they appear in the different steps of a collaborative sequence of actions performed during the task.
ROOct 15, 2022
Robot Navigation Anticipative Strategies in Deep Reinforcement Motion PlanningÓscar Gil, Alberto Sanfeliu
The navigation of robots in dynamic urban environments, requires elaborated anticipative strategies for the robot to avoid collisions with dynamic objects, like bicycles or pedestrians, and to be human aware. We have developed and analyzed three anticipative strategies in motion planning taking into account the future motion of the mobile objects that can move up to 18 km/h. First, we have used our hybrid policy resulting from a Deep Deterministic Policy Gradient (DDPG) training and the Social Force Model (SFM), and we have tested it in simulation in four complex map scenarios with many pedestrians. Second, we have used these anticipative strategies in real-life experiments using the hybrid motion planning method and the ROS Navigation Stack with Dynamic Windows Approach (NS-DWA). The results in simulations and real-life experiments show very good results in open environments and also in mixed scenarios with narrow spaces.
CVMay 9, 2022
Single-view 3D Body and Cloth Reconstruction under Complex PosesNicolas Ugrinovic, Albert Pumarola, Alberto Sanfeliu et al.
Recent advances in 3D human shape reconstruction from single images have shown impressive results, leveraging on deep networks that model the so-called implicit function to learn the occupancy status of arbitrarily dense 3D points in space. However, while current algorithms based on this paradigm, like PiFuHD, are able to estimate accurate geometry of the human shape and clothes, they require high-resolution input images and are not able to capture complex body poses. Most training and evaluation is performed on 1k-resolution images of humans standing in front of the camera under neutral body poses. In this paper, we leverage publicly available data to extend existing implicit function-based models to deal with images of humans that can have arbitrary poses and self-occluded limbs. We argue that the representation power of the implicit function is not sufficient to simultaneously model details of the geometry and of the body pose. We, therefore, propose a coarse-to-fine approach in which we first learn an implicit function that maps the input image to a 3D body shape with a low level of detail, but which correctly fits the underlying human pose, despite its complexity. We then learn a displacement map, conditioned on the smoothed surface and on the input image, which encodes the high-frequency details of the clothes and body. In the experimental section, we show that this coarse-to-fine strategy represents a very good trade-off between shape detail and pose correctness, comparing favorably to the most recent state-of-the-art approaches. Our code will be made publicly available.
LGJul 6, 2022
Humans Social Relationship Classification during AccompanimentOscar Castro, Ely Repiso, Anais Garrell et al.
This paper presents the design of deep learning architectures which allow to classify the social relationship existing between two people who are walking in a side-by-side formation into four possible categories --colleagues, couple, family or friendship. The models are developed using Neural Networks or Recurrent Neural Networks to achieve the classification and are trained and evaluated using a database of readings obtained from humans performing an accompaniment process in an urban environment. The best achieved model accomplishes a relatively good accuracy in the classification problem and its results enhance partially the outcomes from a previous study [1]. Furthermore, the model proposed shows its future potential to improve its efficiency and to be implemented in a real robot.
RONov 17, 2023
Human motion trajectory prediction using the Social Force Model for real-time and low computational cost applicationsOscar Gil, Alberto Sanfeliu
Human motion trajectory prediction is a very important functionality for human-robot collaboration, specifically in accompanying, guiding, or approaching tasks, but also in social robotics, self-driving vehicles, or security systems. In this paper, a novel trajectory prediction model, Social Force Generative Adversarial Network (SoFGAN), is proposed. SoFGAN uses a Generative Adversarial Network (GAN) and Social Force Model (SFM) to generate different plausible people trajectories reducing collisions in a scene. Furthermore, a Conditional Variational Autoencoder (CVAE) module is added to emphasize the destination learning. We show that our method is more accurate in making predictions in UCY or BIWI datasets than most of the current state-of-the-art models and also reduces collisions in comparison to other approaches. Through real-life experiments, we demonstrate that the model can be used in real-time without GPU's to perform good quality predictions with a low computational cost.
ROFeb 21
When the Inference Meets the Explicitness or Why Multimodality Can Make Us Forget About the Perfect PredictorJ. E. Domínguez-Vidal, Alberto Sanfeliu
Although in the literature it is common to find predictors and inference systems that try to predict human intentions, the uncertainty of these models due to the randomness of human behavior has led some authors to start advocating the use of communication systems that explicitly elicit human intention. In this work, it is analyzed the use of four different communication systems with a human-robot collaborative object transportation task as experimental testbed: two intention predictors (one based on force prediction and another with an enhanced velocity prediction algorithm) and two explicit communication methods (a button interface and a voice-command recognition system). These systems were integrated into IVO, a custom mobile social robot equipped with force sensor to detect the force exchange between both agents and LiDAR to detect the environment. The collaborative task required transporting an object over a 5-7 meter distance with obstacles in the middle, demanding rapid decisions and precise physical coordination. 75 volunteers perform a total of 255 executions divided into three groups, testing inference systems in the first round, communication systems in the second, and the combined strategies in the third. The results show that, 1) once sufficient performance is achieved, the human no longer notices and positively assesses technical improvements; 2) the human prefers systems that are more natural to them even though they have higher failure rates; and 3) the preferred option is the right combination of both systems.
CVJan 30, 2025
Learning Priors of Human Motion With Vision TransformersPlacido Falqueto, Alberto Sanfeliu, Luigi Palopoli et al.
A clear understanding of where humans move in a scenario, their usual paths and speeds, and where they stop, is very important for different applications, such as mobility studies in urban areas or robot navigation tasks within human-populated environments. We propose in this article, a neural architecture based on Vision Transformers (ViTs) to provide this information. This solution can arguably capture spatial correlations more effectively than Convolutional Neural Networks (CNNs). In the paper, we describe the methodology and proposed neural architecture and show the experiments' results with a standard dataset. We show that the proposed ViT architecture improves the metrics compared to a method based on a CNN.
CVNov 2, 2021
Body Size and Depth Disambiguation in Multi-Person Reconstruction from Single ImagesNicolas Ugrinovic, Adria Ruiz, Antonio Agudo et al.
We address the problem of multi-person 3D body pose and shape estimation from a single image. While this problem can be addressed by applying single-person approaches multiple times for the same scene, recent works have shown the advantages of building upon deep architectures that simultaneously reason about all people in the scene in a holistic manner by enforcing, e.g., depth order constraints or minimizing interpenetration among reconstructed bodies. However, existing approaches are still unable to capture the size variability of people caused by the inherent body scale and depth ambiguity. In this work, we tackle this challenge by devising a novel optimization scheme that learns the appropriate body scale and relative camera pose, by enforcing the feet of all people to remain on the ground floor. A thorough evaluation on MuPoTS-3D and 3DPW datasets demonstrates that our approach is able to robustly estimate the body translation and shape of multiple people while retrieving their spatial arrangement, consistently improving current state-of-the-art, especially in scenes with people of very different heights
RODec 8, 2019
Effects of a Social Force Model reward in Robot Navigation based on Deep Reinforcement LearningÓscar Gil Viyuela, Alberto Sanfeliu
In this paper is proposed an inclusion of the Social Force Model (SFM) into a concrete Deep Reinforcement Learning (RL) framework for robot navigation. These types of techniques have demonstrated to be useful to deal with different types of environments to achieve a goal. In Deep RL, a description of the world to describe the states and a reward adapted to the environment are crucial elements to get the desire behaviour and achieve a high performance. For this reason, this work adds a dense reward function based on SFM and uses the forces in the states like an additional description. Furthermore, obstacles are added to improve the behaviour of works that only consider moving agents. This SFM inclusion can offer a better description of the obstacles for the navigation. Several simulations have been done to check the effects of these modifications in the average performance.
CVOct 8, 2019
Improving Map Re-localization with Deep 'Movable' Objects Segmentation on 3D LiDAR Point CloudsVictor Vaquero, Kai Fischer, Francesc Moreno-Noguer et al.
Localization and Mapping is an essential component to enable Autonomous Vehicles navigation, and requires an accuracy exceeding that of commercial GPS-based systems. Current odometry and mapping algorithms are able to provide this accurate information. However, the lack of robustness of these algorithms against dynamic obstacles and environmental changes, even for short time periods, forces the generation of new maps on every session without taking advantage of previously obtained ones. In this paper we propose the use of a deep learning architecture to segment movable objects from 3D LiDAR point clouds in order to obtain longer-lasting 3D maps. This will in turn allow for better, faster and more accurate re-localization and trajectoy estimation on subsequent days. We show the effectiveness of our approach in a very dynamic and cluttered scenario, a supermarket parking lot. For that, we record several sequences on different days and compare localization errors with and without our movable objects segmentation method. Results show that we are able to accurately re-locate over a filtered map, consistently reducing trajectory errors between an average of 35.1% with respect to a non-filtered map version and of 47.9% with respect to a standalone map created on the current session.
ROSep 10, 2019
Human-robot Collaborative Navigation Search using Social Reward SourcesMarc Dalmasso, Anaís Garrell, Pablo Jiménez et al.
This paper proposes a Social Reward Sources (SRS) design for a Human-Robot Collaborative Navigation (HRCN) task: human-robot collaborative search. It is a flexible approach capable of handling the collaborative task, human-robot interaction and environment restrictions, all integrated on a common environment. We modelled task rewards based on unexplored area observability and isolation and evaluated the model through different levels of human-robot communication. The models are validated through quantitative evaluation against both agents' individual performance and qualitative surveying of participants' perception. After that, the three proposed communication levels are compared against each other using the previous metrics.
CVApr 9, 2019
3DPeople: Modeling the Geometry of Dressed HumansAlbert Pumarola, Jordi Sanchez, Gary P. T. Choi et al.
Recent advances in 3D human shape estimation build upon parametric representations that model very well the shape of the naked body, but are not appropriate to represent the clothing geometry. In this paper, we present an approach to model dressed humans and predict their geometry from single images. We contribute in three fundamental aspects of the problem, namely, a new dataset, a novel shape parameterization algorithm and an end-to-end deep generative network for predicting shape. First, we present 3DPeople, a large-scale synthetic dataset with 2.5 Million photo-realistic images of 80 subjects performing 70 activities and wearing diverse outfits. Besides providing textured 3D meshes for clothes and body, we annotate the dataset with segmentation masks, skeletons, depth, normal maps and optical flow. All this together makes 3DPeople suitable for a plethora of tasks. We then represent the 3D shapes using 2D geometry images. To build these images we propose a novel spherical area-preserving parameterization algorithm based on the optimal mass transportation method. We show this approach to improve existing spherical maps which tend to shrink the elongated parts of the full body models such as the arms and legs, making the geometry images incomplete. Finally, we design a multi-resolution deep generative network that, given an input image of a dressed human, predicts his/her geometry image (and thus the clothed body shape) in an end-to-end manner. We obtain very promising results in jointly capturing body pose and clothing shape, both for synthetic validation and on the wild images.
CVMar 28, 2019
Fast video object segmentation with Spatio-Temporal GANsSergi Caelles, Albert Pumarola, Francesc Moreno-Noguer et al.
Learning descriptive spatio-temporal object models from data is paramount for the task of semi-supervised video object segmentation. Most existing approaches mainly rely on models that estimate the segmentation mask based on a reference mask at the first frame (aided sometimes by optical flow or the previous mask). These models, however, are prone to fail under rapid appearance changes or occlusions due to their limitations in modelling the temporal component. On the other hand, very recently, other approaches learned long-term features using a convolutional LSTM to leverage the information from all previous video frames. Even though these models achieve better temporal representations, they still have to be fine-tuned for every new video sequence. In this paper, we present an intermediate solution and devise a novel GAN architecture, FaSTGAN, to learn spatio-temporal object models over finite temporal windows. To achieve this, we concentrate all the heavy computational load to the training phase with two critics that enforce spatial and temporal mask consistency over the last K frames. Then at test time, we only use a relatively light regressor, which reduces the inference time considerably. As a result, our approach combines a high resiliency to sudden geometric and photometric object changes with efficiency at test time (no need for fine-tuning nor post-processing). We demonstrate that the accuracy of our method is on par with state-of-the-art techniques on the challenging YouTube-VOS and DAVIS datasets, while running at 32 fps, about 4x faster than the closest competitor.
CVSep 27, 2018
Geometry-Aware Network for Non-Rigid Shape Prediction from a Single ViewAlbert Pumarola, Antonio Agudo, Lorenzo Porzi et al.
We propose a method for predicting the 3D shape of a deformable surface from a single view. By contrast with previous approaches, we do not need a pre-registered template of the surface, and our method is robust to the lack of texture and partial occlusions. At the core of our approach is a {\it geometry-aware} deep architecture that tackles the problem as usually done in analytic solutions: first perform 2D detection of the mesh and then estimate a 3D shape that is geometrically consistent with the image. We train this architecture in an end-to-end manner using a large dataset of synthetic renderings of shapes under different levels of deformation, material properties, textures and lighting conditions. We evaluate our approach on a test split of this dataset and available real benchmarks, consistently improving state-of-the-art solutions with a significantly lower computational time.
CVSep 27, 2018
Unsupervised Person Image Synthesis in Arbitrary PosesAlbert Pumarola, Antonio Agudo, Alberto Sanfeliu et al.
We present a novel approach for synthesizing photo-realistic images of people in arbitrary poses using generative adversarial learning. Given an input image of a person and a desired pose represented by a 2D skeleton, our model renders the image of the same person under the new pose, synthesizing novel views of the parts visible in the input image and hallucinating those that are not seen. This problem has recently been addressed in a supervised manner, i.e., during training the ground truth images under the new poses are given to the network. We go beyond these approaches by proposing a fully unsupervised strategy. We tackle this challenging scenario by splitting the problem into two principal subtasks. First, we consider a pose conditioned bidirectional generator that maps back the initially rendered image to the original pose, hence being directly comparable to the input image without the need to resort to any training image. Second, we devise a novel loss function that incorporates content and style terms, and aims at producing images of high perceptual quality. Extensive experiments conducted on the DeepFashion dataset demonstrate that the images rendered by our model are very close in appearance to those obtained by fully supervised approaches.
CVAug 30, 2018
Hallucinating Dense Optical Flow from Sparse Lidar for Autonomous VehiclesVictor Vaquero, Alberto Sanfeliu, Francesc Moreno-Noguer
In this paper we propose a novel approach to estimate dense optical flow from sparse lidar data acquired on an autonomous vehicle. This is intended to be used as a drop-in replacement of any image-based optical flow system when images are not reliable due to e.g. adverse weather conditions or at night. In order to infer high resolution 2D flows from discrete range data we devise a three-block architecture of multiscale filters that combines multiple intermediate objectives, both in the lidar and image domain. To train this network we introduce a dataset with approximately 20K lidar samples of the Kitti dataset which we have augmented with a pseudo ground-truth image-based optical flow computed using FlowNet2. We demonstrate the effectiveness of our approach on Kitti, and show that despite using the low-resolution and sparse measurements of the lidar, we can regress dense optical flow maps which are at par with those estimated with image-based methods.
CVAug 28, 2018
Deep Lidar CNN to Understand the Dynamics of Moving VehiclesVictor Vaquero, Alberto Sanfeliu, Francesc Moreno-Noguer
Perception technologies in Autonomous Driving are experiencing their golden age due to the advances in Deep Learning. Yet, most of these systems rely on the semantically rich information of RGB images. Deep Learning solutions applied to the data of other sensors typically mounted on autonomous cars (e.g. lidars or radars) are not explored much. In this paper we propose a novel solution to understand the dynamics of moving vehicles of the scene from only lidar information. The main challenge of this problem stems from the fact that we need to disambiguate the proprio-motion of the 'observer' vehicle from that of the external 'observed' vehicles. For this purpose, we devise a CNN architecture which at testing time is fed with pairs of consecutive lidar scans. However, in order to properly learn the parameters of this network, during training we introduce a series of so-called pretext tasks which also leverage on image data. These tasks include semantic information about vehicleness and a novel lidar-flow feature which combines standard image-based optical flow with lidar scans. We obtain very promising results and show that including distilled image information only during training, allows improving the inference results of the network at test time, even when image data is no longer used.
CVAug 23, 2018
Deconvolutional Networks for Point-Cloud Vehicle Detection and Tracking in Driving ScenariosVictor Vaquero, Ivan del Pino, Francesc Moreno-Noguer et al.
Vehicle detection and tracking is a core ingredient for developing autonomous driving applications in urban scenarios. Recent image-based Deep Learning (DL) techniques are obtaining breakthrough results in these perceptive tasks. However, DL research has not yet advanced much towards processing 3D point clouds from lidar range-finders. These sensors are very common in autonomous vehicles since, despite not providing as semantically rich information as images, their performance is more robust under harsh weather conditions than vision sensors. In this paper we present a full vehicle detection and tracking system that works with 3D lidar information only. Our detection step uses a Convolutional Neural Network (CNN) that receives as input a featured representation of the 3D information provided by a Velodyne HDL-64 sensor and returns a per-point classification of whether it belongs to a vehicle or not. The classified point cloud is then geometrically processed to generate observations for a multi-object tracking system implemented via a number of Multi-Hypothesis Extended Kalman Filters (MH-EKF) that estimate the position and velocity of the surrounding vehicles. The system is thoroughly evaluated on the KITTI tracking dataset, and we show the performance boost provided by our CNN-based vehicle detector over a standard geometric approach. Our lidar-based approach uses about a 4% of the data needed for an image-based detector with similarly competitive results.
CVAug 22, 2018
Joint Coarse-And-Fine Reasoning for Deep Optical FlowVictor Vaquero, German Ros, Francesc Moreno-Noguer et al.
We propose a novel representation for dense pixel-wise estimation tasks using CNNs that boosts accuracy and reduces training time, by explicitly exploiting joint coarse-and-fine reasoning. The coarse reasoning is performed over a discrete classification space to obtain a general rough solution, while the fine details of the solution are obtained over a continuous regression space. In our approach both components are jointly estimated, which proved to be beneficial for improving estimation accuracy. Additionally, we propose a new network architecture, which combines coarse and fine components by treating the fine estimation as a refinement built on top of the coarse solution, and therefore adding details to the general prediction. We apply our approach to the challenging problem of optical flow estimation and empirically validate it against state-of-the-art CNN-based solutions trained from scratch and tested on large optical flow datasets.
CVJul 24, 2018
GANimation: Anatomically-aware Facial Animation from a Single ImageAlbert Pumarola, Antonio Agudo, Aleix M. Martinez et al.
Recent advances in Generative Adversarial Networks (GANs) have shown impressive results for task of facial expression synthesis. The most successful architecture is StarGAN, that conditions GANs generation process with images of a specific domain, namely a set of images of persons sharing the same expression. While effective, this approach can only generate a discrete number of expressions, determined by the content of the dataset. To address this limitation, in this paper, we introduce a novel GAN conditioning scheme based on Action Units (AU) annotations, which describes in a continuous manifold the anatomical facial movements defining a human expression. Our approach allows controlling the magnitude of activation of each AU and combine several of them. Additionally, we propose a fully unsupervised strategy to train the model, that only requires images annotated with their activated AUs, and exploit attention mechanisms that make our network robust to changing backgrounds and lighting conditions. Extensive evaluation show that our approach goes beyond competing conditional generators both in the capability to synthesize a much wider range of expressions ruled by anatomically feasible muscle movements, as in the capacity of dealing with images in the wild.