Valdir Grassi

h-index17

6papers

161citations

Novelty44%

AI Score37

Ranked #94,997 of 194,257 authors (top 49%)#31,904 in CV (top 54%)

6 Papers

4.6LGApr 14, 2022Code

Leveraging convergence behavior to balance conflicting tasks in multi-task learning

Angelica Tiemi Mizuno Nakamura, Denis Fernando Wolf, Valdir Grassi

Multi-Task Learning is a learning paradigm that uses correlated tasks to improve performance generalization. A common way to learn multiple tasks is through the hard parameter sharing approach, in which a single architecture is used to share the same subset of parameters, creating an inductive bias between them during the training process. Due to its simplicity, potential to improve generalization, and reduce computational cost, it has gained the attention of the scientific and industrial communities. However, tasks often conflict with each other, which makes it challenging to define how the gradients of multiple tasks should be combined to allow simultaneous learning. To address this problem, we use the idea of multi-objective optimization to propose a method that takes into account temporal behaviour of the gradients to create a dynamic bias that adjust the importance of each task during the backpropagation. The result of this method is to give more attention to the tasks that are diverging or that are not being benefited during the last iterations, allowing to ensure that the simultaneous learning is heading to the performance maximization of all tasks. As a result, we empirically show that the proposed method outperforms the state-of-art approaches on learning conflicting tasks. Unlike the adopted baselines, our method ensures that all tasks reach good generalization performances.

2.6CVNov 2, 2022Code

Semantic SuperPoint: A Deep Semantic Descriptor

Gabriel S. Gama, Nícolas S. Rosa, Valdir Grassi

Several SLAM methods benefit from the use of semantic information. Most integrate photometric methods with high-level semantics such as object detection and semantic segmentation. We propose that adding a semantic segmentation decoder in a shared encoder architecture would help the descriptor decoder learn semantic information, improving the feature extractor. This would be a more robust approach than only using high-level semantic information since it would be intrinsically learned in the descriptor and would not depend on the final quality of the semantic prediction. To add this information, we take advantage of multi-task learning methods to improve accuracy and balance the performance of each task. The proposed models are evaluated according to detection and matching metrics on the HPatches dataset. The results show that the Semantic SuperPoint model performs better than the baseline one.

4.1LGMay 15, 2025Code

Uniform Loss vs. Specialized Optimization: A Comparative Analysis in Multi-Task Learning

Gabriel S. Gama, Valdir Grassi

Specialized Multi-Task Optimizers (SMTOs) balance task learning in Multi-Task Learning by addressing issues like conflicting gradients and differing gradient norms, which hinder equal-weighted task training. However, recent critiques suggest that equally weighted tasks can achieve competitive results compared to SMTOs, arguing that previous SMTO results were influenced by poor hyperparameter optimization and lack of regularization. In this work, we evaluate these claims through an extensive empirical evaluation of SMTOs, including some of the latest methods, on more complex multi-task problems to clarify this behavior. Our findings indicate that SMTOs perform well compared to uniform loss and that fixed weights can achieve competitive performance compared to SMTOs. Furthermore, we demonstrate why uniform loss perform similarly to SMTOs in some instances. The source code is available at https://github.com/Gabriel-SGama/UnitScal_vs_SMTOs.

12.4CVOct 13, 2020

On Deep Learning Techniques to Boost Monocular Depth Estimation for Autonomous Navigation

Raul de Queiroz Mendes, Eduardo Godinho Ribeiro, Nicolas dos Santos Rosa et al.

Inferring the depth of images is a fundamental inverse problem within the field of Computer Vision since depth information is obtained through 2D images, which can be generated from infinite possibilities of observed real scenes. Benefiting from the progress of Convolutional Neural Networks (CNNs) to explore structural features and spatial image information, Single Image Depth Estimation (SIDE) is often highlighted in scopes of scientific and technological innovation, as this concept provides advantages related to its low implementation cost and robustness to environmental conditions. In the context of autonomous vehicles, state-of-the-art CNNs optimize the SIDE task by producing high-quality depth maps, which are essential during the autonomous navigation process in different locations. However, such networks are usually supervised by sparse and noisy depth data, from Light Detection and Ranging (LiDAR) laser scans, and are carried out at high computational cost, requiring high-performance Graphic Processing Units (GPUs). Therefore, we propose a new lightweight and fast supervised CNN architecture combined with novel feature extraction models which are designed for real-world autonomous navigation. We also introduce an efficient surface normals module, jointly with a simple geometric 2.5D loss function, to solve SIDE problems. We also innovate by incorporating multiple Deep Learning techniques, such as the use of densification algorithms and additional semantic, surface normals and depth information to train our framework. The method introduced in this work focuses on robotic applications in indoor and outdoor environments and its results are evaluated on the competitive and publicly available NYU Depth V2 and KITTI Depth datasets.

10.4ROOct 13, 2020

Real-Time Deep Learning Approach to Visual Servo Control and Grasp Detection for Autonomous Robotic Manipulation

Eduardo Godinho Ribeiro, Raul de Queiroz Mendes, Valdir Grassi

In order to explore robotic grasping in unstructured and dynamic environments, this work addresses the visual perception phase involved in the task. This phase involves the processing of visual data to obtain the location of the object to be grasped, its pose and the points at which the robot`s grippers must make contact to ensure a stable grasp. For this, the Cornell Grasping dataset is used to train a convolutional neural network that, having an image of the robot`s workspace, with a certain object, is able to predict a grasp rectangle that symbolizes the position, orientation and opening of the robot`s grippers before its closing. In addition to this network, which runs in real-time, another one is designed to deal with situations in which the object moves in the environment. Therefore, the second network is trained to perform a visual servo control, ensuring that the object remains in the robot`s field of view. This network predicts the proportional values of the linear and angular velocities that the camera must have so that the object is always in the image processed by the grasp network. The dataset used for training was automatically generated by a Kinova Gen3 manipulator. The robot is also used to evaluate the applicability in real-time and obtain practical results from the designed algorithms. Moreover, the offline results obtained through validation sets are also analyzed and discussed regarding their efficiency and processing speed. The developed controller was able to achieve a millimeter accuracy in the final position considering a target object seen for the first time. To the best of our knowledge, we have not found in the literature other works that achieve such precision with a controller learned from scratch. Thus, this work presents a new system for autonomous robotic manipulation with high processing speed and the ability to generalize to several different objects.

3.9CVSep 24, 2018Code

Sparse-to-Continuous: Enhancing Monocular Depth Estimation using Occupancy Maps

Nícolas Rosa, Vitor Guizilini, Valdir Grassi

This paper addresses the problem of single image depth estimation (SIDE), focusing on improving the quality of deep neural network predictions. In a supervised learning scenario, the quality of predictions is intrinsically related to the training labels, which guide the optimization process. For indoor scenes, structured-light-based depth sensors (e.g. Kinect) are able to provide dense, albeit short-range, depth maps. On the other hand, for outdoor scenes, LiDARs are considered the standard sensor, which comparatively provides much sparser measurements, especially in areas further away. Rather than modifying the neural network architecture to deal with sparse depth maps, this article introduces a novel densification method for depth maps, using the Hilbert Maps framework. A continuous occupancy map is produced based on 3D points from LiDAR scans, and the resulting reconstructed surface is projected into a 2D depth map with arbitrary resolution. Experiments conducted with various subsets of the KITTI dataset show a significant improvement produced by the proposed Sparse-to-Continuous technique, without the introduction of extra information into the training stage.