Kuniaki Uto

4papers

109citations

Novelty50%

AI Score31

Ranked #142,633 of 205,806 authors (top 69%)#44,229 in CV (top 75%)

4 Papers

CVSep 26, 2024

CAMOT: Camera Angle-aware Multi-Object Tracking

Felix Limanta, Kuniaki Uto, Koichi Shinoda

This paper proposes CAMOT, a simple camera angle estimator for multi-object tracking to tackle two problems: 1) occlusion and 2) inaccurate distance estimation in the depth direction. Under the assumption that multiple objects are located on a flat plane in each video frame, CAMOT estimates the camera angle using object detection. In addition, it gives the depth of each object, enabling pseudo-3D MOT. We evaluated its performance by adding it to various 2D MOT methods on the MOT17 and MOT20 datasets and confirmed its effectiveness. Applying CAMOT to ByteTrack, we obtained 63.8% HOTA, 80.6% MOTA, and 78.5% IDF1 in MOT17, which are state-of-the-art results. Its computational cost is significantly lower than the existing deep-learning-based depth estimators for tracking.

CVSep 19, 2020Code

MSR-DARTS: Minimum Stable Rank of Differentiable Architecture Search

Kengo Machida, Kuniaki Uto, Koichi Shinoda et al.

In neural architecture search (NAS), differentiable architecture search (DARTS) has recently attracted much attention due to its high efficiency. It defines an over-parameterized network with mixed edges, each of which represents all operator candidates, and jointly optimizes the weights of the network and its architecture in an alternating manner. However, this method finds a model with the weights converging faster than the others, and such a model with fastest convergence often leads to overfitting. Accordingly, the resulting model cannot always be well-generalized. To overcome this problem, we propose a method called minimum stable rank DARTS (MSR-DARTS), for finding a model with the best generalization error by replacing architecture optimization with the selection process using the minimum stable rank criterion. Specifically, a convolution operator is represented by a matrix, and MSR-DARTS selects the one with the smallest stable rank. We evaluated MSR-DARTS on CIFAR-10 and ImageNet datasets. It achieves an error rate of 2.54% with 4.0M parameters within 0.3 GPU-days on CIFAR-10, and a top-1 error rate of 23.9% on ImageNet. The official code is available at https://github.com/mtaecchhi/msrdarts.git.

SYFeb 6, 2022

3D Map Reconstruction of an Orchard using an Angle-Aware Covering Control Strategy

Martina Mammarella, Cesare Donati, Takumi Shimizu et al.

In the last years, unmanned aerial vehicles are becoming a reality in the context of precision agriculture, mainly for monitoring, patrolling and remote sensing tasks, but also for 3D map reconstruction. In this paper, we present an innovative approach where a fleet of unmanned aerial vehicles is exploited to perform remote sensing tasks over an apple orchard for reconstructing a 3D map of the field, formulating the covering control problem to combine the position of a monitoring target and the viewing angle. Moreover, the objective function of the controller is defined by an importance index, which has been computed from a multi-spectral map of the field, obtained by a preliminary flight, using a semantic interpretation scheme based on a convolutional neural network. This objective function is then updated according to the history of the past coverage states, thus allowing the drones to take situation-adaptive actions. The effectiveness of the proposed covering control strategy has been validated through simulations on a Robot Operating System.

ASSep 29, 2021

Multimodal Emotion Recognition with High-level Speech and Text Features

Mariana Rodrigues Makiuchi, Kuniaki Uto, Koichi Shinoda

Automatic emotion recognition is one of the central concerns of the Human-Computer Interaction field as it can bridge the gap between humans and machines. Current works train deep learning models on low-level data representations to solve the emotion recognition task. Since emotion datasets often have a limited amount of data, these approaches may suffer from overfitting, and they may learn based on superficial cues. To address these issues, we propose a novel cross-representation speech model, inspired by disentanglement representation learning, to perform emotion recognition on wav2vec 2.0 speech features. We also train a CNN-based model to recognize emotions from text features extracted with Transformer-based models. We further combine the speech-based and text-based results with a score fusion approach. Our method is evaluated on the IEMOCAP dataset in a 4-class classification problem, and it surpasses current works on speech-only, text-only, and multimodal emotion recognition.