CVJun 8, 2022
Patch-based Object-centric Transformers for Efficient Video GenerationWilson Yan, Ryo Okumura, Stephen James et al.
In this work, we present Patch-based Object-centric Video Transformer (POVT), a novel region-based video generation architecture that leverages object-centric information to efficiently model temporal dynamics in videos. We build upon prior work in video prediction via an autoregressive transformer over the discrete latent space of compressed videos, with an added modification to model object-centric information via bounding boxes. Due to better compressibility of object-centric representations, we can improve training efficiency by allowing the model to only access object information for longer horizon temporal information. When evaluated on various difficult object-centric datasets, our method achieves better or equal performance to other video generation models, while remaining computationally more efficient and scalable. In addition, we show that our method is able to perform object-centric controllability through bounding box manipulation, which may aid downstream tasks such as video editing, or visual planning. Samples are available at https://sites.google.com/view/povt-public
ROMar 10, 2022
Tactile-Sensitive NewtonianVAE for High-Accuracy Industrial Connector InsertionRyo Okumura, Nobuki Nishio, Tadahiro Taniguchi
An industrial connector insertion task requires submillimeter positioning and grasp pose compensation for a plug. Thus, highly accurate estimation of the relative pose between a plug and socket is fundamental for achieving the task. World models are promising technologies for visuomotor control because they obtain appropriate state representation to jointly optimize feature extraction and latent dynamics model. Recent studies show that the NewtonianVAE, a type of the world model, acquires latent space equivalent to mapping from images to physical coordinates. Proportional control can be achieved in the latent space of NewtonianVAE. However, applying NewtonianVAE to high-accuracy industrial tasks in physical environments is an open problem. Moreover, the existing framework does not consider the grasp pose compensation in the obtained latent space. In this work, we proposed tactile-sensitive NewtonianVAE and applied it to a USB connector insertion with grasp pose variation in the physical environments. We adopted a GelSight-type tactile sensor and estimated the insertion position compensated by the grasp pose of the plug. Our method trains the latent space in an end-to-end manner, and no additional engineering and annotation are required. Simple proportional control is available in the obtained latent space. Moreover, we showed that the original NewtonianVAE fails in some situations, and demonstrated that domain knowledge induction improves model accuracy. This domain knowledge can be easily obtained using robot specification and grasp pose error measurement. We demonstrated that our proposed method achieved a 100\% success rate and 0.3 mm positioning accuracy in the USB connector insertion task in the physical environment. It outperformed SOTA CNN-based two-stage goal pose regression with grasp pose compensation using coordinate transformation.
AIMar 15, 2022
Multi-View Dreaming: Multi-View World Model with Contrastive LearningAkira Kinose, Masashi Okada, Ryo Okumura et al.
In this paper, we propose Multi-View Dreaming, a novel reinforcement learning agent for integrated recognition and control from multi-view observations by extending Dreaming. Most current reinforcement learning method assumes a single-view observation space, and this imposes limitations on the observed data, such as lack of spatial information and occlusions. This makes obtaining ideal observational information from the environment difficult and is a bottleneck for real-world robotics applications. In this paper, we use contrastive learning to train a shared latent space between different viewpoints, and show how the Products of Experts approach can be used to integrate and control the probability distributions of latent states for multiple viewpoints. We also propose Multi-View DreamingV2, a variant of Multi-View Dreaming that uses a categorical distribution to model the latent state instead of the Gaussian distribution. Experiments show that the proposed method outperforms simple extensions of existing methods in a realistic robot control task.
LGJan 31, 2020
Domain-Adversarial and Conditional State Space Model for Imitation LearningRyo Okumura, Masashi Okada, Tadahiro Taniguchi
State representation learning (SRL) in partially observable Markov decision processes has been studied to learn abstract features of data useful for robot control tasks. For SRL, acquiring domain-agnostic states is essential for achieving efficient imitation learning. Without these states, imitation learning is hampered by domain-dependent information useless for control. However, existing methods fail to remove such disturbances from the states when the data from experts and agents show large domain shifts. To overcome this issue, we propose a domain-adversarial and conditional state space model (DAC-SSM) that enables control systems to obtain domain-agnostic and task- and dynamics-aware states. DAC-SSM jointly optimizes the state inference, observation reconstruction, forward dynamics, and reward models. To remove domain-dependent information from the states, the model is trained with domain discriminators in an adversarial manner, and the reconstruction is conditioned on domain labels. We experimentally evaluated the model predictive control performance via imitation learning for continuous control of sparse reward tasks in simulators and compared it with the performance of the existing SRL method. The agents from DAC-SSM achieved performance comparable to experts and more than twice the baselines. We conclude domain-agnostic states are essential for imitation learning that has large domain shifts and can be obtained using DAC-SSM.