Daiki Kimura

CV
19papers
3,147citations
Novelty52%
AI Score35

19 Papers

CVOct 28, 2023Code
Foundation Models for Generalist Geospatial Artificial Intelligence

Johannes Jakubik, Sujit Roy, C. E. Phillips et al.

Significant progress in the development of highly adaptable and reusable Artificial Intelligence (AI) models is expected to have a significant impact on Earth science and remote sensing. Foundation models are pre-trained on large unlabeled datasets through self-supervision, and then fine-tuned for various downstream tasks with small labeled datasets. This paper introduces a first-of-a-kind framework for the efficient pre-training and fine-tuning of foundational models on extensive geospatial data. We have utilized this framework to create Prithvi, a transformer-based geospatial foundational model pre-trained on more than 1TB of multispectral satellite imagery from the Harmonized Landsat-Sentinel 2 (HLS) dataset. Our study demonstrates the efficacy of our framework in successfully fine-tuning Prithvi to a range of Earth observation tasks that have not been tackled by previous work on foundation models involving multi-temporal cloud gap imputation, flood mapping, wildfire scar segmentation, and multi-temporal crop segmentation. Our experiments show that the pre-trained model accelerates the fine-tuning process compared to leveraging randomly initialized weights. In addition, pre-trained Prithvi compares well against the state-of-the-art, e.g., outperforming a conditional GAN model in multi-temporal cloud imputation by up to 5pp (or 5.7%) in the structural similarity index. Finally, due to the limited availability of labeled data in the field of Earth observation, we gradually reduce the quantity of available labeled data for refining the model to evaluate data efficiency and demonstrate that data can be decreased significantly without affecting the model's accuracy. The pre-trained 100 million parameter model and corresponding fine-tuning workflows have been released publicly as open source contributions to the global Earth sciences community through Hugging Face.

CLNov 29, 2022Code
DiffG-RL: Leveraging Difference between State and Common Sense

Tsunehiko Tanaka, Daiki Kimura, Michiaki Tatsubori

Taking into account background knowledge as the context has always been an important part of solving tasks that involve natural language. One representative example of such tasks is text-based games, where players need to make decisions based on both description text previously shown in the game, and their own background knowledge about the language and common sense. In this work, we investigate not simply giving common sense, as can be seen in prior research, but also its effective usage. We assume that a part of the environment states different from common sense should constitute one of the grounds for action selection. We propose a novel agent, DiffG-RL, which constructs a Difference Graph that organizes the environment states and common sense by means of interactive objects with a dedicated graph encoder. DiffG-RL also contains a framework for extracting the appropriate amount and representation of common sense from the source to support the construction of the graph. We validate DiffG-RL in experiments with text-based games that require common sense and show that it outperforms baselines by 17% of scores. The code is available at https://github.com/ibm/diffg-rl

CLJul 5, 2023
Learning Symbolic Rules over Abstract Meaning Representations for Textual Reinforcement Learning

Subhajit Chaudhury, Sarathkrishna Swaminathan, Daiki Kimura et al.

Text-based reinforcement learning agents have predominantly been neural network-based models with embeddings-based representation, learning uninterpretable policies that often do not generalize well to unseen games. On the other hand, neuro-symbolic methods, specifically those that leverage an intermediate formal representation, are gaining significant attention in language understanding tasks. This is because of their advantages ranging from inherent interpretability, the lesser requirement of training data, and being generalizable in scenarios with unseen data. Therefore, in this paper, we propose a modular, NEuro-Symbolic Textual Agent (NESTA) that combines a generic semantic parser with a rule induction system to learn abstract interpretable rules as policies. Our experiments on established text-based game benchmarks show that the proposed NESTA method outperforms deep reinforcement learning-based techniques by achieving better generalization to unseen test games and learning from fewer training interactions.

LGSep 5, 2023Code
TensorBank: Tensor Lakehouse for Foundation Model Training

Romeo Kienzler, Leonardo Pondian Tizzei, Benedikt Blumenstiel et al.

Storing and streaming high dimensional data for foundation model training became a critical requirement with the rise of foundation models beyond natural language. In this paper we introduce TensorBank, a petabyte scale tensor lakehouse capable of streaming tensors from Cloud Object Store (COS) to GPU memory at wire speed based on complex relational queries. We use Hierarchical Statistical Indices (HSI) for query acceleration. Our architecture allows to directly address tensors on block level using HTTP range reads. Once in GPU memory, data can be transformed using PyTorch transforms. We provide a generic PyTorch dataset type with a corresponding dataset factory translating relational queries and requested transformations as an instance. By making use of the HSI, irrelevant blocks can be skipped without reading them as those indices contain statistics on their content at different hierarchical resolution levels. This is an opinionated architecture powered by open standards and making heavy use of open-source technology. Although, hardened for production use using geospatial-temporal data, this architecture generalizes to other use case like computer vision, computational neuroscience, biological sequence analysis and more.

CLJun 6, 2023
Utterance Classification with Logical Neural Network: Explainable AI for Mental Disorder Diagnosis

Yeldar Toleubay, Don Joven Agravante, Daiki Kimura et al.

In response to the global challenge of mental health problems, we proposes a Logical Neural Network (LNN) based Neuro-Symbolic AI method for the diagnosis of mental disorders. Due to the lack of effective therapy coverage for mental disorders, there is a need for an AI solution that can assist therapists with the diagnosis. However, current Neural Network models lack explainability and may not be trusted by therapists. The LNN is a Recurrent Neural Network architecture that combines the learning capabilities of neural networks with the reasoning capabilities of classical logic-based AI. The proposed system uses input predicates from clinical interviews to output a mental disorder class, and different predicate pruning techniques are used to achieve scalability and higher scores. In addition, we provide an insight extraction method to aid therapists with their diagnosis. The proposed system addresses the lack of explainability of current Neural Network models and provides a more trustworthy solution for mental disorder diagnosis.

CVOct 19, 2022
Commonsense Knowledge from Scene Graphs for Textual Environments

Tsunehiko Tanaka, Daiki Kimura, Michiaki Tatsubori

Text-based games are becoming commonly used in reinforcement learning as real-world simulation environments. They are usually imperfect information games, and their interactions are only in the textual modality. To challenge these games, it is effective to complement the missing information by providing knowledge outside the game, such as human common sense. However, such knowledge has only been available from textual information in previous works. In this paper, we investigate the advantage of employing commonsense reasoning obtained from visual datasets such as scene graph datasets. In general, images convey more comprehensive information compared with text for humans. This property enables to extract commonsense relationship knowledge more useful for acting effectively in a game. We compare the statistics of spatial relationships available in Visual Genome (a scene graph dataset) and ConceptNet (a text-based knowledge) to analyze the effectiveness of introducing scene graph datasets. We also conducted experiments on a text-based game task that requires commonsense reasoning. Our experimental results demonstrated that our proposed methods have higher and competitive performance than existing state-of-the-art methods.

AIOct 21, 2021Code
LOA: Logical Optimal Actions for Text-based Interaction Games

Daiki Kimura, Subhajit Chaudhury, Masaki Ono et al.

We present Logical Optimal Actions (LOA), an action decision architecture of reinforcement learning applications with a neuro-symbolic framework which is a combination of neural network and symbolic knowledge acquisition approach for natural language interaction games. The demonstration for LOA experiments consists of a web-based interactive platform for text-based games and visualization for acquired knowledge for improving interpretability for trained rules. This demonstration also provides a comparison module with other neuro-symbolic approaches as well as non-symbolic state-of-the-art agent models on the same text-based games. Our LOA also provides open-sourced implementation in Python for the reinforcement learning environment to facilitate an experiment for studying neuro-symbolic agents. Code: https://github.com/ibm/loa

ROJun 3, 2018Code
MaestROB: A Robotics Framework for Integrated Orchestration of Low-Level Control and High-Level Reasoning

Asim Munawar, Giovanni De Magistris, Tu-Hoa Pham et al.

This paper describes a framework called MaestROB. It is designed to make the robots perform complex tasks with high precision by simple high-level instructions given by natural language or demonstration. To realize this, it handles a hierarchical structure by using the knowledge stored in the forms of ontology and rules for bridging among different levels of instructions. Accordingly, the framework has multiple layers of processing components; perception and actuation control at the low level, symbolic planner and Watson APIs for cognitive capabilities and semantic understanding, and orchestration of these components by a new open source robot middleware called Project Intu at its core. We show how this framework can be used in a complex scenario where multiple actors (human, a communication robot, and an industrial robot) collaborate to perform a common industrial task. Human teaches an assembly task to Pepper (a humanoid robot from SoftBank Robotics) using natural language conversation and demonstration. Our framework helps Pepper perceive the human demonstration and generate a sequence of actions for UR5 (collaborative robot arm from Universal Robots), which ultimately performs the assembly (e.g. insertion) task.

AIJun 28, 2024
Fine-tuning of Geospatial Foundation Models for Aboveground Biomass Estimation

Michal Muszynski, Levente Klein, Ademir Ferreira da Silva et al.

Global vegetation structure mapping is critical for understanding the global carbon cycle and maximizing the efficacy of nature-based carbon sequestration initiatives. Moreover, vegetation structure mapping can help reduce the impacts of climate change by, for example, guiding actions to improve water security, increase biodiversity and reduce flood risk. Global satellite measurements provide an important set of observations for monitoring and managing deforestation and degradation of existing forests, natural forest regeneration, reforestation, biodiversity restoration, and the implementation of sustainable agricultural practices. In this paper, we explore the effectiveness of fine-tuning of a geospatial foundation model to estimate above-ground biomass (AGB) using space-borne data collected across different eco-regions in Brazil. The fine-tuned model architecture consisted of a Swin-B transformer as the encoder (i.e., backbone) and a single convolutional layer for the decoder head. All results were compared to a U-Net which was trained as the baseline model Experimental results of this sparse-label prediction task demonstrate that the fine-tuned geospatial foundation model with a frozen encoder has comparable performance to a U-Net trained from scratch. This is despite the fine-tuned model having 13 times less parameters requiring optimization, which saves both time and compute resources. Further, we explore the transfer-learning capabilities of the geospatial foundation models by fine-tuning on satellite imagery with sparse labels from different eco-regions in Brazil.

AIOct 21, 2021
Neuro-Symbolic Reinforcement Learning with First-Order Logic

Daiki Kimura, Masaki Ono, Subhajit Chaudhury et al.

Deep reinforcement learning (RL) methods often require many trials before convergence, and no direct interpretability of trained policies is provided. In order to achieve fast convergence and interpretability for the policy in RL, we propose a novel RL method for text-based games with a recent neuro-symbolic framework called Logical Neural Network, which can learn symbolic and interpretable rules in their differentiable network. The method is first to extract first-order logical facts from text observation and external word meaning network (ConceptNet), then train a policy in the network with directly interpretable logical operators. Our experimental results show RL training with the proposed method converges significantly faster than other state-of-the-art neuro-symbolic methods in a TextWorld benchmark.

AIMar 3, 2021
Reinforcement Learning with External Knowledge by using Logical Neural Networks

Daiki Kimura, Subhajit Chaudhury, Akifumi Wachi et al.

Conventional deep reinforcement learning methods are sample-inefficient and usually require a large number of training trials before convergence. Since such methods operate on an unconstrained action set, they can lead to useless actions. A recent neuro-symbolic framework called the Logical Neural Networks (LNNs) can simultaneously provide key-properties of both neural networks and symbolic logic. The LNNs functions as an end-to-end differentiable network that minimizes a novel contradiction loss to learn interpretable rules. In this paper, we utilize LNNs to define an inference graph using basic logical operations, such as AND and NOT, for faster convergence in reinforcement learning. Specifically, we propose an integrated method that enables model-free reinforcement learning from external knowledge sources in an LNNs-based logical constrained framework such as action shielding and guide. Our results empirically demonstrate that our method converges faster compared to a model-free reinforcement learning method that doesn't have such logical constraints.

LGSep 24, 2020
Bootstrapped Q-learning with Context Relevant Observation Pruning to Generalize in Text-based Games

Subhajit Chaudhury, Daiki Kimura, Kartik Talamadupula et al.

We show that Reinforcement Learning (RL) methods for solving Text-Based Games (TBGs) often fail to generalize on unseen games, especially in small data regimes. To address this issue, we propose Context Relevant Episodic State Truncation (CREST) for irrelevant token removal in observation text for improved generalization. Our method first trains a base model using Q-learning, which typically overfits the training games. The base model's action token distribution is used to perform observation pruning that removes irrelevant tokens. A second bootstrapped model is then retrained on the pruned observation text. Our bootstrapped agent shows improved generalization in solving unseen TextWorld games, using 10x-20x fewer training games compared to previous state-of-the-art methods despite requiring less number of training episodes.

CVFeb 19, 2020
Unsupervised Temporal Feature Aggregation for Event Detection in Unstructured Sports Videos

Subhajit Chaudhury, Daiki Kimura, Phongtharin Vinayavekhin et al.

Image-based sports analytics enable automatic retrieval of key events in a game to speed up the analytics process for human experts. However, most existing methods focus on structured television broadcast video datasets with a straight and fixed camera having minimum variability in the capturing pose. In this paper, we study the case of event detection in sports videos for unstructured environments with arbitrary camera angles. The transition from structured to unstructured video analysis produces multiple challenges that we address in our paper. Specifically, we identify and solve two major problems: unsupervised identification of players in an unstructured setting and generalization of the trained models to pose variations due to arbitrary shooting angles. For the first problem, we propose a temporal feature aggregation algorithm using person re-identification features to obtain high player retrieval precision by boosting a weak heuristic scoring method. Additionally, we propose a data augmentation technique, based on multi-modal image translation model, to reduce bias in the appearance of training samples. Experimental evaluations show that our proposed method improves precision for player retrieval from 0.78 to 0.86 for obliquely angled videos. Additionally, we obtain an improvement in F1 score for rally detection in table tennis videos from 0.79 in case of global frame-level features to 0.89 using our proposed player-level features. Please see the supplementary video submission at https://ibm.biz/BdzeZA.

CVMar 23, 2019
Spatially-weighted Anomaly Detection with Regression Model

Daiki Kimura, Minori Narita, Asim Munawar et al.

Visual anomaly detection is common in several applications including medical screening and production quality check. Although a definition of the anomaly is an unknown trend in data, in many cases some hints or samples of the anomaly class can be given in advance. Conventional methods cannot use the available anomaly data, and also do not have a robustness of noise. In this paper, we propose a novel spatially-weighted reconstruction-loss-based anomaly detection with a likelihood value from a regression model trained by all known data. The spatial weights are calculated by a region of interest generated from employing visualization of the regression model. We introduce some ways to combine with various strategies to propose a state-of-the-art method. Comparing with other methods on three different datasets, we empirically verify the proposed method performs better than the others.

CVOct 5, 2018
Spatially-weighted Anomaly Detection

Minori Narita, Daiki Kimura, Ryuki Tachibana

Many types of anomaly detection methods have been proposed recently, and applied to a wide variety of fields including medical screening and production quality checking. Some methods have utilized images, and, in some cases, a part of the anomaly images is known beforehand. However, this kind of information is dismissed by previous methods, because the methods can only utilize a normal pattern. Moreover, the previous methods suffer a decrease in accuracy due to negative effects from surrounding noises. In this study, we propose a spatially-weighted anomaly detection method (SPADE) that utilizes all of the known patterns and lessens the vulnerability to ambient noises by applying Grad-CAM, which is the visualization method of a CNN. We evaluated our method quantitatively using two datasets, the MNIST dataset with noise and a dataset based on a brief screening test for dementia.

LGOct 2, 2018
Injective State-Image Mapping facilitates Visual Adversarial Imitation Learning

Subhajit Chaudhury, Daiki Kimura, Asim Munawar et al.

The growing use of virtual autonomous agents in applications like games and entertainment demands better control policies for natural-looking movements and actions. Unlike the conventional approach of hard-coding motion routines, we propose a deep learning method for obtaining control policies by directly mimicking raw video demonstrations. Previous methods in this domain rely on extracting low-dimensional features from expert videos followed by a separate hand-crafted reward estimation step. We propose an imitation learning framework that reduces the dependence on hand-engineered reward functions by jointly learning the feature extraction and reward estimation steps using Generative Adversarial Networks (GANs). Our main contribution in this paper is to show that under injective mapping between low-level joint state (angles and velocities) trajectories and corresponding raw video stream, performing adversarial imitation learning on video demonstrations is equivalent to learning from the state trajectories. Experimental results show that the proposed adversarial learning method from raw videos produces a similar performance to state-of-the-art imitation learning techniques while frequently outperforming existing hand-crafted video imitation methods. Furthermore, we show that our method can learn action policies by imitating video demonstrations on YouTube with similar performance to learned agents from true reward signals. Please see the supplementary video submission at https://ibm.biz/BdzzNA.

CVJun 22, 2018
Focusing on What is Relevant: Time-Series Learning and Understanding using Attention

Phongtharin Vinayavekhin, Subhajit Chaudhury, Asim Munawar et al.

This paper is a contribution towards interpretability of the deep learning models in different applications of time-series. We propose a temporal attention layer that is capable of selecting the relevant information to perform various tasks, including data completion, key-frame detection and classification. The method uses the whole input sequence to calculate an attention value for each time step. This results in more focused attention values and more plausible visualisation than previous methods. We apply the proposed method to three different tasks. Experimental results show that the proposed network produces comparable results to a state of the art. In addition, the network provides better interpretability of the decision, that is, it generates more significant attention weight to related frames compared to similar techniques attempted in the past.

CVJun 2, 2018
DAQN: Deep Auto-encoder and Q-Network

Daiki Kimura

The deep reinforcement learning method usually requires a large number of training images and executing actions to obtain sufficient results. When it is extended a real-task in the real environment with an actual robot, the method will be required more training images due to complexities or noises of the input images, and executing a lot of actions on the real robot also becomes a serious problem. Therefore, we propose an extended deep reinforcement learning method that is applied a generative model to initialize the network for reducing the number of training trials. In this paper, we used a deep q-network method as the deep reinforcement learning method and a deep auto-encoder as the generative model. We conducted experiments on three different tasks: a cart-pole game, an atari game, and a real-game with an actual robot. The proposed method trained efficiently on all tasks than the previous method, especially 2.5 times faster on a task with real environment images.

LGJun 2, 2018
Internal Model from Observations for Reward Shaping

Daiki Kimura, Subhajit Chaudhury, Ryuki Tachibana et al.

Reinforcement learning methods require careful design involving a reward function to obtain the desired action policy for a given task. In the absence of hand-crafted reward functions, prior work on the topic has proposed several methods for reward estimation by using expert state trajectories and action pairs. However, there are cases where complete or good action information cannot be obtained from expert demonstrations. We propose a novel reinforcement learning method in which the agent learns an internal model of observation on the basis of expert-demonstrated state trajectories to estimate rewards without completely learning the dynamics of the external environment from state-action pairs. The internal model is obtained in the form of a predictive model for the given expert state distribution. During reinforcement learning, the agent predicts the reward as a function of the difference between the actual state and the state predicted by the internal model. We conducted multiple experiments in environments of varying complexity, including the Super Mario Bros and Flappy Bird games. We show our method successfully trains good policies directly from expert game-play videos.