Kazuhiko Kawamoto

CV
h-index11
30papers
90citations
Novelty51%
AI Score55

30 Papers

CVJan 21, 2023
Improving Zero-Shot Action Recognition using Human Instruction with Text Description

Nan Wu, Hiroshi Kera, Kazuhiko Kawamoto

Zero-shot action recognition, which recognizes actions in videos without having received any training examples, is gaining wide attention considering it can save labor costs and training time. Nevertheless, the performance of zero-shot learning is still unsatisfactory, which limits its practical application. To solve this problem, this study proposes a framework to improve zero-shot action recognition using human instructions with text descriptions. The proposed framework manually describes video contents, which incurs some labor costs; in many situations, the labor costs are worth it. We manually annotate text features for each action, which can be a word, phrase, or sentence. Then by computing the matching degrees between the video and all text features, we can predict the class of the video. Furthermore, the proposed model can also be combined with other models to improve its accuracy. In addition, our model can be continuously optimized to improve the accuracy by repeating human instructions. The results with UCF101 and HMDB51 showed that our model achieved the best accuracy and improved the accuracies of other models.

LGNov 7, 2022
Spatiotemporal forecasting of vertical track alignment with exogenous factors

Katsuya Kosukegawa, Yasukuni Mori, Hiroki Suyari et al.

To ensure the safety of railroad operations, it is important to monitor and forecast track geometry irregularities. A higher safety requires forecasting with higher spatiotemporal frequencies, which in turn requires capturing spatial correlations. Additionally, track geometry irregularities are influenced by multiple exogenous factors. In this study, a method is proposed to forecast one type of track geometry irregularity, vertical alignment, by incorporating spatial and exogenous factor calculations. The proposed method embeds exogenous factors and captures spatiotemporal correlations using a convolutional long short-term memory. The proposed method is also experimentally compared with other methods in terms of the forecasting performance. Additionally, an ablation study on exogenous factors is conducted to examine their individual contributions to the forecasting performance. The results reveal that spatial calculations and maintenance record data improve the forecasting of vertical alignment.

ROMay 20, 2022
Adversarial joint attacks on legged robots

Takuto Otomo, Hiroshi Kera, Kazuhiko Kawamoto

We address adversarial attacks on the actuators at the joints of legged robots trained by deep reinforcement learning. The vulnerability to the joint attacks can significantly impact the safety and robustness of legged robots. In this study, we demonstrate that the adversarial perturbations to the torque control signals of the actuators can significantly reduce the rewards and cause walking instability in robots. To find the adversarial torque perturbations, we develop black-box adversarial attacks, where, the adversary cannot access the neural networks trained by deep reinforcement learning. The black box attack can be applied to legged robots regardless of the architecture and algorithms of deep reinforcement learning. We employ three search methods for the black-box adversarial attacks: random search, differential evolution, and numerical gradient descent methods. In experiments with the quadruped robot Ant-v2 and the bipedal robot Humanoid-v2, in OpenAI Gym environments, we find that differential evolution can efficiently find the strongest torque perturbations among the three methods. In addition, we realize that the quadruped robot Ant-v2 is vulnerable to the adversarial perturbations, whereas the bipedal robot Humanoid-v2 is robust to the perturbations. Consequently, the joint attacks can be used for proactive diagnosis of robot walking instability.

CVAug 20, 2024
Low-Quality Image Detection by Hierarchical VAE

Tomoyasu Nanaumi, Kazuhiko Kawamoto, Hiroshi Kera

To make an employee roster, photo album, or training dataset of generative models, one needs to collect high-quality images while dismissing low-quality ones. This study addresses a new task of unsupervised detection of low-quality images. We propose a method that not only detects low-quality images with various types of degradation but also provides visual clues of them based on an observation that partial reconstruction by hierarchical variational autoencoders fails for low-quality images. The experiments show that our method outperforms several unsupervised out-of-distribution detection methods and also gives visual clues for low-quality images that help humans recognize them even in thumbnail view.

ROMay 20, 2022
Adversarial Body Shape Search for Legged Robots

Takaaki Azakami, Hiroshi Kera, Kazuhiko Kawamoto

We propose an evolutionary computation method for an adversarial attack on the length and thickness of parts of legged robots by deep reinforcement learning. This attack changes the robot body shape and interferes with walking-we call the attacked body as adversarial body shape. The evolutionary computation method searches adversarial body shape by minimizing the expected cumulative reward earned through walking simulation. To evaluate the effectiveness of the proposed method, we perform experiments with three-legged robots, Walker2d, Ant-v2, and Humanoid-v2 in OpenAI Gym. The experimental results reveal that Walker2d and Ant-v2 are more vulnerable to the attack on the length than the thickness of the body parts, whereas Humanoid-v2 is vulnerable to the attack on both of the length and thickness. We further identify that the adversarial body shapes break left-right symmetry or shift the center of gravity of the legged robots. Finding adversarial body shape can be used to proactively diagnose the vulnerability of legged robot walking.

CVOct 7, 2022
Game-Theoretic Understanding of Misclassification

Kosuke Sumiyasu, Kazuhiko Kawamoto, Hiroshi Kera

This paper analyzes various types of image misclassification from a game-theoretic view. Particularly, we consider the misclassification of clean, adversarial, and corrupted images and characterize it through the distribution of multi-order interactions. We discover that the distribution of multi-order interactions varies across the types of misclassification. For example, misclassified adversarial images have a higher strength of high-order interactions than correctly classified clean images, which indicates that adversarial perturbations create spurious features that arise from complex cooperation between pixels. By contrast, misclassified corrupted images have a lower strength of low-order interactions than correctly classified clean images, which indicates that corruptions break the local cooperation between pixels. We also provide the first analysis of Vision Transformers using interactions. We found that Vision Transformers show a different tendency in the distribution of interactions from that in CNNs, and this implies that they exploit the features that CNNs do not use for the prediction. Our study demonstrates that the recent game-theoretic analysis of deep learning models can be broadened to analyze various malfunctions of deep learning models including Vision Transformers by using the distribution, order, and sign of interactions.

CVMar 14, 2022
Adversarial amplitude swap towards robust image classifiers

Chun Yang Tan, Kazuhiko Kawamoto, Hiroshi Kera

The vulnerability of convolutional neural networks (CNNs) to image perturbations such as common corruptions and adversarial perturbations has recently been investigated from the perspective of frequency. In this study, we investigate the effect of the amplitude and phase spectra of adversarial images on the robustness of CNN classifiers. Extensive experiments revealed that the images generated by combining the amplitude spectrum of adversarial images and the phase spectrum of clean images accommodates moderate and general perturbations, and training with these images equips a CNN classifier with more general robustness, performing well under both common corruptions and adversarial perturbations. We also found that two types of overfitting (catastrophic overfitting and robust overfitting) can be circumvented by the aforementioned spectrum recombination. We believe that these results contribute to the understanding and the training of truly robust classifiers.

CVDec 24, 2024Code
Adapter Merging with Centroid Prototype Mapping for Scalable Class-Incremental Learning

Takuma Fukuda, Hiroshi Kera, Kazuhiko Kawamoto

We propose Adapter Merging with Centroid Prototype Mapping (ACMap), an exemplar-free framework for class-incremental learning (CIL) that addresses both catastrophic forgetting and scalability. While existing methods involve a trade-off between inference time and accuracy, ACMap consolidates task-specific adapters into a single adapter, thus achieving constant inference time across tasks without sacrificing accuracy. The framework employs adapter merging to build a shared subspace that aligns task representations and mitigates forgetting, while centroid prototype mapping maintains high accuracy by consistently adapting representations within the shared subspace. To further improve scalability, an early stopping strategy limits adapter merging as tasks increase. Extensive experiments on five benchmark datasets demonstrate that ACMap matches state-of-the-art accuracy while maintaining inference time comparable to the fastest existing methods. The code is available at https://github.com/tf63/ACMap.

CVFeb 26
Robust Human Trajectory Prediction via Self-Supervised Skeleton Representation Learning

Taishu Arashima, Hiroshi Kera, Kazuhiko Kawamoto

Human trajectory prediction plays a crucial role in applications such as autonomous navigation and video surveillance. While recent works have explored the integration of human skeleton sequences to complement trajectory information, skeleton data in real-world environments often suffer from missing joints caused by occlusions. These disturbances significantly degrade prediction accuracy, indicating the need for more robust skeleton representations. We propose a robust trajectory prediction method that incorporates a self-supervised skeleton representation model pretrained with masked autoencoding. Experimental results in occlusion-prone scenarios show that our method improves robustness to missing skeletal data without sacrificing prediction accuracy, and consistently outperforms baseline models in clean-to-moderate missingness regimes.

CVMay 16
Zero-Shot Faithful Textual Explanations via Directional-Derivative Influence on Predictions

Toshinori Yamauchi, Hiroshi Kera, Kazuhiko Kawamoto

Zero-shot textual explanations aim to make image classifiers more transparent by probing their internal representations, without relying on task-specific supervision or LVLMs. However, existing methods often miss the features that truly drive the prediction, resulting in limited \textit{faithfulness} to the evidence underlying the model's decision. To address this, we propose FaithTrace. Motivated by the idea that faithful explanations should describe concepts that strongly influence the prediction, FaithTrace directly measures how much the representation induced by the explanation changes the class logit. We introduce an influence score, computed as the directional derivative of the class logit along the text-induced direction in the classifier's feature space, and use it as a proxy for faithfulness. Moreover, we extend this influence score into quantitative evaluation metrics, helping fill the gap in faithfulness evaluation for textual explanations. Experiments show that FaithTrace yields more faithful explanations than baselines, facilitating a more accurate understanding of the model. The code will be publicly released.

LGMay 14
Guided Diffusion Sampling for Precipitation Forecast Interventions

Ayumu Ueyama, Kazuhiko Kawamoto, Hiroshi Kera

Extreme precipitation causes severe societal and economic damage, and weather control has long been discussed as a potential mitigation strategy. However, to the best of our knowledge, perturbation-based interventions for weather control using data-driven weather forecasting models have not yet been explored. While adversarial attacks also generate perturbations that alter forecasts, they aim to exploit model artifacts and do not account for physical plausibility. In this paper, we propose a gradient-based guidance framework for precipitation-reduction interventions through diffusion sampling in diffusion-based weather forecasting models. Instead of directly perturbing atmospheric states, our method steers the diffusion sampling trajectory, enabling precipitation reduction while maintaining consistency with the atmospheric distribution. To assess physical plausibility, we evaluate from three perspectives: (i) vertical and variable-wise perturbation profiles, (ii) latent-space trajectory deviation, and (iii) cross-model transferability. Experiments on extreme precipitation events from WeatherBench2 demonstrate that our method achieves effective precipitation reduction while yielding more physically plausible interventions than adversarial perturbations.

CVJan 8, 2024Code
Identifying Important Group of Pixels using Interactions

Kosuke Sumiyasu, Kazuhiko Kawamoto, Hiroshi Kera

To better understand the behavior of image classifiers, it is useful to visualize the contribution of individual pixels to the model prediction. In this study, we propose a method, MoXI ($\textbf{Mo}$del e$\textbf{X}$planation by $\textbf{I}$nteractions), that efficiently and accurately identifies a group of pixels with high prediction confidence. The proposed method employs game-theoretic concepts, Shapley values and interactions, taking into account the effects of individual pixels and the cooperative influence of pixels on model confidence. Theoretical analysis and experiments demonstrate that our method better identifies the pixels that are highly contributing to the model outputs than widely-used visualization by Grad-CAM, Attention rollout, and Shapley value. While prior studies have suffered from the exponential computational cost in the computation of Shapley value and interactions, we show that this can be reduced to quadratic cost for our task. The code is available at https://github.com/KosukeSumiyasu/MoXI.

CVMay 17, 2025Code
Cross-Model Transfer of Task Vectors via Few-Shot Orthogonal Alignment

Kazuhiko Kawamoto, Atsuhiro Endo, Hiroshi Kera

Task arithmetic enables efficient model editing by representing task-specific changes as vectors in parameter space. Task arithmetic typically assumes that the source and target models are initialized from the same pre-trained parameters. This assumption limits its applicability in cross-model transfer settings, where models are independently pre-trained on different datasets. To address this challenge, we propose a method based on few-shot orthogonal alignment, which aligns task vectors to the parameter space of a differently pre-trained target model. These transformations preserve key properties of task vectors, such as norm and rank, and are learned using only a small number of labeled examples. We evaluate the method using two Vision Transformers pre-trained on YFCC100M and LAION400M, and test on eight classification datasets. Experimental results show that our method improves transfer accuracy over direct task vector application and achieves performance comparable to few-shot fine-tuning, while maintaining the modularity and reusability of task vectors. Our code is available at https://github.com/kawakera-lab/CrossModelTransfer.

LGMay 8
Learning Large-Scale Modular Addition with an Auxiliary Modulus

Hanato Kikuchi, Ryosuke Masuya, Kazuhiko Kawamoto et al.

Learning parity functions, more general modular addition, is a challenging machine learning task due to its input sensitivity. A recent study substantially scaled modular addition learning in both the number of summands and the modulus. Its key idea is to increase zeros in training sequences, reducing the effective number of summands and thus controlling training difficulty; however, this induces covariate shift between training and test input distributions. This study theoretically and empirically analyzes this side effect and proposes a covariate-shift-free method for modular addition. Specifically, we introduce an auxiliary modulus $Kq$ during training, which reduces wrap-around frequency and problem difficulty while preserving the same input distribution across training and testing. Experiments show strong scalability and sample efficiency: even for large input length $N$, large modulus $q$, and small datasets -- where the sparse method fails to learn -- our method achieves equal or better match accuracy and relaxed $τ$-accuracy. For example, at $N=64$ and $q=974269$, our method trained on 100K samples achieves $97.0\%$ $τ$-accuracy at $τ=0.05$, while the sparse method achieves only $9.5\%$ with the same data size and $93.9\%$ even when extended to 1M samples.

CVApr 1
Towards Viewpoint-Robust End-to-End Autonomous Driving with 3D Foundation Model Priors

Hiroki Hashimoto, Hiromichi Goto, Hiroyuki Sugai et al.

Robust trajectory planning under camera viewpoint changes is important for scalable end-to-end autonomous driving. However, existing models often depend heavily on the camera viewpoints seen during training. We investigate an augmentation-free approach that leverages geometric priors from a 3D foundation model. The method injects per-pixel 3D positions derived from depth estimates as positional embeddings and fuses intermediate geometric features through cross-attention. Experiments on the VR-Drive camera viewpoint perturbation benchmark show reduced performance degradation under most perturbation conditions, with clear improvements under pitch and height perturbations. Gains under longitudinal translation are smaller, suggesting that more viewpoint-agnostic integration is needed for robustness to camera viewpoint changes.

CVDec 8, 2025
Zero-Shot Textual Explanations via Translating Decision-Critical Features

Toshinori Yamauchi, Hiroshi Kera, Kazuhiko Kawamoto

Textual explanations make image classifier decisions transparent by describing the prediction rationale in natural language. Large vision-language models can generate captions but are designed for general visual understanding, not classifier-specific reasoning. Existing zero-shot explanation methods align global image features with language, producing descriptions of what is visible rather than what drives the prediction. We propose TEXTER, which overcomes this limitation by isolating decision-critical features before alignment. TEXTER identifies the neurons contributing to the prediction and emphasizes the features encoded in those neurons -- i.e., the decision-critical features. It then maps these emphasized features into the CLIP feature space to retrieve textual explanations that reflect the model's reasoning. A sparse autoencoder further improves interpretability, particularly for Transformer architectures. Extensive experiments show that TEXTER generates more faithful and interpretable explanations than existing methods. The code will be publicly released.

CVDec 1, 2024
Explaining Object Detectors via Collective Contribution of Pixels

Toshinori Yamauchi, Hiroshi Kera, Kazuhiko Kawamoto

Visual explanations for object detectors are crucial for enhancing their reliability. Since object detectors identify and localize instances by assessing multiple features collectively, generating explanations that capture these collective contributions is critical. However, existing methods focus solely on individual pixel contributions, ignoring the collective contribution of multiple pixels. To address this, we proposed a method for object detectors that considers the collective contribution of multiple pixels. Our approach leverages game-theoretic concepts, specifically Shapley values and interactions, to provide explanations. These explanations cover both bounding box generation and class determination, considering both individual and collective pixel contributions. Extensive quantitative and qualitative experiments demonstrate that the proposed method more accurately identifies important regions in detection results compared to current state-of-the-art methods. The code will be publicly available soon.

RODec 25, 2024
Robustness Evaluation of Offline Reinforcement Learning for Robot Control Against Action Perturbations

Shingo Ayabe, Takuto Otomo, Hiroshi Kera et al.

Offline reinforcement learning, which learns solely from datasets without environmental interaction, has gained attention. This approach, similar to traditional online deep reinforcement learning, is particularly promising for robot control applications. Nevertheless, its robustness against real-world challenges, such as joint actuator faults in robots, remains a critical concern. This study evaluates the robustness of existing offline reinforcement learning methods using legged robots from OpenAI Gym based on average episodic rewards. For robustness evaluation, we simulate failures by incorporating both random and adversarial perturbations, representing worst-case scenarios, into the joint torque signals. Our experiments show that existing offline reinforcement learning methods exhibit significant vulnerabilities to these action perturbations and are more vulnerable than online reinforcement learning methods, highlighting the need for more robust approaches in this field.

ROOct 15, 2025
Adversarial Fine-tuning in Offline-to-Online Reinforcement Learning for Robust Robot Control

Shingo Ayabe, Hiroshi Kera, Kazuhiko Kawamoto

Offline reinforcement learning enables sample-efficient policy acquisition without risky online interaction, yet policies trained on static datasets remain brittle under action-space perturbations such as actuator faults. This study introduces an offline-to-online framework that trains policies on clean data and then performs adversarial fine-tuning, where perturbations are injected into executed actions to induce compensatory behavior and improve resilience. A performance-aware curriculum further adjusts the perturbation probability during training via an exponential-moving-average signal, balancing robustness and stability throughout the learning process. Experiments on continuous-control locomotion tasks demonstrate that the proposed method consistently improves robustness over offline-only baselines and converges faster than training from scratch. Matching the fine-tuning and evaluation conditions yields the strongest robustness to action-space perturbations, while the adaptive curriculum strategy mitigates the degradation of nominal performance observed with the linear curriculum strategy. Overall, the results show that adversarial fine-tuning enables adaptive and robust control under uncertain environments, bridging the gap between offline efficiency and online adaptability.

LGJun 30, 2025
Chain of Thought in Order: Discovering Learning-Friendly Orders for Arithmetic

Yuta Sato, Kazuhiko Kawamoto, Hiroshi Kera

The chain of thought is fundamental in Transformers, which is to perform step-by-step reasoning. Besides what intermediate steps work, the order of these steps critically affects the difficulty of the reasoning. This study addresses a novel task of unraveling chain of thought - reordering decoder input tokens to a learning-friendly sequence for Transformers to learn arithmetic tasks. The proposed pipeline first trains a Transformer on a mixture of target sequences arranged in different orders and then identifies benign orders as those with fast loss drops in the early stage. As the search space grows factorially with sequence length, we propose a two-stage hierarchical approach for inter- and intra-block reordering. Experiments on four order-sensitive arithmetic tasks show that our method identifies a learning-friendly order out of a few billion candidates. Notably, on the multiplication task, it recovered the reverse-digit order reported in prior studies.

LGJun 25, 2025
Learning Moderately Input-Sensitive Functions: A Case Study in QR Code Decoding

Kazuki Yoda, Kazuhiko Kawamoto, Hiroshi Kera

The hardness of learning a function that attains a target task relates to its input-sensitivity. For example, image classification tasks are input-insensitive as minor corruptions should not affect the classification results, whereas arithmetic and symbolic computation, which have been recently attracting interest, are highly input-sensitive as each input variable connects to the computation results. This study presents the first learning-based Quick Response (QR) code decoding and investigates learning functions of medium sensitivity. Our experiments reveal that Transformers can successfully decode QR codes, even beyond the theoretical error-correction limit, by learning the structure of embedded texts. They generalize from English-rich training data to other languages and even random strings. Moreover, we observe that the Transformer-based QR decoder focuses on data bits while ignoring error-correction bits, suggesting a decoding mechanism distinct from standard QR code readers.

IVMar 4, 2025
Undertrained Image Reconstruction for Realistic Degradation in Blind Image Super-Resolution

Ru Ito, Supatta Viriyavisuthisakul, Kazuhiko Kawamoto et al.

Most super-resolution (SR) models struggle with real-world low-resolution (LR) images. This issue arises because the degradation characteristics in the synthetic datasets differ from those in real-world LR images. Since SR models are trained on pairs of high-resolution (HR) and LR images generated by downsampling, they are optimized for simple degradation. However, real-world LR images contain complex degradation caused by factors such as the imaging process and JPEG compression. Due to these differences in degradation characteristics, most SR models perform poorly on real-world LR images. This study proposes a dataset generation method using undertrained image reconstruction models. These models have the property of reconstructing low-quality images with diverse degradation from input images. By leveraging this property, this study generates LR images with diverse degradation from HR images to construct the datasets. Fine-tuning pre-trained SR models on our generated datasets improves noise removal and blur reduction, enhancing performance on real-world LR images. Furthermore, an analysis of the datasets reveals that degradation diversity contributes to performance improvements, whereas color differences between HR and LR images may degrade performance. 11 pages, (11 figures and 2 tables)

LGJun 28, 2024
VarteX: Enhancing Weather Forecast through Distributed Variable Representation

Ayumu Ueyama, Kazuhiko Kawamoto, Hiroshi Kera

Weather forecasting is essential for various human activities. Recent data-driven models have outperformed numerical weather prediction by utilizing deep learning in forecasting performance. However, challenges remain in efficiently handling multiple meteorological variables. This study proposes a new variable aggregation scheme and an efficient learning framework for that challenge. Experiments show that VarteX outperforms the conventional model in forecast performance, requiring significantly fewer parameters and resources. The effectiveness of learning through multiple aggregations and regional split training is demonstrated, enabling more efficient and accurate deep learning-based weather forecasting.

CVMar 13, 2024
Matching Non-Identical Objects

Yusuke Marumo, Kazuhiko Kawamoto, Satomi Tanaka et al.

Not identical but similar objects are ubiquitous in our world, ranging from four-legged animals such as dogs and cats to cars of different models and flowers of various colors. This study addresses a novel task of matching such non-identical objects at the pixel level. We propose a weighting scheme of descriptors that incorporates semantic information from object detectors into existing sparse image matching methods, extending their targets from identical objects captured from different perspectives to semantically similar objects. The experiments show successful matching between non-identical objects in various cases, including in-class design variations, class discrepancy, and domain shifts (e.g., photo--drawing and image corruptions).

CVMay 29, 2023
Fourier Analysis on Robustness of Graph Convolutional Neural Networks for Skeleton-based Action Recognition

Nariki Tanaka, Hiroshi Kera, Kazuhiko Kawamoto

Using Fourier analysis, we explore the robustness and vulnerability of graph convolutional neural networks (GCNs) for skeleton-based action recognition. We adopt a joint Fourier transform (JFT), a combination of the graph Fourier transform (GFT) and the discrete Fourier transform (DFT), to examine the robustness of adversarially-trained GCNs against adversarial attacks and common corruptions. Experimental results with the NTU RGB+D dataset reveal that adversarial training does not introduce a robustness trade-off between adversarial attacks and low-frequency perturbations, which typically occurs during image classification based on convolutional neural networks. This finding indicates that adversarial training is a practical approach to enhancing robustness against adversarial attacks and common corruptions in skeleton-based action recognition. Furthermore, we find that the Fourier approach cannot explain vulnerability against skeletal part occlusion corruption, which highlights its limitations. These findings extend our understanding of the robustness of GCNs, potentially guiding the development of more robust learning methods for skeleton-based action recognition.

CVMay 15, 2023
Exploiting Frequency Spectrum of Adversarial Images for General Robustness

Chun Yang Tan, Kazuhiko Kawamoto, Hiroshi Kera

In recent years, there has been growing concern over the vulnerability of convolutional neural networks (CNNs) to image perturbations. However, achieving general robustness against different types of perturbations remains challenging, in which enhancing robustness to some perturbations (e.g., adversarial perturbations) may degrade others (e.g., common corruptions). In this paper, we demonstrate that adversarial training with an emphasis on phase components significantly improves model performance on clean, adversarial, and common corruption accuracies. We propose a frequency-based data augmentation method, Adversarial Amplitude Swap, that swaps the amplitude spectrum between clean and adversarial images to generate two novel training images: adversarial amplitude and adversarial phase images. These images act as substitutes for adversarial images and can be implemented in various adversarial training setups. Through extensive experiments, we demonstrate that our method enables the CNNs to gain general robustness against different types of perturbations and results in a uniform performance against all types of common corruptions.

RONov 19, 2021
Reinforcement Learning with Adaptive Curriculum Dynamics Randomization for Fault-Tolerant Robot Control

Wataru Okamoto, Hiroshi Kera, Kazuhiko Kawamoto

This study is aimed at addressing the problem of fault tolerance of quadruped robots to actuator failure, which is critical for robots operating in remote or extreme environments. In particular, an adaptive curriculum reinforcement learning algorithm with dynamics randomization (ACDR) is established. The ACDR algorithm can adaptively train a quadruped robot in random actuator failure conditions and formulate a single robust policy for fault-tolerant robot control. It is noted that the hard2easy curriculum is more effective than the easy2hard curriculum for quadruped robot locomotion. The ACDR algorithm can be used to build a robot system that does not require additional modules for detecting actuator failures and switching policies. Experimental results show that the ACDR algorithm outperforms conventional algorithms in terms of the average reward and walking distance.

CVSep 13, 2021
Conditional MoCoGAN for Zero-Shot Video Generation

Shun Kimura, Kazuhiko Kawamoto

We propose a conditional generative adversarial network (GAN) model for zero-shot video generation. In this study, we have explored zero-shot conditional generation setting. In other words, we generate unseen videos from training samples with missing classes. The task is an extension of conditional data generation. The key idea is to learn disentangled representations in the latent space of a GAN. To realize this objective, we base our model on the motion and content decomposed GAN and conditional GAN for image generation. We build the model to find better-disentangled representations and to generate good-quality videos. We demonstrate the effectiveness of our proposed model through experiments on the Weizmann action database and the MUG facial expression database.

CVSep 13, 2021
Adversarial Bone Length Attack on Action Recognition

Nariki Tanaka, Hiroshi Kera, Kazuhiko Kawamoto

Skeleton-based action recognition models have recently been shown to be vulnerable to adversarial attacks. Compared to adversarial attacks on images, perturbations to skeletons are typically bounded to a lower dimension of approximately 100 per frame. This lower-dimensional setting makes it more difficult to generate imperceptible perturbations. Existing attacks resolve this by exploiting the temporal structure of the skeleton motion so that the perturbation dimension increases to thousands. In this paper, we show that adversarial attacks can be performed on skeleton-based action recognition models, even in a significantly low-dimensional setting without any temporal manipulation. Specifically, we restrict the perturbations to the lengths of the skeleton's bones, which allows an adversary to manipulate only approximately 30 effective dimensions. We conducted experiments on the NTU RGB+D and HDM05 datasets and demonstrate that the proposed attack successfully deceived models with sometimes greater than 90% success rate by small perturbations. Furthermore, we discovered an interesting phenomenon: in our low-dimensional setting, the adversarial training with the bone length attack shares a similar property with data augmentation, and it not only improves the adversarial robustness but also improves the classification accuracy on the original data. This is an interesting counterexample of the trade-off between adversarial robustness and clean accuracy, which has been widely observed in studies on adversarial training in the high-dimensional regime.

CVSep 13, 2021
Adversarially Trained Object Detector for Unsupervised Domain Adaptation

Kazuma Fujii, Hiroshi Kera, Kazuhiko Kawamoto

Unsupervised domain adaptation, which involves transferring knowledge from a label-rich source domain to an unlabeled target domain, can be used to substantially reduce annotation costs in the field of object detection. In this study, we demonstrate that adversarial training in the source domain can be employed as a new approach for unsupervised domain adaptation. Specifically, we establish that adversarially trained detectors achieve improved detection performance in target domains that are significantly shifted from source domains. This phenomenon is attributed to the fact that adversarially trained detectors can be used to extract robust features that are in alignment with human perception and worth transferring across domains while discarding domain-specific non-robust features. In addition, we propose a method that combines adversarial training and feature alignment to ensure the improved alignment of robust features with the target domain. We conduct experiments on four benchmark datasets and confirm the effectiveness of our proposed approach on large domain shifts from real to artistic images. Compared to the baseline models, the adversarially trained detectors improve the mean average precision by up to 7.7%, and further by up to 11.8% when feature alignments are incorporated. Although our method degrades performance for small domain shifts, quantification of the domain shift based on the Frechet distance allows us to determine whether adversarial training should be conducted.