CVJul 12, 2024
Layer-Wise Relevance Propagation with Conservation Property for ResNetSeitaro Otsuki, Tsumugi Iida, Félix Doublet et al.
The transparent formulation of explanation methods is essential for elucidating the predictions of neural networks, which are typically black-box models. Layer-wise Relevance Propagation (LRP) is a well-established method that transparently traces the flow of a model's prediction backward through its architecture by backpropagating relevance scores. However, the conventional LRP does not fully consider the existence of skip connections, and thus its application to the widely used ResNet architecture has not been thoroughly explored. In this study, we extend LRP to ResNet models by introducing Relevance Splitting at points where the output from a skip connection converges with that from a residual block. Our formulation guarantees the conservation property throughout the process, thereby preserving the integrity of the generated explanations. To evaluate the effectiveness of our approach, we conduct experiments on ImageNet and the Caltech-UCSD Birds-200-2011 dataset. Our method achieves superior performance to that of baseline methods on standard evaluation metrics such as the Insertion-Deletion score while maintaining its conservation property. We will release our code for further research at https://5ei74r0.github.io/lrp-for-resnet.page/
CVSep 12, 2022
Data Augmentation by Selecting Mixed Classes Considering Distance Between ClassesShungo Fujii, Yasunori Ishii, Kazuki Kozuka et al.
Data augmentation is an essential technique for improving recognition accuracy in object recognition using deep learning. Methods that generate mixed data from multiple data sets, such as mixup, can acquire new diversity that is not included in the training data, and thus contribute significantly to accuracy improvement. However, since the data selected for mixing are randomly sampled throughout the training process, there are cases where appropriate classes or data are not selected. In this study, we propose a data augmentation method that calculates the distance between classes based on class probabilities and can select data from suitable classes to be mixed in the training process. Mixture data is dynamically adjusted according to the training trend of each class to facilitate training. The proposed method is applied in combination with conventional methods for generating mixed data. Evaluation experiments show that the proposed method improves recognition performance on general and long-tailed image recognition datasets.
LGJun 24, 2023
Action Q-Transformer: Visual Explanation in Deep Reinforcement Learning with Encoder-Decoder Model using Action QueryHidenori Itaya, Tsubasa Hirakawa, Takayoshi Yamashita et al.
The excellent performance of Transformer in supervised learning has led to growing interest in its potential application to deep reinforcement learning (DRL) to achieve high performance on a wide variety of problems. However, the decision making of a DRL agent is a black box, which greatly hinders the application of the agent to real-world problems. To address this problem, we propose the Action Q-Transformer (AQT), which introduces a transformer encoder-decoder structure to Q-learning based DRL methods. In AQT, the encoder calculates the state value function and the decoder calculates the advantage function to promote the acquisition of different attentions indicating the agent's decision-making. The decoder in AQT utilizes action queries, which represent the information of each action, as queries. This enables us to obtain the attentions for the state value and for each action. By acquiring and visualizing these attentions that detail the agent's decision-making, we achieve a DRL model with high interpretability. In this paper, we show that visualization of attention in Atari 2600 games enables detailed analysis of agents' decision-making in various game tasks. Further, experimental results demonstrate that our method can achieve higher performance than the baseline in some games.
AIJun 4, 2023
Learning from AI: An Interactive Learning Method Using a DNN Model Incorporating Expert Knowledge as a TeacherKohei Hattori, Tsubasa Hirakawa, Takayoshi Yamashita et al.
Visual explanation is an approach for visualizing the grounds of judgment by deep learning, and it is possible to visually interpret the grounds of a judgment for a certain input by visualizing an attention map. As for deep-learning models that output erroneous decision-making grounds, a method that incorporates expert human knowledge in the model via an attention map in a manner that improves explanatory power and recognition accuracy is proposed. In this study, based on a deep-learning model that incorporates the knowledge of experts, a method by which a learner "learns from AI" the grounds for its decisions is proposed. An "attention branch network" (ABN), which has been fine-tuned with attention maps modified by experts, is prepared as a teacher. By using an interactive editing tool for the fine-tuned ABN and attention maps, the learner learns by editing the attention maps and changing the inference results. By repeatedly editing the attention maps and making inferences so that the correct recognition results are output, the learner can acquire the grounds for the expert's judgments embedded in the ABN. The results of an evaluation experiment with subjects show that learning using the proposed method is more efficient than the conventional method.
CVFeb 16, 2023
Masking and Mixing Adversarial TrainingHiroki Adachi, Tsubasa Hirakawa, Takayoshi Yamashita et al.
While convolutional neural networks (CNNs) have achieved excellent performances in various computer vision tasks, they often misclassify with malicious samples, a.k.a. adversarial examples. Adversarial training is a popular and straightforward technique to defend against the threat of adversarial examples. Unfortunately, CNNs must sacrifice the accuracy of standard samples to improve robustness against adversarial examples when adversarial training is used. In this work, we propose Masking and Mixing Adversarial Training (M2AT) to mitigate the trade-off between accuracy and robustness. We focus on creating diverse adversarial examples during training. Specifically, our approach consists of two processes: 1) masking a perturbation with a binary mask and 2) mixing two partially perturbed images. Experimental results on CIFAR-10 dataset demonstrate that our method achieves better robustness against several adversarial attacks than previous methods.
CVJul 27, 2022
Object-ABN: Learning to Generate Sharp Attention Maps for Action RecognitionTomoya Nitta, Tsubasa Hirakawa, Hironobu Fujiyoshi et al.
In this paper we propose an extension of the Attention Branch Network (ABN) by using instance segmentation for generating sharper attention maps for action recognition. Methods for visual explanation such as Grad-CAM usually generate blurry maps which are not intuitive for humans to understand, particularly in recognizing actions of people in videos. Our proposed method, Object-ABN, tackles this issue by introducing a new mask loss that makes the generated attention maps close to the instance segmentation result. Further the PC loss and multiple attention maps are introduced to enhance the sharpness of the maps and improve the performance of classification. Experimental results with UCF101 and SSv2 shows that the generated maps by the proposed method are much clearer qualitatively and quantitatively than those of the original ABN.
CVOct 11, 2024Code
DeBiFormer: Vision Transformer with Deformable Agent Bi-level Routing AttentionNguyen Huu Bao Long, Chenyu Zhang, Yuzhi Shi et al.
Vision Transformers with various attention modules have demonstrated superior performance on vision tasks. While using sparsity-adaptive attention, such as in DAT, has yielded strong results in image classification, the key-value pairs selected by deformable points lack semantic relevance when fine-tuning for semantic segmentation tasks. The query-aware sparsity attention in BiFormer seeks to focus each query on top-k routed regions. However, during attention calculation, the selected key-value pairs are influenced by too many irrelevant queries, reducing attention on the more important ones. To address these issues, we propose the Deformable Bi-level Routing Attention (DBRA) module, which optimizes the selection of key-value pairs using agent queries and enhances the interpretability of queries in attention maps. Based on this, we introduce the Deformable Bi-level Routing Attention Transformer (DeBiFormer), a novel general-purpose vision transformer built with the DBRA module. DeBiFormer has been validated on various computer vision tasks, including image classification, object detection, and semantic segmentation, providing strong evidence of its effectiveness.Code is available at {https://github.com/maclong01/DeBiFormer}
CVMay 16
P2GS: Physical Prior-guided Gaussian Splatting for Photometrically Consistent Urban ReconstructionKota Shimomura, Hidehisa Arai, Tsubasa Takahashi et al.
3D Gaussian Splatting (3DGS) has recently emerged as a powerful explicit representation enabling fast, high-fidelity rendering, making it a promising foundation for closed-loop simulators and perception models in autonomous driving. However, conventional 3DGS implicitly assumes consistent exposure and tone mapping across views. Real driving data violates this assumption due to heterogeneous camera pipelines and dynamic outdoor illumination, baking exposure discrepancies and sensor noise into the radiance field and producing artifacts and inconsistent illumination especially in static backgrounds crucial for realistic simulation. These issues are amplified in autonomous driving, where sparse viewpoints, varying exposures, and outdoor lighting interact, while prior work mainly targets dynamic-object reconstruction and overlooks cross-view photometric consistency. To address this limitation, we introduce P2GS, a physically consistent Gaussian Splatting framework that jointly decomposes a view-invariant linear HDR radiance field, per-view exposure scales, and tone-mapping functions from only LDR images without HDR supervision. P2GS employs a unified optimization strategy grounded in the physical image-formation process, enforcing relative-exposure consistency and HDR-domain radiance regularization. This yields a radiance field robust to inter-camera illumination differences while preserving the real-time efficiency of standard 3DGS. Experiments across real and simulated driving environments show that P2GS matches or surpasses prior methods in LDR reconstruction while providing substantially improved photometric consistency, reliable exposure normalization, and physically coherent illumination across diverse scenes.
CVDec 25, 2018Code
Attention Branch Network: Learning of Attention Mechanism for Visual ExplanationHiroshi Fukui, Tsubasa Hirakawa, Takayoshi Yamashita et al.
Visual explanation enables human to understand the decision making of Deep Convolutional Neural Network (CNN), but it is insufficient to contribute the performance improvement. In this paper, we focus on the attention map for visual explanation, which represents high response value as the important region in image recognition. This region significantly improves the performance of CNN by introducing an attention mechanism that focuses on a specific region in an image. In this work, we propose Attention Branch Network (ABN), which extends the top-down visual explanation model by introducing a branch structure with an attention mechanism. ABN can be applicable to several image recognition tasks by introducing a branch for attention mechanism and is trainable for the visual explanation and image recognition in end-to-end manner. We evaluate ABN on several image recognition tasks such as image classification, fine-grained recognition, and multiple facial attributes recognition. Experimental results show that ABN can outperform the accuracy of baseline models on these image recognition tasks while generating an attention map for visual explanation. Our code is available at https://github.com/machine-perception-robotics-group/attention_branch_network.
CVOct 29, 2021
ST-ABN: Visual Explanation Taking into Account Spatio-temporal Information for Video RecognitionMasahiro Mitsuhara, Tsubasa Hirakawa, Takayoshi Yamashita et al.
It is difficult for people to interpret the decision-making in the inference process of deep neural networks. Visual explanation is one method for interpreting the decision-making of deep learning. It analyzes the decision-making of 2D CNNs by visualizing an attention map that highlights discriminative regions. Visual explanation for interpreting the decision-making process in video recognition is more difficult because it is necessary to consider not only spatial but also temporal information, which is different from the case of still images. In this paper, we propose a visual explanation method called spatio-temporal attention branch network (ST-ABN) for video recognition. It enables visual explanation for both spatial and temporal information. ST-ABN acquires the importance of spatial and temporal information during network inference and applies it to recognition processing to improve recognition performance and visual explainability. Experimental results with Something-Something datasets V1 \& V2 demonstrated that ST-ABN enables visual explanation that takes into account spatial and temporal information simultaneously and improves recognition performance.
CVMar 27, 2021
Deep Ensemble Collaborative Learning by using Knowledge-transfer Graph for Fine-grained Object ClassificationNaoki Okamoto, Soma Minami, Tsubasa Hirakawa et al.
Mutual learning, in which multiple networks learn by sharing their knowledge, improves the performance of each network. However, the performance of ensembles of networks that have undergone mutual learning does not improve significantly from that of normal ensembles without mutual learning, even though the performance of each network has improved significantly. This may be due to the relationship between the knowledge in mutual learning and the individuality of the networks in the ensemble. In this study, we propose an ensemble method using knowledge transfer to improve the accuracy of ensembles by introducing a loss design that promotes diversity among networks in mutual learning. We use an attention map as knowledge, which represents the probability distribution and information in the middle layer of a network. There are many ways to combine networks and loss designs for knowledge transfer methods. Therefore, we use the automatic optimization of knowledge-transfer graphs to consider a variety of knowledge-transfer methods by graphically representing conventional mutual-learning and distillation methods and optimizing each element through hyperparameter search. The proposed method consists of a mechanism for constructing an ensemble in a knowledge-transfer graph, attention loss, and a loss design that promotes diversity among networks. We explore optimal ensemble learning by optimizing a knowledge-transfer graph to maximize ensemble accuracy. From exploration of graphs and evaluation experiments using the datasets of Stanford Dogs, Stanford Cars, and CUB-200-2011, we confirm that the proposed method is more accurate than a conventional ensemble method.
LGMar 6, 2021
Visual Explanation using Attention Mechanism in Actor-Critic-based Deep Reinforcement LearningHidenori Itaya, Tsubasa Hirakawa, Takayoshi Yamashita et al.
Deep reinforcement learning (DRL) has great potential for acquiring the optimal action in complex environments such as games and robot control. However, it is difficult to analyze the decision-making of the agent, i.e., the reasons it selects the action acquired by learning. In this work, we propose Mask-Attention A3C (Mask A3C), which introduces an attention mechanism into Asynchronous Advantage Actor-Critic (A3C), which is an actor-critic-based DRL method, and can analyze the decision-making of an agent in DRL. A3C consists of a feature extractor that extracts features from an image, a policy branch that outputs the policy, and a value branch that outputs the state value. In this method, we focus on the policy and value branches and introduce an attention mechanism into them. The attention mechanism applies a mask processing to the feature maps of each branch using mask-attention that expresses the judgment reason for the policy and state value with a heat map. We visualized mask-attention maps for games on the Atari 2600 and found we could easily analyze the reasons behind an agent's decision-making in various game tasks. Furthermore, experimental results showed that the agent could achieve a higher performance by introducing the attention mechanism.
ROFeb 12, 2021
Predicting and Attending to Damaging Collisions for Placing Everyday Objects in Photo-Realistic SimulationsAly Magassouba, Komei Sugiura, Angelica Nakayama et al.
Placing objects is a fundamental task for domestic service robots (DSRs). Thus, inferring the collision-risk before a placing motion is crucial for achieving the requested task. This problem is particularly challenging because it is necessary to predict what happens if an object is placed in a cluttered designated area. We show that a rule-based approach that uses plane detection, to detect free areas, performs poorly. To address this, we develop PonNet, which has multimodal attention branches and a self-attention mechanism to predict damaging collisions, based on RGBD images. Our method can visualize the risk of damaging collisions, which is convenient because it enables the user to understand the risk. For this purpose, we build and publish an original dataset that contains 12,000 photo-realistic images of specific placing areas, with daily life objects, in home environments. The experimental results show that our approach improves accuracy compared with the baseline methods.
CVJul 9, 2020
Alleviating the Burden of Labeling: Sentence Generation by Attention Branch Encoder-Decoder NetworkTadashi Ogura, Aly Magassouba, Komei Sugiura et al.
Domestic service robots (DSRs) are a promising solution to the shortage of home care workers. However, one of the main limitations of DSRs is their inability to interact naturally through language. Recently, data-driven approaches have been shown to be effective for tackling this limitation; however, they often require large-scale datasets, which is costly. Based on this background, we aim to perform automatic sentence generation of fetching instructions: for example, "Bring me a green tea bottle on the table." This is particularly challenging because appropriate expressions depend on the target object, as well as its surroundings. In this paper, we propose the attention branch encoder--decoder network (ABEN), to generate sentences from visual inputs. Unlike other approaches, the ABEN has multimodal attention branches that use subword-level attention and generate sentences based on subword embeddings. In experiments, we compared the ABEN with a baseline method using four standard metrics in image captioning. Results show that the ABEN outperformed the baseline in terms of these metrics.
CVSep 10, 2019
Knowledge Transfer Graph for Deep Collaborative LearningSoma Minami, Tsubasa Hirakawa, Takayoshi Yamashita et al.
Knowledge transfer among multiple networks using their outputs or intermediate activations have evolved through extensive manual design from a simple teacher-student approach (knowledge distillation) to a bidirectional cohort one (deep mutual learning). The key factors of such knowledge transfer involve the network size, the number of networks, the transfer direction, and the design of the loss function. However, because these factors are enormous when combined and become intricately entangled, the methods of conventional knowledge transfer have explored only limited combinations. In this paper, we propose a new graph-based approach for more flexible and diverse combinations of knowledge transfer. To achieve the knowledge transfer, we propose a novel graph representation called knowledge transfer graph that provides a unified view of the knowledge transfer and has the potential to represent diverse knowledge transfer patterns. We also propose four gate functions that are introduced into loss functions. The four gates, which control the gradient, can deliver diverse combinations of knowledge transfer. Searching the graph structure enables us to discover more effective knowledge transfer methods than a manually designed one. Experimental results on the CIFAR-10, -100, and Tiny-ImageNet datasets show that the proposed method achieved significant performance improvements and was able to find remarkable graph structures.
CVMay 9, 2019
Embedding Human Knowledge into Deep Neural Network via Attention MapMasahiro Mitsuhara, Hiroshi Fukui, Yusuke Sakashita et al.
In this work, we aim to realize a method for embedding human knowledge into deep neural networks. While the conventional method to embed human knowledge has been applied for non-deep machine learning, it is challenging to apply it for deep learning models due to the enormous number of model parameters. To tackle this problem, we focus on the attention mechanism of an attention branch network (ABN). In this paper, we propose a fine-tuning method that utilizes a single-channel attention map which is manually edited by a human expert. Our fine-tuning method can train a network so that the output attention map corresponds to the edited ones. As a result, the fine-tuned network can output an attention map that takes into account human knowledge. Experimental results with ImageNet, CUB-200-2010, and IDRiD demonstrate that it is possible to obtain a clear attention map for a visual explanation and improve the classification performance. Our findings can be a novel framework for optimizing networks through human intuitive editing via a visual interface and suggest new possibilities for human-machine cooperation in addition to the improvement of visual explanations.
CVNov 1, 2018
Survey on Vision-based Path PredictionTsubasa Hirakawa, Takayoshi Yamashita, Toru Tamaki et al.
Path prediction is a fundamental task for estimating how pedestrians or vehicles are going to move in a scene. Because path prediction as a task of computer vision uses video as input, various information used for prediction, such as the environment surrounding the target and the internal state of the target, need to be estimated from the video in addition to predicting paths. Many prediction approaches that include understanding the environment and the internal state have been proposed. In this survey, we systematically summarize methods of path prediction that take video as input and and extract features from the video. Moreover, we introduce datasets used to evaluate path prediction methods quantitatively.
CVOct 30, 2017
Denoising random forestsMasaya Hibino, Akisato Kimura, Takayoshi Yamashita et al.
This paper proposes a novel type of random forests called a denoising random forests that are robust against noises contained in test samples. Such noise-corrupted samples cause serious damage to the estimation performances of random forests, since unexpected child nodes are often selected and the leaf nodes that the input sample reaches are sometimes far from those for a clean sample. Our main idea for tackling this problem originates from a binary indicator vector that encodes a traversal path of a sample in the forest. Our proposed method effectively employs this vector by introducing denoising autoencoders into random forests. A denoising autoencoder can be trained with indicator vectors produced from clean and noisy input samples, and non-leaf nodes where incorrect decisions are made can be identified by comparing the input and output of the trained denoising autoencoder. Multiple traversal paths with respect to the nodes with incorrect decisions caused by the noises can then be considered for the estimation.
CVSep 14, 2017
Binary-decomposed DCNN for accelerating computation and compressing model without retrainingRyuji Kamiya, Takayoshi Yamashita, Mitsuru Ambai et al.
Recent trends show recognition accuracy increasing even more profoundly. Inference process of Deep Convolutional Neural Networks (DCNN) has a large number of parameters, requires a large amount of computation, and can be very slow. The large number of parameters also require large amounts of memory. This is resulting in increasingly long computation times and large model sizes. To implement mobile and other low performance devices incorporating DCNN, model sizes must be compressed and computation must be accelerated. To that end, this paper proposes Binary-decomposed DCNN, which resolves these issues without the need for retraining. Our method replaces real-valued inner-product computations with binary inner-product computations in existing network models to accelerate computation of inference and decrease model size without the need for retraining. Binary computations can be done at high speed using logical operators such as XOR and AND, together with bit counting. In tests using AlexNet with the ImageNet classification task, speed increased by a factor of 1.79, models were compressed by approximately 80%, and increase in error rate was limited to 1.20%. With VGG-16, speed increased by a factor of 2.07, model sizes decreased by 81%, and error increased by only 2.16%.