Kazuhiro Hotta

CV
h-index1
32papers
277citations
Novelty50%
AI Score54

32 Papers

CVOct 14, 2022
Reconstructed Student-Teacher and Discriminative Networks for Anomaly Detection

Shinji Yamada, Satoshi Kamiya, Kazuhiro Hotta

Anomaly detection is an important problem in computer vision; however, the scarcity of anomalous samples makes this task difficult. Thus, recent anomaly detection methods have used only normal images with no abnormal areas for training. In this work, a powerful anomaly detection method is proposed based on student-teacher feature pyramid matching (STPM), which consists of a student and teacher network. Generative models are another approach to anomaly detection. They reconstruct normal images from an input and compute the difference between the predicted normal and the input. Unfortunately, STPM does not have the ability to generate normal images. To improve the accuracy of STPM, this work uses a student network, as in generative models, to reconstruct normal features. This improves the accuracy; however, the anomaly maps for normal images are not clean because STPM does not use anomaly images for training, which decreases the accuracy of the image-level anomaly detection. To further improve accuracy, a discriminative network trained with pseudo-anomalies from anomaly maps is used in our method, which consists of two pairs of student-teacher networks and a discriminative network. The method displayed high accuracy on the MVTec anomaly detection dataset.

CVJun 15, 2023
Enlarged Large Margin Loss for Imbalanced Classification

Sota Kato, Kazuhiro Hotta

We propose a novel loss function for imbalanced classification. LDAM loss, which minimizes a margin-based generalization bound, is widely utilized for class-imbalanced image classification. Although, by using LDAM loss, it is possible to obtain large margins for the minority classes and small margins for the majority classes, the relevance to a large margin, which is included in the original softmax cross entropy loss, is not be clarified yet. In this study, we reconvert the formula of LDAM loss using the concept of the large margin softmax cross entropy loss based on the softplus function and confirm that LDAM loss includes a wider large margin than softmax cross entropy loss. Furthermore, we propose a novel Enlarged Large Margin (ELM) loss, which can further widen the large margin of LDAM loss. ELM loss utilizes the large margin for the maximum logit of the incorrect class in addition to the basic margin used in LDAM loss. Through experiments conducted on imbalanced CIFAR datasets and large-scale datasets with long-tailed distribution, we confirmed that classification accuracy was much improved compared with LDAM loss and conventional losses for imbalanced classification.

CVAug 23, 2023
Lite-HRNet Plus: Fast and Accurate Facial Landmark Detection

Sota Kato, Kazuhiro Hotta, Yuhki Hatakeyama et al.

Facial landmark detection is an essential technology for driver status tracking and has been in demand for real-time estimations. As a landmark coordinate prediction, heatmap-based methods are known to achieve a high accuracy, and Lite-HRNet can achieve a fast estimation. However, with Lite-HRNet, the problem of a heavy computational cost of the fusion block, which connects feature maps with different resolutions, has yet to be solved. In addition, the strong output module used in HRNetV2 is not applied to Lite-HRNet. Given these problems, we propose a novel architecture called Lite-HRNet Plus. Lite-HRNet Plus achieves two improvements: a novel fusion block based on a channel attention and a novel output module with less computational intensity using multi-resolution feature maps. Through experiments conducted on two facial landmark datasets, we confirmed that Lite-HRNet Plus further improved the accuracy in comparison with conventional methods, and achieved a state-of-the-art accuracy with a computational complexity with the range of 10M FLOPs.

CVApr 17, 2023
One-shot and Partially-Supervised Cell Image Segmentation Using Small Visual Prompt

Sota Kato, Kazuhiro Hotta

Semantic segmentation of microscopic cell images using deep learning is an important technique, however, it requires a large number of images and ground truth labels for training. To address the above problem, we consider an efficient learning framework with as little data as possible, and we propose two types of learning strategies: One-shot segmentation which can learn with only one training sample, and Partially-supervised segmentation which assigns annotations to only a part of images. Furthermore, we introduce novel segmentation methods using the small prompt images inspired by prompt learning in recent studies. Our proposed methods use a pre-trained model based on only cell images and teach the information of the prompt pairs to the target image to be segmented by the attention mechanism, which allows for efficient learning while reducing the burden of annotation costs. Through experiments conducted on three types of microscopic cell image datasets, we confirmed that the proposed method improved the Dice score coefficient (DSC) in comparison with the conventional methods.

CVSep 17, 2024
Genetic Information Analysis of Age-Related Macular Degeneration Fellow Eye Using Multi-Modal Selective ViT

Yoichi Furukawa, Satoshi Kamiya, Yoichi Sakurada et al.

In recent years, there has been significant development in the analysis of medical data using machine learning. It is believed that the onset of Age-related Macular Degeneration (AMD) is associated with genetic polymorphisms. However, genetic analysis is costly, and artificial intelligence may offer assistance. This paper presents a method that predict the presence of multiple susceptibility genes for AMD using fundus and Optical Coherence Tomography (OCT) images, as well as medical records. Experimental results demonstrate that integrating information from multiple modalities can effectively predict the presence of susceptibility genes with over 80$\%$ accuracy.

CVApr 21, 2023
DeformableFormer: Classification of Endoscopic Ultrasound Guided Fine Needle Biopsy in Pancreatic Diseases

Taiji Kurami, Takuya Ishikawa, Kazuhiro Hotta

Endoscopic Ultrasound-Fine Needle Aspiration (EUS-FNA) is used to examine pancreatic cancer. EUS-FNA is an examination using EUS to insert a thin needle into the tumor and collect pancreatic tissue fragments. Then collected pancreatic tissue fragments are then stained to classify whether they are pancreatic cancer. However, staining and visual inspection are time consuming. In addition, if the pancreatic tissue fragment cannot be examined after staining, the collection must be done again on the other day. Therefore, our purpose is to classify from an unstained image whether it is available for examination or not, and to exceed the accuracy of visual classification by specialist physicians. Image classification before staining can reduce the time required for staining and the burden of patients. However, the images of pancreatic tissue fragments used in this study cannot be successfully classified by processing the entire image because the pancreatic tissue fragments are only a part of the image. Therefore, we propose a DeformableFormer that uses Deformable Convolution in MetaFormer framework. The architecture consists of a generalized model of the Vision Transformer, and we use Deformable Convolution in the TokenMixer part. In contrast to existing approaches, our proposed DeformableFormer is possible to perform feature extraction more locally and dynamically by Deformable Convolution. Therefore, it is possible to perform suitable feature extraction for classifying target. To evaluate our method, we classify two categories of pancreatic tissue fragments; available and unavailable for examination. We demonstrated that our method outperformed the accuracy by specialist physicians and conventional methods.

IVMar 20, 2022
Adversarial Mutual Leakage Network for Cell Image Segmentation

Hiroki Tsuda, Kazuhiro Hotta

We propose three segmentation methods using GAN and information leakage between generator and discriminator. First, we propose an Adversarial Training Attention Module (ATA-Module) that uses an attention mechanism from the discriminator to the generator to enhance and leak important information in the discriminator. ATA-Module transmits important information to the generator from the discriminator. Second, we propose a Top-Down Pixel-wise Difficulty Attention Module (Top-Down PDA-Module) that leaks an attention map based on pixel-wise difficulty in the generator to the discriminator. The generator trains to focus on pixel-wise difficulty, and the discriminator uses the difficulty information leaked from the generator for classification. Finally, we propose an Adversarial Mutual Leakage Network (AML-Net) that mutually leaks the information each other between the generator and the discriminator. By using the information of the other network, it is able to train more efficiently than ordinary segmentation models. Three proposed methods have been evaluated on two datasets for cell image segmentation. The experimental results show that the segmentation accuracy of AML-Net was much improved in comparison with conventional methods.

CVAug 22, 2024
Generalized SAM: Efficient Fine-Tuning of SAM for Variable Input Image Sizes

Sota Kato, Hinako Mitsuoka, Kazuhiro Hotta

There has been a lot of recent research on improving the efficiency of fine-tuning foundation models. In this paper, we propose a novel efficient fine-tuning method that allows the input image size of Segment Anything Model (SAM) to be variable. SAM is a powerful foundational model for image segmentation trained on huge datasets, but it requires fine-tuning to recognize arbitrary classes. The input image size of SAM is fixed at 1024 x 1024, resulting in substantial computational demands during training. Furthermore, the fixed input image size may result in the loss of image information, e.g. due to fixed aspect ratios. To address this problem, we propose Generalized SAM (GSAM). Different from the previous methods, GSAM is the first to apply random cropping during training with SAM, thereby significantly reducing the computational cost of training. Experiments on datasets of various types and various pixel counts have shown that GSAM can train more efficiently than SAM and other fine-tuning methods for SAM, achieving comparable or higher accuracy.

31.4CVApr 8Code
Novel Anomaly Detection Scenarios and Evaluation Metrics to Address the Ambiguity in the Definition of Normal Samples

Reiji Saito, Satoshi Kamiya, Kazuhiro Hotta

In conventional anomaly detection, training data consist of only normal samples. However, in real-world scenarios, the definition of a normal sample is often ambiguous. For example, there are cases where a sample has small scratches or stains but is still acceptable for practical usage. On the other hand, higher precision is required when manufacturing equipment is upgraded. In such cases, normal samples may include small scratches, tiny dust particles, or a foreign object that we would prefer to classify as an anomaly. Such cases frequently occur in industrial settings, yet they have not been discussed until now. Thus, we propose novel scenarios and an evaluation metric to accommodate specification changes in real-world applications. Furthermore, to address the ambiguity of normal samples, we propose the RePaste, which enhances learning by re-pasting regions with high anomaly scores from the previous step into the input for the next step. On our scenarios using the MVTec AD benchmark, RePaste achieved the state-of-the-art performance with respect to the proposed evaluation metric, while maintaining high AUROC and PRO scores. Code: https://github.com/ReijiSoftmaxSaito/Scenario

CVAug 23, 2024
Accuracy Improvement of Cell Image Segmentation Using Feedback Former

Hinako Mitsuoka, Kazuhiro Hotta

Semantic segmentation of microscopy cell images by deep learning is a significant technique. We considered that the Transformers, which have recently outperformed CNNs in image recognition, could also be improved and developed for cell image segmentation. Transformers tend to focus more on contextual information than on detailed information. This tendency leads to a lack of detailed information for segmentation. Therefore, to supplement or reinforce the missing detailed information, we hypothesized that feedback processing in the human visual cortex should be effective. Our proposed Feedback Former is a novel architecture for semantic segmentation, in which Transformers is used as an encoder and has a feedback processing mechanism. Feature maps with detailed information are fed back to the lower layers from near the output of the model to compensate for the lack of detailed information which is the weakness of Transformers and improve the segmentation accuracy. By experiments on three cell image datasets, we confirmed that our method surpasses methods without feedback, demonstrating its superior accuracy in cell image segmentation. Our method achieved higher segmentation accuracy while consuming less computational cost than conventional feedback approaches. Moreover, our method offered superior precision without simply increasing the model size of Transformer encoder, demonstrating higher accuracy with lower computational cost.

10.1CVMay 7
Prompt-Free and Efficient SAM2 Adaptation for Biomedical Semantic Segmentation via Dual Adapters

Hinako Mitsuoka, Kazuhiro Hotta

Segment Anything Model 2 (SAM2) demonstrated impressive zero-shot capabilities on natural images but faces challenges in biomedical segmentation due to significant domain shifts and prompt dependency. To address these limitations, we propose a prompt-free, parameter-efficient fine-tuning framework designed for multi-class segmentation on variable-sized inputs. We introduce a convolutional Positional Encoding Generator to adapt effectively to arbitrary aspect ratios and present a dual-adapter strategy: High-Performance Adapter utilizing deformable convolutions for precise boundary modeling and Lightweight Adapter employing structural re-parameterization to minimize inference latency. Experiments on ISBI 2012, Kvasir-SEG, Synapse, and ACDC datasets demonstrate that our approach significantly outperforms strong adaptation baselines. Specifically, our method improved segmentation accuracy by up to 19.66\% over the vanilla SAM2, while reducing computational costs by approximately 87\% compared to heavyweight medical SAM adaptations, establishing a superior trade-off between accuracy and efficiency.

CVSep 17, 2024
Reducing Catastrophic Forgetting in Online Class Incremental Learning Using Self-Distillation

Kotaro Nagata, Hiromu Ono, Kazuhiro Hotta

In continual learning, there is a serious problem of catastrophic forgetting, in which previous knowledge is forgotten when a model learns new tasks. Various methods have been proposed to solve this problem. Replay methods which replay data from previous tasks in later training, have shown good accuracy. However, replay methods have a generalizability problem from a limited memory buffer. In this paper, we tried to solve this problem by acquiring transferable knowledge through self-distillation using highly generalizable output in shallow layer as a teacher. Furthermore, when we deal with a large number of classes or challenging data, there is a risk of learning not converging and not experiencing overfitting. Therefore, we attempted to achieve more efficient and thorough learning by prioritizing the storage of easily misclassified samples through a new method of memory update. We confirmed that our proposed method outperformed conventional methods by experiments on CIFAR10, CIFAR100, and MiniimageNet datasets.

5.8CVMay 5
Dynamic Distillation and Gradient Consistency for Robust Long-Tailed Incremental Learning

Taigo Sakai, Kazuhiro Hotta

The task of Long-tailed Class Incremental Learning (LT-CIL) addresses the sequential learning of new classes from datasets with imbalanced class distributions. This scenario intensifies the fundamental problem of catastrophic forgetting, inherent to continual learning, with the dual challenges of under-learning minority classes and overfitting majority classes. To tackle these combined issues, this paper proposes two main techniques. First, we introduce gradient consistency regularization, which leverages the moving average of gradients to suppress abrupt fluctuations and stabilize the training process. Second, we dynamically adjust the weight of the distillation loss by measuring the degree of class imbalance with normalized entropy. This adaptive weighting establishes an optimal balance between retaining old knowledge and acquiring new information. Experiments on the CIFAR-100-LT, ImageNetSubset-LT, and Food101-LT benchmarks show that our method achieves consistent accuracy improvements of up to 5.0\%. Furthermore, we demonstrate dramatic gains in the challenging 'In-ordered' setting, where tasks progress from majority to minority classes, highlighting our method's robustness in mitigating forgetting under unfavorable learning dynamics. This enhanced performance is achieved without a significant increase in computational overhead, demonstrating the practicality of our framework.

CVSep 5, 2024
Gr-IoU: Ground-Intersection over Union for Robust Multi-Object Tracking with 3D Geometric Constraints

Keisuke Toida, Naoki Kato, Osamu Segawa et al.

We propose a Ground IoU (Gr-IoU) to address the data association problem in multi-object tracking. When tracking objects detected by a camera, it often occurs that the same object is assigned different IDs in consecutive frames, especially when objects are close to each other or overlapping. To address this issue, we introduce Gr-IoU, which takes into account the 3D structure of the scene. Gr-IoU transforms traditional bounding boxes from the image space to the ground plane using the vanishing point geometry. The IoU calculated with these transformed bounding boxes is more sensitive to the front-to-back relationships of objects, thereby improving data association accuracy and reducing ID switches. We evaluated our Gr-IoU method on the MOT17 and MOT20 datasets, which contain diverse tracking scenarios including crowded scenes and sequences with frequent occlusions. Experimental results demonstrated that Gr-IoU outperforms conventional real-time methods without appearance features.

9.5CVApr 2
Combining Boundary Supervision and Segment-Level Regularization for Fine-Grained Action Segmentation

Hinako Mitsuoka, Kazuhiro Hotta

Recent progress in Temporal Action Segmentation (TAS) has increasingly relied on complex architectures, which can hinder practical deployment. We present a lightweight dual-loss training framework that improves fine-grained segmentation quality with only one additional output channel and two auxiliary loss terms, requiring minimal architectural modification. Our approach combines a boundary-regression loss that promotes accurate temporal localization via a single-channel boundary prediction and a CDF-based segment-level regularization loss that encourages coherent within-segment structure by matching cumulative distributions over predicted and ground-truth segments. The framework is architecture-agnostic and can be integrated into existing TAS models (e.g., MS-TCN, C2F-TCN, FACT) as a training-time loss function. Across three benchmark datasets, the proposed method improves segment-level consistency and boundary quality, yielding higher F1 and Edit scores across three different models. Frame-wise accuracy remains largely unchanged, highlighting that precise segmentation can be achieved through simple loss design rather than heavier architectures or inference-time refinements.

6.7CVApr 8
Multiple Domain Generalization Using Category Information Independent of Domain Differences

Reiji Saito, Kazuhiro Hotta

Domain generalization is a technique aimed at enabling models to maintain high accuracy when applied to new environments or datasets (unseen domains) that differ from the datasets used in training. Generally, the accuracy of models trained on a specific dataset (source domain) often decreases significantly when evaluated on different datasets (target domain). This issue arises due to differences in domains caused by varying environmental conditions such as imaging equipment and staining methods. Therefore, we undertook two initiatives to perform segmentation that does not depend on domain differences. We propose a method that separates category information independent of domain differences from the information specific to the source domain. By using information independent of domain differences, our method enables learning the segmentation targets (e.g., blood vessels and cell nuclei). Although we extract independent information of domain differences, this cannot completely bridge the domain gap between training and test data. Therefore, we absorb the domain gap using the quantum vectors in Stochastically Quantized Variational AutoEncoder (SQ-VAE). In experiments, we evaluated our method on datasets for vascular segmentation and cell nucleus segmentation. Our methods improved the accuracy compared to conventional methods.

11.1CVApr 8
Accuracy Improvement of Semi-Supervised Segmentation Using Supervised ClassMix and Sup-Unsup Feature Discriminator

Takahiro Mano, Reiji Saito, Kazuhiro Hotta

In semantic segmentation, the creation of pixel-level labels for training data incurs significant costs. To address this problem, semi-supervised learning, which utilizes a small number of labeled images alongside unlabeled images to enhance the performance, has gained attention. A conventional semi-supervised learning method, ClassMix, pastes class labels predicted from unlabeled images onto other images. However, since ClassMix performs operations using pseudo-labels obtained from unlabeled images, there is a risk of handling inaccurate labels. Additionally, there is a gap in data quality between labeled and unlabeled images, which can impact the feature maps. This study addresses these two issues. First, we propose a method where class labels from labeled images, along with the corresponding image regions, are pasted onto unlabeled images and their pseudo-labeled images. Second, we introduce a method that trains the model to make predictions on unlabeled images more similar to those on labeled images. Experiments on the Chase and COVID-19 datasets demonstrated an average improvement of 2.07% in mIoU compared to conventional semi-supervised learning methods.

CVFeb 2
SVD-ViT: Does SVD Make Vision Transformers Attend More to the Foreground?

Haruhiko Murata, Kazuhiro Hotta

Vision Transformers (ViT) have been established as large-scale foundation models. However, because self-attention operates globally, they lack an explicit mechanism to distinguish foreground from background. As a result, ViT may learn unnecessary background features and artifacts, leading to degraded classification performance. To address this issue, we propose SVD-ViT, which leverages singular value decomposition (SVD) to prioritize the learning of foreground features. SVD-ViT consists of three components-\textbf{SPC module}, \textbf{SSVA}, and \textbf{ID-RSVD}-and suppresses task-irrelevant factors such as background noise and artifacts by extracting and aggregating singular vectors that capture object foreground information. Experimental results demonstrate that our method improves classification accuracy and effectively learns informative foreground representations while reducing the impact of background noise.

CVOct 14, 2025
Multiplicative Loss for Enhancing Semantic Segmentation in Medical and Cellular Images

Yuto Yokoi, Kazuhiro Hotta

We propose two novel loss functions, Multiplicative Loss and Confidence-Adaptive Multiplicative Loss, for semantic segmentation in medical and cellular images. Although Cross Entropy and Dice Loss are widely used, their additive combination is sensitive to hyperparameters and often performs suboptimally, especially with limited data. Medical images suffer from data scarcity due to privacy, ethics, and costly annotations, requiring robust and efficient training objectives. Our Multiplicative Loss combines Cross Entropy and Dice losses multiplicatively, dynamically modulating gradients based on prediction confidence. This reduces penalties for confident correct predictions and amplifies gradients for incorrect overconfident ones, stabilizing optimization. Building on this, Confidence-Adaptive Multiplicative Loss applies a confidence-driven exponential scaling inspired by Focal Loss, integrating predicted probabilities and Dice coefficients to emphasize difficult samples. This enhances learning under extreme data scarcity by strengthening gradients when confidence is low. Experiments on cellular and medical segmentation benchmarks show our framework consistently outperforms tuned additive and existing loss functions, offering a simple, effective, and hyperparameter-free mechanism for robust segmentation under challenging data limitations.

CVSep 10, 2025
Sparse BEV Fusion with Self-View Consistency for Multi-View Detection and Tracking

Keisuke Toida, Taigo Sakai, Naoki Kato et al.

Multi-View Multi-Object Tracking (MVMOT) is essential for applications such as surveillance, autonomous driving, and sports analytics. However, maintaining consistent object identities across multiple cameras remains challenging due to viewpoint changes, lighting variations, and occlusions, which often lead to tracking errors.Recent methods project features from multiple cameras into a unified Bird's-Eye-View (BEV) space to improve robustness against occlusion. However, this projection introduces feature distortion and non-uniform density caused by variations in object scale with distance. These issues degrade the quality of the fused representation and reduce detection and tracking accuracy.To address these problems, we propose SCFusion, a framework that combines three techniques to improve multi-view feature integration. First, it applies a sparse transformation to avoid unnatural interpolation during projection. Next, it performs density-aware weighting to adaptively fuse features based on spatial confidence and camera distance. Finally, it introduces a multi-view consistency loss that encourages each camera to learn discriminative features independently before fusion.Experiments show that SCFusion achieves state-of-the-art performance, reaching an IDF1 score of 95.9% on WildTrack and a MODP of 89.2% on MultiviewX, outperforming the baseline method TrackTacular. These results demonstrate that SCFusion effectively mitigates the limitations of conventional BEV projection and provides a robust and accurate solution for multi-view object detection and tracking.

LGJul 14, 2025
Long-Tailed Data Classification by Increasing and Decreasing Neurons During Training

Taigo Sakai, Kazuhiro Hotta

In conventional deep learning, the number of neurons typically remains fixed during training. However, insights from biology suggest that the human hippocampus undergoes continuous neuron generation and pruning of neurons over the course of learning, implying that a flexible allocation of capacity can contribute to enhance performance. Real-world datasets often exhibit class imbalance situations where certain classes have far fewer samples than others, leading to significantly reduce recognition accuracy for minority classes when relying on fixed size networks.To address the challenge, we propose a method that periodically adds and removes neurons during training, thereby boosting representational power for minority classes. By retaining critical features learned from majority classes while selectively increasing neurons for underrepresented classes, our approach dynamically adjusts capacity during training. Importantly, while the number of neurons changes throughout training, the final network size and structure remain unchanged, ensuring efficiency and compatibility with deployment.Furthermore, by experiments on three different datasets and five representative models, we demonstrate that the proposed method outperforms fixed size networks and shows even greater accuracy when combined with other imbalance-handling techniques. Our results underscore the effectiveness of dynamic, biologically inspired network designs in improving performance on class-imbalanced data.

CVApr 9, 2025
Domain Generalization through Attenuation of Domain-Specific Information

Reiji Saito, Kazuhiro Hotta

In this paper, we propose a new evaluation metric called Domain Independence (DI) and Attenuation of Domain-Specific Information (ADSI) which is specifically designed for domain-generalized semantic segmentation in automotive images. DI measures the presence of domain-specific information: a lower DI value indicates strong domain dependence, while a higher DI value suggests greater domain independence. This makes it roughly where domain-specific information exists and up to which frequency range it is present. As a result, it becomes possible to effectively suppress only the regions in the image that contain domain-specific information, enabling feature extraction independent of the domain. ADSI uses a Butterworth filter to remove the low-frequency components of images that contain inherent domain-specific information such as sensor characteristics and lighting conditions. However, since low-frequency components also contain important information such as color, we should not remove them completely. Thus, a scalar value (ranging from 0 to 1) is multiplied by the low-frequency components to retain essential information. This helps the model learn more domain-independent features. In experiments, GTA5 (synthetic dataset) was used as training images, and a real-world dataset was used for evaluation, and the proposed method outperformed conventional approaches. Similarly, in experiments that the Cityscapes (real-world dataset) was used for training and various environment datasets such as rain and nighttime were used for evaluation, the proposed method demonstrated its robustness under nighttime conditions.

CVAug 23, 2024
Growing Deep Neural Network Considering with Similarity between Neurons

Taigo Sakai, Kazuhiro Hotta

Deep learning has excelled in image recognition tasks through neural networks inspired by the human brain. However, the necessity for large models to improve prediction accuracy introduces significant computational demands and extended training times.Conventional methods such as fine-tuning, knowledge distillation, and pruning have the limitations like potential accuracy drops. Drawing inspiration from human neurogenesis, where neuron formation continues into adulthood, we explore a novel approach of progressively increasing neuron numbers in compact models during training phases, thereby managing computational costs effectively. We propose a method that reduces feature extraction biases and neuronal redundancy by introducing constraints based on neuron similarity distributions. This approach not only fosters efficient learning in new neurons but also enhances feature extraction relevancy for given tasks. Results on CIFAR-10 and CIFAR-100 datasets demonstrated accuracy improvement, and our method pays more attention to whole object to be classified in comparison with conventional method through Grad-CAM visualizations. These results suggest that our method's potential to decision-making processes.

IVDec 3, 2021
Localized Feature Aggregation Module for Semantic Segmentation

Ryouichi Furukawa, Kazuhiro Hotta

We propose a new information aggregation method which called Localized Feature Aggregation Module based on the similarity between the feature maps of an encoder and a decoder. The proposed method recovers positional information by emphasizing the similarity between decoder's feature maps with superior semantic information and encoder's feature maps with superior positional information. The proposed method can learn positional information more efficiently than conventional concatenation in the U-net and attention U-net. Additionally, the proposed method also uses localized attention range to reduce the computational cost. Two innovations contributed to improve the segmentation accuracy with lower computational cost. By experiments on the Drosophila cell image dataset and COVID-19 image dataset, we confirmed that our method outperformed conventional methods.

CVNov 30, 2021
Reconstruction Student with Attention for Student-Teacher Pyramid Matching

Shinji Yamada, Kazuhiro Hotta

Anomaly detection and localization are important problems in computer vision. Recently, Convolutional Neural Network (CNN) has been used for visual inspection. In particular, the scarcity of anomalous samples increases the difficulty of this task, and unsupervised leaning based methods are attracting attention. We focus on Student-Teacher Feature Pyramid Matching (STPM) which can be trained from only normal images with small number of epochs. Here we proposed a powerful method which compensates for the shortcomings of STPM. Proposed method consists of two students and two teachers that a pair of student-teacher network is the same as STPM. The other student-teacher network has a role to reconstruct the features of normal products. By reconstructing the features of normal products from an abnormal image, it is possible to detect abnormalities with higher accuracy by taking the difference between them. The new student-teacher network uses attention modules and different teacher network from the original STPM. Attention mechanism acts to successfully reconstruct the normal regions in an input image. Different teacher network prevents looking at the same regions as the original STPM. Six anomaly maps obtained from the two student-teacher networks are used to calculate the final anomaly map. Student-teacher network for reconstructing features improved AUC scores for pixel level and image level in comparison with the original STPM.

IVAug 30, 2021
Automatic Preprocessing and Ensemble Learning for Low Quality Cell Image Segmentation

Sota Kato, Kazuhiro Hotta

We propose an automatic preprocessing and ensemble learning for segmentation of cell images with low quality. It is difficult to capture cells with strong light. Therefore, the microscopic images of cells tend to have low image quality but these images are not good for semantic segmentation. Here we propose a method to translate an input image to the images that are easy to recognize by deep learning. The proposed method consists of two deep neural networks. The first network is the usual training for semantic segmentation, and penultimate feature maps of the first network are used as filters to translate an input image to the images that emphasize each class. This is the automatic preprocessing and translated cell images are easily classified. The input cell image with low quality is translated by the feature maps in the first network, and the translated images are fed into the second network for semantic segmentation. Since the outputs of the second network are multiple segmentation results, we conduct the weighted ensemble of those segmentation images. Two networks are trained by end-to-end manner, and we do not need to prepare images with high quality for the translation. We confirmed that our proposed method can translate cell images with low quality to the images that are easy to segment, and segmentation accuracy has improved using the weighted ensemble learning.

CVJul 6, 2021
MSE Loss with Outlying Label for Imbalanced Classification

Sota Kato, Kazuhiro Hotta

In this paper, we propose mean squared error (MSE) loss with outlying label for class imbalanced classification. Cross entropy (CE) loss, which is widely used for image recognition, is learned so that the probability value of true class is closer to one by back propagation. However, for imbalanced datasets, the learning is insufficient for the classes with a small number of samples. Therefore, we propose a novel classification method using the MSE loss that can be learned the relationships of all classes no matter which image is input. Unlike CE loss, MSE loss is possible to equalize the number of back propagation for all classes and to learn the feature space considering the relationships between classes as metric learning. Furthermore, instead of the usual one-hot teacher label, we use a novel teacher label that takes the number of class samples into account. This induces the outlying label which depends on the number of samples in each class, and the class with a small number of samples has outlying margin in a feature space. It is possible to create the feature space for separating high-difficulty classes and low-difficulty classes. By the experiments on imbalanced classification and semantic segmentation, we confirmed that the proposed method was much improved in comparison with standard CE loss and conventional methods, even though only the loss and teacher labels were changed.

IVJan 20, 2021
Cell image segmentation by Feature Random Enhancement Module

Takamasa Ando, Kazuhiro Hotta

It is important to extract good features using an encoder to realize semantic segmentation with high accuracy. Although loss function is optimized in training deep neural network, far layers from the layers for computing loss function are difficult to train. Skip connection is effective for this problem but there are still far layers from the loss function. In this paper, we propose the Feature Random Enhancement Module which enhances the features randomly in only training. By emphasizing the features at far layers from loss function, we can train those layers well and the accuracy was improved. In experiments, we evaluated the proposed module on two kinds of cell image datasets, and our module improved the segmentation accuracy without increasing computational cost in test phase.

CVJan 20, 2021
Feature Sharing Cooperative Network for Semantic Segmentation

Ryota Ikedo, Kazuhiro Hotta

In recent years, deep neural networks have achieved high ac-curacy in the field of image recognition. By inspired from human learning method, we propose a semantic segmentation method using cooperative learning which shares the information resembling a group learning. We use two same networks and paths for sending feature maps between two networks. Two networks are trained simultaneously. By sharing feature maps, one of two networks can obtain the information that cannot be obtained by a single network. In addition, in order to enhance the degree of cooperation, we propose two kinds of methods that connect only the same layer and multiple layers. We evaluated our proposed idea on two kinds of networks. One is Dual Attention Network (DANet) and the other one is DeepLabv3+. The proposed method achieved better segmentation accuracy than the conventional single network and ensemble of networks.

CVAug 14, 2020
Feedback Attention for Cell Image Segmentation

Hiroki Tsuda, Eisuke Shibuya, Kazuhiro Hotta

In this paper, we address cell image segmentation task by Feedback Attention mechanism like feedback processing. Unlike conventional neural network models of feedforward processing, we focused on the feedback processing in human brain and assumed that the network learns like a human by connecting feature maps from deep layers to shallow layers. We propose some Feedback Attentions which imitate human brain and feeds back the feature maps of output layer to close layer to the input. U-Net with Feedback Attention showed better result than the conventional methods using only feedforward processing.

CVApr 30, 2020
Feedback U-net for Cell Image Segmentation

Eisuke Shibuya, Kazuhiro Hotta

Human brain is a layered structure, and performs not only a feedforward process from a lower layer to an upper layer but also a feedback process from an upper layer to a lower layer. The layer is a collection of neurons, and neural network is a mathematical model of the function of neurons. Although neural network imitates the human brain, everyone uses only feedforward process from the lower layer to the upper layer, and feedback process from the upper layer to the lower layer is not used. Therefore, in this paper, we propose Feedback U-Net using Convolutional LSTM which is the segmentation method using Convolutional LSTM and feedback process. The output of U-net gave feedback to the input, and the second round is performed. By using Convolutional LSTM, the features in the second round are extracted based on the features acquired in the first round. On both of the Drosophila cell image and Mouse cell image datasets, our method outperformed conventional U-Net which uses only feedforward process.

CVMar 28, 2017
Mixture of Counting CNNs: Adaptive Integration of CNNs Specialized to Specific Appearance for Crowd Counting

Shohei Kumagai, Kazuhiro Hotta, Takio Kurita

This paper proposes a crowd counting method. Crowd counting is difficult because of large appearance changes of a target which caused by density and scale changes. Conventional crowd counting methods generally utilize one predictor (e,g., regression and multi-class classifier). However, such only one predictor can not count targets with large appearance changes well. In this paper, we propose to predict the number of targets using multiple CNNs specialized to a specific appearance, and those CNNs are adaptively selected according to the appearance of a test image. By integrating the selected CNNs, the proposed method has the robustness to large appearance changes. In experiments, we confirm that the proposed method can count crowd with lower counting error than a CNN and integration of CNNs with fixed weights. Moreover, we confirm that each predictor automatically specialized to a specific appearance.