Masakazu Iwamura

12papers

281citations

Novelty47%

AI Score45

Ranked #67,535 of 205,806 authors (top 33%)#23,907 in CV (top 41%)

12 Papers

CVFeb 13Code

CBEN -- A Multimodal Machine Learning Dataset for Cloud Robust Remote Sensing Image Understanding

Marco Stricker, Masakazu Iwamura, Koichi Kise

Clouds are a common phenomenon that distorts optical satellite imagery, which poses a challenge for remote sensing. However, in the literature cloudless analysis is often performed where cloudy images are excluded from machine learning datasets and methods. Such an approach cannot be applied to time sensitive applications, e.g., during natural disasters. A possible solution is to apply cloud removal as a preprocessing step to ensure that cloudfree solutions are not failing under such conditions. But cloud removal methods are still actively researched and suffer from drawbacks, such as generated visual artifacts. Therefore, it is desirable to develop cloud robust methods that are less affected by cloudy weather. Cloud robust methods can be achieved by combining optical data with radar, a modality unaffected by clouds. While many datasets for machine learning combine optical and radar data, most researchers exclude cloudy images. We identify this exclusion from machine learning training and evaluation as a limitation that reduces applicability to cloudy scenarios. To investigate this, we assembled a dataset, named CloudyBigEarthNet (CBEN), of paired optical and radar images with cloud occlusion for training and evaluation. Using average precision (AP) as the evaluation metric, we show that state-of-the-art methods trained on combined clear-sky optical and radar imagery suffer performance drops of 23-33 percentage points when evaluated on cloudy images. We then adapt these methods to cloudy optical data during training, achieving relative improvement of 17.2-28.7 percentage points on cloudy test cases compared with the original approaches. Code and dataset are publicly available at: https://github.com/mstricker13/CBEN

71.1LGMar 24

CDMT-EHR: A Continuous-Time Diffusion Framework for Generating Mixed-Type Time-Series Electronic Health Records

Shaonan Liu, Yuichiro Iwashita, Soichiro Nakako et al.

Electronic health records (EHRs) are invaluable for clinical research, yet privacy concerns severely restrict data sharing. Synthetic data generation offers a promising solution, but EHRs present unique challenges: they contain both numerical and categorical features that evolve over time. While diffusion models have demonstrated strong performance in EHR synthesis, existing approaches predominantly rely on discrete-time formulations, which suffer from finite-step approximation errors and coupled training-sampling step counts. We propose a continuous-time diffusion framework for generating mixed-type time-series EHRs with three contributions: (1) continuous-time diffusion with a bidirectional gated recurrent unit backbone for capturing temporal dependencies, (2) unified Gaussian diffusion via learnable continuous embeddings for categorical variables, enabling joint cross-feature modeling, and (3) a factorized learnable noise schedule that adapts to per-feature-per-timestep learning difficulties. Experiments on two large-scale intensive care unit datasets demonstrate that our method outperforms existing approaches in downstream task performance, distribution fidelity, and discriminability, while requiring only 50 sampling steps compared to 1,000 for baseline methods. Classifier-free guidance further enables effective conditional generation for class-imbalanced clinical scenarios.

CVMay 4, 2016Code

A Generic Method for Automatic Ground Truth Generation of Camera-captured Documents

Sheraz Ahmed, Muhammad Imran Malik, Muhammad Zeshan Afzal et al.

The contribution of this paper is fourfold. The first contribution is a novel, generic method for automatic ground truth generation of camera-captured document images (books, magazines, articles, invoices, etc.). It enables us to build large-scale (i.e., millions of images) labeled camera-captured/scanned documents datasets, without any human intervention. The method is generic, language independent and can be used for generation of labeled documents datasets (both scanned and cameracaptured) in any cursive and non-cursive language, e.g., English, Russian, Arabic, Urdu, etc. To assess the effectiveness of the presented method, two different datasets in English and Russian are generated using the presented method. Evaluation of samples from the two datasets shows that 99:98% of the images were correctly labeled. The second contribution is a large dataset (called C3Wi) of camera-captured characters and words images, comprising 1 million word images (10 million character images), captured in a real camera-based acquisition. This dataset can be used for training as well as testing of character recognition systems on camera-captured documents. The third contribution is a novel method for the recognition of cameracaptured document images. The proposed method is based on Long Short-Term Memory and outperforms the state-of-the-art methods for camera based OCRs. As a fourth contribution, various benchmark tests are performed to uncover the behavior of commercial (ABBYY), open source (Tesseract), and the presented camera-based OCR using the presented C3Wi dataset. Evaluation results reveal that the existing OCRs, which already get very high accuracies on scanned documents, have limited performance on camera-captured document images; where ABBYY has an accuracy of 75%, Tesseract an accuracy of 50.22%, while the presented character recognition system has an accuracy of 95.10%.

CVJun 30, 2021

Word-level Sign Language Recognition with Multi-stream Neural Networks Focusing on Local Regions and Skeletal Information

Mizuki Maruyama, Shrey Singh, Katsufumi Inoue et al.

Word-level sign language recognition (WSLR) has attracted attention because it is expected to overcome the communication barrier between people with speech impairment and those who can hear. In the WSLR problem, a method designed for action recognition has achieved the state-of-the-art accuracy. Indeed, it sounds reasonable for an action recognition method to perform well on WSLR because sign language is regarded as an action. However, a careful evaluation of the tasks reveals that the tasks of action recognition and WSLR are inherently different. Hence, in this paper, we propose a novel WSLR method that takes into account information specifically useful for the WSLR problem. We realize it as a multi-stream neural network (MSNN), which consist of three streams: 1) base stream, 2) local image stream, and 3) skeleton stream. Each stream is designed to handle different types of information. The base stream deals with quick and detailed movements of the hands and body, the local image stream focuses on handshapes and facial expressions, and the skeleton stream captures the relative positions of the body and both hands. This approach allows us to combine various types of data for more comprehensive gesture analysis. Experimental results on the WLASL and MS-ASL datasets show the effectiveness of the proposed method; it achieved an improvement of approximately 10\%--15\% in Top-1 accuracy when compared with conventional methods.

HCDec 7, 2020

Self-supervised Deep Learning for Reading Activity Classification

Md. Rabiul Islam, Shuji Sakamoto, Yoshihiro Yamada et al.

Reading analysis can give important information about a user's confidence and habits and can be used to construct feedback to improve a user's reading behavior. A lack of labeled data inhibits the effective application of fully-supervised Deep Learning (DL) for automatic reading analysis. In this paper, we propose a self-supervised DL method for reading analysis and evaluate it on two classification tasks. We first evaluate the proposed self-supervised DL method on a four-class classification task on reading detection using electrooculography (EOG) glasses datasets, followed by an evaluation of a two-class classification task of confidence estimation on answers of multiple-choice questions (MCQs) using eye-tracking datasets. Fully-supervised DL and support vector machines (SVMs) are used to compare the performance of the proposed self-supervised DL method. The results show that the proposed self-supervised DL method is superior to the fully-supervised DL and SVM for both tasks, especially when training data is scarce. This result indicates that the proposed self-supervised DL method is the superior choice for reading analysis tasks. The results of this study are important for informing the design and implementation of automatic reading analysis platforms.

SDOct 13, 2020

End-to-end Triplet Loss based Emotion Embedding System for Speech Emotion Recognition

Puneet Kumar, Sidharth Jain, Balasubramanian Raman et al.

In this paper, an end-to-end neural embedding system based on triplet loss and residual learning has been proposed for speech emotion recognition. The proposed system learns the embeddings from the emotional information of the speech utterances. The learned embeddings are used to recognize the emotions portrayed by given speech samples of various lengths. The proposed system implements Residual Neural Network architecture. It is trained using softmax pre-training and triplet loss function. The weights between the fully connected and embedding layers of the trained network are used to calculate the embedding values. The embedding representations of various emotions are mapped onto a hyperplane, and the angles among them are computed using the cosine similarity. These angles are utilized to classify a new speech sample into its appropriate emotion class. The proposed system has demonstrated 91.67% and 64.44% accuracy while recognizing emotions for RAVDESS and IEMOCAP dataset, respectively.

CVAug 28, 2020

Distortion-Adaptive Grape Bunch Counting for Omnidirectional Images

Ryota Akai, Yuzuko Utsumi, Yuka Miwa et al.

This paper proposes the first object counting method for omnidirectional images. Because conventional object counting methods cannot handle the distortion of omnidirectional images, we propose to process them using stereographic projection, which enables conventional methods to obtain a good approximation of the density function. However, the images obtained by stereographic projection are still distorted. Hence, to manage this distortion, we propose two methods. One is a new data augmentation method designed for the stereographic projection of omnidirectional images. The other is a distortion-adaptive Gaussian kernel that generates a density map ground truth while taking into account the distortion of stereographic projection. Using the counting of grape bunches as a case study, we constructed an original grape-bunch image dataset consisting of omnidirectional images and conducted experiments to evaluate the proposed method. The results show that the proposed method performs better than a direct application of the conventional method, improving mean absolute error by 14.7% and mean squared error by 10.5%.

CVDec 13, 2018

Advances of Scene Text Datasets

Masakazu Iwamura

This article introduces publicly available datasets in scene text detection and recognition. The information is as of 2017.

CVFeb 7, 2018

ShakeDrop Regularization for Deep Residual Learning

Yoshihiro Yamada, Masakazu Iwamura, Takuya Akiba et al.

Overfitting is a crucial problem in deep neural networks, even in the latest network architectures. In this paper, to relieve the overfitting effect of ResNet and its improvements (i.e., Wide ResNet, PyramidNet, and ResNeXt), we propose a new regularization method called ShakeDrop regularization. ShakeDrop is inspired by Shake-Shake, which is an effective regularization method, but can be applied to ResNeXt only. ShakeDrop is more effective than Shake-Shake and can be applied not only to ResNeXt but also ResNet, Wide ResNet, and PyramidNet. An important key is to achieve stability of training. Because effective regularization often causes unstable training, we introduce a training stabilizer, which is an unusual use of an existing regularizer. Through experiments under various conditions, we demonstrate the conditions under which ShakeDrop works well.

CVJan 20, 2017

Automatic Generation of Typographic Font from a Small Font Subset

Tomo Miyazaki, Tatsunori Tsuchiya, Yoshihiro Sugaya et al.

This paper addresses the automatic generation of a typographic font from a subset of characters. Specifically, we use a subset of a typographic font to extrapolate additional characters. Consequently, we obtain a complete font containing a number of characters sufficient for daily use. The automated generation of Japanese fonts is in high demand because a Japanese font requires over 1,000 characters. Unfortunately, professional typographers create most fonts, resulting in significant financial and time investments for font generation. The proposed method can be a great aid for font creation because designers do not need to create the majority of the characters for a new font. The proposed method uses strokes from given samples for font generation. The strokes, from which we construct characters, are extracted by exploiting a character skeleton dataset. This study makes three main contributions: a novel method of extracting strokes from characters, which is applicable to both standard fonts and their variations; a fully automated approach for constructing characters; and a selection method for sample characters. We demonstrate our proposed method by generating 2,965 characters in 47 fonts. Objective and subjective evaluations verify that the generated characters are similar to handmade characters.

CVDec 5, 2016

Deep Pyramidal Residual Networks with Separated Stochastic Depth

Yoshihiro Yamada, Masakazu Iwamura, Koichi Kise

On general object recognition, Deep Convolutional Neural Networks (DCNNs) achieve high accuracy. In particular, ResNet and its improvements have broken the lowest error rate records. In this paper, we propose a method to successfully combine two ResNet improvements, ResDrop and PyramidNet. We confirmed that the proposed network outperformed the conventional methods; on CIFAR-100, the proposed network achieved an error rate of 16.18% in contrast to PiramidNet achieving that of 18.29% and ResNeXt 17.31%.

CVMar 29, 2016

Scalable Solution for Approximate Nearest Subspace Search

Masakazu Iwamura, Masataka Konishi, Koichi Kise

Finding the nearest subspace is a fundamental problem and influential to many applications. In particular, a scalable solution that is fast and accurate for a large problem has a great impact. The existing methods for the problem are, however, useless in a large-scale problem with a large number of subspaces and high dimensionality of the feature space. A cause is that they are designed based on the traditional idea to represent a subspace by a single point. In this paper, we propose a scalable solution for the approximate nearest subspace search (ANSS) problem. Intuitively, the proposed method represents a subspace by multiple points unlike the existing methods. This makes a large-scale ANSS problem tractable. In the experiment with 3036 subspaces in the 1024-dimensional space, we confirmed that the proposed method was 7.3 times faster than the previous state-of-the-art without loss of accuracy.