CVSep 20, 2024
Formula-Supervised Visual-Geometric Pre-trainingRyosuke Yamada, Kensho Hara, Hirokatsu Kataoka et al.
Throughout the history of computer vision, while research has explored the integration of images (visual) and point clouds (geometric), many advancements in image and 3D object recognition have tended to process these modalities separately. We aim to bridge this divide by integrating images and point clouds on a unified transformer model. This approach integrates the modality-specific properties of images and point clouds and achieves fundamental downstream tasks in image and 3D object recognition on a unified transformer model by learning visual-geometric representations. In this work, we introduce Formula-Supervised Visual-Geometric Pre-training (FSVGP), a novel synthetic pre-training method that automatically generates aligned synthetic images and point clouds from mathematical formulas. Through cross-modality supervision, we enable supervised pre-training between visual and geometric modalities. FSVGP also reduces reliance on real data collection, cross-modality alignment, and human annotation. Our experimental results show that FSVGP pre-trains more effectively than VisualAtom and PC-FractalDB across six tasks: image and 3D object classification, detection, and segmentation. These achievements demonstrate FSVGP's superior generalization in image and 3D object recognition and underscore the potential of synthetic pre-training in visual-geometric representation learning. Our project website is available at https://ryosuke-yamada.github.io/fdsl-fsvgp/.
CVAug 8, 2023
Estimation of Human Condition at Disaster Site Using Aerial Drone ImagesTomoki Arai, Kenji Iwata, Kensho Hara et al.
Drones are being used to assess the situation in various disasters. In this study, we investigate a method to automatically estimate the damage status of people based on their actions in aerial drone images in order to understand disaster sites faster and save labor. We constructed a new dataset of aerial images of human actions in a hypothetical disaster that occurred in an urban area, and classified the human damage status using 3D ResNet. The results showed that the status with characteristic human actions could be classified with a recall rate of more than 80%, while other statuses with similar human actions could only be classified with a recall rate of about 50%. In addition, a cloud-based VR presentation application suggested the effectiveness of using drones to understand the disaster site and estimate the human condition.
CVNov 27, 2017Code
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?Kensho Hara, Hirokatsu Kataoka, Yutaka Satoh
The purpose of this study is to determine whether current video datasets have sufficient data for training very deep convolutional neural networks (CNNs) with spatio-temporal three-dimensional (3D) kernels. Recently, the performance levels of 3D CNNs in the field of action recognition have improved significantly. However, to date, conventional research has only explored relatively shallow 3D architectures. We examine the architectures of various 3D CNNs from relatively shallow to very deep ones on current video datasets. Based on the results of those experiments, the following conclusions could be obtained: (i) ResNet-18 training resulted in significant overfitting for UCF-101, HMDB-51, and ActivityNet but not for Kinetics. (ii) The Kinetics dataset has sufficient data for training of deep 3D CNNs, and enables training of up to 152 ResNets layers, interestingly similar to 2D ResNets on ImageNet. ResNeXt-101 achieved 78.4% average accuracy on the Kinetics test set. (iii) Kinetics pretrained simple 3D architectures outperforms complex 2D architectures, and the pretrained ResNeXt-101 achieved 94.5% and 70.2% on UCF-101 and HMDB-51, respectively. The use of 2D CNNs trained on ImageNet has produced significant progress in various tasks in image. We believe that using deep 3D CNNs together with Kinetics will retrace the successful history of 2D CNNs and ImageNet, and stimulate advances in computer vision for videos. The codes and pretrained models used in this study are publicly available. https://github.com/kenshohara/3D-ResNets-PyTorch
CVAug 25, 2017Code
Learning Spatio-Temporal Features with 3D Residual Networks for Action RecognitionKensho Hara, Hirokatsu Kataoka, Yutaka Satoh
Convolutional neural networks with spatio-temporal 3D kernels (3D CNNs) have an ability to directly extract spatio-temporal features from videos for action recognition. Although the 3D kernels tend to overfit because of a large number of their parameters, the 3D CNNs are greatly improved by using recent huge video databases. However, the architecture of 3D CNNs is relatively shallow against to the success of very deep neural networks in 2D-based CNNs, such as residual networks (ResNets). In this paper, we propose a 3D CNNs based on ResNets toward a better action representation. We describe the training procedure of our 3D ResNets in details. We experimentally evaluate the 3D ResNets on the ActivityNet and Kinetics datasets. The 3D ResNets trained on the Kinetics did not suffer from overfitting despite the large number of parameters of the model, and achieved better performance than relatively shallow networks, such as C3D. Our code and pretrained models (e.g. Kinetics and ActivityNet) are publicly available at https://github.com/kenshohara/3D-ResNets.
CVApr 21, 2025
Guidelines for External Disturbance Factors in the Use of OCR in Real-World EnvironmentsKenji Iwata, Eiki Ishidera, Toshifumi Yamaai et al.
The performance of OCR has improved with the evolution of AI technology. As OCR continues to broaden its range of applications, the increased likelihood of interference introduced by various usage environments can prevent it from achieving its inherent performance. This results in reduced recognition accuracy under certain conditions, and makes the quality control of recognition devices more challenging. Therefore, to ensure that users can properly utilize OCR, we compiled the real-world external disturbance factors that cause performance degradation, along with the resulting image degradation phenomena, into an external disturbance factor table and, by also indicating how to make use of it, organized them into guidelines.
CVMar 25, 2021
Describing and Localizing Multiple Changes with TransformersYue Qiu, Shintaro Yamamoto, Kodai Nakashima et al.
Change captioning tasks aim to detect changes in image pairs observed before and after a scene change and generate a natural language description of the changes. Existing change captioning studies have mainly focused on a single change.However, detecting and describing multiple changed parts in image pairs is essential for enhancing adaptability to complex scenarios. We solve the above issues from three aspects: (i) We propose a simulation-based multi-change captioning dataset; (ii) We benchmark existing state-of-the-art methods of single change captioning on multi-change captioning; (iii) We further propose Multi-Change Captioning transformers (MCCFormers) that identify change regions by densely correlating different regions in image pairs and dynamically determines the related change regions with words in sentences. The proposed method obtained the highest scores on four conventional change captioning evaluation metrics for multi-change captioning. Additionally, our proposed method can separate attention maps for each change and performs well with respect to change localization. Moreover, the proposed framework outperformed the previous state-of-the-art methods on an existing change captioning benchmark, CLEVR-Change, by a large margin (+6.1 on BLEU-4 and +9.7 on CIDEr scores), indicating its general ability in change captioning tasks.
CVJan 21, 2021
Pre-training without Natural ImagesHirokatsu Kataoka, Kazushige Okayasu, Asato Matsumoto et al.
Is it possible to use convolutional neural networks pre-trained without any natural images to assist natural image understanding? The paper proposes a novel concept, Formula-driven Supervised Learning. We automatically generate image patterns and their category labels by assigning fractals, which are based on a natural law existing in the background knowledge of the real world. Theoretically, the use of automatically generated images instead of natural images in the pre-training phase allows us to generate an infinite scale dataset of labeled images. Although the models pre-trained with the proposed Fractal DataBase (FractalDB), a database without natural images, does not necessarily outperform models pre-trained with human annotated datasets at all settings, we are able to partially surpass the accuracy of ImageNet/Places pre-trained models. The image representation with the proposed FractalDB captures a unique feature in the visualization of convolutional layers and attentions.
CVApr 10, 2020
Would Mega-scale Datasets Further Enhance Spatiotemporal 3D CNNs?Hirokatsu Kataoka, Tenga Wakamiya, Kensho Hara et al.
How can we collect and use a video dataset to further improve spatiotemporal 3D Convolutional Neural Networks (3D CNNs)? In order to positively answer this open question in video recognition, we have conducted an exploration study using a couple of large-scale video datasets and 3D CNNs. In the early era of deep neural networks, 2D CNNs have been better than 3D CNNs in the context of video recognition. Recent studies revealed that 3D CNNs can outperform 2D CNNs trained on a large-scale video dataset. However, we heavily rely on architecture exploration instead of dataset consideration. Therefore, in the present paper, we conduct exploration study in order to improve spatiotemporal 3D CNNs as follows: (i) Recently proposed large-scale video datasets help improve spatiotemporal 3D CNNs in terms of video classification accuracy. We reveal that a carefully annotated dataset (e.g., Kinetics-700) effectively pre-trains a video representation for a video classification task. (ii) We confirm the relationships between #category/#instance and video classification accuracy. The results show that #category should initially be fixed, and then #instance is increased on a video dataset in case of dataset construction. (iii) In order to practically extend a video dataset, we simply concatenate publicly available datasets, such as Kinetics-700 and Moments in Time (MiT) datasets. Compared with Kinetics-700 pre-training, we further enhance spatiotemporal 3D CNNs with the merged dataset, e.g., +0.9, +3.4, and +1.1 on UCF-101, HMDB-51, and ActivityNet datasets, respectively, in terms of fine-tuning. (iv) In terms of recognition architecture, the Kinetics-700 and merged dataset pre-trained models increase the recognition performance to 200 layers with the Residual Network (ResNet), while the Kinetics-400 pre-trained model cannot successfully optimize the 200-layer architecture.
CVMar 27, 2020
Weakly Supervised Dataset Collection for Robust Person DetectionMunetaka Minoguchi, Ken Okayama, Yutaka Satoh et al.
To construct an algorithm that can provide robust person detection, we present a dataset with over 8 million images that was produced in a weakly supervised manner. Through labor-intensive human annotation, the person detection research community has produced relatively small datasets containing on the order of 100,000 images, such as the EuroCity Persons dataset, which includes 240,000 bounding boxes. Therefore, we have collected 8.7 million images of persons based on a two-step collection process, namely person detection with an existing detector and data refinement for false positive suppression. According to the experimental results, the Weakly Supervised Person Dataset (WSPD) is simple yet effective for person detection pre-training. In the context of pre-trained person detection algorithms, our WSPD pre-trained model has 13.38 and 6.38% better accuracy than the same model trained on the fully supervised ImageNet and EuroCity Persons datasets, respectively, when verified with the Caltech Pedestrian.
CVApr 8, 2018
Anticipating Traffic Accidents with Adaptive Loss and Large-scale Incident DBTomoyuki Suzuki, Hirokatsu Kataoka, Yoshimitsu Aoki et al.
In this paper, we propose a novel approach for traffic accident anticipation through (i) Adaptive Loss for Early Anticipation (AdaLEA) and (ii) a large-scale self-annotated incident database for anticipation. The proposed AdaLEA allows a model to gradually learn an earlier anticipation as training progresses. The loss function adaptively assigns penalty weights depending on how early the model can an- ticipate a traffic accident at each epoch. Additionally, we construct a Near-miss Incident DataBase for anticipation. This database contains an enormous number of traffic near- miss incident videos and annotations for detail evaluation of two tasks, risk anticipation and risk-factor anticipation. In our experimental results, we found our proposal achieved the highest scores for risk anticipation (+6.6% better on mean average precision (mAP) and 2.36 sec earlier than previous work on the average time-to-collision (ATTC)) and risk-factor anticipation (+4.3% better on mAP and 0.70 sec earlier than previous work on ATTC).
CVApr 7, 2018
Drive Video Analysis for the Detection of Traffic Near-Miss IncidentsHirokatsu Kataoka, Teppei Suzuki, Shoko Oikawa et al.
Because of their recent introduction, self-driving cars and advanced driver assistance system (ADAS) equipped vehicles have had little opportunity to learn, the dangerous traffic (including near-miss incident) scenarios that provide normal drivers with strong motivation to drive safely. Accordingly, as a means of providing learning depth, this paper presents a novel traffic database that contains information on a large number of traffic near-miss incidents that were obtained by mounting driving recorders in more than 100 taxis over the course of a decade. The study makes the following two main contributions: (i) In order to assist automated systems in detecting near-miss incidents based on database instances, we created a large-scale traffic near-miss incident database (NIDB) that consists of video clip of dangerous events captured by monocular driving recorders. (ii) To illustrate the applicability of NIDB traffic near-miss incidents, we provide two primary database-related improvements: parameter fine-tuning using various near-miss scenes from NIDB, and foreground/background separation into motion representation. Then, using our new database in conjunction with a monocular driving recorder, we developed a near-miss recognition method that provides automated systems with a performance level that is comparable to a human-level understanding of near-miss incidents (64.5% vs. 68.4% at near-miss recognition, 61.3% vs. 78.7% at near-miss detection).
CVMay 10, 2017
Collaborative Descriptors: Convolutional Maps for PreprocessingHirokatsu Kataoka, Kaori Abe, Akio Nakamura et al.
The paper presents a novel concept for collaborative descriptors between deeply learned and hand-crafted features. To achieve this concept, we apply convolutional maps for pre-processing, namely the convovlutional maps are used as input of hand-crafted features. We recorded an increase in the performance rate of +17.06 % (multi-class object recognition) and +24.71 % (car detection) from grayscale input to convolutional maps. Although the framework is straight-forward, the concept should be inherited for an improved representation.
CVMar 23, 2017
Changing Fashion CulturesKaori Abe, Teppei Suzuki, Shunya Ueta et al.
The paper presents a novel concept that analyzes and visualizes worldwide fashion trends. Our goal is to reveal cutting-edge fashion trends without displaying an ordinary fashion style. To achieve the fashion-based analysis, we created a new fashion culture database (FCDB), which consists of 76 million geo-tagged images in 16 cosmopolitan cities. By grasping a fashion trend of mixed fashion styles,the paper also proposes an unsupervised fashion trend descriptor (FTD) using a fashion descriptor, a codeword vetor, and temporal analysis. To unveil fashion trends in the FCDB, the temporal analysis in FTD effectively emphasizes consecutive features between two different times. In experiments, we clearly show the analysis of fashion trends and fashion-based city similarity. As the result of large-scale data collection and an unsupervised analyzer, the proposed approach achieves world-level fashion visualization in a time series. The code, model, and FCDB will be publicly available after the construction of the project page.
CVAug 30, 2016
Motion Representation with Acceleration ImagesHirokatsu Kataoka, Yun He, Soma Shirakabe et al.
Information of time differentiation is extremely important cue for a motion representation. We have applied first-order differential velocity from a positional information, moreover we believe that second-order differential acceleration is also a significant feature in a motion representation. However, an acceleration image based on a typical optical flow includes motion noises. We have not employed the acceleration image because the noises are too strong to catch an effective motion feature in an image sequence. On one hand, the recent convolutional neural networks (CNN) are robust against input noises. In this paper, we employ acceleration-stream in addition to the spatial- and temporal-stream based on the two-stream CNN. We clearly show the effectiveness of adding the acceleration stream to the two-stream CNN.
CVAug 29, 2016
Human Action Recognition without HumanHirokatsu Kataoka, Kensho Hara, Yutaka Satoh
The objective of this paper is to evaluate "human action recognition without human". Motion representation is frequently discussed in human action recognition. We have examined several sophisticated options, such as dense trajectories (DT) and the two-stream convolutional neural network (CNN). However, some features from the background could be too strong, as shown in some recent studies on human action recognition. Therefore, we considered whether a background sequence alone can classify human actions in current large-scale action datasets (e.g., UCF101). In this paper, we propose a novel concept for human action analysis that is named "human action recognition without human". An experiment clearly shows the effect of a background sequence for understanding an action label.
CVMay 1, 2016
Dominant Codewords Selection with Topic Model for Action RecognitionHirokatsu Kataoka, Masaki Hayashi, Kenji Iwata et al.
In this paper, we propose a framework for recognizing human activities that uses only in-topic dominant codewords and a mixture of intertopic vectors. Latent Dirichlet allocation (LDA) is used to develop approximations of human motion primitives; these are mid-level representations, and they adaptively integrate dominant vectors when classifying human activities. In LDA topic modeling, action videos (documents) are represented by a bag-of-words (input from a dictionary), and these are based on improved dense trajectories. The output topics correspond to human motion primitives, such as finger moving or subtle leg motion. We eliminate the impurities, such as missed tracking or changing light conditions, in each motion primitive. The assembled vector of motion primitives is an improved representation of the action. We demonstrate our method on four different datasets.
CVApr 26, 2016
Semantic Change Detection with HypermapsTeppei Suzuki, Soma Shirakabe, Yudai Miyashita et al.
Change detection is the study of detecting changes between two different images of a scene taken at different times. By the detected change areas, however, a human cannot understand how different the two images. Therefore, a semantic understanding is required in the change detection research such as disaster investigation. The paper proposes the concept of semantic change detection, which involves intuitively inserting semantic meaning into detected change areas. We mainly focus on the novel semantic segmentation in addition to a conventional change detection approach. In order to solve this problem and obtain a high-level of performance, we propose an improvement to the hypercolumns representation, hereafter known as hypermaps, which effectively uses convolutional maps obtained from convolutional neural networks (CNNs). We also employ multi-scale feature representation captured by different image patches. We applied our method to the TSUNAMI Panoramic Change Detection dataset, and re-annotated the changed areas of the dataset via semantic classes. The results show that our multi-scale hypermaps provided outstanding performance on the re-annotated TSUNAMI dataset.
CVSep 25, 2015
Feature Evaluation of Deep Convolutional Neural Networks for Object Recognition and DetectionHirokatsu Kataoka, Kenji Iwata, Yutaka Satoh
In this paper, we evaluate convolutional neural network (CNN) features using the AlexNet architecture and very deep convolutional network (VGGNet) architecture. To date, most CNN researchers have employed the last layers before output, which were extracted from the fully connected feature layers. However, since it is unlikely that feature representation effectiveness is dependent on the problem, this study evaluates additional convolutional layers that are adjacent to fully connected layers, in addition to executing simple tuning for feature concatenation (e.g., layer 3 + layer 5 + layer 7) and transformation, using tools such as principal component analysis. In our experiments, we carried out detection and classification tasks using the Caltech 101 and Daimler Pedestrian Benchmark Datasets.