Hirokatsu Kataoka

CV
h-index28
65papers
4,055citations
Novelty42%
AI Score55

65 Papers

CVSep 29, 2023Code
SegRCDB: Semantic Segmentation via Formula-Driven Supervised Learning

Risa Shinoda, Ryo Hayamizu, Kodai Nakashima et al.

Pre-training is a strong strategy for enhancing visual models to efficiently train them with a limited number of labeled images. In semantic segmentation, creating annotation masks requires an intensive amount of labor and time, and therefore, a large-scale pre-training dataset with semantic labels is quite difficult to construct. Moreover, what matters in semantic segmentation pre-training has not been fully investigated. In this paper, we propose the Segmentation Radial Contour DataBase (SegRCDB), which for the first time applies formula-driven supervised learning for semantic segmentation. SegRCDB enables pre-training for semantic segmentation without real images or any manual semantic labels. SegRCDB is based on insights about what is important in pre-training for semantic segmentation and allows efficient pre-training. Pre-training with SegRCDB achieved higher mIoU than the pre-training with COCO-Stuff for fine-tuning on ADE-20k and Cityscapes with the same number of training images. SegRCDB has a high potential to contribute to semantic segmentation pre-training and investigation by enabling the creation of large datasets without manual annotation. The SegRCDB dataset will be released under a license that allows research and commercial use. Code is available at: https://github.com/dahlian00/SegRCDB

CVJul 29, 2022Code
Neural Density-Distance Fields

Itsuki Ueda, Yoshihiro Fukuhara, Hirokatsu Kataoka et al.

The success of neural fields for 3D vision tasks is now indisputable. Following this trend, several methods aiming for visual localization (e.g., SLAM) have been proposed to estimate distance or density fields using neural fields. However, it is difficult to achieve high localization performance by only density fields-based methods such as Neural Radiance Field (NeRF) since they do not provide density gradient in most empty regions. On the other hand, distance field-based methods such as Neural Implicit Surface (NeuS) have limitations in objects' surface shapes. This paper proposes Neural Density-Distance Field (NeDDF), a novel 3D representation that reciprocally constrains the distance and density fields. We extend distance field formulation to shapes with no explicit boundary surface, such as fur or smoke, which enable explicit conversion from distance field to density field. Consistent distance and density fields realized by explicit conversion enable both robustness to initial values and high-quality registration. Furthermore, the consistency between fields allows fast convergence from sparse point clouds. Experiments show that NeDDF can achieve high localization performance while providing comparable results to NeRF on novel view synthesis. The code is available at https://github.com/ueda0319/neddf.

CVJun 18, 2022
Pre-training Vision Transformers with Formula-driven Supervised Learning

Hirokatsu Kataoka, Sora Takashima, Ryo Hayamizu et al.

In the present work, we show that the performance of formula-driven supervised learning (FDSL) can match or even exceed that of ImageNet-21k and can approach that of the JFT-300M dataset without the use of real images, human supervision, or self-supervision during the pre-training of vision transformers (ViTs). For example, ViT-Base pre-trained on ImageNet-21k and JFT-300M showed 83.0 and 84.1% top-1 accuracy when fine-tuned on ImageNet-1k, and FDSL showed 83.8% top-1 accuracy when pre-trained under comparable conditions (hyperparameters and number of epochs). Especially, the ExFractalDB-21k pre-training was calculated with x14.2 fewer images compared with JFT-300M. Images generated by formulas avoid privacy and copyright issues, labeling costs and errors, and biases that real images suffer from, and thus have tremendous potential for pre-training general models. To understand the performance of the synthetic images, we tested two hypotheses, namely (i) object contours are what matter in FDSL datasets and (ii) an increased number of parameters for label creation improves performance in FDSL pre-training. To test the former hypothesis, we constructed a dataset that consisted of simple object contour combinations. We found that this dataset matched the performance of fractal databases. For the latter hypothesis, we found that increasing the difficulty of the pre-training task generally leads to better fine-tuning accuracy.

CVMar 2, 2023
Visual Atoms: Pre-training Vision Transformers with Sinusoidal Waves

Sora Takashima, Ryo Hayamizu, Nakamasa Inoue et al.

Formula-driven supervised learning (FDSL) has been shown to be an effective method for pre-training vision transformers, where ExFractalDB-21k was shown to exceed the pre-training effect of ImageNet-21k. These studies also indicate that contours mattered more than textures when pre-training vision transformers. However, the lack of a systematic investigation as to why these contour-oriented synthetic datasets can achieve the same accuracy as real datasets leaves much room for skepticism. In the present work, we develop a novel methodology based on circular harmonics for systematically investigating the design space of contour-oriented synthetic datasets. This allows us to efficiently search the optimal range of FDSL parameters and maximize the variety of synthetic images in the dataset, which we found to be a critical factor. When the resulting new dataset VisualAtom-21k is used for pre-training ViT-Base, the top-1 accuracy reached 83.7% when fine-tuning on ImageNet-1k. This is close to the top-1 accuracy (84.2%) achieved by JFT-300M pre-training, while the number of images is 1/14. Unlike JFT-300M which is a static dataset, the quality of synthetic datasets will continue to improve, and the current work is a testament to this possibility. FDSL is also free of the common issues associated with real images, e.g. privacy/copyright issues, labeling costs/errors, and ethical biases.

CVMar 24Code
FDIF: Formula-Driven supervised Learning with Implicit Functions for 3D Medical Image Segmentation

Yukinori Yamamoto, Kazuya Nishimura, Tsukasa Fukusato et al.

Deep learning-based 3D medical image segmentation methods relies on large-scale labeled datasets, yet acquiring such data is difficult due to privacy constraints and the high cost of expert annotation. Formula-Driven Supervised Learning (FDSL) offers an appealing alternative by generating training data and labels directly from mathematical formulas. However, existing voxel-based approaches are limited in geometric expressiveness and cannot synthesize realistic textures. We introduce Formula-Driven supervised learning with Implicit Functions (FDIF), a framework that enables scalable pre-training without using any real data and medical expert annotations. FDIF introduces an implicit-function representation based on signed distance functions (SDFs), enabling compact modeling of complex geometries while exploiting the surface representation of SDFs to support controllable synthesis of both geometric and intensity textures. Across three medical image segmentation benchmarks (AMOS, ACDC, and KiTS) and three architectures (SwinUNETR, nnUNet ResEnc-L, and nnUNet Primus-M), FDIF consistently improves over a formula-driven method, and achieves performance comparable to self-supervised approaches pre-trained on large-scale real datasets. We further show that FDIF pre-training also benefits 3D classification tasks, highlighting implicit-function-based formula supervision as a promising paradigm for data-free representation learning. Code is available at https://github.com/yamanoko/FDIF.

CVAug 1, 2024
Scaling Backwards: Minimal Synthetic Pre-training?

Ryo Nakamura, Ryu Tadokoro, Ryosuke Yamada et al.

Pre-training and transfer learning are an important building block of current computer vision systems. While pre-training is usually performed on large real-world image datasets, in this paper we ask whether this is truly necessary. To this end, we search for a minimal, purely synthetic pre-training dataset that allows us to achieve performance similar to the 1 million images of ImageNet-1k. We construct such a dataset from a single fractal with perturbations. With this, we contribute three main findings. (i) We show that pre-training is effective even with minimal synthetic images, with performance on par with large-scale pre-training datasets like ImageNet-1k for full fine-tuning. (ii) We investigate the single parameter with which we construct artificial categories for our dataset. We find that while the shape differences can be indistinguishable to humans, they are crucial for obtaining strong performances. (iii) Finally, we investigate the minimal requirements for successful pre-training. Surprisingly, we find that a substantial reduction of synthetic images from 1k to 1 can even lead to an increase in pre-training performance, a motivation to further investigate ''scaling backwards''. Finally, we extend our method from synthetic images to real images to see if a single real image can show similar pre-training effect through shape augmentation. We find that the use of grayscale images and affine transformations allows even real images to ''scale backwards''.

CVJul 27, 2023
Pre-training Vision Transformers with Very Limited Synthesized Images

Ryo Nakamura, Hirokatsu Kataoka, Sora Takashima et al.

Formula-driven supervised learning (FDSL) is a pre-training method that relies on synthetic images generated from mathematical formulae such as fractals. Prior work on FDSL has shown that pre-training vision transformers on such synthetic datasets can yield competitive accuracy on a wide range of downstream tasks. These synthetic images are categorized according to the parameters in the mathematical formula that generate them. In the present work, we hypothesize that the process for generating different instances for the same category in FDSL, can be viewed as a form of data augmentation. We validate this hypothesis by replacing the instances with data augmentation, which means we only need a single image per category. Our experiments shows that this one-instance fractal database (OFDB) performs better than the original dataset where instances were explicitly generated. We further scale up OFDB to 21,000 categories and show that it matches, or even surpasses, the model pre-trained on ImageNet-21k in ImageNet-1k fine-tuning. The number of images in OFDB is 21k, whereas ImageNet-21k has 14M. This opens new possibilities for pre-training vision transformers with much smaller datasets.

GRSep 26, 2023
Diffusion-based Holistic Texture Rectification and Synthesis

Guoqing Hao, Satoshi Iizuka, Kensho Hara et al.

We present a novel framework for rectifying occlusions and distortions in degraded texture samples from natural images. Traditional texture synthesis approaches focus on generating textures from pristine samples, which necessitate meticulous preparation by humans and are often unattainable in most natural images. These challenges stem from the frequent occlusions and distortions of texture samples in natural images due to obstructions and variations in object surface geometry. To address these issues, we propose a framework that synthesizes holistic textures from degraded samples in natural images, extending the applicability of exemplar-based texture synthesis techniques. Our framework utilizes a conditional Latent Diffusion Model (LDM) with a novel occlusion-aware latent transformer. This latent transformer not only effectively encodes texture features from partially-observed samples necessary for the generation process of the LDM, but also explicitly captures long-range dependencies in samples with large occlusions. To train our model, we introduce a method for generating synthetic data by applying geometric transformations and free-form mask generation to clean textures. Experimental results demonstrate that our framework significantly outperforms existing methods both quantitatively and quantitatively. Furthermore, we conduct comprehensive ablation studies to validate the different components of our proposed framework. Results are corroborated by a perceptual user study which highlights the efficiency of our proposed approach.

CVSep 1, 2024
Rethinking Image Super-Resolution from Training Data Perspectives

Go Ohtani, Ryu Tadokoro, Ryosuke Yamada et al.

In this work, we investigate the understudied effect of the training data used for image super-resolution (SR). Most commonly, novel SR methods are developed and benchmarked on common training datasets such as DIV2K and DF2K. However, we investigate and rethink the training data from the perspectives of diversity and quality, {thereby addressing the question of ``How important is SR training for SR models?''}. To this end, we propose an automated image evaluation pipeline. With this, we stratify existing high-resolution image datasets and larger-scale image datasets such as ImageNet and PASS to compare their performances. We find that datasets with (i) low compression artifacts, (ii) high within-image diversity as judged by the number of different objects, and (iii) a large number of images from ImageNet or PASS all positively affect SR performance. We hope that the proposed simple-yet-effective dataset curation pipeline will inform the construction of SR datasets in the future and yield overall better models.

CVSep 20, 2024
Formula-Supervised Visual-Geometric Pre-training

Ryosuke Yamada, Kensho Hara, Hirokatsu Kataoka et al.

Throughout the history of computer vision, while research has explored the integration of images (visual) and point clouds (geometric), many advancements in image and 3D object recognition have tended to process these modalities separately. We aim to bridge this divide by integrating images and point clouds on a unified transformer model. This approach integrates the modality-specific properties of images and point clouds and achieves fundamental downstream tasks in image and 3D object recognition on a unified transformer model by learning visual-geometric representations. In this work, we introduce Formula-Supervised Visual-Geometric Pre-training (FSVGP), a novel synthetic pre-training method that automatically generates aligned synthetic images and point clouds from mathematical formulas. Through cross-modality supervision, we enable supervised pre-training between visual and geometric modalities. FSVGP also reduces reliance on real data collection, cross-modality alignment, and human annotation. Our experimental results show that FSVGP pre-trains more effectively than VisualAtom and PC-FractalDB across six tasks: image and 3D object classification, detection, and segmentation. These achievements demonstrate FSVGP's superior generalization in image and 3D object recognition and underscore the potential of synthetic pre-training in visual-geometric representation learning. Our project website is available at https://ryosuke-yamada.github.io/fdsl-fsvgp/.

CVOct 3, 2023
Constructing Image-Text Pair Dataset from Books

Yamato Okamoto, Haruto Toyonaga, Yoshihisa Ijiri et al.

Digital archiving is becoming widespread owing to its effectiveness in protecting valuable books and providing knowledge to many people electronically. In this paper, we propose a novel approach to leverage digital archives for machine learning. If we can fully utilize such digitized data, machine learning has the potential to uncover unknown insights and ultimately acquire knowledge autonomously, just like humans read books. As a first step, we design a dataset construction pipeline comprising an optical character reader (OCR), an object detector, and a layout analyzer for the autonomous extraction of image-text pairs. In our experiments, we apply our pipeline on old photo books to construct an image-text pair dataset, showing its effectiveness in image-text retrieval and insight extraction.

CVDec 28, 2025
3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds

Ryousuke Yamada, Kohsuke Ide, Yoshihiro Fukuhara et al.

Despite recent progress in 3D self-supervised learning, collecting large-scale 3D scene scans remains expensive and labor-intensive. In this work, we investigate whether 3D representations can be learned from unlabeled videos recorded without any real 3D sensors. We present Laplacian-Aware Multi-level 3D Clustering with Sinkhorn-Knopp (LAM3C), a self-supervised framework that learns from video-generated point clouds from unlabeled videos. We first introduce RoomTours, a video-generated point cloud dataset constructed by collecting room-walkthrough videos from the web (e.g., real-estate tours) and generating 49,219 scenes using an off-the-shelf feed-forward reconstruction model. We also propose a noise-regularized loss that stabilizes representation learning by enforcing local geometric smoothness and ensuring feature stability under noisy point clouds. Remarkably, without using any real 3D scans, LAM3C achieves higher performance than the previous self-supervised methods on indoor semantic and instance segmentation. These results suggest that unlabeled videos represent an abundant source of data for 3D self-supervised learning.

CVMar 6, 2023
Scapegoat Generation for Privacy Protection from Deepfake

Gido Kato, Yoshihiro Fukuhara, Mariko Isogawa et al.

To protect privacy and prevent malicious use of deepfake, current studies propose methods that interfere with the generation process, such as detection and destruction approaches. However, these methods suffer from sub-optimal generalization performance to unseen models and add undesirable noise to the original image. To address these problems, we propose a new problem formulation for deepfake prevention: generating a ``scapegoat image'' by modifying the style of the original input in a way that is recognizable as an avatar by the user, but impossible to reconstruct the real face. Even in the case of malicious deepfake, the privacy of the users is still protected. To achieve this, we introduce an optimization-based editing method that utilizes GAN inversion to discourage deepfake models from generating similar scapegoats. We validate the effectiveness of our proposed method through quantitative and user studies.

CVJan 8, 2024Code
Primitive Geometry Segment Pre-training for 3D Medical Image Segmentation

Ryu Tadokoro, Ryosuke Yamada, Kodai Nakashima et al.

The construction of 3D medical image datasets presents several issues, including requiring significant financial costs in data collection and specialized expertise for annotation, as well as strict privacy concerns for patient confidentiality compared to natural image datasets. Therefore, it has become a pressing issue in 3D medical image segmentation to enable data-efficient learning with limited 3D medical data and supervision. A promising approach is pre-training, but improving its performance in 3D medical image segmentation is difficult due to the small size of existing 3D medical image datasets. We thus present the Primitive Geometry Segment Pre-training (PrimGeoSeg) method to enable the learning of 3D semantic features by pre-training segmentation tasks using only primitive geometric objects for 3D medical image segmentation. PrimGeoSeg performs more accurate and efficient 3D medical image segmentation without manual data collection and annotation. Further, experimental results show that PrimGeoSeg on SwinUNETR improves performance over learning from scratch on BTCV, MSD (Task06), and BraTS datasets by 3.7%, 4.4%, and 0.3%, respectively. Remarkably, the performance was equal to or better than state-of-the-art self-supervised learning despite the equal number of pre-training data. From experimental results, we conclude that effective pre-training can be achieved by looking at primitive geometric objects only. Code and dataset are available at https://github.com/SUPER-TADORY/PrimGeoSeg.

CVNov 30, 2025
HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics

Masatoshi Tateno, Gido Kato, Hirokatsu Kataoka et al.

Hand-object interaction (HOI) inherently involves dynamics where human manipulations produce distinct spatio-temporal effects on objects. However, existing semantic HOI benchmarks focused either on manipulation or on the resulting effects at a coarse level, lacking fine-grained spatio-temporal reasoning to capture the underlying dynamics in HOI. We introduce HanDyVQA, a fine-grained video question-answering benchmark that comprehensively covers both the manipulation and effect aspects of HOI. HanDyVQA comprises six complementary question types (Action, Process, Objects, Location, State Change, and Object Parts), totalling 11.1K multiple-choice QA pairs. Collected QA pairs recognizing manipulation styles, hand/object motions, and part-level state changes. HanDyVQA also includes 10.3K segmentation masks for Objects and Object Parts questions, enabling the evaluation of object/part-level reasoning in video object segmentation. We evaluated recent video foundation models on our benchmark and found that even the best-performing model, Gemini-2.5-Pro, reached only 73% average accuracy, which is far from human performance (97%). Further analysis shows the remaining challenges in spatial relationship, motion, and part-level geometric understanding. We also found that integrating explicit HOI-related cues into visual features improves performance, offering insights for developing future models with a deeper understanding of HOI dynamics.

CVNov 7, 2023
Image Generation and Learning Strategy for Deep Document Forgery Detection

Yamato Okamoto, Osada Genki, Iu Yahiro et al.

In recent years, document processing has flourished and brought numerous benefits. However, there has been a significant rise in reported cases of forged document images. Specifically, recent advancements in deep neural network (DNN) methods for generative tasks may amplify the threat of document forgery. Traditional approaches for forged document images created by prevalent copy-move methods are unsuitable against those created by DNN-based methods, as we have verified. To address this issue, we construct a training dataset of document forgery images, named FD-VIED, by emulating possible attacks, such as text addition, removal, and replacement with recent DNN-methods. Additionally, we introduce an effective pre-training approach through self-supervised learning with both natural images and document images. In our experiments, we demonstrate that our approach enhances detection performance.

CVJul 28, 2025Code
AgroBench: Vision-Language Model Benchmark in Agriculture

Risa Shinoda, Nakamasa Inoue, Hirokatsu Kataoka et al.

Precise automated understanding of agricultural tasks such as disease identification is essential for sustainable crop production. Recent advances in vision-language models (VLMs) are expected to further expand the range of agricultural tasks by facilitating human-model interaction through easy, text-based communication. Here, we introduce AgroBench (Agronomist AI Benchmark), a benchmark for evaluating VLM models across seven agricultural topics, covering key areas in agricultural engineering and relevant to real-world farming. Unlike recent agricultural VLM benchmarks, AgroBench is annotated by expert agronomists. Our AgroBench covers a state-of-the-art range of categories, including 203 crop categories and 682 disease categories, to thoroughly evaluate VLM capabilities. In our evaluation on AgroBench, we reveal that VLMs have room for improvement in fine-grained identification tasks. Notably, in weed identification, most open-source VLMs perform close to random. With our wide range of topics and expert-annotated categories, we analyze the types of errors made by VLMs and suggest potential pathways for future VLM development. Our dataset and code are available at https://dahlian00.github.io/AgroBenchPage/ .

CVMar 17, 2022
Community-Driven Comprehensive Scientific Paper Summarization: Insight from cvpaper.challenge

Shintaro Yamamoto, Hirokatsu Kataoka, Ryota Suzuki et al.

The present paper introduces a group activity involving writing summaries of conference proceedings by volunteer participants. The rapid increase in scientific papers is a heavy burden for researchers, especially non-native speakers, who need to survey scientific literature. To alleviate this problem, we organized a group of non-native English speakers to write summaries of papers presented at a computer vision conference to share the knowledge of the papers read by the group. We summarized a total of 2,000 papers presented at the Conference on Computer Vision and Pattern Recognition, a top-tier conference on computer vision, in 2019 and 2020. We quantitatively analyzed participants' selection regarding which papers they read among the many available papers. The experimental results suggest that we can summarize a wide range of papers without asking participants to read papers unrelated to their interests.

CVOct 23, 2023
Leveraging Image-Text Similarity and Caption Modification for the DataComp Challenge: Filtering Track and BYOD Track

Shuhei Yokoo, Peifei Zhu, Yuchi Ishikawa et al.

Large web crawl datasets have already played an important role in learning multimodal features with high generalization capabilities. However, there are still very limited studies investigating the details or improvements of data design. Recently, a DataComp challenge has been designed to propose the best training data with the fixed models. This paper presents our solution to both filtering track and BYOD track of the DataComp challenge. Our solution adopts large multimodal models CLIP and BLIP-2 to filter and modify web crawl data, and utilize external datasets along with a bag of tricks to improve the data quality. Experiments show our solution significantly outperforms DataComp baselines (filtering track: 6.6% improvement, BYOD track: 48.5% improvement).

CVNov 27, 2017Code
Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?

Kensho Hara, Hirokatsu Kataoka, Yutaka Satoh

The purpose of this study is to determine whether current video datasets have sufficient data for training very deep convolutional neural networks (CNNs) with spatio-temporal three-dimensional (3D) kernels. Recently, the performance levels of 3D CNNs in the field of action recognition have improved significantly. However, to date, conventional research has only explored relatively shallow 3D architectures. We examine the architectures of various 3D CNNs from relatively shallow to very deep ones on current video datasets. Based on the results of those experiments, the following conclusions could be obtained: (i) ResNet-18 training resulted in significant overfitting for UCF-101, HMDB-51, and ActivityNet but not for Kinetics. (ii) The Kinetics dataset has sufficient data for training of deep 3D CNNs, and enables training of up to 152 ResNets layers, interestingly similar to 2D ResNets on ImageNet. ResNeXt-101 achieved 78.4% average accuracy on the Kinetics test set. (iii) Kinetics pretrained simple 3D architectures outperforms complex 2D architectures, and the pretrained ResNeXt-101 achieved 94.5% and 70.2% on UCF-101 and HMDB-51, respectively. The use of 2D CNNs trained on ImageNet has produced significant progress in various tasks in image. We believe that using deep 3D CNNs together with Kinetics will retrace the successful history of 2D CNNs and ImageNet, and stimulate advances in computer vision for videos. The codes and pretrained models used in this study are publicly available. https://github.com/kenshohara/3D-ResNets-PyTorch

CVAug 25, 2017Code
Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition

Kensho Hara, Hirokatsu Kataoka, Yutaka Satoh

Convolutional neural networks with spatio-temporal 3D kernels (3D CNNs) have an ability to directly extract spatio-temporal features from videos for action recognition. Although the 3D kernels tend to overfit because of a large number of their parameters, the 3D CNNs are greatly improved by using recent huge video databases. However, the architecture of 3D CNNs is relatively shallow against to the success of very deep neural networks in 2D-based CNNs, such as residual networks (ResNets). In this paper, we propose a 3D CNNs based on ResNets toward a better action representation. We describe the training procedure of our 3D ResNets in details. We experimentally evaluate the 3D ResNets on the ActivityNet and Kinetics datasets. The 3D ResNets trained on the Kinetics did not suffer from overfitting despite the large number of parameters of the model, and achieved better performance than relatively shallow networks, such as C3D. Our code and pretrained models (e.g. Kinetics and ActivityNet) are publicly available at https://github.com/kenshohara/3D-ResNets.

CVMar 26
MoireMix: A Formula-Based Data Augmentation for Improving Image Classification Robustness

Yuto Matsuo, Yoshihiro Fukuhara, Yuki M. Asano et al.

Data augmentation is a key technique for improving the robustness of image classification models. However, many recent approaches rely on diffusion-based synthesis or complex feature mixing strategies, which introduce substantial computational overhead or require external datasets. In this work, we explore a different direction: procedural augmentation based on analytic interference patterns. Unlike conventional augmentation methods that rely on stochastic noise, feature mixing, or generative models, our approach exploits Moire interference to generate structured perturbations spanning a wide range of spatial frequencies. We propose a lightweight augmentation method that procedurally generates Moire textures on-the-fly using a closed-form mathematical formulation. The patterns are synthesized directly in memory with negligible computational cost (0.0026 seconds per image), mixed with training images during training, and immediately discarded, enabling a storage-free augmentation pipeline without external data. Extensive experiments with Vision Transformers demonstrate that the proposed method consistently improves robustness across multiple benchmarks, including ImageNet-C, ImageNet-R, and adversarial benchmarks, outperforming standard augmentation baselines and existing external-data-free augmentation approaches. These results suggest that analytic interference patterns provide a practical and efficient alternative to data-driven generative augmentation methods.

CVApr 15, 2024
Watermark-embedded Adversarial Examples for Copyright Protection against Diffusion Models

Peifei Zhu, Tsubasa Takahashi, Hirokatsu Kataoka

Diffusion Models (DMs) have shown remarkable capabilities in various image-generation tasks. However, there are growing concerns that DMs could be used to imitate unauthorized creations and thus raise copyright issues. To address this issue, we propose a novel framework that embeds personal watermarks in the generation of adversarial examples. Such examples can force DMs to generate images with visible watermarks and prevent DMs from imitating unauthorized images. We construct a generator based on conditional adversarial networks and design three losses (adversarial loss, GAN loss, and perturbation loss) to generate adversarial examples that have subtle perturbation but can effectively attack DMs to prevent copyright violations. Training a generator for a personal watermark by our method only requires 5-10 samples within 2-3 minutes, and once the generator is trained, it can generate adversarial examples with that watermark significantly fast (0.2s per image). We conduct extensive experiments in various conditional image-generation scenarios. Compared to existing methods that generate images with chaotic textures, our method adds visible watermarks on the generated images, which is a more straightforward way to indicate copyright violations. We also observe that our adversarial examples exhibit good transferability across unknown generative models. Therefore, this work provides a simple yet powerful way to protect copyright from DM-based imitation.

CVJan 22, 2025
Can masking background and object reduce static bias for zero-shot action recognition?

Takumi Fukuzawa, Kensho Hara, Hirokatsu Kataoka et al.

In this paper, we address the issue of static bias in zero-shot action recognition. Action recognition models need to represent the action itself, not the appearance. However, some fully-supervised works show that models often rely on static appearances, such as the background and objects, rather than human actions. This issue, known as static bias, has not been investigated for zero-shot. Although CLIP-based zero-shot models are now common, it remains unclear if they sufficiently focus on human actions, as CLIP primarily captures appearance features related to languages. In this paper, we investigate the influence of static bias in zero-shot action recognition with CLIP-based models. Our approach involves masking backgrounds, objects, and people differently during training and validation. Experiments with masking background show that models depend on background bias as their performance decreases for Kinetics400. However, for Mimetics, which has a weak background bias, masking the background leads to improved performance even if the background is masked during validation. Furthermore, masking both the background and objects in different colors improves performance for SSv2, which has a strong object bias. These results suggest that masking the background or objects during training prevents models from overly depending on static bias and makes them focus more on human action.

CVJul 27, 2025
AnimalClue: Recognizing Animals by their Traces

Risa Shinoda, Nakamasa Inoue, Iro Laina et al.

Wildlife observation plays an important role in biodiversity conservation, necessitating robust methodologies for monitoring wildlife populations and interspecies interactions. Recent advances in computer vision have significantly contributed to automating fundamental wildlife observation tasks, such as animal detection and species identification. However, accurately identifying species from indirect evidence like footprints and feces remains relatively underexplored, despite its importance in contributing to wildlife monitoring. To bridge this gap, we introduce AnimalClue, the first large-scale dataset for species identification from images of indirect evidence. Our dataset consists of 159,605 bounding boxes encompassing five categories of indirect clues: footprints, feces, eggs, bones, and feathers. It covers 968 species, 200 families, and 65 orders. Each image is annotated with species-level labels, bounding boxes or segmentation masks, and fine-grained trait information, including activity patterns and habitat preferences. Unlike existing datasets primarily focused on direct visual features (e.g., animal appearances), AnimalClue presents unique challenges for classification, detection, and instance segmentation tasks due to the need for recognizing more detailed and subtle visual features. In our experiments, we extensively evaluate representative vision models and identify key challenges in animal identification from their traces. Our dataset and code are available at https://dahlian00.github.io/AnimalCluePage/

SDApr 6, 2025
Formula-Supervised Sound Event Detection: Pre-Training Without Real Data

Yuto Shibata, Keitaro Tanaka, Yoshiaki Bando et al.

In this paper, we propose a novel formula-driven supervised learning (FDSL) framework for pre-training an environmental sound analysis model by leveraging acoustic signals parametrically synthesized through formula-driven methods. Specifically, we outline detailed procedures and evaluate their effectiveness for sound event detection (SED). The SED task, which involves estimating the types and timings of sound events, is particularly challenged by the difficulty of acquiring a sufficient quantity of accurately labeled training data. Moreover, it is well known that manually annotated labels often contain noises and are significantly influenced by the subjective judgment of annotators. To address these challenges, we propose a novel pre-training method that utilizes a synthetic dataset, Formula-SED, where acoustic data are generated solely based on mathematical formulas. The proposed method enables large-scale pre-training by using the synthesis parameters applied at each time step as ground truth labels, thereby eliminating label noise and bias. We demonstrate that large-scale pre-training with Formula-SED significantly enhances model accuracy and accelerates training, as evidenced by our results in the DESED dataset used for DCASE2023 Challenge Task 4. The project page is at https://yutoshibata07.github.io/Formula-SED/

CVMar 31, 2025
Pre-training with 3D Synthetic Data: Learning 3D Point Cloud Instance Segmentation from 3D Synthetic Scenes

Daichi Otsuka, Shinichi Mae, Ryosuke Yamada et al.

In the recent years, the research community has witnessed growing use of 3D point cloud data for the high applicability in various real-world applications. By means of 3D point cloud, this modality enables to consider the actual size and spatial understanding. The applied fields include mechanical control of robots, vehicles, or other real-world systems. Along this line, we would like to improve 3D point cloud instance segmentation which has emerged as a particularly promising approach for these applications. However, the creation of 3D point cloud datasets entails enormous costs compared to 2D image datasets. To train a model of 3D point cloud instance segmentation, it is necessary not only to assign categories but also to provide detailed annotations for each point in the large-scale 3D space. Meanwhile, the increase of recent proposals for generative models in 3D domain has spurred proposals for using a generative model to create 3D point cloud data. In this work, we propose a pre-training with 3D synthetic data to train a 3D point cloud instance segmentation model based on generative model for 3D scenes represented by point cloud data. We directly generate 3D point cloud data with Point-E for inserting a generated data into a 3D scene. More recently in 2025, although there are other accurate 3D generation models, even using the Point-E as an early 3D generative model can effectively support the pre-training with 3D synthetic data. In the experimental section, we compare our pre-training method with baseline methods indicated improved performance, demonstrating the efficacy of 3D generative models for 3D point cloud instance segmentation.

CVNov 28, 2025
PowerCLIP: Powerset Alignment for Contrastive Pre-Training

Masaki Kawamura, Nakamasa Inoue, Rintaro Yanagi et al.

Contrastive vision-language pre-training frameworks such as CLIP have demonstrated impressive zero-shot performance across a range of vision-language tasks. Recent studies have shown that aligning individual text tokens with specific image patches or regions enhances fine-grained compositional understanding. However, it remains challenging to capture compositional semantics that span multiple image regions. To address this limitation, we propose PowerCLIP, a novel contrastive pre-training framework enhanced by powerset alignment, which exhaustively optimizes region-to-phrase alignments by minimizing the loss defined between powersets of image regions and textual parse trees. Since the naive powerset construction incurs exponential computational cost due to the combinatorial explosion in the number of region subsets, we introduce efficient non-linear aggregators (NLAs) that reduce complexity from O(2^M) to O(M) with respect to the number of regions M, while approximating the exact loss value with arbitrary precision. Our extensive experiments demonstrate that PowerCLIP outperforms state-of-the-art methods in zero-shot classification and retrieval tasks, underscoring the compositionality and robustness of our approach. Our code will be made publicly available.

CVOct 24, 2025
S3OD: Towards Generalizable Salient Object Detection with Synthetic Data

Orest Kupyn, Hirokatsu Kataoka, Christian Rupprecht

Salient object detection exemplifies data-bounded tasks where expensive pixel-precise annotations force separate model training for related subtasks like DIS and HR-SOD. We present a method that dramatically improves generalization through large-scale synthetic data generation and ambiguity-aware architecture. We introduce S3OD, a dataset of over 139,000 high-resolution images created through our multi-modal diffusion pipeline that extracts labels from diffusion and DINO-v3 features. The iterative generation framework prioritizes challenging categories based on model performance. We propose a streamlined multi-mask decoder that naturally handles the inherent ambiguity in salient object detection by predicting multiple valid interpretations. Models trained solely on synthetic data achieve 20-50% error reduction in cross-dataset generalization, while fine-tuned versions reach state-of-the-art performance across DIS and HR-SOD benchmarks.

LGOct 9, 2025
Approximate Domain Unlearning for Vision-Language Models

Kodai Kawamura, Yuta Goto, Rintaro Yanagi et al.

Pre-trained Vision-Language Models (VLMs) exhibit strong generalization capabilities, enabling them to recognize a wide range of objects across diverse domains without additional training. However, they often retain irrelevant information beyond the requirements of specific downstream tasks, raising concerns about computational efficiency and potential information leakage. This has motivated growing interest in approximate unlearning, which aims to selectively remove unnecessary knowledge while preserving overall model performance. Existing approaches to approximate unlearning have primarily focused on class unlearning, where a VLM is retrained to fail to recognize specified object classes while maintaining accuracy for others. However, merely forgetting object classes is often insufficient in practical applications. For instance, an autonomous driving system should accurately recognize real cars while avoiding misrecognition of illustrated cars depicted in roadside advertisements as real cars, which could be hazardous. In this paper, we introduce Approximate Domain Unlearning (ADU), a novel problem setting that requires reducing recognition accuracy for images from specified domains (e.g., illustration) while preserving accuracy for other domains (e.g., real). ADU presents new technical challenges: due to the strong domain generalization capability of pre-trained VLMs, domain distributions are highly entangled in the feature space, making naive approaches based on penalizing target domains ineffective. To tackle this limitation, we propose a novel approach that explicitly disentangles domain distributions and adaptively captures instance-specific domain information. Extensive experiments show that our approach outperforms baselines built upon VLM tuning techniques, paving the way for practical and fine-grained unlearning in VLMs. Code: https://kodaikawamura.github.io/Domain_Unlearning/.

CVJun 28, 2025
How Semantically Informative is an Image?: Measuring the Covariance-Weighted Norm of Contrastive Learning Embeddings

Fumiya Uchiyama, Rintaro Yanagi, Shohei Taniguchi et al.

Contrastive learning has the capacity to model multimodal probability distributions by embedding and aligning visual representations with semantics from captions. This approach enables the estimation of relational semantic similarity; however, it remains unclear whether it can also represent absolute semantic informativeness. In this work, we introduce a semantic informativeness metric for an image calculated from text samples via a contrastive learning model; similarly, the informativeness of a text is calculated from image samples. We propose a redefinition of the concept of Information Gain, a concept previously explored in natural language processing, extending its application to the domains of vision and language. Our metric quantifies how conditioning on an image distorts the distribution of associated texts, and vice versa for text conditioning on image distributions. In OpenCLIP's empirical results, we observe that images with the lowest Information Gain scores often correspond to placeholder icons such as "image not found." Furthermore, we propose to measure a norm-based metric of the embedding to estimate the Information Gain, following the theoretical results for Skip-Gram with Negative Sampling (SGNS) word embedding. Information Gain can be measured using either CLIP or SigLIP, and the results demonstrate a strong correlation with a coefficient of determination ranging from 0.98 to 1.00. After obtaining the mean and the covariance of the sample embedding, the computational cost of this method is independent of the sample size, and it is compatible with publicly available, open-weight models.

CVMay 19, 2025
Industrial Synthetic Segment Pre-training

Shinichi Mae, Ryousuke Yamada, Hirokatsu Kataoka

Pre-training on real-image datasets has been widely proven effective for improving instance segmentation. However, industrial applications face two key challenges: (1) legal and ethical restrictions, such as ImageNet's prohibition of commercial use, and (2) limited transferability due to the domain gap between web images and industrial imagery. Even recent vision foundation models, including the segment anything model (SAM), show notable performance degradation in industrial settings. These challenges raise critical questions: Can we build a vision foundation model for industrial applications without relying on real images or manual annotations? And can such models outperform even fine-tuned SAM on industrial datasets? To address these questions, we propose the Instance Core Segmentation Dataset (InsCore), a synthetic pre-training dataset based on formula-driven supervised learning (FDSL). InsCore generates fully annotated instance segmentation images that reflect key characteristics of industrial data, including complex occlusions, dense hierarchical masks, and diverse non-rigid shapes, distinct from typical web imagery. Unlike previous methods, InsCore requires neither real images nor human annotations. Experiments on five industrial datasets show that models pre-trained with InsCore outperform those trained on COCO and ImageNet-21k, as well as fine-tuned SAM, achieving an average improvement of 6.2 points in instance segmentation performance. This result is achieved using only 100k synthetic images, more than 100 times fewer than the 11 million images in SAM's SA-1B dataset, demonstrating the data efficiency of our approach. These findings position InsCore as a practical and license-free vision foundation model for industrial applications.

CVApr 30, 2025
Simple Visual Artifact Detection in Sora-Generated Videos

Misora Sugiyama, Hirokatsu Kataoka

The December 2024 release of OpenAI's Sora, a powerful video generation model driven by natural language prompts, highlights a growing convergence between large language models (LLMs) and video synthesis. As these multimodal systems evolve into video-enabled LLMs (VidLLMs), capable of interpreting, generating, and interacting with visual content, understanding their limitations and ensuring their safe deployment becomes essential. This study investigates visual artifacts frequently found and reported in Sora-generated videos, which can compromise quality, mislead viewers, or propagate disinformation. We propose a multi-label classification framework targeting four common artifact label types: label 1: boundary / edge defects, label 2: texture / noise issues, label 3: movement / joint anomalies, and label 4: object mismatches / disappearances. Using a dataset of 300 manually annotated frames extracted from 15 Sora-generated videos, we trained multiple 2D CNN architectures (ResNet-50, EfficientNet-B3 / B4, ViT-Base). The best-performing model trained by ResNet-50 achieved an average multi-label classification accuracy of 94.14%. This work supports the broader development of VidLLMs by contributing to (1) the creation of datasets for video quality evaluation, (2) interpretable artifact-based analysis beyond language metrics, and (3) the identification of visual risks relevant to factuality and safety.

CVMar 27, 2025
Leveraging LLMs with Iterative Loop Structure for Enhanced Social Intelligence in Video Question Answering

Erika Mori, Yue Qiu, Hirokatsu Kataoka et al.

Social intelligence, the ability to interpret emotions, intentions, and behaviors, is essential for effective communication and adaptive responses. As robots and AI systems become more prevalent in caregiving, healthcare, and education, the demand for AI that can interact naturally with humans grows. However, creating AI that seamlessly integrates multiple modalities, such as vision and speech, remains a challenge. Current video-based methods for social intelligence rely on general video recognition or emotion recognition techniques, often overlook the unique elements inherent in human interactions. To address this, we propose the Looped Video Debating (LVD) framework, which integrates Large Language Models (LLMs) with visual information, such as facial expressions and body movements, to enhance the transparency and reliability of question-answering tasks involving human interaction videos. Our results on the Social-IQ 2.0 benchmark show that LVD achieves state-of-the-art performance without fine-tuning. Furthermore, supplementary human annotations on existing datasets provide insights into the model's accuracy, guiding future improvements in AI-driven social intelligence.

CVMar 4, 2025
On the Relationship Between Double Descent of CNNs and Shape/Texture Bias Under Learning Process

Shun Iwase, Shuya Takahashi, Nakamasa Inoue et al.

The double descent phenomenon, which deviates from the traditional bias-variance trade-off theory, attracts considerable research attention; however, the mechanism of its occurrence is not fully understood. On the other hand, in the study of convolutional neural networks (CNNs) for image recognition, methods are proposed to quantify the bias on shape features versus texture features in images, determining which features the CNN focuses on more. In this work, we hypothesize that there is a relationship between the shape/texture bias in the learning process of CNNs and epoch-wise double descent, and we conduct verification. As a result, we discover double descent/ascent of shape/texture bias synchronized with double descent of test error under conditions where epoch-wise double descent is observed. Quantitative evaluations confirm this correlation between the test errors and the bias values from the initial decrease to the full increase in test error. Interestingly, double descent/ascent of shape/texture bias is observed in some cases even in conditions without label noise, where double descent is thought not to occur. These experimental results are considered to contribute to the understanding of the mechanisms behind the double descent phenomenon and the learning process of CNNs in image recognition.

CVFeb 3, 2025
MoireDB: Formula-generated Interference-fringe Image Dataset

Yuto Matsuo, Ryo Hayamizu, Hirokatsu Kataoka et al.

Image recognition models have struggled to treat recognition robustness to real-world degradations. In this context, data augmentation methods like PixMix improve robustness but rely on generative arts and feature visualizations (FVis), which have copyright, drawing cost, and scalability issues. We propose MoireDB, a formula-generated interference-fringe image dataset for image augmentation enhancing robustness. MoireDB eliminates copyright concerns, reduces dataset assembly costs, and enhances robustness by leveraging illusory patterns. Experiments show that MoireDB augmented images outperforms traditional Fractal arts and FVis-based augmentations, making it a scalable and effective solution for improving model robustness against real-world degradations.

CVJan 16, 2025
Text-guided Synthetic Geometric Augmentation for Zero-shot 3D Understanding

Kohei Torimi, Ryosuke Yamada, Daichi Otsuka et al.

Zero-shot recognition models require extensive training data for generalization. However, in zero-shot 3D classification, collecting 3D data and captions is costly and laborintensive, posing a significant barrier compared to 2D vision. Recent advances in generative models have achieved unprecedented realism in synthetic data production, and recent research shows the potential for using generated data as training data. Here, naturally raising the question: Can synthetic 3D data generated by generative models be used as expanding limited 3D datasets? In response, we present a synthetic 3D dataset expansion method, Textguided Geometric Augmentation (TeGA). TeGA is tailored for language-image-3D pretraining, which achieves SoTA in zero-shot 3D classification, and uses a generative textto-3D model to enhance and extend limited 3D datasets. Specifically, we automatically generate text-guided synthetic 3D data and introduce a consistency filtering strategy to discard noisy samples where semantics and geometric shapes do not match with text. In the experiment to double the original dataset size using TeGA, our approach demonstrates improvements over the baselines, achieving zeroshot performance gains of 3.0% on Objaverse-LVIS, 4.6% on ScanObjectNN, and 8.7% on ModelNet40. These results demonstrate that TeGA effectively bridges the 3D data gap, enabling robust zero-shot 3D classification even with limited real training data and paving the way for zero-shot 3D vision application.

CVNov 19, 2024
Constant Rate Scheduling: Constant-Rate Distributional Change for Efficient Training and Sampling in Diffusion Models

Shuntaro Okada, Kenji Doi, Ryota Yoshihashi et al.

We propose a general approach to optimize noise schedules for training and sampling in diffusion models. Our approach optimizes the noise schedules to ensure a constant rate of change in the probability distribution of diffused data throughout the diffusion process. Any distance metric for measuring the probability-distributional change is applicable to our approach, and we introduce three distance metrics. We evaluated the effectiveness of our approach on unconditional and class-conditional image-generation tasks using the LSUN (Horse, Bedroom, Church), ImageNet, FFHQ, and CIFAR10 datasets. Through extensive experiments, we confirmed that our approach broadly improves the performance of pixel-space and latent-space diffusion models regardless of the dataset, sampler, and number of function evaluations ranging from 5 to 250. Notably, by using our approach for optimizing both training and sampling schedules, we achieved a state-of-the-art FID score of 2.03 without sacrificing mode coverage on LSUN Horse 256 $\times$ 256.

CVDec 17, 2023
Traffic Incident Database with Multiple Labels Including Various Perspective Environmental Information

Shota Nishiyama, Takuma Saito, Ryo Nakamura et al.

A large dataset of annotated traffic accidents is necessary to improve the accuracy of traffic accident recognition using deep learning models. Conventional traffic accident datasets provide annotations on traffic accidents and other teacher labels, improving traffic accident recognition performance. However, the labels annotated in conventional datasets need to be more comprehensive to describe traffic accidents in detail. Therefore, we propose V-TIDB, a large-scale traffic accident recognition dataset annotated with various environmental information as multi-labels. Our proposed dataset aims to improve the performance of traffic accident recognition by annotating ten types of environmental information as teacher labels in addition to the presence or absence of traffic accidents. V-TIDB is constructed by collecting many videos from the Internet and annotating them with appropriate environmental information. In our experiments, we compare the performance of traffic accident recognition when only labels related to the presence or absence of traffic accidents are trained and when environmental information is added as a multi-label. In the second experiment, we compare the performance of the training with only contact level, which represents the severity of the traffic accident, and the performance with environmental information added as a multi-label. The results showed that 6 out of 10 environmental information labels improved the performance of recognizing the presence or absence of traffic accidents. In the experiment on the degree of recognition of traffic accidents, the performance of recognition of car wrecks and contacts was improved for all environmental information. These experiments show that V-TIDB can be used to learn traffic accident recognition models that take environmental information into account in detail and can be used for appropriate traffic accident analysis.

CVSep 4, 2023
Exploring Limits of Diffusion-Synthetic Training with Weakly Supervised Semantic Segmentation

Ryota Yoshihashi, Yuya Otsuka, Kenji Doi et al.

The advance of generative models for images has inspired various training techniques for image recognition utilizing synthetic images. In semantic segmentation, one promising approach is extracting pseudo-masks from attention maps in text-to-image diffusion models, which enables real-image-and-annotation-free training. However, the pioneering training method using the diffusion-synthetic images and pseudo-masks, i.e., DiffuMask has limitations in terms of mask quality, scalability, and ranges of applicable domains. To overcome these limitations, this work introduces three techniques for diffusion-synthetic semantic segmentation training. First, reliability-aware robust training, originally used in weakly supervised learning, helps segmentation with insufficient synthetic mask quality. %Second, large-scale pretraining of whole segmentation models, not only backbones, on synthetic ImageNet-1k-class images with pixel-labels benefits downstream segmentation tasks. Second, we introduce prompt augmentation, data augmentation to the prompt text set to scale up and diversify training images with a limited text resources. Finally, LoRA-based adaptation of Stable Diffusion enables the transfer to a distant domain, e.g., auto-driving images. Experiments in PASCAL VOC, ImageNet-S, and Cityscapes show that our method effectively closes gap between real and synthetic training in semantic segmentation.

CVMar 25, 2021
Describing and Localizing Multiple Changes with Transformers

Yue Qiu, Shintaro Yamamoto, Kodai Nakashima et al.

Change captioning tasks aim to detect changes in image pairs observed before and after a scene change and generate a natural language description of the changes. Existing change captioning studies have mainly focused on a single change.However, detecting and describing multiple changed parts in image pairs is essential for enhancing adaptability to complex scenarios. We solve the above issues from three aspects: (i) We propose a simulation-based multi-change captioning dataset; (ii) We benchmark existing state-of-the-art methods of single change captioning on multi-change captioning; (iii) We further propose Multi-Change Captioning transformers (MCCFormers) that identify change regions by densely correlating different regions in image pairs and dynamically determines the related change regions with words in sentences. The proposed method obtained the highest scores on four conventional change captioning evaluation metrics for multi-change captioning. Additionally, our proposed method can separate attention maps for each change and performs well with respect to change localization. Moreover, the proposed framework outperformed the previous state-of-the-art methods on an existing change captioning benchmark, CLEVR-Change, by a large margin (+6.1 on BLEU-4 and +9.7 on CIDEr scores), indicating its general ability in change captioning tasks.

CVMar 24, 2021
Can Vision Transformers Learn without Natural Images?

Kodai Nakashima, Hirokatsu Kataoka, Asato Matsumoto et al.

Can we complete pre-training of Vision Transformers (ViT) without natural images and human-annotated labels? Although a pre-trained ViT seems to heavily rely on a large-scale dataset and human-annotated labels, recent large-scale datasets contain several problems in terms of privacy violations, inadequate fairness protection, and labor-intensive annotation. In the present paper, we pre-train ViT without any image collections and annotation labor. We experimentally verify that our proposed framework partially outperforms sophisticated Self-Supervised Learning (SSL) methods like SimCLRv2 and MoCov2 without using any natural images in the pre-training phase. Moreover, although the ViT pre-trained without natural images produces some different visualizations from ImageNet pre-trained ViT, it can interpret natural image datasets to a large extent. For example, the performance rates on the CIFAR-10 dataset are as follows: our proposal 97.6 vs. SimCLRv2 97.4 vs. ImageNet 98.0.

CVJan 21, 2021
Pre-training without Natural Images

Hirokatsu Kataoka, Kazushige Okayasu, Asato Matsumoto et al.

Is it possible to use convolutional neural networks pre-trained without any natural images to assist natural image understanding? The paper proposes a novel concept, Formula-driven Supervised Learning. We automatically generate image patterns and their category labels by assigning fractals, which are based on a natural law existing in the background knowledge of the real world. Theoretically, the use of automatically generated images instead of natural images in the pre-training phase allows us to generate an infinite scale dataset of labeled images. Although the models pre-trained with the proposed Fractal DataBase (FractalDB), a database without natural images, does not necessarily outperform models pre-trained with human annotated datasets at all settings, we are able to partially surpass the accuracy of ImageNet/Places pre-trained models. The image representation with the proposed FractalDB captures a unique feature in the visualization of convolutional layers and attentions.

CVJan 19, 2021
Initialization Using Perlin Noise for Training Networks with a Limited Amount of Data

Nakamasa Inoue, Eisuke Yamagata, Hirokatsu Kataoka

We propose a novel network initialization method using Perlin noise for training image classification networks with a limited amount of data. Our main idea is to initialize the network parameters by solving an artificial noise classification problem, where the aim is to classify Perlin noise samples into their noise categories. Specifically, the proposed method consists of two steps. First, it generates Perlin noise samples with category labels defined based on noise complexity. Second, it solves a classification problem, in which network parameters are optimized to classify the generated noise samples. This method produces a reasonable set of initial weights (filters) for image classification. To the best of our knowledge, this is the first work to initialize networks by solving an artificial optimization problem without using any real-world images. Our experiments show that the proposed method outperforms conventional initialization methods on four image classification datasets.

CVJul 14, 2020
Alleviating Over-segmentation Errors by Detecting Action Boundaries

Yuchi Ishikawa, Seito Kasai, Yoshimitsu Aoki et al.

We propose an effective framework for the temporal action segmentation task, namely an Action Segment Refinement Framework (ASRF). Our model architecture consists of a long-term feature extractor and two branches: the Action Segmentation Branch (ASB) and the Boundary Regression Branch (BRB). The long-term feature extractor provides shared features for the two branches with a wide temporal receptive field. The ASB classifies video frames with action classes, while the BRB regresses the action boundary probabilities. The action boundaries predicted by the BRB refine the output from the ASB, which results in a significant performance improvement. Our contributions are three-fold: (i) We propose a framework for temporal action segmentation, the ASRF, which divides temporal action segmentation into frame-wise action classification and action boundary regression. Our framework refines frame-level hypotheses of action classes using predicted action boundaries. (ii) We propose a loss function for smoothing the transition of action probabilities, and analyze combinations of various loss functions for temporal action segmentation. (iii) Our framework outperforms state-of-the-art methods on three challenging datasets, offering an improvement of up to 13.7% in terms of segmental edit distance and up to 16.1% in terms of segmental F1 score. Our code will be publicly available soon.

CVMay 19, 2020
Retrieving and Highlighting Action with Spatiotemporal Reference

Seito Kasai, Yuchi Ishikawa, Masaki Hayashi et al.

In this paper, we present a framework that jointly retrieves and spatiotemporally highlights actions in videos by enhancing current deep cross-modal retrieval methods. Our work takes on the novel task of action highlighting, which visualizes where and when actions occur in an untrimmed video setting. Action highlighting is a fine-grained task, compared to conventional action recognition tasks which focus on classification or window-based localization. Leveraging weak supervision from annotated captions, our framework acquires spatiotemporal relevance maps and generates local embeddings which relate to the nouns and verbs in captions. Through experiments, we show that our model generates various maps conditioned on different actions, in which conventional visual reasoning methods only go as far as to show a single deterministic saliency map. Also, our model improves retrieval recall over our baseline without alignment by 2-3% on the MSR-VTT dataset.

CVApr 10, 2020
Would Mega-scale Datasets Further Enhance Spatiotemporal 3D CNNs?

Hirokatsu Kataoka, Tenga Wakamiya, Kensho Hara et al.

How can we collect and use a video dataset to further improve spatiotemporal 3D Convolutional Neural Networks (3D CNNs)? In order to positively answer this open question in video recognition, we have conducted an exploration study using a couple of large-scale video datasets and 3D CNNs. In the early era of deep neural networks, 2D CNNs have been better than 3D CNNs in the context of video recognition. Recent studies revealed that 3D CNNs can outperform 2D CNNs trained on a large-scale video dataset. However, we heavily rely on architecture exploration instead of dataset consideration. Therefore, in the present paper, we conduct exploration study in order to improve spatiotemporal 3D CNNs as follows: (i) Recently proposed large-scale video datasets help improve spatiotemporal 3D CNNs in terms of video classification accuracy. We reveal that a carefully annotated dataset (e.g., Kinetics-700) effectively pre-trains a video representation for a video classification task. (ii) We confirm the relationships between #category/#instance and video classification accuracy. The results show that #category should initially be fixed, and then #instance is increased on a video dataset in case of dataset construction. (iii) In order to practically extend a video dataset, we simply concatenate publicly available datasets, such as Kinetics-700 and Moments in Time (MiT) datasets. Compared with Kinetics-700 pre-training, we further enhance spatiotemporal 3D CNNs with the merged dataset, e.g., +0.9, +3.4, and +1.1 on UCF-101, HMDB-51, and ActivityNet datasets, respectively, in terms of fine-tuning. (iv) In terms of recognition architecture, the Kinetics-700 and merged dataset pre-trained models increase the recognition performance to 200 layers with the Residual Network (ResNet), while the Kinetics-400 pre-trained model cannot successfully optimize the 200-layer architecture.

CVMar 27, 2020
Weakly Supervised Dataset Collection for Robust Person Detection

Munetaka Minoguchi, Ken Okayama, Yutaka Satoh et al.

To construct an algorithm that can provide robust person detection, we present a dataset with over 8 million images that was produced in a weakly supervised manner. Through labor-intensive human annotation, the person detection research community has produced relatively small datasets containing on the order of 100,000 images, such as the EuroCity Persons dataset, which includes 240,000 bounding boxes. Therefore, we have collected 8.7 million images of persons based on a two-step collection process, namely person detection with an existing detector and data refinement for false positive suppression. According to the experimental results, the Weakly Supervised Person Dataset (WSPD) is simple yet effective for person detection pre-training. In the context of pre-trained person detection algorithms, our WSPD pre-trained model has 13.38 and 6.38% better accuracy than the same model trained on the fully supervised ImageNet and EuroCity Persons datasets, respectively, when verified with the Caltech Pedestrian.

CVFeb 29, 2020
Augmented Cyclic Consistency Regularization for Unpaired Image-to-Image Translation

Takehiko Ohkawa, Naoto Inoue, Hirokatsu Kataoka et al.

Unpaired image-to-image (I2I) translation has received considerable attention in pattern recognition and computer vision because of recent advancements in generative adversarial networks (GANs). However, due to the lack of explicit supervision, unpaired I2I models often fail to generate realistic images, especially in challenging datasets with different backgrounds and poses. Hence, stabilization is indispensable for GANs and applications of I2I translation. Herein, we propose Augmented Cyclic Consistency Regularization (ACCR), a novel regularization method for unpaired I2I translation. Our main idea is to enforce consistency regularization originating from semi-supervised learning on the discriminators leveraging real, fake, reconstructed, and augmented samples. We regularize the discriminators to output similar predictions when fed pairs of original and perturbed images. We qualitatively clarify why consistency regularization on fake and reconstructed samples works well. Quantitatively, our method outperforms the consistency regularized GAN (CR-GAN) in real-world translations and demonstrates efficacy against several data augmentation variants and cycle-consistent constraints.

CVMay 19, 2019
What Do Adversarially Robust Models Look At?

Takahiro Itazuri, Yoshihiro Fukuhara, Hirokatsu Kataoka et al.

In this paper, we address the open question: "What do adversarially robust models look at?" Recently, it has been reported in many works that there exists the trade-off between standard accuracy and adversarial robustness. According to prior works, this trade-off is rooted in the fact that adversarially robust and standard accurate models might depend on very different sets of features. However, it has not been well studied what kind of difference actually exists. In this paper, we analyze this difference through various experiments visually and quantitatively. Experimental results show that adversarially robust models look at things at a larger scale than standard models and pay less attention to fine textures. Furthermore, although it has been claimed that adversarially robust features are not compatible with standard accuracy, there is even a positive effect by using them as pre-trained models particularly in low resolution datasets.