Koichi Ito

CV
h-index21
15papers
274citations
Novelty49%
AI Score52

15 Papers

CVApr 20
PlankFormer: Robust Plankton Instance Segmentation via MAE-Pretrained Vision Transformers and Pseudo Community Image Generation

Masaharu Miyazaki, Yurie Otake, Koichi Ito et al.

Plankton monitoring is essential for assessing aquatic ecosystems but is limited by the labor-intensive nature of manual microscopic analysis. Automating the segmentation of plankton from crowded images is crucial, however, it faces two major challenges: (i) the scarcity of pixel-level annotated datasets and (ii) the difficulty of distinguishing plankton from debris and overlapping individuals using conventional CNN-based methods. To address these issues, we propose PlankFormer, a novel framework for plankton instance segmentation. First, to overcome the data shortage, we introduce a method to generate labeled Pseudo Community Images (PCI) by synthesizing individual plankton images onto diverse backgrounds, including those created by generative models. Second, we propose a segmentation model utilizing a Vision Transformer (ViT) backbone with a Mask2Former decoder. To robustly capture the global structural features of plankton against occlusion and debris, we employ a Masked Autoencoder (MAE) for self-supervised pre-training on unlabeled individual images. Experimental results on real-world datasets demonstrate that our method significantly outperforms conventional methods, such as Mask R-CNN, particularly in challenging environments with high debris density. We demonstrate that our synthetic training strategy and MAE-based architecture enable high-precision segmentation with requiring less manual annotations for individual plankton images.

CVSep 14, 2024
LabellessFace: Fair Metric Learning for Face Recognition without Attribute Labels

Tetsushi Ohki, Yuya Sato, Masakatsu Nishigaki et al.

Demographic bias is one of the major challenges for face recognition systems. The majority of existing studies on demographic biases are heavily dependent on specific demographic groups or demographic classifier, making it difficult to address performance for unrecognised groups. This paper introduces ``LabellessFace'', a novel framework that improves demographic bias in face recognition without requiring demographic group labeling typically required for fairness considerations. We propose a novel fairness enhancement metric called the class favoritism level, which assesses the extent of favoritism towards specific classes across the dataset. Leveraging this metric, we introduce the fair class margin penalty, an extension of existing margin-based metric learning. This method dynamically adjusts learning parameters based on class favoritism levels, promoting fairness across all attributes. By treating each class as an individual in facial recognition systems, we facilitate learning that minimizes biases in authentication accuracy among individuals. Comprehensive experiments have demonstrated that our proposed method is effective for enhancing fairness while maintaining authentication accuracy.

CVApr 21
Benchmarking Vision Foundation Models for Domain-Generalizable Face Anti-Spoofing

Mika Feng, Pierre Gallin-Martel, Koichi Ito et al.

Face Anti-Spoofing (FAS) remains challenging due to the requirement for robust domain generalization across unseen environments. While recent trends leverage Vision-Language Models (VLMs) for semantic supervision, these multimodal approaches often demand prohibitive computational resources and exhibit high inference latency. Furthermore, their efficacy is inherently limited by the quality of the underlying visual features. This paper revisits the potential of vision-only foundation models to establish a highly efficient and robust baseline for FAS. We conduct a systematic benchmarking of 15 pre-trained models, such as supervised CNNs, supervised ViTs, and self-supervised ViTs, under severe cross-domain scenarios including the MICO and Limited Source Domains (LSD) protocols. Our comprehensive analysis reveals that self-supervised vision models, particularly DINOv2 with Registers, significantly suppress attention artifacts and capture critical, fine-grained spoofing cues. Combined with Face Anti-Spoofing Data Augmentation (FAS-Aug), Patch-wise Data Augmentation (PDA) and Attention-weighted Patch Loss (APL), our proposed vision-only baseline achieves state-of-the-art performance in the MICO protocol. This baseline outperforms existing methods under the data-constrained LSD protocol while maintaining superior computational efficiency. This work provides a definitive vision-only baseline for FAS, demonstrating that optimized self-supervised vision transformers can serve as a backbone for both vision-only and future multimodal FAS systems. The project page is available at: https://gsisaoki.github.io/FAS-VFMbenchmark-CVPRW2026/ .

CVDec 24, 2024Code
ZenSVI: An Open-Source Software for the Integrated Acquisition, Processing and Analysis of Street View Imagery Towards Scalable Urban Science

Koichi Ito, Yihan Zhu, Mahmoud Abdelrahman et al.

Street view imagery (SVI) has been instrumental in many studies in the past decade to understand and characterize street features and the built environment. Researchers across a variety of domains, such as transportation, health, architecture, human perception, and infrastructure have employed different methods to analyze SVI. However, these applications and image-processing procedures have not been standardized, and solutions have been implemented in isolation, often making it difficult for others to reproduce existing work and carry out new research. Using SVI for research requires multiple technical steps: accessing APIs for scalable data collection, preprocessing images to standardize formats, implementing computer vision models for feature extraction, and conducting spatial analysis. These technical requirements create barriers for researchers in urban studies, particularly those without extensive programming experience. We developed ZenSVI, a free and open-source Python package that integrates and implements the entire process of SVI analysis, supporting a wide range of use cases. Its end-to-end pipeline includes downloading SVI from multiple platforms (e.g., Mapillary and KartaView) efficiently, analyzing metadata of SVI, applying computer vision models to extract target features, transforming SVI into different projections (e.g., fish-eye and perspective) and different formats (e.g., depth map and point cloud), visualizing analyses with maps and plots, and exporting outputs to other software tools. We demonstrated its use in Singapore through a case study of data quality assessment and clustering analysis in a streamlined manner. Our software improves the transparency, reproducibility, and scalability of research relying on SVI and supports researchers in conducting urban analyses efficiently. Its modular design facilitates extensions of the package for new use cases.

CVSep 30, 2024
Multibiometrics Using a Single Face Image

Koichi Ito, Taito Tonosaki, Takafumi Aoki et al.

Multibiometrics, which uses multiple biometric traits to improve recognition performance instead of using only one biometric trait to authenticate individuals, has been investigated. Previous studies have combined individually acquired biometric traits or have not fully considered the convenience of the system. Focusing on a single face image, we propose a novel multibiometric method that combines five biometric traits, i.e., face, iris, periocular, nose, eyebrow, that can be extracted from a single face image. The proposed method does not sacrifice the convenience of biometrics since only a single face image is used as input. Through a variety of experiments using the CASIA Iris Distance database, we demonstrate the effectiveness of the proposed multibiometrics method.

CVDec 19, 2025
It is not always greener on the other side: Greenery perception across demographics and personalities in multiple cities

Matias Quintana, Fangqi Liu, Jussi Torkko et al.

Quantifying and assessing urban greenery is consequential for planning and development, reflecting the everlasting importance of green spaces for multiple climate and well-being dimensions of cities. Evaluation can be broadly grouped into objective (e.g., measuring the amount of greenery) and subjective (e.g., polling the perception of people) approaches, which may differ -- what people see and feel about how green a place is might not match the measurements of the actual amount of vegetation. In this work, we advance the state of the art by measuring such differences and explaining them through human, geographic, and spatial dimensions. The experiments rely on contextual information extracted from street view imagery and a comprehensive urban visual perception survey collected from 1,000 people across five countries with their extensive demographic and personality information. We analyze the discrepancies between objective measures (e.g., Green View Index (GVI)) and subjective scores (e.g., pairwise ratings), examining whether they can be explained by a variety of human and visual factors such as age group and spatial variation of greenery in the scene. The findings reveal that such discrepancies are comparable around the world and that demographics and personality do not play a significant role in perception. Further, while perceived and measured greenery correlate consistently across geographies (both where people and where imagery are from), where people live plays a significant role in explaining perceptual differences, with these two, as the top among seven, features that influences perceived greenery the most. This location influence suggests that cultural, environmental, and experiential factors substantially shape how individuals observe greenery in cities.

CVOct 20, 2025
Optimizing DINOv2 with Registers for Face Anti-Spoofing

Mika Feng, Pierre Gallin-Martel, Koichi Ito et al.

Face recognition systems are designed to be robust against variations in head pose, illumination, and image blur during capture. However, malicious actors can exploit these systems by presenting a face photo of a registered user, potentially bypassing the authentication process. Such spoofing attacks must be detected prior to face recognition. In this paper, we propose a DINOv2-based spoofing attack detection method to discern minute differences between live and spoofed face images. Specifically, we employ DINOv2 with registers to extract generalizable features and to suppress perturbations in the attention mechanism, which enables focused attention on essential and minute features. We demonstrate the effectiveness of the proposed method through experiments conducted on the dataset provided by ``The 6th Face Anti-Spoofing Workshop: Unified Physical-Digital Attacks Detection@ICCV2025'' and SiW dataset.

CVSep 3, 2025
Backdoor Poisoning Attack Against Face Spoofing Attack Detection Methods

Shota Iwamatsu, Koichi Ito, Takafumi Aoki

Face recognition systems are robust against environmental changes and noise, and thus may be vulnerable to illegal authentication attempts using user face photos, such as spoofing attacks. To prevent such spoofing attacks, it is crucial to discriminate whether the input image is a live user image or a spoofed image prior to the face recognition process. Most existing spoofing attack detection methods utilize deep learning, which necessitates a substantial amount of training data. Consequently, if malicious data is injected into a portion of the training dataset, a specific spoofing attack may be erroneously classified as live, leading to false positives. In this paper, we propose a novel backdoor poisoning attack method to demonstrate the latent threat of backdoor poisoning within face anti-spoofing detection. The proposed method enables certain spoofing attacks to bypass detection by embedding features extracted from the spoofing attack's face image into a live face image without inducing any perceptible visual alterations. Through experiments conducted on public datasets, we demonstrate that the proposed method constitutes a realistic threat to existing spoofing attack detection systems.

CVMay 30, 2025
Leveraging Intermediate Features of Vision Transformer for Face Anti-Spoofing

Mika Feng, Koichi Ito, Takafumi Aoki et al.

Face recognition systems are designed to be robust against changes in head pose, illumination, and blurring during image capture. If a malicious person presents a face photo of the registered user, they may bypass the authentication process illegally. Such spoofing attacks need to be detected before face recognition. In this paper, we propose a spoofing attack detection method based on Vision Transformer (ViT) to detect minute differences between live and spoofed face images. The proposed method utilizes the intermediate features of ViT, which have a good balance between local and global features that are important for spoofing attack detection, for calculating loss in training and score in inference. The proposed method also introduces two data augmentation methods: face anti-spoofing data augmentation and patch-wise data augmentation, to improve the accuracy of spoofing attack detection. We demonstrate the effectiveness of the proposed method through experiments using the OULU-NPU and SiW datasets. The project page is available at: https://gsisaoki.github.io/FAS-ViT-CVPRW/ .

CVMay 27, 2025
Stereo Radargrammetry Using Deep Learning from Airborne SAR Images

Tatsuya Sasayama, Shintaro Ito, Koichi Ito et al.

In this paper, we propose a stereo radargrammetry method using deep learning from airborne Synthetic Aperture Radar (SAR) images. Deep learning-based methods are considered to suffer less from geometric image modulation, while there is no public SAR image dataset used to train such methods. We create a SAR image dataset and perform fine-tuning of a deep learning-based image correspondence method. The proposed method suppresses the degradation of image quality by pixel interpolation without ground projection of the SAR image and divides the SAR image into patches for processing, which makes it possible to apply deep learning. Through a set of experiments, we demonstrate that the proposed method exhibits a wider range and more accurate elevation measurements compared to conventional methods. The project web page is available at: https://gsisaoki.github.io/IGARSS2025_sasayama/

CVMay 19, 2025
Global urban visual perception varies across demographics and personalities

Matias Quintana, Youlong Gu, Xiucheng Liang et al.

Understanding people's preferences is crucial for urban planning, yet current approaches often combine responses from multi-cultural populations, obscuring demographic differences and risking amplifying biases. We conducted a largescale urban visual perception survey of streetscapes worldwide using street view imagery, examining how demographics -- including gender, age, income, education, race and ethnicity, and personality traits -- shape perceptions among 1,000 participants with balanced demographics from five countries and 45 nationalities. This dataset, Street Perception Evaluation Considering Socioeconomics (SPECS), reveals demographic- and personality-based differences across six traditional indicators -- safe, lively, wealthy, beautiful, boring, depressing -- and four new ones -- live nearby, walk, cycle, green. Location-based sentiments further shape these preferences. Machine learning models trained on existing global datasets tend to overestimate positive indicators and underestimate negative ones compared to human responses, underscoring the need for local context. Our study aspires to rectify the myopic treatment of street perception, which rarely considers demographics or personality traits.

CVDec 11, 2024
Enhancing Remote Adversarial Patch Attacks on Face Detectors with Tiling and Scaling

Masora Okano, Koichi Ito, Masakatsu Nishigaki et al.

This paper discusses the attack feasibility of Remote Adversarial Patch (RAP) targeting face detectors. The RAP that targets face detectors is similar to the RAP that targets general object detectors, but the former has multiple issues in the attack process the latter does not. (1) It is possible to detect objects of various scales. In particular, the area of small objects that are convolved during feature extraction by CNN is small,so the area that affects the inference results is also small. (2) It is a two-class classification, so there is a large gap in characteristics between the classes. This makes it difficult to attack the inference results by directing them to a different class. In this paper, we propose a new patch placement method and loss function for each problem. The patches targeting the proposed face detector showed superior detection obstruct effects compared to the patches targeting the general object detector.

CVMay 13, 2021
Assessing bikeability with street view imagery and computer vision

Koichi Ito, Filip Biljecki

Studies evaluating bikeability usually compute spatial indicators shaping cycling conditions and conflate them in a quantitative index. Much research involves site visits or conventional geospatial approaches, and few studies have leveraged street view imagery (SVI) for conducting virtual audits. These have assessed a limited range of aspects, and not all have been automated using computer vision (CV). Furthermore, studies have not yet zeroed in on gauging the usability of these technologies thoroughly. We investigate, with experiments at a fine spatial scale and across multiple geographies (Singapore and Tokyo), whether we can use SVI and CV to assess bikeability comprehensively. Extending related work, we develop an exhaustive index of bikeability composed of 34 indicators. The results suggest that SVI and CV are adequate to evaluate bikeability in cities comprehensively. As they outperformed non-SVI counterparts by a wide margin, SVI indicators are also found to be superior in assessing urban bikeability, and potentially can be used independently, replacing traditional techniques. However, the paper exposes some limitations, suggesting that the best way forward is combining both SVI and non-SVI approaches. The new bikeability index presents a contribution in transportation and urban analytics, and it is scalable to assess cycling appeal widely.

CVAug 27, 2020
Fingerprint Feature Extraction by Combining Texture, Minutiae, and Frequency Spectrum Using Multi-Task CNN

Ai Takahashi, Yoshinori Koda, Koichi Ito et al.

Although most fingerprint matching methods utilize minutia points and/or texture of fingerprint images as fingerprint features, the frequency spectrum is also a useful feature since a fingerprint is composed of ridge patterns with its inherent frequency band. We propose a novel CNN-based method for extracting fingerprint features from texture, minutiae, and frequency spectrum. In order to extract effective texture features from local regions around the minutiae, the minutia attention module is introduced to the proposed method. We also propose new data augmentation methods, which takes into account the characteristics of fingerprint images to increase the number of images during training since we use only a public dataset in training, which includes a few fingerprint classes. Through a set of experiments using FVC2004 DB1 and DB2, we demonstrated that the proposed method exhibits the efficient performance on fingerprint verification compared with a commercial fingerprint matching software and the conventional method.

CVDec 28, 2019
Silhouette-Net: 3D Hand Pose Estimation from Silhouettes

Kuo-Wei Lee, Shih-Hung Liu, Hwann-Tzong Chen et al.

3D hand pose estimation has received a lot of attention for its wide range of applications and has made great progress owing to the development of deep learning. Existing approaches mainly consider different input modalities and settings, such as monocular RGB, multi-view RGB, depth, or point cloud, to provide sufficient cues for resolving variations caused by self occlusion and viewpoint change. In contrast, this work aims to address the less-explored idea of using minimal information to estimate 3D hand poses. We present a new architecture that automatically learns a guidance from implicit depth perception and solves the ambiguity of hand pose through end-to-end training. The experimental results show that 3D hand poses can be accurately estimated from solely {\em hand silhouettes} without using depth maps. Extensive evaluations on the {\em 2017 Hands In the Million Challenge} (HIM2017) benchmark dataset further demonstrate that our method achieves comparable or even better performance than recent depth-based approaches and serves as the state-of-the-art of its own kind on estimating 3D hand poses from silhouettes.