Takatsugu Hirayama

HC
8papers
89citations
Novelty34%
AI Score33

8 Papers

CVJul 18, 2023
MVA2023 Small Object Detection Challenge for Spotting Birds: Dataset, Methods, and Results

Yuki Kondo, Norimichi Ukita, Takayuki Yamaguchi et al.

Small Object Detection (SOD) is an important machine vision topic because (i) a variety of real-world applications require object detection for distant objects and (ii) SOD is a challenging task due to the noisy, blurred, and less-informative image appearances of small objects. This paper proposes a new SOD dataset consisting of 39,070 images including 137,121 bird instances, which is called the Small Object Detection for Spotting Birds (SOD4SB) dataset. The detail of the challenge with the SOD4SB dataset is introduced in this paper. In total, 223 participants joined this challenge. This paper briefly introduces the award-winning methods. The dataset, the baseline code, and the website for evaluation on the public testset are publicly available.

MMMar 6, 2023
IPA-CLIP: Integrating Phonetic Priors into Vision and Language Pretraining

Chihaya Matsuhira, Marc A. Kastner, Takahiro Komamizu et al.

Recently, large-scale Vision and Language (V\&L) pretraining has become the standard backbone of many multimedia systems. While it has shown remarkable performance even in unseen situations, it often performs in ways not intuitive to humans. Particularly, they usually do not consider the pronunciation of the input, which humans would utilize to understand language, especially when it comes to unknown words. Thus, this paper inserts phonetic prior into Contrastive Language-Image Pretraining (CLIP), one of the V\&L pretrained models, to make it consider the pronunciation similarity among its pronunciation inputs. To achieve this, we first propose a phoneme embedding that utilizes the phoneme relationships provided by the International Phonetic Alphabet (IPA) chart as a phonetic prior. Next, by distilling the frozen CLIP text encoder, we train a pronunciation encoder employing the IPA-based embedding. The proposed model named IPA-CLIP comprises this pronunciation encoder and the original CLIP encoders (image and text). Quantitative evaluation reveals that the phoneme distribution on the embedding space represents phonetic relationships more accurately when using the proposed phoneme embedding. Furthermore, in some multimodal retrieval tasks, we confirm that the proposed pronunciation encoder enhances the performance of the text encoder and that the pronunciation encoder handles nonsense words in a more phonetic manner than the text encoder. Finally, qualitative evaluation verifies the correlation between the pronunciation encoder and human perception regarding pronunciation similarity.

CVNov 27, 2025
Small Object Detection for Birds with Swin Transformer

Da Huo, Marc A. Kastner, Tingwei Liu et al.

Object detection is the task of detecting objects in an image. In this task, the detection of small objects is particularly difficult. Other than the small size, it is also accompanied by difficulties due to blur, occlusion, and so on. Current small object detection methods are tailored to small and dense situations, such as pedestrians in a crowd or far objects in remote sensing scenarios. However, when the target object is small and sparse, there is a lack of objects available for training, making it more difficult to learn effective features. In this paper, we propose a specialized method for detecting a specific category of small objects; birds. Particularly, we improve the features learned by the neck; the sub-network between the backbone and the prediction head, to learn more effective features with a hierarchical design. We employ Swin Transformer to upsample the image features. Moreover, we change the shifted window size for adapting to small objects. Experiments show that the proposed Swin Transformer-based neck combined with CenterNet can lead to good performance by changing the window sizes. We further find that smaller window sizes (default 2) benefit mAPs for small object detection.

CVMay 9, 2021
Interaction Detection Between Vehicles and Vulnerable Road Users: A Deep Generative Approach with Attention

Hao Cheng, Li Feng, Hailong Liu et al.

Intersections where vehicles are permitted to turn and interact with vulnerable road users (VRUs) like pedestrians and cyclists are among some of the most challenging locations for automated and accurate recognition of road users' behavior. In this paper, we propose a deep conditional generative model for interaction detection at such locations. It aims to automatically analyze massive video data about the continuity of road users' behavior. This task is essential for many intelligent transportation systems such as traffic safety control and self-driving cars that depend on the understanding of road users' locomotion. A Conditional Variational Auto-Encoder based model with Gaussian latent variables is trained to encode road users' behavior and perform probabilistic and diverse predictions of interactions. The model takes as input the information of road users' type, position and motion automatically extracted by a deep learning object detector and optical flow from videos, and generates frame-wise probabilities that represent the dynamics of interactions between a turning vehicle and any VRUs involved. The model's efficacy was validated by testing on real--world datasets acquired from two different intersections. It achieved an F1-score above 0.96 at a right--turn intersection in Germany and 0.89 at a left--turn intersection in Japan, both with very busy traffic flows.

HCFeb 16, 2021
Importance of Instruction for Pedestrian-Automated Driving Vehicle Interaction with an External Human Machine Interface: Effects on Pedestrians' Situation Awareness, Trust, Perceived Risks and Decision Making

Hailong Liu, Takatsugu Hirayama, Masaya Watanabe

Compared to a manual driving vehicle (MV), an automated driving vehicle lacks a way to communicate with the pedestrian through the driver when it interacts with the pedestrian because the driver usually does not participate in driving tasks. Thus, an external human machine interface (eHMI) can be viewed as a novel explicit communication method for providing driving intentions of an automated driving vehicle (AV) to pedestrians when they need to negotiate in an interaction, e.g., an encountering scene. However, the eHMI may not guarantee that the pedestrians will fully recognize the intention of the AV. In this paper, we propose that the instruction of the eHMI's rationale can help pedestrians correctly understand the driving intentions and predict the behavior of the AV, and thus their subjective feelings (i.e., dangerous feeling, trust in the AV, and feeling of relief) and decision-making are also improved. The results of an interaction experiment in a road-crossing scene indicate that the participants were more difficult to be aware of the situation when they encountered an AV w/o eHMI compared to when they encountered an MV; further, the participants' subjective feelings and hesitation in decision-making also deteriorated significantly. When the eHMI was used in the AV, the situational awareness, subjective feelings and decision-making of the participants regarding the AV w/ eHMI were improved. After the instruction, it was easier for the participants to understand the driving intention and predict driving behavior of the AV w/ eHMI. Further, the subjective feelings and the hesitation related to decision-making were improved and reached the same standards as that for the MV.

HCMar 2, 2020
What Timing for an Automated Vehicle to Make Pedestrians Understand Its Driving Intentions for Improving Their Perception of Safety?

Hailong Liu, Takatsugu Hirayama, Luis Yoichi Morales et al.

Although automated driving systems have been used frequently, they are still unpopular in society. To increase the popularity of automated vehicles (AVs), assisting pedestrians to accurately understand the driving intentions and improving their perception of safety when interacting with AVs are considered effective. Therefore, the AV should send information about its driving intention to pedestrians when they interact with each other. However, the following questions should be answered regarding how the AV sends the information to them: 1) What timing for an AV to make pedestrians understand its driving intentions after being noticed by them? 2) What timing for an AV to make pedestrians feel safe after being noticed by them? Thirteen participants were invited to interact with a manually driven vehicle and an AV in an experiment. The participants' gaze information and a subjective evaluation of their understanding of the driving intention as well as their perception of safety were collected. By analyzing the participants' gaze duration on the vehicle with their subjective evaluations, we found that the AV should enable the pedestrian to accurately understand its driving intention within 0.5~6.5 [s] and make the pedestrian feel safe within 0.5~8.0 [s] while the pedestrian is gazing at it.

HCJan 6, 2020
What Is the Gaze Behavior of Pedestrians in Interactions with an Automated Vehicle When They Do Not Understand Its Intentions?

Hailong Liu, Takatsugu Hirayama, Luis Yoichi Morales et al.

Interactions between pedestrians and automated vehicles (AVs) will increase significantly with the popularity of AV. However, pedestrians often have not enough trust on the AVs , particularly when they are confused about an AV's intention in a interaction. This study seeks to evaluate if pedestrians clearly understand the driving intentions of AVs in interactions and presents experimental research on the relationship between gaze behaviors of pedestrians and their understanding of the intentions of the AV. The hypothesis investigated in this study was that the less the pedestrian understands the driving intentions of the AV, the longer the duration of their gazing behavior will be. A pedestrian--vehicle interaction experiment was designed to verify the proposed hypothesis. A robotic wheelchair was used as the manual driving vehicle (MV) and AV for interacting with pedestrians while pedestrians' gaze data and their subjective evaluation of the driving intentions were recorded. The experimental results supported our hypothesis as there was a negative correlation between the pedestrians' gaze duration on the AV and their understanding of the driving intentions of the AV. Moreover, the gaze duration of most of the pedestrians on the MV was shorter than that on an AV. Therefore, we conclude with two recommendations to designers of external human-machine interfaces (eHMI): (1) when a pedestrian is engaged in an interaction with an AV, the driving intentions of the AV should be provided; (2) if the pedestrian still gazes at the AV after the AV displays its driving intentions, the AV should provide clearer information about its driving intentions.

HCMay 13, 2019
Saliency difference based objective evaluation method for a superimposed screen of the HUD with various background

Hailong Liu, Toshihiro Hiraoka, Takatsugu Hirayama et al.

The head-up display (HUD) is an emerging device which can project information on a transparent screen. The HUD has been used in airplanes and vehicles, and it is usually placed in front of the operator's view. In the case of the vehicle, the driver can see not only various information on the HUD but also the backgrounds (driving environment) through the HUD. However, the projected information on the HUD may interfere with the colors in the background because the HUD is transparent. For example, a red message on the HUD will be less noticeable when there is an overlap between it and the red brake light from the front vehicle. As the first step to solve this issue, how to evaluate the mutual interference between the information on the HUD and backgrounds is important. Therefore, this paper proposes a method to evaluate the mutual interference based on saliency. It can be evaluated by comparing the HUD part cut from a saliency map of a measured image with the HUD image.