LGJan 4, 2023Code
Learning Decorrelated Representations Efficiently Using Fast Fourier TransformYutaro Shigeto, Masashi Shimbo, Yuya Yoshikawa et al.
Barlow Twins and VICReg are self-supervised representation learning models that use regularizers to decorrelate features. Although these models are as effective as conventional representation learning models, their training can be computationally demanding if the dimension d of the projected embeddings is high. As the regularizers are defined in terms of individual elements of a cross-correlation or covariance matrix, computing the loss for n samples takes O(n d^2) time. In this paper, we propose a relaxed decorrelating regularizer that can be computed in O(n d log d) time by Fast Fourier Transform. We also propose an inexpensive technique to mitigate undesirable local minima that develop with the relaxation. The proposed regularizer exhibits accuracy comparable to that of existing regularizers in downstream tasks, whereas their training requires less memory and is faster for large d. The source code is available.
CVAug 15, 2023
Action Class Relation Detection and Classification Across Multiple Video DatasetsYuya Yoshikawa, Yutaro Shigeto, Masashi Shimbo et al.
The Meta Video Dataset (MetaVD) provides annotated relations between action classes in major datasets for human action recognition in videos. Although these annotated relations enable dataset augmentation, it is only applicable to those covered by MetaVD. For an external dataset to enjoy the same benefit, the relations between its action classes and those in MetaVD need to be determined. To address this issue, we consider two new machine learning tasks: action class relation detection and classification. We propose a unified model to predict relations between action classes, using language and visual information associated with classes. Experimental results show that (i) pre-trained recent neural network models for texts and videos contribute to high predictive performance, (ii) the relation prediction based on action label texts is more accurate than based on videos, and (iii) a blending approach that combines predictions by both modalities can further improve the predictive performance in some cases.
CLMar 10, 2020
Video Caption Dataset for Describing Human Actions in JapaneseYutaro Shigeto, Yuya Yoshikawa, Jiaqing Lin et al.
In recent years, automatic video caption generation has attracted considerable attention. This paper focuses on the generation of Japanese captions for describing human actions. While most currently available video caption datasets have been constructed for English, there is no equivalent Japanese dataset. To address this, we constructed a large-scale Japanese video caption dataset consisting of 79,822 videos and 399,233 captions. Each caption in our dataset describes a video in the form of "who does what and where." To describe human actions, it is important to identify the details of a person, place, and action. Indeed, when we describe human actions, we usually mention the scene, person, and action. In our experiments, we evaluated two caption generation methods to obtain benchmark results. Further, we investigated whether those generation methods could specify "who does what and where."
CVApr 12, 2018
STAIR Actions: A Video Dataset of Everyday Home ActionsYuya Yoshikawa, Jiaqing Lin, Akikazu Takeuchi
A new large-scale video dataset for human action recognition, called STAIR Actions is introduced. STAIR Actions contains 100 categories of action labels representing fine-grained everyday home actions so that it can be applied to research in various home tasks such as nursing, caring, and security. In STAIR Actions, each video has a single action label. Moreover, for each action category, there are around 1,000 videos that were obtained from YouTube or produced by crowdsource workers. The duration of each video is mostly five to six seconds. The total number of videos is 102,462. We explain how we constructed STAIR Actions and show the characteristics of STAIR Actions compared to existing datasets for human action recognition. Experiments with three major models for action recognition show that STAIR Actions can train large models and achieve good performance. STAIR Actions can be downloaded from http://actions.stair.center
CLMay 2, 2017
STAIR Captions: Constructing a Large-Scale Japanese Image Caption DatasetYuya Yoshikawa, Yutaro Shigeto, Akikazu Takeuchi
In recent years, automatic generation of image descriptions (captions), that is, image captioning, has attracted a great deal of attention. In this paper, we particularly consider generating Japanese captions for images. Since most available caption datasets have been constructed for English language, there are few datasets for Japanese. To tackle this problem, we construct a large-scale Japanese image caption dataset based on images from MS-COCO, which is called STAIR Captions. STAIR Captions consists of 820,310 Japanese captions for 164,062 images. In the experiment, we show that a neural network trained using STAIR Captions can generate more natural and better Japanese captions, compared to those generated using English-Japanese machine translation after generating English captions.