Yuji Sato

CV
h-index8
3papers
14citations
Novelty28%
AI Score26

3 Papers

CVJul 4, 2023
Technical Report for Ego4D Long Term Action Anticipation Challenge 2023

Tatsuya Ishibashi, Kosuke Ono, Noriyuki Kugo et al.

In this report, we describe the technical details of our approach for the Ego4D Long-Term Action Anticipation Challenge 2023. The aim of this task is to predict a sequence of future actions that will take place at an arbitrary time or later, given an input video. To accomplish this task, we introduce three improvements to the baseline model, which consists of an encoder that generates clip-level features from the video, an aggregator that integrates multiple clip-level features, and a decoder that outputs Z future actions. 1) Model ensemble of SlowFast and SlowFast-CLIP; 2) Label smoothing to relax order constraints for future actions; 3) Constraining the prediction of the action class (verb, noun) based on word co-occurrence. Our method outperformed the baseline performance and recorded as second place solution on the public leaderboard.

CVJul 4, 2024
VDMA: Video Question Answering with Dynamically Generated Multi-Agents

Noriyuki Kugo, Tatsuya Ishibashi, Kosuke Ono et al.

This technical report provides a detailed description of our approach to the EgoSchema Challenge 2024. The EgoSchema Challenge aims to identify the most appropriate responses to questions regarding a given video clip. In this paper, we propose Video Question Answering with Dynamically Generated Multi-Agents (VDMA). This method is a complementary approach to existing response generation systems by employing a multi-agent system with dynamically generated expert agents. This method aims to provide the most accurate and contextually appropriate responses. This report details the stages of our approach, the tools employed, and the results of our experiments.

CVAug 1, 2025
Bidirectional Action Sequence Learning for Long-term Action Anticipation with Large Language Models

Yuji Sato, Yasunori Ishii, Takayoshi Yamashita

Video-based long-term action anticipation is crucial for early risk detection in areas such as automated driving and robotics. Conventional approaches extract features from past actions using encoders and predict future events with decoders, which limits performance due to their unidirectional nature. These methods struggle to capture semantically distinct sub-actions within a scene. The proposed method, BiAnt, addresses this limitation by combining forward prediction with backward prediction using a large language model. Experimental results on Ego4D demonstrate that BiAnt improves performance in terms of edit distance compared to baseline methods.