CVAug 23, 2023

RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D

arXiv:2308.12035v214.926 citationsh-index: 8Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the need for agents like glass-devices or autonomous robots to localize objects in real-world scenes, but it is incremental as it builds on existing datasets and methods.

The authors tackled the problem of grounding textual expressions to objects in first-person videos by constructing RefEgo, a dataset based on Ego4D with over 12k video clips and 41 hours of annotations, and achieved video-wise referred object tracking even in challenging conditions like out-of-frame objects or multiple similar objects.

Grounding textual expressions on scene objects from first-person views is a truly demanding capability in developing agents that are aware of their surroundings and behave following intuitive text instructions. Such capability is of necessity for glass-devices or autonomous robots to localize referred objects in the real-world. In the conventional referring expression comprehension tasks of images, however, datasets are mostly constructed based on the web-crawled data and don't reflect diverse real-world structures on the task of grounding textual expressions in diverse objects in the real world. Recently, a massive-scale egocentric video dataset of Ego4D was proposed. Ego4D covers around the world diverse real-world scenes including numerous indoor and outdoor situations such as shopping, cooking, walking, talking, manufacturing, etc. Based on egocentric videos of Ego4D, we constructed a broad coverage of the video-based referring expression comprehension dataset: RefEgo. Our dataset includes more than 12k video clips and 41 hours for video-based referring expression comprehension annotation. In experiments, we combine the state-of-the-art 2D referring expression comprehension models with the object tracking algorithm, achieving the video-wise referred object tracking even in difficult conditions: the referred object becomes out-of-frame in the middle of the video or multiple similar objects are presented in the video. Codes are available at https://github.com/shuheikurita/RefEgo

View on arXiv PDF Code

Similar