CV AI CL LG ROJun 25, 2024

ET tu, CLIP? Addressing Common Object Errors for Unseen Environments

Ye Won Byun, Cathy Jiao, Shahriar Noroozizadeh, Jimin Sun, Rosa Vitiello

arXiv:2406.17876v12.0

Originality Incremental advance

AI Analysis

This addresses generalization issues in embodied AI tasks like ALFRED, though it appears incremental as it builds on existing Episodic Transformer architecture.

The paper tackles the problem of poor model generalization in unseen environments for the ALFRED task by using pre-trained CLIP encoders as an auxiliary module for object detection, resulting in improved task performance on the unseen validation set.

We introduce a simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task. In contrast to previous literature where CLIP replaces the visual encoder, we suggest using CLIP as an additional module through an auxiliary object detection objective. We validate our method on the recently proposed Episodic Transformer architecture and demonstrate that incorporating CLIP improves task performance on the unseen validation set. Additionally, our analysis results support that CLIP especially helps with leveraging object descriptions, detecting small objects, and interpreting rare words.

View on arXiv PDF

Similar