LGCVJun 13, 2025

Visual Pre-Training on Unlabeled Images using Reinforcement Learning

Berkeley
arXiv:2506.11967v1h-index: 18Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of learning effective visual features from unlabeled data for computer vision applications, offering a novel approach that can leverage curated or weakly labeled information when available, though it is incremental in combining RL with self-supervised methods.

The paper tackles the problem of visual pre-training on unlabeled images by framing it as a reinforcement learning task, where an agent learns a value function by transforming images through view changes or augmentations. The result shows improved representations across diverse datasets like EpicKitchens, COCO, and CC12M, with concrete gains in performance metrics.

In reinforcement learning (RL), value-based algorithms learn to associate each observation with the states and rewards that are likely to be reached from it. We observe that many self-supervised image pre-training methods bear similarity to this formulation: learning features that associate crops of images with those of nearby views, e.g., by taking a different crop or color augmentation. In this paper, we complete this analogy and explore a method that directly casts pre-training on unlabeled image data like web crawls and video frames as an RL problem. We train a general value function in a dynamical system where an agent transforms an image by changing the view or adding image augmentations. Learning in this way resembles crop-consistency self-supervision, but through the reward function, offers a simple lever to shape feature learning using curated images or weakly labeled captions when they exist. Our experiments demonstrate improved representations when training on unlabeled images in the wild, including video data like EpicKitchens, scene data like COCO, and web-crawl data like CC12M.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes