CVLGNESep 23, 2021

How much human-like visual experience do current self-supervised learning algorithms need in order to achieve human-level object recognition?

arXiv:2109.11523v35 citations
Originality Incremental advance
AI Analysis

This reveals a fundamental gap in AI visual learning efficiency, highlighting a critical bottleneck for achieving human-like AI in object recognition.

The paper investigates how much human-like visual experience current self-supervised learning algorithms require to achieve human-level object recognition on ImageNet, finding that it would take millions to billions of years, far exceeding a human lifetime.

This paper addresses a fundamental question: how good are our current self-supervised visual representation learning algorithms relative to humans? More concretely, how much "human-like" natural visual experience would these algorithms need in order to reach human-level performance in a complex, realistic visual object recognition task such as ImageNet? Using a scaling experiment, here we estimate that the answer is several orders of magnitude longer than a human lifetime: typically on the order of a million to a billion years of natural visual experience (depending on the algorithm used). We obtain even larger estimates for achieving human-level performance in ImageNet-derived robustness benchmarks. The exact values of these estimates are sensitive to some underlying assumptions, however even in the most optimistic scenarios they remain orders of magnitude larger than a human lifetime. We discuss the main caveats surrounding our estimates and the implications of these surprising results.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes