Self-supervised learning through the eyes of a child
This work addresses a fundamental question in cognitive development for researchers, but it is incremental as it builds on existing self-supervised methods and datasets.
The paper tackled the problem of how much early visual knowledge in children can be learned from sensory data using generic mechanisms versus innate biases, by applying self-supervised deep learning to longitudinal egocentric videos from three young children. The result showed the emergence of powerful, high-level visual representations from these developmentally realistic videos.
Within months of birth, children develop meaningful expectations about the world around them. How much of this early knowledge can be explained through generic learning mechanisms applied to sensory data, and how much of it requires more substantive innate inductive biases? Addressing this fundamental question in its full generality is currently infeasible, but we can hope to make real progress in more narrowly defined domains, such as the development of high-level visual categories, thanks to improvements in data collecting technology and recent progress in deep learning. In this paper, our goal is precisely to achieve such progress by utilizing modern self-supervised deep learning methods and a recent longitudinal, egocentric video dataset recorded from the perspective of three young children (Sullivan et al., 2020). Our results demonstrate the emergence of powerful, high-level visual representations from developmentally realistic natural videos using generic self-supervised learning objectives.