CVMay 14

Characterizing the visual representation of objects from the child's view

Jane Yang, Tarun Sepuri, Alvin Wei Ming Tan, Khai Loong Aw, Michael C. Frank, Bria Long

arXiv:2605.1499064.3

Predicted impact top 46% in CV · last 90 daysOriginality Incremental advance

AI Analysis

For researchers studying visual category learning in children, this work provides a detailed characterization of naturalistic visual input, revealing challenges and structure that models must address.

This study analyzed 868 hours of first-person video from 31 children (5-36 months) to characterize the visual input for object category learning. They found that category exposure is highly skewed, exemplars are highly variable (unusual angles, clutter, occlusion, depictions), yet detected categories show stronger superordinate groupings than canonical photographs.

Children acquire object category representations from their everyday experiences in the first few years of life. What do the inputs to this learning process look like? We analyzed first-person videos of young children's visual experience at home from the BabyView dataset ($N$ = 31 participants, 868 hours, ages 5--36 months), using a supervised object detection model to extract common object categories from more than 3 million frames. We found that children's object category exposure was highly skewed: a few categories (e.g., cups, chairs) dominated children's visual experiences while most categories appeared rarely, replicating previous findings from a more restricted set of contexts. Category exemplars were highly variable: children encountered objects from unusual angles, in highly cluttered scenes, and partially occluded views; many categories (especially animals) were most frequently viewed as depictions. Surprisingly, despite this variability, detected categories (e.g., giraffes, apples) showed stronger groupings within superordinate categories (e.g., animals, food) relative to groupings derived from canonical photographs of these categories. We found this same pattern when using high-dimensional embeddings from both self-supervised visual and multimodal models; this effect was also recapitulated in densely sampled data from individual children. Understanding the robustness and efficiency of visual category learning will require the development of models that can exploit strong superordinate structure and learn from non-canonical, sparse, and variable exemplars.

View on arXiv PDF

Similar