Learning More by Seeing Less: Structure First Learning for Efficient, Transferable, and Human-Aligned Vision
This work addresses the need for more efficient, transferable, and human-aligned vision systems, offering a novel approach that is incremental in its application of existing methods to a new training modality.
The paper tackles the problem of inefficient and non-generalizable visual recognition systems by proposing a structure-first learning paradigm that uses line drawings for initial training, resulting in models with stronger shape bias, greater data efficiency, and lower intrinsic dimensionality across tasks like classification and detection.
Despite remarkable progress in computer vision, modern recognition systems remain fundamentally limited by their dependence on rich, redundant visual inputs. In contrast, humans can effortlessly understand sparse, minimal representations like line drawings, suggesting that structure, rather than appearance, underlies efficient visual understanding. In this work, we propose a novel structure-first learning paradigm that uses line drawings as an initial training modality to induce more compact and generalizable visual representations. We demonstrate that models trained with this approach develop a stronger shape bias, more focused attention, and greater data efficiency across classification, detection, and segmentation tasks. Notably, these models also exhibit lower intrinsic dimensionality, requiring significantly fewer principal components to capture representational variance, which mirrors observations of low-dimensional, efficient representations in the human brain. Beyond performance improvements, structure-first learning produces more compressible representations, enabling better distillation into lightweight student models. Students distilled from teachers trained on line drawings consistently outperform those trained from color-supervised teachers, highlighting the benefits of structurally compact knowledge. Together, our results support the view that structure-first visual learning fosters efficiency, generalization, and human-aligned inductive biases, offering a simple yet powerful strategy for building more robust and adaptable vision systems.