Learning Hierarchical Visual Representations in Deep Neural Networks Using Hierarchical Linguistic Labels
This addresses a limitation in AI visual recognition for more human-like learning, though it is incremental as it builds on existing CNN methods.
The authors tackled the problem of CNNs being biased by single-label training that misaligns with human hierarchical categorization, by training CNNs with multiple hierarchical labels per image to better match human generalization patterns.
Modern convolutional neural networks (CNNs) are able to achieve human-level object classification accuracy on specific tasks, and currently outperform competing models in explaining complex human visual representations. However, the categorization problem is posed differently for these networks than for humans: the accuracy of these networks is evaluated by their ability to identify single labels assigned to each image. These labels often cut arbitrarily across natural psychological taxonomies (e.g., dogs are separated into breeds, but never jointly categorized as "dogs"), and bias the resulting representations. By contrast, it is common for children to hear both "dog" and "Dalmatian" to describe the same stimulus, helping to group perceptually disparate objects (e.g., breeds) into a common mental class. In this work, we train CNN classifiers with multiple labels for each image that correspond to different levels of abstraction, and use this framework to reproduce classic patterns that appear in human generalization behavior.