Video-based Contrastive Learning on Decision Trees: from Action Recognition to Autism Diagnosis
This work addresses the problem of scalable action recognition for applications such as medical diagnosis, though it appears incremental by combining existing methods like contrastive learning and decision trees.
The paper tackles action recognition by proposing a contrastive learning framework that translates multi-class classification into binary tasks on a decision tree, using skeleton graphs and an interaction adjacent matrix, and demonstrates promising results in applications like autism diagnosis on the CalTech database.
How can we teach a computer to recognize 10,000 different actions? Deep learning has evolved from supervised and unsupervised to self-supervised approaches. In this paper, we present a new contrastive learning-based framework for decision tree-based classification of actions, including human-human interactions (HHI) and human-object interactions (HOI). The key idea is to translate the original multi-class action recognition into a series of binary classification tasks on a pre-constructed decision tree. Under the new framework of contrastive learning, we present the design of an interaction adjacent matrix (IAM) with skeleton graphs as the backbone for modeling various action-related attributes such as periodicity and symmetry. Through the construction of various pretext tasks, we obtain a series of binary classification nodes on the decision tree that can be combined to support higher-level recognition tasks. Experimental justification for the potential of our approach in real-world applications ranges from interaction recognition to symmetry detection. In particular, we have demonstrated the promising performance of video-based autism spectrum disorder (ASD) diagnosis on the CalTech interview video database.