Forced Spatial Attention for Driver Foot Activity Classification
This work addresses the challenge of spatial-dependent classification in computer vision, specifically for driver monitoring systems, though it is incremental as it builds on existing loss functions.
The paper tackles the problem of image classification tasks where output classes depend on the spatial location of objects, proposing a Forced Spatial Attention (FSA) loss to compel networks to attend to specific regions. This approach significantly improves accuracies, enhances generalization, and increases robustness against noise for driver foot activity classification, while reducing the need for large datasets.
This paper provides a simple solution for reliably solving image classification tasks tied to spatial locations of salient objects in the scene. Unlike conventional image classification approaches that are designed to be invariant to translations of objects in the scene, we focus on tasks where the output classes vary with respect to where an object of interest is situated within an image. To handle this variant of the image classification task, we propose augmenting the standard cross-entropy (classification) loss with a domain dependent Forced Spatial Attention (FSA) loss, which in essence compels the network to attend to specific regions in the image associated with the desired output class. To demonstrate the utility of this loss function, we consider the task of driver foot activity classification - where each activity is strongly correlated with where the driver's foot is in the scene. Training with our proposed loss function results in significantly improved accuracies, better generalization, and robustness against noise, while obviating the need for very large datasets.