CVMay 4, 2022
Attention-based Knowledge Distillation in Multi-attention Tasks: The Impact of a DCT-driven LossAlejandro López-Cifuentes, Marcos Escudero-Viñolo, Jesús Bescós et al.
Knowledge Distillation (KD) is a strategy for the definition of a set of transferability gangways to improve the efficiency of Convolutional Neural Networks. Feature-based Knowledge Distillation is a subfield of KD that relies on intermediate network representations, either unaltered or depth-reduced via maximum activation maps, as the source knowledge. In this paper, we propose and analyse the use of a 2D frequency transform of the activation maps before transferring them. We pose that\textemdash by using global image cues rather than pixel estimates, this strategy enhances knowledge transferability in tasks such as scene recognition, defined by strong spatial and contextual relationships between multiple and varied concepts. To validate the proposed method, an extensive evaluation of the state-of-the-art in scene recognition is presented. Experimental results provide strong evidences that the proposed strategy enables the student network to better focus on the relevant image areas learnt by the teacher network, hence leading to better descriptive features and higher transferred performance than every other state-of-the-art alternative. We publicly release the training and evaluation framework used along this paper at http://www-vpu.eps.uam.es/publications/DCTBasedKDForSceneRecognition.
CVSep 5, 2019Code
Semantic-Aware Scene RecognitionAlejandro López-Cifuentes, Marcos Escudero-Viñolo, Jesús Bescós et al.
Scene recognition is currently one of the top-challenging research fields in computer vision. This may be due to the ambiguity between classes: images of several scene classes may share similar objects, which causes confusion among them. The problem is aggravated when images of a particular scene class are notably different. Convolutional Neural Networks (CNNs) have significantly boosted performance in scene recognition, albeit it is still far below from other recognition tasks (e.g., object or image recognition). In this paper, we describe a novel approach for scene recognition based on an end-to-end multi-modal CNN that combines image and context information by means of an attention module. Context information, in the shape of semantic segmentation, is used to gate features extracted from the RGB image by leveraging on information encoded in the semantic representation: the set of scene objects and stuff, and their relative locations. This gating process reinforces the learning of indicative scene content and enhances scene disambiguation by refocusing the receptive fields of the CNN towards them. Experimental results on four publicly available datasets show that the proposed approach outperforms every other state-of-the-art method while significantly reducing the number of network parameters. All the code and data used along this paper is available at https://github.com/vpulab/Semantic-Aware-Scene-Recognition
CVAug 26, 2020
A Prospective Study on Sequence-Driven Temporal Sampling and Ego-Motion Compensation for Action Recognition in the EPIC-Kitchens DatasetAlejandro López-Cifuentes, Marcos Escudero-Viñolo, Jesús Bescós
Action recognition is currently one of the top-challenging research fields in computer vision. Convolutional Neural Networks (CNNs) have significantly boosted its performance but rely on fixed-size spatio-temporal windows of analysis, reducing CNNs temporal receptive fields. Among action recognition datasets, egocentric recorded sequences have become of important relevance while entailing an additional challenge: ego-motion is unavoidably transferred to these sequences. The proposed method aims to cope with it by estimating this ego-motion or camera motion. The estimation is used to temporally partition video sequences into motion-compensated temporal \textit{chunks} showing the action under stable backgrounds and allowing for a content-driven temporal sampling. A CNN trained in an end-to-end fashion is used to extract temporal features from each \textit{chunk}, which are late fused. This process leads to the extraction of features from the whole temporal range of an action, increasing the temporal receptive field of the network.
CVDec 27, 2018
Semantic Driven Multi-Camera Pedestrian DetectionAlejandro López-Cifuentes, Marcos Escudero-Viñolo, Jesús Bescós et al.
In the current worldwide situation, pedestrian detection has reemerged as a pivotal tool for intelligent video-based systems aiming to solve tasks such as pedestrian tracking, social distancing monitoring or pedestrian mass counting. Pedestrian detection methods, even the top performing ones, are highly sensitive to occlusions among pedestrians, which dramatically degrades their performance in crowded scenarios. The generalization of multi-camera set-ups permits to better confront occlusions by combining information from different viewpoints. In this paper, we present a multi-camera approach to globally combine pedestrian detections leveraging automatically extracted scene context. Contrarily to the majority of the methods of the state-of-the-art, the proposed approach is scene-agnostic, not requiring a tailored adaptation to the target scenario\textemdash e.g., via fine-tunning. This noteworthy attribute does not require \textit{ad hoc} training with labelled data, expediting the deployment of the proposed method in real-world situations. Context information, obtained via semantic segmentation, is used 1) to automatically generate a common Area of Interest for the scene and all the cameras, avoiding the usual need of manually defining it; and 2) to obtain detections for each camera by solving a global optimization problem that maximizes coherence of detections both in each 2D image and in the 3D scene. This process yields tightly-fitted bounding boxes that circumvent occlusions or miss-detections. Experimental results on five publicly available datasets show that the proposed approach outperforms state-of-the-art multi-camera pedestrian detectors, even some specifically trained on the target scenario, signifying the versatility and robustness of the proposed method without requiring ad-hoc annotations nor human-guided configuration.