CVOct 30, 2022Code
Combining Attention Module and Pixel Shuffle for License Plate Super-ResolutionValfride Nascimento, Rayson Laroca, Jorge de A. Lambert et al.
The License Plate Recognition (LPR) field has made impressive advances in the last decade due to novel deep learning approaches combined with the increased availability of training data. However, it still has some open issues, especially when the data come from low-resolution (LR) and low-quality images/videos, as in surveillance systems. This work focuses on license plate (LP) reconstruction in LR and low-quality images. We present a Single-Image Super-Resolution (SISR) approach that extends the attention/transformer module concept by exploiting the capabilities of PixelShuffle layers and that has an improved loss function based on LPR predictions. For training the proposed architecture, we use synthetic images generated by applying heavy Gaussian noise in terms of Structural Similarity Index Measure (SSIM) to the original high-resolution (HR) images. In our experiments, the proposed method outperformed the baselines both quantitatively and qualitatively. The datasets we created for this work are publicly available to the research community at https://github.com/valfride/lpr-rsr/
CVAug 27, 2024Code
Enhancing License Plate Super-Resolution: A Layout-Aware and Character-Driven ApproachValfride Nascimento, Rayson Laroca, Rafael O. Ribeiro et al.
Despite significant advancements in License Plate Recognition (LPR) through deep learning, most improvements rely on high-resolution images with clear characters. This scenario does not reflect real-world conditions where traffic surveillance often captures low-resolution and blurry images. Under these conditions, characters tend to blend with the background or neighboring characters, making accurate LPR challenging. To address this issue, we introduce a novel loss function, Layout and Character Oriented Focal Loss (LCOFL), which considers factors such as resolution, texture, and structural details, as well as the performance of the LPR task itself. We enhance character feature learning using deformable convolutions and shared weights in an attention module and employ a GAN-based training approach with an Optical Character Recognition (OCR) model as the discriminator to guide the super-resolution process. Our experimental results show significant improvements in character reconstruction quality, outperforming two state-of-the-art methods in both quantitative and qualitative measures. Our code is publicly available at https://github.com/valfride/lpsr-lacd
CVNov 1, 2023
Open-Set Face Recognition with Maximal Entropy and Objectosphere LossRafael Henrique Vareto, Yu Linghu, Terrance E. Boult et al.
Open-set face recognition characterizes a scenario where unknown individuals, unseen during the training and enrollment stages, appear on operation time. This work concentrates on watchlists, an open-set task that is expected to operate at a low False Positive Identification Rate and generally includes only a few enrollment samples per identity. We introduce a compact adapter network that benefits from additional negative face images when combined with distinct cost functions, such as Objectosphere Loss (OS) and the proposed Maximal Entropy Loss (MEL). MEL modifies the traditional Cross-Entropy loss in favor of increasing the entropy for negative samples and attaches a penalty to known target classes in pursuance of gallery specialization. The proposed approach adopts pre-trained deep neural networks (DNNs) for face recognition as feature extractors. Then, the adapter network takes deep feature representations and acts as a substitute for the output layer of the pre-trained DNN in exchange for an agile domain adaptation. Promising results have been achieved following open-set protocols for three different datasets: LFW, IJB-C, and UCCS as well as state-of-the-art performance when supplementary negative data is properly selected to fine-tune the adapter network.
LGNov 3, 2023
The Potential of Wearable Sensors for Assessing Patient Acuity in Intensive Care Unit (ICU)Jessica Sena, Mohammad Tahsin Mostafiz, Jiaqing Zhang et al.
Acuity assessments are vital in critical care settings to provide timely interventions and fair resource allocation. Traditional acuity scores rely on manual assessments and documentation of physiological states, which can be time-consuming, intermittent, and difficult to use for healthcare providers. Furthermore, such scores do not incorporate granular information such as patients' mobility level, which can indicate recovery or deterioration in the ICU. We hypothesized that existing acuity scores could be potentially improved by employing Artificial Intelligence (AI) techniques in conjunction with Electronic Health Records (EHR) and wearable sensor data. In this study, we evaluated the impact of integrating mobility data collected from wrist-worn accelerometers with clinical data obtained from EHR for developing an AI-driven acuity assessment score. Accelerometry data were collected from 86 patients wearing accelerometers on their wrists in an academic hospital setting. The data was analyzed using five deep neural network models: VGG, ResNet, MobileNet, SqueezeNet, and a custom Transformer network. These models outperformed a rule-based clinical score (SOFA= Sequential Organ Failure Assessment) used as a baseline, particularly regarding the precision, sensitivity, and F1 score. The results showed that while a model relying solely on accelerometer data achieved limited performance (AUC 0.50, Precision 0.61, and F1-score 0.68), including demographic information with the accelerometer data led to a notable enhancement in performance (AUC 0.69, Precision 0.75, and F1-score 0.67). This work shows that the combination of mobility and patient information can successfully differentiate between stable and unstable states in critically ill patients.
CVAug 23, 2023
Open-set Face Recognition with Neural Ensemble, Maximal Entropy Loss and Feature AugmentationRafael Henrique Vareto, Manuel Günther, William Robson Schwartz
Open-set face recognition refers to a scenario in which biometric systems have incomplete knowledge of all existing subjects. Therefore, they are expected to prevent face samples of unregistered subjects from being identified as previously enrolled identities. This watchlist context adds an arduous requirement that calls for the dismissal of irrelevant faces by focusing mainly on subjects of interest. As a response, this work introduces a novel method that associates an ensemble of compact neural networks with a margin-based cost function that explores additional samples. Supplementary negative samples can be obtained from external databases or synthetically built at the representation level in training time with a new mix-up feature augmentation approach. Deep neural networks pre-trained on large face datasets serve as the preliminary feature extraction module. We carry out experiments on well-known LFW and IJB-C datasets where results show that the approach is able to boost closed and open-set identification rates.
CVAug 14, 2023
Open-set Face Recognition using Ensembles trained on Clustered DataRafael Henrique Vareto, William Robson Schwartz
Open-set face recognition describes a scenario where unknown subjects, unseen during the training stage, appear on test time. Not only it requires methods that accurately identify individuals of interest, but also demands approaches that effectively deal with unfamiliar faces. This work details a scalable open-set face identification approach to galleries composed of hundreds and thousands of subjects. It is composed of clustering and an ensemble of binary learning algorithms that estimates when query face samples belong to the face gallery and then retrieves their correct identity. The approach selects the most suitable gallery subjects and uses the ensemble to improve prediction performance. We carry out experiments on well-known LFW and YTF benchmarks. Results show that competitive performance can be achieved even when targeting scalability.
6.4CVMay 15
Attention-Aware Transformer-Based Aggregation Network for Video Periocular RecognitionLuiz G F Carreira, Breno A Mariano, Victor H C de Melo et al.
Video periocular recognition is the task of recognizing an individual's identity based on the region around an individual's eyes. The periocular area is one of the most discriminative regions of the human face, making it suitable for recognition tasks. Its use as a biometric modality has emerged as an alternative, especially in surveillance scenarios where conventional biometric traits such as face or iris recognition become unfeasible due to unconstrained acquisition conditions. This paper proposes an attention-aware approach for video-based periocular recognition in surveillance environments. The framework consists of two main modules: feature embedding and aggregation. The feature embedding module is a deep convolutional neural network that maps periocular data to feature vectors. The aggregation module is an encoder-only transformer that adaptively learns to aggregate frame-level features into a single video representation and a feature vector for the still reference image. Experiments on the publicly available COX Face dataset show the robustness of the proposed method, consistently outperforming naive aggregation schemes. In the best scenario, the approach achieves $99.8\%$ of TPR@$1e^{-1}$ and $96.6\%$ of Rank-5.
CVSep 11, 2024
Watchlist Challenge: 3rd Open-set Face Detection and IdentificationFurkan Kasım, Terrance E. Boult, Rensso Mora et al.
In the current landscape of biometrics and surveillance, the ability to accurately recognize faces in uncontrolled settings is paramount. The Watchlist Challenge addresses this critical need by focusing on face detection and open-set identification in real-world surveillance scenarios. This paper presents a comprehensive evaluation of participating algorithms, using the enhanced UnConstrained College Students (UCCS) dataset with new evaluation protocols. In total, four participants submitted four face detection and nine open-set face recognition systems. The evaluation demonstrates that while detection capabilities are generally robust, closed-set identification performance varies significantly, with models pre-trained on large-scale datasets showing superior performance. However, open-set scenarios require further improvement, especially at higher true positive identification rates, i.e., lower thresholds.
CVOct 23, 2025Code
VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation ModelsJesimon Barreto, Carlos Caetano, André Araujo et al.
Foundation models have advanced computer vision by enabling strong performance across diverse tasks through large-scale pretraining and supervised fine-tuning. However, they may underperform in domains with distribution shifts and scarce labels, where supervised fine-tuning may be infeasible. While continued self-supervised learning for model adaptation is common for generative language models, this strategy has not proven effective for vision-centric encoder models. To address this challenge, we introduce a novel formulation of self-supervised fine-tuning for vision foundation models, where the model is adapted to a new domain without requiring annotations, leveraging only short multi-view object-centric videos. Our method is referred to as VESSA: Video-based objEct-centric Self-Supervised Adaptation for visual foundation models. VESSA's training technique is based on a self-distillation paradigm, where it is critical to carefully tune prediction heads and deploy parameter-efficient adaptation techniques - otherwise, the model may quickly forget its pretrained knowledge and reach a degraded state. VESSA benefits significantly from multi-view object observations sourced from different frames in an object-centric video, efficiently learning robustness to varied capture conditions, without the need of annotations. Through comprehensive experiments with 3 vision foundation models on 2 datasets, VESSA demonstrates consistent improvements in downstream classification tasks, compared to the base models and previous adaptation methods. Code is publicly available at https://github.com/jesimonbarreto/VESSA.
CVMay 27, 2023Code
Super-Resolution of License Plate Images Using Attention Modules and Sub-Pixel Convolution LayersValfride Nascimento, Rayson Laroca, Jorge de A. Lambert et al.
Recent years have seen significant developments in the field of License Plate Recognition (LPR) through the integration of deep learning techniques and the increasing availability of training data. Nevertheless, reconstructing license plates (LPs) from low-resolution (LR) surveillance footage remains challenging. To address this issue, we introduce a Single-Image Super-Resolution (SISR) approach that integrates attention and transformer modules to enhance the detection of structural and textural features in LR images. Our approach incorporates sub-pixel convolution layers (also known as PixelShuffle) and a loss function that uses an Optical Character Recognition (OCR) model for feature extraction. We trained the proposed architecture on synthetic images created by applying heavy Gaussian noise to high-resolution LP images from two public datasets, followed by bicubic downsampling. As a result, the generated images have a Structural Similarity Index Measure (SSIM) of less than 0.10. Our results show that our approach for reconstructing these low-resolution synthesized images outperforms existing ones in both quantitative and qualitative measures. Our code is publicly available at https://github.com/valfride/lpr-rsr-ext/
CVOct 17, 2018Code
Pruning Deep Neural Networks using Partial Least SquaresArtur Jordao, Ricardo Kloss, Fernando Yamada et al.
Modern pattern recognition methods are based on convolutional networks since they are able to learn complex patterns that benefit the classification. However, convolutional networks are computationally expensive and require a considerable amount of memory, which limits their deployment on low-power and resource-constrained systems. To handle these problems, recent approaches have proposed pruning strategies that find and remove unimportant neurons (i.e., filters) in these networks. Despite achieving remarkable results, existing pruning approaches are ineffective since the accuracy of the original network is degraded. In this work, we propose a novel approach to efficiently remove filters from convolutional networks. Our approach estimates the filter importance based on its relationship with the class label on a low-dimensional space. This relationship is computed using Partial Least Squares (PLS) and Variable Importance in Projection (VIP). Our method is able to reduce up to 67% of the floating point operations (FLOPs) without penalizing the network accuracy. With a negligible drop in accuracy, we can reduce up to 90% of FLOPs. Additionally, sometimes the method is even able to improve the accuracy compared to original, unpruned, network. We show that employing PLS+VIP as the criterion for detecting the filters to be removed is better than recent feature selection techniques, which have been employed by state-of-the-art pruning methods. Finally, we show that the proposed method achieves the highest FLOPs reduction and the smallest drop in accuracy when compared to state-of-the-art pruning approaches. Codes are available at: https://github.com/arturjordao/PruningNeuralNetworks
CVMay 9, 2025
Toward Advancing License Plate Super-Resolution in Real-World Scenarios: A Dataset and BenchmarkValfride Nascimento, Gabriel E. Lima, Rafael O. Ribeiro et al.
Recent advancements in super-resolution for License Plate Recognition (LPR) have sought to address challenges posed by low-resolution (LR) and degraded images in surveillance, traffic monitoring, and forensic applications. However, existing studies have relied on private datasets and simplistic degradation models. To address this gap, we introduce UFPR-SR-Plates, a novel dataset containing 10,000 tracks with 100,000 paired low and high-resolution license plate images captured under real-world conditions. We establish a benchmark using multiple sequential LR and high-resolution (HR) images per vehicle -- five of each -- and two state-of-the-art models for super-resolution of license plates. We also investigate three fusion strategies to evaluate how combining predictions from a leading Optical Character Recognition (OCR) model for multiple super-resolved license plates enhances overall performance. Our findings demonstrate that super-resolution significantly boosts LPR performance, with further improvements observed when applying majority vote-based fusion techniques. Specifically, the Layout-Aware and Character-Driven Network (LCDNet) model combined with the Majority Vote by Character Position (MVCP) strategy led to the highest recognition rates, increasing from 1.7% with low-resolution images to 31.1% with super-resolution, and up to 44.7% when combining OCR outputs from five super-resolved images. These findings underscore the critical role of super-resolution and temporal information in enhancing LPR accuracy under real-world, adverse conditions. The proposed dataset is publicly available to support further research and can be accessed at: https://valfride.github.io/nascimento2024toward/
CVMay 14, 2021
Face Attributes as Cues for Deep Face Recognition UnderstandingMatheus Alves Diniz, William Robson Schwartz
Deeply learned representations are the state-of-the-art descriptors for face recognition methods. These representations encode latent features that are difficult to explain, compromising the confidence and interpretability of their predictions. Most attempts to explain deep features are visualization techniques that are often open to interpretation. Instead of relying only on visualizations, we use the outputs of hidden layers to predict face attributes. The obtained performance is an indicator of how well the attribute is implicitly learned in that layer of the network. Using a variable selection technique, we also analyze how these semantic concepts are distributed inside each layer, establishing the precise location of relevant neurons for each attribute. According to our experiments, gender, eyeglasses and hat usage can be predicted with over 96% accuracy even when only a single neural output is used to predict each attribute. These performances are less than 3 percentage points lower than the ones achieved by deep supervised face attribute networks. In summary, our experiments show that, inside DCNNs optimized for face identification, there exists latent neurons encoding face attributes almost as accurately as DCNNs optimized for these attributes.
CVApr 23, 2020
Stage-Wise Neural Architecture SearchArtur Jordao, Fernando Akio, Maiko Lie et al.
Modern convolutional networks such as ResNet and NASNet have achieved state-of-the-art results in many computer vision applications. These architectures consist of stages, which are sets of layers that operate on representations in the same resolution. It has been demonstrated that increasing the number of layers in each stage improves the prediction ability of the network. However, the resulting architecture becomes computationally expensive in terms of floating point operations, memory requirements and inference time. Thus, significant human effort is necessary to evaluate different trade-offs between depth and performance. To handle this problem, recent works have proposed to automatically design high-performance architectures, mainly by means of neural architecture search (NAS). Current NAS strategies analyze a large set of possible candidate architectures and, hence, require vast computational resources and take many GPUs days. Motivated by this, we propose a NAS approach to efficiently design accurate and low-cost convolutional architectures and demonstrate that an efficient strategy for designing these architectures is to learn the depth stage-by-stage. For this purpose, our approach increases depth incrementally in each stage taking into account its importance, such that stages with low importance are kept shallow while stages with high importance become deeper. We conduct experiments on the CIFAR and different versions of ImageNet datasets, where we show that architectures discovered by our approach achieve better accuracy and efficiency than human-designed architectures. Additionally, we show that architectures discovered on CIFAR-10 can be successfully transferred to large datasets. Compared to previous NAS approaches, our method is substantially more efficient, as it evaluates one order of magnitude fewer models and yields architectures on par with the state-of-the-art.
CVOct 21, 2019
The SWAX Benchmark: Attacking Biometric Systems with Wax FiguresRafael Henrique Vareto, Araceli Marcia Sandanha, William Robson Schwartz
A face spoofing attack occurs when an intruder attempts to impersonate someone who carries a gainful authentication clearance. It is a trending topic due to the increasing demand for biometric authentication on mobile devices, high-security areas, among others. This work introduces a new database named Sense Wax Attack dataset (SWAX), comprised of real human and wax figure images and videos that endorse the problem of face spoofing detection. The dataset consists of more than 1800 face images and 110 videos of 55 people/waxworks, arranged in training, validation and test sets with a large range in expression, illumination and pose variations. Experiments performed with baseline methods show that despite the progress in recent years, advanced spoofing methods are still vulnerable to high-quality violation attempts.
CVOct 5, 2019
Covariance-free Partial Least Squares: An Incremental Dimensionality Reduction MethodArtur Jordao, Maiko Lie, Victor Hugo Cunha de Melo et al.
Dimensionality reduction plays an important role in computer vision problems since it reduces computational cost and is often capable of yielding more discriminative data representation. In this context, Partial Least Squares (PLS) has presented notable results in tasks such as image classification and neural network optimization. However, PLS is infeasible on large datasets, such as ImageNet, because it requires all the data to be in memory in advance, which is often impractical due to hardware limitations. Additionally, this requirement prevents us from employing PLS on streaming applications where the data are being continuously generated. Motivated by this, we propose a novel incremental PLS, named Covariance-free Incremental Partial Least Squares (CIPLS), which learns a low-dimensional representation of the data using a single sample at a time. In contrast to other state-of-the-art approaches, instead of adopting a partially-discriminative or SGD-based model, we extend Nonlinear Iterative Partial Least Squares (NIPALS) -- the standard algorithm used to compute PLS -- for incremental processing. Among the advantages of this approach are the preservation of discriminative information across all components, the possibility of employing its score matrices for feature selection, and its computational efficiency. We validate CIPLS on face verification and image classification tasks, where it outperforms several other incremental dimensionality reduction techniques. In the context of feature selection, CIPLS achieves comparable results when compared to state-of-the-art techniques.
CVSep 11, 2019
Skeleton Image Representation for 3D Action Recognition based on Tree Structure and Reference JointsCarlos Caetano, François Brémond, William Robson Schwartz
In the last years, the computer vision research community has studied on how to model temporal dynamics in videos to employ 3D human action recognition. To that end, two main baseline approaches have been researched: (i) Recurrent Neural Networks (RNNs) with Long-Short Term Memory (LSTM); and (ii) skeleton image representations used as input to a Convolutional Neural Network (CNN). Although RNN approaches present excellent results, such methods lack the ability to efficiently learn the spatial relations between the skeleton joints. On the other hand, the representations used to feed CNN approaches present the advantage of having the natural ability of learning structural information from 2D arrays (i.e., they learn spatial relations from the skeleton joints). To further improve such representations, we introduce the Tree Structure Reference Joints Image (TSRJI), a novel skeleton image representation to be used as input to CNNs. The proposed representation has the advantage of combining the use of reference joints and a tree structure skeleton. While the former incorporates different spatial relationships between the joints, the latter preserves important spatial relations by traversing a skeleton tree with a depth-first order algorithm. Experimental results demonstrate the effectiveness of the proposed representation for 3D action recognition on two datasets achieving state-of-the-art results on the recent NTU RGB+D~120 dataset.
CVSep 4, 2019
An Efficient and Layout-Independent Automatic License Plate Recognition System Based on the YOLO detectorRayson Laroca, Luiz A. Zanlorensi, Gabriel R. Gonçalves et al.
This paper presents an efficient and layout-independent Automatic License Plate Recognition (ALPR) system based on the state-of-the-art YOLO object detector that contains a unified approach for license plate (LP) detection and layout classification to improve the recognition results using post-processing rules. The system is conceived by evaluating and optimizing different models, aiming at achieving the best speed/accuracy trade-off at each stage. The networks are trained using images from several datasets, with the addition of various data augmentation techniques, so that they are robust under different conditions. The proposed system achieved an average end-to-end recognition rate of 96.9% across eight public datasets (from five different regions) used in the experiments, outperforming both previous works and commercial systems in the ChineseLP, OpenALPR-EU, SSIG-SegPlate and UFPR-ALPR datasets. In the other datasets, the proposed approach achieved competitive results to those attained by the baselines. Our system also achieved impressive frames per second (FPS) rates on a high-end GPU, being able to perform in real time even when there are four vehicles in the scene. An additional contribution is that we manually labeled 38,351 bounding boxes on 6,239 images from public datasets and made the annotations publicly available to the research community.
CVJul 30, 2019
SkeleMotion: A New Representation of Skeleton Joint Sequences Based on Motion Information for 3D Action RecognitionCarlos Caetano, Jessica Sena, François Brémond et al.
Due to the availability of large-scale skeleton datasets, 3D human action recognition has recently called the attention of computer vision community. Many works have focused on encoding skeleton data as skeleton image representations based on spatial structure of the skeleton joints, in which the temporal dynamics of the sequence is encoded as variations in columns and the spatial structure of each frame is represented as rows of a matrix. To further improve such representations, we introduce a novel skeleton image representation to be used as input of Convolutional Neural Networks (CNNs), named SkeleMotion. The proposed approach encodes the temporal dynamics by explicitly computing the magnitude and orientation values of the skeleton joints. Different temporal scales are employed to compute motion values to aggregate more temporal dynamics to the representation making it able to capture longrange joint interactions involved in actions as well as filtering noisy motion values. Experimental results demonstrate the effectiveness of the proposed representation on 3D action recognition outperforming the state-of-the-art on NTU RGB+D 120 dataset.
CVFeb 25, 2019
Convolutional Neural Networks for Automatic Meter ReadingRayson Laroca, Victor Barroso, Matheus A. Diniz et al.
In this paper, we tackle Automatic Meter Reading (AMR) by leveraging the high capability of Convolutional Neural Networks (CNNs). We design a two-stage approach that employs the Fast-YOLO object detector for counter detection and evaluates three different CNN-based approaches for counter recognition. In the AMR literature, most datasets are not available to the research community since the images belong to a service company. In this sense, we introduce a new public dataset, called UFPR-AMR dataset, with 2,000 fully and manually annotated images. This dataset is, to the best of our knowledge, three times larger than the largest public dataset found in the literature and contains a well-defined evaluation protocol to assist the development and evaluation of AMR methods. Furthermore, we propose the use of a data augmentation technique to generate a balanced training set with many more examples to train the CNN models for counter recognition. In the proposed dataset, impressive results were obtained and a detailed speed/accuracy trade-off evaluation of each model was performed. In a public dataset, state-of-the-art results were achieved using less than 200 images for training.
CVJun 13, 2018
Human Activity Recognition Based on Wearable Sensor Data: A Standardization of the State-of-the-ArtArtur Jordao, Antonio C. Nazare, Jessica Sena et al.
Human activity recognition based on wearable sensor data has been an attractive research topic due to its application in areas such as healthcare and smart environments. In this context, many works have presented remarkable results using accelerometer, gyroscope and magnetometer data to represent the activities categories. However, current studies do not consider important issues that lead to skewed results, making it hard to assess the quality of sensor-based human activity recognition and preventing a direct comparison of previous works. These issues include the samples generation processes and the validation protocols used. We emphasize that in other research areas, such as image classification and object detection, these issues are already well-defined, which brings more efforts towards the application. Inspired by this, we conduct an extensive set of experiments that analyze different sample generation processes and validation protocols to indicate the vulnerable points in human activity recognition based on wearable sensor data. For this purpose, we implement and evaluate several top-performance methods, ranging from handcrafted-based approaches to convolutional neural networks. According to our study, most of the experimental evaluations that are currently employed are not adequate to perform the activity recognition in the context of wearable sensor data, in which the recognition accuracy drops considerably when compared to an appropriate evaluation approach. To the best of our knowledge, this is the first study that tackles essential issues that compromise the understanding of the performance in human activity recognition based on wearable sensor data.
CVJun 8, 2018
A Content-Based Late Fusion Approach Applied to Pedestrian DetectionJessica Sena, Artur Jordao, William Robson Schwartz
The variety of pedestrians detectors proposed in recent years has encouraged some works to fuse pedestrian detectors to achieve a more accurate detection. The intuition behind is to combine the detectors based on its spatial consensus. We propose a novel method called Content-Based Spatial Consensus (CSBC), which, in addition to relying on spatial consensus, considers the content of the detection windows to learn a weighted-fusion of pedestrian detectors. The result is a reduction in false alarms and an enhancement in the detection. In this work, we also demonstrate that there is small influence of the feature used to learn the contents of the windows of each detector, which enables our method to be efficient even employing simple features. The CSBC overcomes state-of-the-art fusion methods in the ETH dataset and in the Caltech dataset. Particularly, our method is more efficient since fewer detectors are necessary to achieve expressive results.
CVFeb 26, 2018
A Robust Real-Time Automatic License Plate Recognition Based on the YOLO DetectorRayson Laroca, Evair Severo, Luiz A. Zanlorensi et al.
Automatic License Plate Recognition (ALPR) has been a frequent topic of research due to many practical applications. However, many of the current solutions are still not robust in real-world situations, commonly depending on many constraints. This paper presents a robust and efficient ALPR system based on the state-of-the-art YOLO object detector. The Convolutional Neural Networks (CNNs) are trained and fine-tuned for each ALPR stage so that they are robust under different conditions (e.g., variations in camera, lighting, and background). Specially for character segmentation and recognition, we design a two-stage approach employing simple data augmentation tricks such as inverted License Plates (LPs) and flipped characters. The resulting ALPR approach achieved impressive results in two datasets. First, in the SSIG dataset, composed of 2,000 frames from 101 vehicle videos, our system achieved a recognition rate of 93.53% and 47 Frames Per Second (FPS), performing better than both Sighthound and OpenALPR commercial systems (89.80% and 93.03%, respectively) and considerably outperforming previous results (81.80%). Second, targeting a more realistic scenario, we introduce a larger public dataset, called UFPR-ALPR dataset, designed to ALPR. This dataset contains 150 videos and 4,500 frames captured when both camera and vehicles are moving and also contains different types of vehicles (cars, motorcycles, buses and trucks). In our proposed dataset, the trial versions of commercial systems achieved recognition rates below 70%. On the other hand, our system performed better, with recognition rate of 78.33% and 35 FPS.
CVNov 7, 2017
Latent hypernet: Exploring all Layers from Convolutional Neural NetworksArtur Jordao, Ricardo Kloss, William Robson Schwartz
Since Convolutional Neural Networks (ConvNets) are able to simultaneously learn features and classifiers to discriminate different categories of activities, recent works have employed ConvNets approaches to perform human activity recognition (HAR) based on wearable sensors, allowing the removal of expensive human work and expert knowledge. However, these approaches have their power of discrimination limited mainly by the large number of parameters that compose the network and the reduced number of samples available for training. Inspired by this, we propose an accurate and robust approach, referred to as Latent HyperNet (LHN). The LHN uses feature maps from early layers (hyper) and projects them, individually, onto a low dimensionality space (latent). Then, these latent features are concatenated and presented to a classifier. To demonstrate the robustness and accuracy of the LHN, we evaluate it using four different networks architectures in five publicly available HAR datasets based on wearable sensors, which vary in the sampling rate and number of activities. Our experiments demonstrate that the proposed LHN is able to produce rich information, improving the results regarding the original ConvNets. Furthermore, the method outperforms existing state-of-the-art methods.
CVAug 22, 2017
Activity Recognition based on a Magnitude-Orientation Stream NetworkCarlos Caetano, Victor H. C. de Melo, Jefersson A. dos Santos et al.
The temporal component of videos provides an important clue for activity recognition, as a number of activities can be reliably recognized based on the motion information. In view of that, this work proposes a novel temporal stream for two-stream convolutional networks based on images computed from the optical flow magnitude and orientation, named Magnitude-Orientation Stream (MOS), to learn the motion in a better and richer manner. Our method applies simple nonlinear transformations on the vertical and horizontal components of the optical flow to generate input images for the temporal stream. Experimental results, carried on two well-known datasets (HMDB51 and UCF101), demonstrate that using our proposed temporal stream as input to existing neural network architectures can improve their performance for activity recognition. Results demonstrate that our temporal stream provides complementary information able to improve the classical two-stream methods, indicating the suitability of our approach to be used as a temporal video representation.
CVNov 21, 2016
Kernel Cross-View Collaborative Representation based Classification for Person Re-IdentificationRaphael Prates, William Robson Schwartz
Person re-identification aims at the maintenance of a global identity as a person moves among non-overlapping surveillance cameras. It is a hard task due to different illumination conditions, viewpoints and the small number of annotated individuals from each pair of cameras (small-sample-size problem). Collaborative Representation based Classification (CRC) has been employed successfully to address the small-sample-size problem in computer vision. However, the original CRC formulation is not well-suited for person re-identification since it does not consider that probe and gallery samples are from different cameras. Furthermore, it is a linear model, while appearance changes caused by different camera conditions indicate a strong nonlinear transition between cameras. To overcome such limitations, we propose the Kernel Cross-View Collaborative Representation based Classification (Kernel X-CRC) that represents probe and gallery images by balancing representativeness and similarity nonlinearly. It assumes that a probe and its corresponding gallery image are represented with similar coding vectors using individuals from the training set. Experimental results demonstrate that our assumption is true when using a high-dimensional feature vector and becomes more compelling when dealing with a low-dimensional and discriminative representation computed using a common subspace learning method. We achieve state-of-the-art for rank-1 matching rates in two person re-identification datasets (PRID450S and GRID) and the second best results on VIPeR and CUHK01 datasets.
CVNov 7, 2016
Meat adulteration detection through digital image analysis of histological cuts using LBPJoão J. de Macedo Neto, Jefersson A. dos Santos, William Robson Schwartz
Food fraud has been an area of great concern due to its risk to public health, reduction of food quality or nutritional value and for its economic consequences. For this reason, it's been object of regulation in many countries (e.g. [1], [2]). One type of food that has been frequently object of fraud through the addition of water or an aqueous solution is bovine meat. The traditional methods used to detect this kind of fraud are expensive, time-consuming and depend on physicochemical analysis that require complex laboratory techniques, specific for each added substance. In this paper, based on digital images of histological cuts of adulterated and not-adulterated (normal) bovine meat, we evaluate the of digital image analysis methods to identify the aforementioned kind of fraud, with focus on the Local Binary Pattern (LBP) algorithm.
CVJul 11, 2016
Benchmark for License Plate Character SegmentationGabriel Resende Gonçalves, Sirlene Pio Gomes da Silva, David Menotti et al.
Automatic License Plate Recognition (ALPR) has been the focus of many researches in the past years. In general, ALPR is divided into the following problems: detection of on-track vehicles, license plates detection, segmention of license plate characters and optical character recognition (OCR). Even though commercial solutions are available for controlled acquisition conditions, e.g., the entrance of a parking lot, ALPR is still an open problem when dealing with data acquired from uncontrolled environments, such as roads and highways when relying only on imaging sensors. Due to the multiple orientations and scales of the license plates captured by the camera, a very challenging task of the ALPR is the License Plate Character Segmentation (LPCS) step, which effectiveness is required to be (near) optimal to achieve a high recognition rate by the OCR. To tackle the LPCS problem, this work proposes a novel benchmark composed of a dataset designed to focus specifically on the character segmentation step of the ALPR within an evaluation protocol. Furthermore, we propose the Jaccard-Centroid coefficient, a new evaluation measure more suitable than the Jaccard coefficient regarding the location of the bounding box within the ground-truth annotation. The dataset is composed of 2,000 Brazilian license plates consisting of 14,000 alphanumeric symbols and their corresponding bounding box annotations. We also present a new straightforward approach to perform LPCS efficiently. Finally, we provide an experimental evaluation for the dataset based on four LPCS approaches and demonstrate the importance of character segmentation for achieving an accurate OCR.
CVMay 12, 2016
A Mid-level Video Representation based on Binary Descriptors: A Case Study for Pornography DetectionCarlos Caetano, Sandra Avila, William Robson Schwartz et al.
With the growing amount of inappropriate content on the Internet, such as pornography, arises the need to detect and filter such material. The reason for this is given by the fact that such content is often prohibited in certain environments (e.g., schools and workplaces) or for certain publics (e.g., children). In recent years, many works have been mainly focused on detecting pornographic images and videos based on visual content, particularly on the detection of skin color. Although these approaches provide good results, they generally have the disadvantage of a high false positive rate since not all images with large areas of skin exposure are necessarily pornographic images, such as people wearing swimsuits or images related to sports. Local feature based approaches with Bag-of-Words models (BoW) have been successfully applied to visual recognition tasks in the context of pornography detection. Even though existing methods provide promising results, they use local feature descriptors that require a high computational processing time yielding high-dimensional vectors. In this work, we propose an approach for pornography detection based on local binary feature extraction and BossaNova image representation, a BoW model extension that preserves more richly the visual information. Moreover, we propose two approaches for video description based on the combination of mid-level representations namely BossaNova Video Descriptor (BNVD) and BoW Video Descriptor (BoW-VD). The proposed techniques are promising, achieving an accuracy of 92.40%, thus reducing the classification error by 16% over the current state-of-the-art local features approach on the Pornography dataset.
CVOct 8, 2014
Deep Representations for Iris, Face, and Fingerprint Spoofing DetectionDavid Menotti, Giovani Chiachia, Allan Pinto et al.
Biometrics systems have significantly improved person identification and authentication, playing an important role in personal, national, and global security. However, these systems might be deceived (or "spoofed") and, despite the recent advances in spoofing detection, current solutions often rely on domain knowledge, specific biometric reading systems, and attack types. We assume a very limited knowledge about biometric spoofing at the sensor to derive outstanding spoofing detection systems for iris, face, and fingerprint modalities based on two deep learning approaches. The first approach consists of learning suitable convolutional network architectures for each domain, while the second approach focuses on learning the weights of the network via back-propagation. We consider nine biometric spoofing benchmarks --- each one containing real and fake samples of a given biometric modality and attack type --- and learn deep representations for each benchmark by combining and contrasting the two learning approaches. This strategy not only provides better comprehension of how these approaches interplay, but also creates systems that exceed the best known results in eight out of the nine benchmarks. The results strongly indicate that spoofing detection systems based on convolutional networks can be robust to attacks already known and possibly adapted, with little effort, to image-based attacks that are yet to come.