LGOct 20, 2022
Improving Data Quality with Training Dynamics of Gradient Boosting Decision TreesMoacir Antonelli Ponti, Lucas de Angelis Oliveira, Mathias Esteban et al.
Real world datasets contain incorrectly labeled instances that hamper the performance of the model and, in particular, the ability to generalize out of distribution. Also, each example might have different contribution towards learning. This motivates studies to better understanding of the role of data instances with respect to their contribution in good metrics in models. In this paper we propose a method based on metrics computed from training dynamics of Gradient Boosting Decision Trees (GBDTs) to assess the behavior of each training example. We focus on datasets containing mostly tabular or structured data, for which the use of Decision Trees ensembles are still the state-of-the-art in terms of performance. Our methods achieved the best results overall when compared with confident learning, direct heuristics and a robust boosting algorithm. We show results on detecting noisy labels in order clean datasets, improving models' metrics in synthetic and real public datasets, as well as on a industry case in which we deployed a model based on the proposed solution.
ASMar 29, 2022
ASR data augmentation in low-resource settings using cross-lingual multi-speaker TTS and cross-lingual voice conversionEdresson Casanova, Christopher Shulby, Alexander Korolev et al.
We explore cross-lingual multi-speaker speech synthesis and cross-lingual voice conversion applied to data augmentation for automatic speech recognition (ASR) systems in low/medium-resource scenarios. Through extensive experiments, we show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training. We also managed to close the gap between ASR models trained with synthesized versus human speech compared to other works that use many speakers. Finally, we show that it is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
CVMar 29, 2023
Sketch-an-Anchor: Sub-epoch Fast Model Adaptation for Zero-shot Sketch-based Image RetrievalLeo Sampaio Ferraz Ribeiro, Moacir Antonelli Ponti
Sketch-an-Anchor is a novel method to train state-of-the-art Zero-shot Sketch-based Image Retrieval (ZSSBIR) models in under an epoch. Most studies break down the problem of ZSSBIR into two parts: domain alignment between images and sketches, inherited from SBIR, and generalization to unseen data, inherent to the zero-shot protocol. We argue one of these problems can be considerably simplified and re-frame the ZSSBIR problem around the already-stellar yet underexplored Zero-shot Image-based Retrieval performance of off-the-shelf models. Our fast-converging model keeps the single-domain performance while learning to extract similar representations from sketches. To this end we introduce our Semantic Anchors -- guiding embeddings learned from word-based semantic spaces and features from off-the-shelf models -- and combine them with our novel Anchored Contrastive Loss. Empirical evidence shows we can achieve state-of-the-art performance on all benchmark datasets while training for 100x less iterations than other methods.
LGNov 28, 2023
Dendrogram distance: an evaluation metric for generative networks using hierarchical clusteringGustavo Sutter Carvalho, Moacir Antonelli Ponti
We present a novel metric for generative modeling evaluation, focusing primarily on generative networks. The method uses dendrograms to represent real and fake data, allowing for the divergence between training and generated samples to be computed. This metric focus on mode collapse, targeting generators that are not able to capture all modes in the training set. To evaluate the proposed method it is introduced a validation scheme based on sampling from real datasets, therefore the metric is evaluated in a controlled environment and proves to be competitive with other state-of-the-art approaches.
CVDec 3, 2025
On the Temporality for Sketch Representation LearningMarcelo Isaias de Moraes Junior, Moacir Antonelli Ponti
Sketches are simple human hand-drawn abstractions of complex scenes and real-world objects. Although the field of sketch representation learning has advanced significantly, there is still a gap in understanding the true relevance of the temporal aspect to the quality of these representations. This work investigates whether it is indeed justifiable to treat sketches as sequences, as well as which internal orders play a more relevant role. The results indicate that, although the use of traditional positional encodings is valid for modeling sketches as sequences, absolute coordinates consistently outperform relative ones. Furthermore, non-autoregressive decoders outperform their autoregressive counterparts. Finally, the importance of temporality was shown to depend on both the order considered and the task evaluated.
CVOct 22, 2025
Transformed Multi-view 3D Shape Features with Contrastive LearningMárcus Vinícius Lobo Costa, Sherlon Almeida da Silva, Bárbara Caroline Benato et al.
This paper addresses the challenges in representation learning of 3D shape features by investigating state-of-the-art backbones paired with both contrastive supervised and self-supervised learning objectives. Computer vision methods struggle with recognizing 3D objects from 2D images, often requiring extensive labeled data and relying on Convolutional Neural Networks (CNNs) that may overlook crucial shape relationships. Our work demonstrates that Vision Transformers (ViTs) based architectures, when paired with modern contrastive objectives, achieve promising results in multi-view 3D analysis on our downstream tasks, unifying contrastive and 3D shape understanding pipelines. For example, supervised contrastive losses reached about 90.6% accuracy on ModelNet10. The use of ViTs and contrastive learning, leveraging ViTs' ability to understand overall shapes and contrastive learning's effectiveness, overcomes the need for extensive labeled data and the limitations of CNNs in capturing crucial shape relationships. The success stems from capturing global shape semantics via ViTs and refining local discriminative features through contrastive optimization. Importantly, our approach is empirical, as it is grounded on extensive experimental evaluation to validate the effectiveness of combining ViTs with contrastive objectives for 3D representation learning.
CVAug 15, 2025
Towards Understanding 3D Vision: the Role of Gaussian CurvatureSherlon Almeida da Silva, Davi Geiger, Luiz Velho et al.
Recent advances in computer vision have predominantly relied on data-driven approaches that leverage deep learning and large-scale datasets. Deep neural networks have achieved remarkable success in tasks such as stereo matching and monocular depth reconstruction. However, these methods lack explicit models of 3D geometry that can be directly analyzed, transferred across modalities, or systematically modified for controlled experimentation. We investigate the role of Gaussian curvature in 3D surface modeling. Besides Gaussian curvature being an invariant quantity under change of observers or coordinate systems, we demonstrate using the Middlebury stereo dataset that it offers a sparse and compact description of 3D surfaces. Furthermore, we show a strong correlation between the performance rank of top state-of-the-art stereo and monocular methods and the low total absolute Gaussian curvature. We propose that this property can serve as a geometric prior to improve future 3D reconstruction algorithms.
CVJun 26, 2025
The Role of Cyclopean-Eye in Stereo VisionSherlon Almeida da Silva, Davi Geiger, Luiz Velho et al.
This work investigates the geometric foundations of modern stereo vision systems, with a focus on how 3D structure and human-inspired perception contribute to accurate depth reconstruction. We revisit the Cyclopean Eye model and propose novel geometric constraints that account for occlusions and depth discontinuities. Our analysis includes the evaluation of stereo feature matching quality derived from deep learning models, as well as the role of attention mechanisms in recovering meaningful 3D surfaces. Through both theoretical insights and empirical studies on real datasets, we demonstrate that combining strong geometric priors with learned features provides internal abstractions for understanding stereo vision systems.
CVFeb 28, 2025
Back to the Future Cyclopean Stereo: a human perception approach combining deep and geometric constraintsSherlon Almeida da Silva, Davi Geiger, Luiz Velho et al.
We innovate in stereo vision by explicitly providing analytical 3D surface models as viewed by a cyclopean eye model that incorporate depth discontinuities and occlusions. This geometrical foundation combined with learned stereo features allows our system to benefit from the strengths of both approaches. We also invoke a prior monocular model of surfaces to fill in occlusion regions or texture-less regions where data matching is not sufficient. Our results already are on par with the state-of-the-art purely data-driven methods and are of much better visual quality, emphasizing the importance of the 3D geometrical model to capture critical visual information. Such qualitative improvements may find applicability in virtual reality, for a better human experience, as well as in robotics, for reducing critical errors. Our approach aims to demonstrate that understanding and modeling geometrical properties of 3D surfaces is beneficial to computer vision research.
LGJan 30, 2025
Real-Time Anomaly Detection with Synthetic Anomaly Monitoring (SAM)Emanuele Luzio, Moacir Antonelli Ponti
Anomaly detection is essential for identifying rare and significant events across diverse domains such as finance, cybersecurity, and network monitoring. This paper presents Synthetic Anomaly Monitoring (SAM), an innovative approach that applies synthetic control methods from causal inference to improve both the accuracy and interpretability of anomaly detection processes. By modeling normal behavior through the treatment of each feature as a control unit, SAM identifies anomalies as deviations within this causal framework. We conducted extensive experiments comparing SAM with established benchmark models, including Isolation Forest, Local Outlier Factor (LOF), k-Nearest Neighbors (kNN), and One-Class Support Vector Machine (SVM), across five diverse datasets, including Credit Card Fraud, HTTP Dataset CSIC 2010, and KDD Cup 1999, among others. Our results demonstrate that SAM consistently delivers robust performance, highlighting its potential as a powerful tool for real-time anomaly detection in dynamic and complex environments.
LGJan 10, 2024
Decoupling Decision-Making in Fraud Prevention through Classifier Calibration for Business Logic ActionEmanuele Luzio, Moacir Antonelli Ponti, Christian Ramirez Arevalo et al.
Machine learning models typically focus on specific targets like creating classifiers, often based on known population feature distributions in a business context. However, models calculating individual features adapt over time to improve precision, introducing the concept of decoupling: shifting from point evaluation to data distribution. We use calibration strategies as strategy for decoupling machine learning (ML) classifiers from score-based actions within business logic frameworks. To evaluate these strategies, we perform a comparative analysis using a real-world business scenario and multiple ML models. Our findings highlight the trade-offs and performance implications of the approach, offering valuable insights for practitioners seeking to optimize their decoupling efforts. In particular, the Isotonic and Beta calibration methods stand out for scenarios in which there is shift between training and testing data.
SDJan 6, 2022
Implementing simple spectral denoising for environmental audio recordingsFábio Felix Dias, Moacir Antonelli Ponti, Rosane Minghim
This technical report details changes applied to a noise filter to facilitate its application and improve its results. The filter is applied to denoise natural sounds recorded in the wild and to generate an acoustic index used in soundscape analysis.
SDDec 4, 2021
YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyoneEdresson Casanova, Julian Weber, Christopher Shulby et al.
YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zero-shot multi-speaker and multilingual training. We achieved state-of-the-art (SOTA) results in zero-shot multi-speaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our approach achieves promising results in a target language with a single-speaker dataset, opening possibilities for zero-shot multi-speaker TTS and zero-shot voice conversion systems in low-resource languages. Finally, it is possible to fine-tune the YourTTS model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality. This is important to allow synthesis for speakers with a very different voice or recording characteristics from those seen during training.
LGSep 6, 2021
Training Deep Networks from Zero to Hero: avoiding pitfalls and going beyondMoacir Antonelli Ponti, Fernando Pereira dos Santos, Leo Sampaio Ferraz Ribeiro et al.
Training deep neural networks may be challenging in real world data. Using models as black-boxes, even with transfer learning, can result in poor generalization or inconclusive results when it comes to small datasets or specific applications. This tutorial covers the basic steps as well as more recent options to improve models, in particular, but not restricted to, supervised learning. It can be particularly useful in datasets that are not as well-prepared as those in challenges, and also under scarce annotation and/or small data. We describe basic procedures: as data preparation, optimization and transfer learning, but also recent architectural choices such as use of transformer modules, alternative convolutional layers, activation functions, wide and deep networks, as well as training procedures including as curriculum, contrastive and self-supervised learning.
ASApr 2, 2021
SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech ModelEdresson Casanova, Christopher Shulby, Eren Gölge et al.
In this paper, we propose SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model that improves similarity for speakers unseen during training. We propose a speaker-conditional architecture that explores a flow-based decoder that works in a zero-shot scenario. As text encoders, we explore a dilated residual convolutional-based encoder, gated convolutional-based encoder, and transformer-based encoder. Additionally, we have shown that adjusting a GAN-based vocoder for the spectrograms predicted by the TTS model on the training dataset can significantly improve the similarity and speech quality for new speakers. Our model converges using only 11 speakers, reaching state-of-the-art results for similarity with new speakers, as well as high speech quality.
ASMay 11, 2020
TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian PortugueseEdresson Casanova, Arnaldo Candido Junior, Christopher Shulby et al.
Speech provides a natural way for human-computer interaction. In particular, speech synthesis systems are popular in different applications, such as personal assistants, GPS applications, screen readers and accessibility tools. However, not all languages are on the same level when in terms of resources and systems for speech synthesis. This work consists of creating publicly available resources for Brazilian Portuguese in the form of a novel dataset along with deep learning models for end-to-end speech synthesis. Such dataset has 10.5 hours from a single speaker, from which a Tacotron 2 model with the RTISI-LA vocoder presented the best performance, achieving a 4.03 MOS value. The obtained results are comparable to related works covering English language and the state-of-the-art in Portuguese.
CLFeb 25, 2020
Speech2Phone: A Novel and Efficient Method for Training Speaker Recognition ModelsEdresson Casanova, Arnaldo Candido Junior, Christopher Shulby et al.
In this paper we present an efficient method for training models for speaker recognition using small or under-resourced datasets. This method requires less data than other SOTA (State-Of-The-Art) methods, e.g. the Angular Prototypical and GE2E loss functions, while achieving similar results to those methods. This is done using the knowledge of the reconstruction of a phoneme in the speaker's voice. For this purpose, a new dataset was built, composed of 40 male speakers, who read sentences in Portuguese, totaling approximately 3h. We compare the three best architectures trained using our method to select the best one, which is the one with a shallow architecture. Then, we compared this model with the SOTA method for the speaker recognition task: the Fast ResNet-34 trained with approximately 2,000 hours, using the loss functions Angular Prototypical and GE2E. Three experiments were carried out with datasets in different languages. Among these three experiments, our model achieved the second best result in two experiments and the best result in one of them. This highlights the importance of our method, which proved to be a great competitor to SOTA speaker recognition models, with 500x less data and a simpler approach.
CVJan 28, 2019
Generalization of feature embeddings transferred from different video anomaly detection domainsFernando Pereira dos Santos, Leonardo Sampaio Ferraz Ribeiro, Moacir Antonelli Ponti
Detecting anomalous activity in video surveillance often involves using only normal activity data in order to learn an accurate detector. Due to lack of annotated data for some specific target domain, one could employ existing data from a source domain to produce better predictions. Hence, transfer learning presents itself as an important tool. But how to analyze the resulting data space? This paper investigates video anomaly detection, in particular feature embeddings of pre-trained CNN that can be used with non-fully supervised data. By proposing novel cross-domain generalization measures, we study how source features can generalize for different target video domains, as well as analyze unsupervised transfer learning. The proposed generalization measures are not only a theorical approach, but show to be useful in practice as a way to understand which datasets can be used or transferred to describe video frames, which it is possible to better discriminate between normal and anomalous activity.
CVNov 1, 2018
Unsupervised representation learning using convolutional and stacked auto-encoders: a domain and cross-domain feature space analysisGabriel B. Cavallari, Leonardo Sampaio Ferraz Ribeiro, Moacir Antonelli Ponti
A feature learning task involves training models that are capable of inferring good representations (transformations of the original space) from input data alone. When working with limited or unlabelled data, and also when multiple visual domains are considered, methods that rely on large annotated datasets, such as Convolutional Neural Networks (CNNs), cannot be employed. In this paper we investigate different auto-encoder (AE) architectures, which require no labels, and explore training strategies to learn representations from images. The models are evaluated considering both the reconstruction error of the images and the feature spaces in terms of their discriminative power. We study the role of dense and convolutional layers on the results, as well as the depth and capacity of the networks, since those are shown to affect both the dimensionality reduction and the capability of generalising for different visual domains. Classification results with AE features were as discriminative as pre-trained CNN features. Our findings can be used as guidelines for the design of unsupervised representation learning methods within and across domains.
LGJun 20, 2018
Como funciona o Deep LearningMoacir Antonelli Ponti, Gabriel B. Paranhos da Costa
Deep Learning methods are currently the state-of-the-art in many problems which can be tackled via machine learning, in particular classification problems. However there is still lack of understanding on how those methods work, why they work and what are the limitations involved in using them. In this chapter we will describe in detail the transition from shallow to deep networks, include examples of code on how to implement them, as well as the main issues one faces when training a deep network. Afterwards, we introduce some theoretical background behind the use of deep models, and discuss their limitations.
LGMay 7, 2018
Computing the Shattering Coefficient of Supervised Learning AlgorithmsRodrigo Fernandes de Mello, Moacir Antonelli Ponti, Carlos Henrique Grossi Ferreira
The Statistical Learning Theory (SLT) provides the theoretical guarantees for supervised machine learning based on the Empirical Risk Minimization Principle (ERMP). Such principle defines an upper bound to ensure the uniform convergence of the empirical risk Remp(f), i.e., the error measured on a given data sample, to the expected value of risk R(f) (a.k.a. actual risk), which depends on the Joint Probability Distribution P(X x Y) mapping input examples x in X to class labels y in Y. The uniform convergence is only ensured when the Shattering coefficient N(F,2n) has a polynomial growing behavior. This paper proves the Shattering coefficient for any Hilbert space H containing the input space X and discusses its effects in terms of learning guarantees for supervised machine algorithms.
LGNov 28, 2017
Providing theoretical learning guarantees to Deep Learning NetworksRodrigo Fernandes de Mello, Martha Dais Ferreira, Moacir Antonelli Ponti
Deep Learning (DL) is one of the most common subjects when Machine Learning and Data Science approaches are considered. There are clearly two movements related to DL: the first aggregates researchers in quest to outperform other algorithms from literature, trying to win contests by considering often small decreases in the empirical risk; and the second investigates overfitting evidences, questioning the learning capabilities of DL classifiers. Motivated by such opposed points of view, this paper employs the Statistical Learning Theory (SLT) to study the convergence of Deep Neural Networks, with particular interest in Convolutional Neural Networks. In order to draw theoretical conclusions, we propose an approach to estimate the Shattering coefficient of those classification algorithms, providing a lower bound for the complexity of their space of admissible functions, a.k.a. algorithm bias. Based on such estimator, we generalize the complexity of network biases, and, next, we study AlexNet and VGG16 architectures in the point of view of their Shattering coefficients, and number of training examples required to provide theoretical learning guarantees. From our theoretical formulation, we show the conditions which Deep Neural Networks learn as well as point out another issue: DL benchmarks may be strictly driven by empirical risks, disregarding the complexity of algorithms biases.