CVMay 25, 2022
A Low Memory Footprint Quantized Neural Network for Depth Completion of Very Sparse Time-of-Flight Depth MapsXiaowen Jiang, Valerio Cambareri, Gianluca Agresti et al.
Sparse active illumination enables precise time-of-flight depth sensing as it maximizes signal-to-noise ratio for low power budgets. However, depth completion is required to produce dense depth maps for 3D perception. We address this task with realistic illumination and sensor resolution constraints by simulating ToF datasets for indoor 3D perception with challenging sparsity levels. We propose a quantized convolutional encoder-decoder network for this task. Our model achieves optimal depth map quality by means of input pre-processing and carefully tuned training with a geometry-preserving loss function. We also achieve low memory footprint for weights and activations by means of mixed precision quantization-at-training techniques. The resulting quantized models are comparable to the state of the art in terms of quality, but they require very low GPU times and achieve up to 14-fold memory size reduction for the weights w.r.t. their floating point counterpart with minimal impact on quality metrics.
CVJul 3, 2024
SAFT: Towards Out-of-Distribution Generalization in Fine-TuningBac Nguyen, Stefan Uhlich, Fabien Cardinaux et al.
Handling distribution shifts from training data, known as out-of-distribution (OOD) generalization, poses a significant challenge in the field of machine learning. While a pre-trained vision-language model like CLIP has demonstrated remarkable zero-shot performance, further adaptation of the model to downstream tasks leads to undesirable degradation for OOD data. In this work, we introduce Sparse Adaptation for Fine-Tuning (SAFT), a method that prevents fine-tuning from forgetting the general knowledge in the pre-trained model. SAFT only updates a small subset of important parameters whose gradient magnitude is large, while keeping the other parameters frozen. SAFT is straightforward to implement and conceptually simple. Extensive experiments show that with only 0.1% of the model parameters, SAFT can significantly improve the performance of CLIP. It consistently outperforms baseline methods across several benchmarks. On the few-shot learning benchmark of ImageNet and its variants, SAFT gives a gain of 5.15% on average over the conventional fine-tuning method in OOD settings.
LGDec 13, 2022
A Statistical Model for Predicting Generalization in Few-Shot ClassificationYassir Bendou, Vincent Gripon, Bastien Pasdeloup et al.
The estimation of the generalization error of classifiers often relies on a validation set. Such a set is hardly available in few-shot learning scenarios, a highly disregarded shortcoming in the field. In these scenarios, it is common to rely on features extracted from pre-trained neural networks combined with distance-based classifiers such as nearest class mean. In this work, we introduce a Gaussian model of the feature distribution. By estimating the parameters of this model, we are able to predict the generalization error on new classification tasks with few samples. We observe that accurate distance estimates between class-conditional densities are the key to accurate estimates of the generalization performance. Therefore, we propose an unbiased estimator for these distances and integrate it in our numerical analysis. We empirically show that our approach outperforms alternatives such as the leave-one-out cross-validation strategy.
SDMar 21, 2022
AutoTTS: End-to-End Text-to-Speech Synthesis through Differentiable Duration ModelingBac Nguyen, Fabien Cardinaux, Stefan Uhlich
Parallel text-to-speech (TTS) models have recently enabled fast and highly-natural speech synthesis. However, they typically require external alignment models, which are not necessarily optimized for the decoder as they are not jointly trained. In this paper, we propose a differentiable duration method for learning monotonic alignments between input and output sequences. Our method is based on a soft-duration mechanism that optimizes a stochastic process in expectation. Using this differentiable duration method, we introduce AutoTTS, a direct text-to-waveform speech synthesis model. AutoTTS enables high-fidelity speech synthesis through a combination of adversarial training and matching the total ground-truth duration. Experimental results show that our model obtains competitive results while enjoying a much simpler training pipeline. Audio samples are available online.
SDJun 2, 2023
Towards Robust FastSpeech 2 by Modelling Residual MultimodalityFabian Kögel, Bac Nguyen, Fabien Cardinaux
State-of-the-art non-autoregressive text-to-speech (TTS) models based on FastSpeech 2 can efficiently synthesise high-fidelity and natural speech. For expressive speech datasets however, we observe characteristic audio distortions. We demonstrate that such artefacts are introduced to the vocoder reconstruction by over-smooth mel-spectrogram predictions, which are induced by the choice of mean-squared-error (MSE) loss for training the mel-spectrogram decoder. With MSE loss FastSpeech 2 is limited to learn conditional averages of the training distribution, which might not lie close to a natural sample if the distribution still appears multimodal after all conditioning signals. To alleviate this problem, we introduce TVC-GMM, a mixture model of Trivariate-Chain Gaussian distributions, to model the residual multimodality. TVC-GMM reduces spectrogram smoothness and improves perceptual audio quality in particular for expressive datasets as shown by both objective and subjective evaluation.
CVNov 24, 2023
Inferring Latent Class Statistics from Text for Robust Visual Few-Shot LearningYassir Bendou, Vincent Gripon, Bastien Pasdeloup et al.
In the realm of few-shot learning, foundation models like CLIP have proven effective but exhibit limitations in cross-domain robustness especially in few-shot settings. Recent works add text as an extra modality to enhance the performance of these models. Most of these approaches treat text as an auxiliary modality without fully exploring its potential to elucidate the underlying class visual features distribution. In this paper, we present a novel approach that leverages text-derived statistics to predict the mean and covariance of the visual feature distribution for each class. This predictive framework enriches the latent space, yielding more robust and generalizable few-shot learning models. We demonstrate the efficacy of incorporating both mean and covariance statistics in improving few-shot classification performance across various datasets. Our method shows that we can use text to predict the mean and covariance of the distribution offering promising improvements in few-shot learning scenarios.
AIMar 19
D5P4: Partition Determinantal Point Process for Diversity in Parallel Discrete Diffusion DecodingJonathan Lys, Vincent Gripon, Bastien Pasdeloup et al.
Discrete diffusion models are promising alternatives to autoregressive approaches for text generation, yet their decoding methods remain under-studied. Standard decoding methods for autoregressive models, such as beam search, do not directly apply to iterative denoising, and existing diffusion decoding techniques provide limited control over in-batch diversity. To bridge this gap, we introduce a generalized beam-search framework for discrete diffusion that generates candidates in parallel and supports modular beam-selection objectives. As a diversity-focused instantiation, we propose D5P4, which formulates the selection step as MAP inference over a Determinantal Point Process. Leveraging a scalable greedy solver, D5P4 maintains multi-GPU compatibility and enables an explicit trade-off between model probability and target diversity with near-zero compute overhead. Experiments on free-form generation and question answering demonstrate that D5P4 improves diversity over strong baselines while maintaining competitive generation quality.
CLFeb 16
Residual Connections and the Causal Shift: Uncovering a Structural Misalignment in TransformersJonathan Lys, Vincent Gripon, Bastien Pasdeloup et al.
Large Language Models (LLMs) are trained with next-token prediction, implemented in autoregressive Transformers via causal masking for parallelism. This creates a subtle misalignment: residual connections tie activations to the current token, while supervision targets the next token, potentially propagating mismatched information if the current token is not the most informative for prediction. In this work, we empirically localize this input-output alignment shift in pretrained LLMs, using decoding trajectories over tied embedding spaces and similarity-based metrics. Our experiments reveal that the hidden token representations switch from input alignment to output alignment deep within the network. Motivated by this observation, we propose a lightweight residual-path mitigation based on residual attenuation, implemented either as a fixed-layer intervention or as a learnable gating mechanism. Experiments on multiple benchmarks show that these strategies alleviate the representation misalignment and yield improvements, providing an efficient and general architectural enhancement for autoregressive Transformers.
LGFeb 16
Inner Loop Inference for Pretrained Transformers: Unlocking Latent Capabilities Without TrainingJonathan Lys, Vincent Gripon, Bastien Pasdeloup et al.
Deep Learning architectures, and in particular Transformers, are conventionally viewed as a composition of layers. These layers are actually often obtained as the sum of two contributions: a residual path that copies the input and the output of a Transformer block. As a consequence, the inner representations (i.e. the input of these blocks) can be interpreted as iterative refinement of a propagated latent representation. Under this lens, many works suggest that the inner space is shared across layers, meaning that tokens can be decoded at early stages. Mechanistic interpretability even goes further by conjecturing that some layers act as refinement layers. Following this path, we propose inference-time inner looping, which prolongs refinement in pretrained off-the-shelf language models by repeatedly re-applying a selected block range. Across multiple benchmarks, inner looping yields modest but consistent accuracy improvements. Analyses of the resulting latent trajectories suggest more stable state evolution and continued semantic refinement. Overall, our results suggest that additional refinement can be obtained through simple test-time looping, extending computation in frozen pretrained models.
LGSep 7, 2023
DBsurf: A Discrepancy Based Method for Discrete Stochastic Gradient EstimationPau Mulet Arabi, Alec Flowers, Lukas Mauch et al.
Computing gradients of an expectation with respect to the distributional parameters of a discrete distribution is a problem arising in many fields of science and engineering. Typically, this problem is tackled using Reinforce, which frames the problem of gradient estimation as a Monte Carlo simulation. Unfortunately, the Reinforce estimator is especially sensitive to discrepancies between the true probability distribution and the drawn samples, a common issue in low sampling regimes that results in inaccurate gradient estimates. In this paper, we introduce DBsurf, a reinforce-based estimator for discrete distributions that uses a novel sampling procedure to reduce the discrepancy between the samples and the actual distribution. To assess the performance of our estimator, we subject it to a diverse set of tasks. Among existing estimators, DBsurf attains the lowest variance in a least squares problem commonly used in the literature for benchmarking. Furthermore, DBsurf achieves the best results for training variational auto-encoders (VAE) across different datasets and sampling setups. Finally, we apply DBsurf to build a simple and efficient Neural Architecture Search (NAS) algorithm with state-of-the-art performance.
CVMar 31, 2024
LLM meets Vision-Language Models for Zero-Shot One-Class ClassificationYassir Bendou, Giulia Lioi, Bastien Pasdeloup et al.
We consider the problem of zero-shot one-class visual classification, extending traditional one-class classification to scenarios where only the label of the target class is available. This method aims to discriminate between positive and negative query samples without requiring examples from the target class. We propose a two-step solution that first queries large language models for visually confusing objects and then relies on vision-language pre-trained models (e.g., CLIP) to perform classification. By adapting large-scale vision benchmarks, we demonstrate the ability of the proposed method to outperform adapted off-the-shelf alternatives in this setting. Namely, we propose a realistic benchmark where negative query samples are drawn from the same original dataset as positive ones, including a granularity-controlled version of iNaturalist, where negative samples are at a fixed distance in the taxonomy tree from the positive ones. To our knowledge, we are the first to demonstrate the ability to discriminate a single category from other semantically related ones using only its label.
LGOct 22, 2025
GaLLoP: Gradient-based Sparse Learning on Low-Magnitude ParametersAnand Choudhary, Yasser Sulaıman, Lukas Mauch et al.
Sparse fine-tuning techniques adapt LLMs to downstream tasks by only tuning a sparse subset of model parameters. However, the effectiveness of sparse adaptation depends on optimally selecting the model parameters to be fine-tuned. In this work, we introduce a novel sparse fine-tuning technique named GaLLoP: Gradient-based Sparse Learning on Low-Magnitude Parameters, which fine-tunes only those model parameters which have the largest gradient magnitudes on downstream tasks and the smallest pre-trained magnitudes, intuitively prioritizing parameters that are highly task-relevant, but minimally disruptive to pre-trained knowledge. Our experimentation with LLaMA3 8B and Gemma 2B as base models shows that GaLLoP consistently improves or matches the in-distribution as well as out-of-distribution performance obtained via the usage of other leading parameter-efficient fine-tuning techniques, including LoRA, DoRA, and SAFT. Our analysis demonstrates that GaLLoP mitigates catastrophic forgetting and memorization of task data, as important pre-trained parameters remain unchanged, and stabilizes performance relative to other fine-tuning techniques, robustly generalizing across most random seeds.
CVJan 20, 2024
A Novel Benchmark for Few-Shot Semantic Segmentation in the Era of Foundation ModelsReda Bensaid, Vincent Gripon, François Leduc-Primeau et al.
Few-shot semantic segmentation (FSS) is a crucial challenge in computer vision, driving extensive research into a diverse range of methods, from advanced meta-learning techniques to simple transfer learning baselines. With the emergence of vision foundation models (VFM) serving as generalist feature extractors, we seek to explore the adaptation of these models for FSS. While current FSS benchmarks focus on adapting pre-trained models to new tasks with few images, they emphasize in-domain generalization, making them less suitable for VFM trained on large-scale web datasets. To address this, we propose a novel realistic benchmark with a simple and straightforward adaptation process tailored for this task. Using this benchmark, we conduct a comprehensive comparative analysis of prominent VFM and semantic segmentation models. To evaluate their effectiveness, we leverage various adaption methods, ranging from linear probing to parameter efficient fine-tuning (PEFT) and full fine-tuning. Our findings show that models designed for segmentation can be outperformed by self-supervised (SSL) models. On the other hand, while PEFT methods yields competitive performance, they provide little discrepancy in the obtained results compared to other methods, highlighting the critical role of the feature extractor in determining results. To our knowledge, this is the first study on the adaptation of VFM for FSS.
SDJun 2, 2021
NVC-Net: End-to-End Adversarial Voice ConversionBac Nguyen, Fabien Cardinaux
Voice conversion has gained increasing popularity in many applications of speech synthesis. The idea is to change the voice identity from one speaker into another while keeping the linguistic content unchanged. Many voice conversion approaches rely on the use of a vocoder to reconstruct the speech from acoustic features, and as a consequence, the speech quality heavily depends on such a vocoder. In this paper, we propose NVC-Net, an end-to-end adversarial network, which performs voice conversion directly on the raw audio waveform of arbitrary length. By disentangling the speaker identity from the speech content, NVC-Net is able to perform non-parallel traditional many-to-many voice conversion as well as zero-shot voice conversion from a short utterance of an unseen target speaker. Importantly, NVC-Net is non-autoregressive and fully convolutional, achieving fast inference. Our model is capable of producing samples at a rate of more than 3600 kHz on an NVIDIA V100 GPU, being orders of magnitude faster than state-of-the-art methods under the same hardware configurations. Objective and subjective evaluations on non-parallel many-to-many voice conversion tasks show that NVC-Net obtains competitive results with significantly fewer parameters.
CVMar 24, 2021
DNN Quantization with AttentionGhouthi Boukli Hacene, Lukas Mauch, Stefan Uhlich et al.
Low-bit quantization of network weights and activations can drastically reduce the memory footprint, complexity, energy consumption and latency of Deep Neural Networks (DNNs). However, low-bit quantization can also cause a considerable drop in accuracy, in particular when we apply it to complex learning tasks or lightweight DNN architectures. In this paper, we propose a training procedure that relaxes the low-bit quantization. We call this procedure \textit{DNN Quantization with Attention} (DQA). The relaxation is achieved by using a learnable linear combination of high, medium and low-bit quantizations. Our learning procedure converges step by step to a low-bit quantization using an attention mechanism with temperature scheduling. In experiments, our approach outperforms other low-bit quantization techniques on various object recognition benchmarks such as CIFAR10, CIFAR100 and ImageNet ILSVRC 2012, achieves almost the same accuracy as a full precision DNN, and considerably reduces the accuracy drop when quantizing lightweight DNN architectures.
LGFeb 12, 2021
Neural Network Libraries: A Deep Learning Framework Designed from Engineers' PerspectivesTakuya Narihira, Javier Alonsogarcia, Fabien Cardinaux et al.
While there exist a plethora of deep learning tools and frameworks, the fast-growing complexity of the field brings new demands and challenges, such as more flexible network design, speedy computation on distributed setting, and compatibility between different tools. In this paper, we introduce Neural Network Libraries (https://nnabla.org), a deep learning framework designed from engineer's perspective, with emphasis on usability and compatibility as its core design principles. We elaborate on each of our design principles and its merits, and validate our attempts via experiments.
LGNov 24, 2020
Efficient Sampling for Predictor-Based Neural Architecture SearchLukas Mauch, Stephen Tiedemann, Javier Alonso Garcia et al.
Recently, predictor-based algorithms emerged as a promising approach for neural architecture search (NAS). For NAS, we typically have to calculate the validation accuracy of a large number of Deep Neural Networks (DNNs), what is computationally complex. Predictor-based NAS algorithms address this problem. They train a proxy model that can infer the validation accuracy of DNNs directly from their network structure. During optimization, the proxy can be used to narrow down the number of architectures for which the true validation accuracy must be computed, what makes predictor-based algorithms sample efficient. Usually, we compute the proxy for all DNNs in the network search space and pick those that maximize the proxy as candidates for optimization. However, that is intractable in practice, because the search spaces are often very large and contain billions of network architectures. The contributions of this paper are threefold: 1) We define a sample efficiency gain to compare different predictor-based NAS algorithms. 2) We conduct experiments on the NASBench-101 dataset and show that the sample efficiency of predictor-based algorithms decreases dramatically if the proxy is only computed for a subset of the search space. 3) We show that if we choose the subset of the search space on which the proxy is evaluated in a smart way, the sample efficiency of the original predictor-based algorithm that has access to the full search space can be regained. This is an important step to make predictor-based NAS algorithms useful, in practice.
ASMay 15, 2020
Unsupervised Cross-Domain Speech-to-Speech Conversion with Time-Frequency ConsistencyMohammad Asif Khan, Fabien Cardinaux, Stefan Uhlich et al.
In recent years generative adversarial network (GAN) based models have been successfully applied for unsupervised speech-to-speech conversion.The rich compact harmonic view of the magnitude spectrogram is considered a suitable choice for training these models with audio data. To reconstruct the speech signal first a magnitude spectrogram is generated by the neural network, which is then utilized by methods like the Griffin-Lim algorithm to reconstruct a phase spectrogram. This procedure bears the problem that the generated magnitude spectrogram may not be consistent, which is required for finding a phase such that the full spectrogram has a natural-sounding speech waveform. In this work, we approach this problem by proposing a condition encouraging spectrogram consistency during the adversarial training procedure. We demonstrate our approach on the task of translating the voice of a male speaker to that of a female speaker, and vice versa. Our experimental results on the Librispeech corpus show that the model trained with the TF consistency provides a perceptually better quality of speech-to-speech conversion.
LGNov 12, 2019
Iteratively Training Look-Up Tables for Network QuantizationFabien Cardinaux, Stefan Uhlich, Kazuki Yoshiyama et al.
Operating deep neural networks (DNNs) on devices with limited resources requires the reduction of their memory as well as computational footprint. Popular reduction methods are network quantization or pruning, which either reduce the word length of the network parameters or remove weights from the network if they are not needed. In this article we discuss a general framework for network reduction which we call `Look-Up Table Quantization` (LUT-Q). For each layer, we learn a value dictionary and an assignment matrix to represent the network weights. We propose a special solver which combines gradient descent and a one-step k-means update to learn both the value dictionaries and assignment matrices iteratively. This method is very flexible: by constraining the value dictionary, many different reduction problems such as non-uniform network quantization, training of multiplierless networks, network pruning or simultaneous quantization and pruning can be implemented without changing the solver. This flexibility of the LUT-Q method allows us to use the same method to train networks for different hardware capabilities.
LGMay 27, 2019
Mixed Precision DNNs: All you need is a good parametrizationStefan Uhlich, Lukas Mauch, Fabien Cardinaux et al.
Efficient deep neural network (DNN) inference on mobile or embedded devices typically involves quantization of the network parameters and activations. In particular, mixed precision networks achieve better performance than networks with homogeneous bitwidth for the same size constraint. Since choosing the optimal bitwidths is not straight forward, training methods, which can learn them, are desirable. Differentiable quantization with straight-through gradients allows to learn the quantizer's parameters using gradient methods. We show that a suited parametrization of the quantizer is the key to achieve a stable training and a good final performance. Specifically, we propose to parametrize the quantizer with the step size and dynamic range. The bitwidth can then be inferred from them. Other parametrizations, which explicitly use the bitwidth, consistently perform worse. We confirm our findings with experiments on CIFAR-10 and ImageNet and we obtain mixed precision DNNs with learned quantization parameters, achieving state-of-the-art performance.
LGNov 13, 2018
Iteratively Training Look-Up Tables for Network QuantizationFabien Cardinaux, Stefan Uhlich, Kazuki Yoshiyama et al.
Operating deep neural networks on devices with limited resources requires the reduction of their memory footprints and computational requirements. In this paper we introduce a training method, called look-up table quantization, LUT-Q, which learns a dictionary and assigns each weight to one of the dictionary's values. We show that this method is very flexible and that many other techniques can be seen as special cases of LUT-Q. For example, we can constrain the dictionary trained with LUT-Q to generate networks with pruned weight matrices or restrict the dictionary to powers-of-two to avoid the need for multiplications. In order to obtain fully multiplier-less networks, we also introduce a multiplier-less version of batch normalization. Extensive experiments on image recognition and object detection tasks show that LUT-Q consistently achieves better performance than other methods with the same quantization bitwidth.
SDJul 7, 2018
Improving DNN-based Music Source Separation using Phase FeaturesJoachim Muth, Stefan Uhlich, Nathanael Perraudin et al.
Music source separation with deep neural networks typically relies only on amplitude features. In this paper we show that additional phase features can improve the separation performance. Using the theoretical relationship between STFT phase and amplitude, we conjecture that derivatives of the phase are a good feature representation opposed to the raw phase. We verify this conjecture experimentally and propose a new DNN architecture which combines amplitude and phase. This joint approach achieves a better signal-to distortion ratio on the DSD100 dataset for all instruments compared to a network that uses only amplitude features. Especially, the bass instrument benefits from the phase information.