Romain Hérault

CV
h-index23
19papers
737citations
Novelty44%
AI Score39

19 Papers

CVSep 12, 2023
SoccerNet 2023 Challenges Results

Anthony Cioppa, Silvio Giancola, Vladimir Somers et al. · pku

The SoccerNet 2023 challenges were the third annual video understanding challenges organized by the SoccerNet team. For this third edition, the challenges were composed of seven vision-based tasks split into three main themes. The first theme, broadcast video understanding, is composed of three high-level tasks related to describing events occurring in the video broadcasts: (1) action spotting, focusing on retrieving all timestamps related to global actions in soccer, (2) ball action spotting, focusing on retrieving all timestamps related to the soccer ball change of state, and (3) dense video captioning, focusing on describing the broadcast with natural language and anchored timestamps. The second theme, field understanding, relates to the single task of (4) camera calibration, focusing on retrieving the intrinsic and extrinsic camera parameters from images. The third and last theme, player understanding, is composed of three low-level tasks related to extracting information about the players: (5) re-identification, focusing on retrieving the same players across multiple views, (6) multiple object tracking, focusing on tracking players and the ball through unedited video streams, and (7) jersey number recognition, focusing on recognizing the jersey number of players from tracklets. Compared to the previous editions of the SoccerNet challenges, tasks (2-3-7) are novel, including new annotations and data, task (4) was enhanced with more data and annotations, and task (6) now focuses on end-to-end approaches. More information on the tasks, challenges, and leaderboards are available on https://www.soccer-net.org. Baselines and development kits can be found on https://github.com/SoccerNet.

CVJun 15, 2022
Physically-admissible polarimetric data augmentation for road-scene analysis

Cyprien Ruffino, Rachel Blin, Samia Ainouz et al.

Polarimetric imaging, along with deep learning, has shown improved performances on different tasks including scene analysis. However, its robustness may be questioned because of the small size of the training datasets. Though the issue could be solved by data augmentation, polarization modalities are subject to physical feasibility constraints unaddressed by classical data augmentation techniques. To address this issue, we propose to use CycleGAN, an image translation technique based on deep generative models that solely relies on unpaired data, to transfer large labeled road scene datasets to the polarimetric domain. We design several auxiliary loss terms that, alongside the CycleGAN losses, deal with the physical constraints of polarimetric images. The efficiency of this solution is demonstrated on road scene object detection tasks where generated realistic polarimetric images allow to improve performances on cars and pedestrian detection up to 9%. The resulting constrained CycleGAN is publicly released, allowing anyone to generate their own polarimetric images.

CVDec 21, 2022
Similarity Contrastive Estimation for Image and Video Soft Contrastive Self-Supervised Learning

Julien Denize, Jaonary Rabarisoa, Astrid Orcesi et al.

Contrastive representation learning has proven to be an effective self-supervised learning method for images and videos. Most successful approaches are based on Noise Contrastive Estimation (NCE) and use different views of an instance as positives that should be contrasted with other instances, called negatives, that are considered as noise. However, several instances in a dataset are drawn from the same distribution and share underlying semantic information. A good data representation should contain relations between the instances, or semantic similarity and dissimilarity, that contrastive learning harms by considering all negatives as noise. To circumvent this issue, we propose a novel formulation of contrastive learning using semantic similarity between instances called Similarity Contrastive Estimation (SCE). Our training objective is a soft contrastive one that brings the positives closer and estimates a continuous distribution to push or pull negative instances based on their learned similarities. We validate empirically our approach on both image and video representation learning. We show that SCE performs competitively with the state of the art on the ImageNet linear evaluation protocol for fewer pretraining epochs and that it generalizes to several downstream image tasks. We also show that SCE reaches state-of-the-art results for pretraining video representation and that the learned representation can generalize to video downstream tasks.

CVNov 29, 2021Code
Similarity Contrastive Estimation for Self-Supervised Soft Contrastive Learning

Julien Denize, Jaonary Rabarisoa, Astrid Orcesi et al.

Contrastive representation learning has proven to be an effective self-supervised learning method. Most successful approaches are based on Noise Contrastive Estimation (NCE) and use different views of an instance as positives that should be contrasted with other instances, called negatives, that are considered as noise. However, several instances in a dataset are drawn from the same distribution and share underlying semantic information. A good data representation should contain relations, or semantic similarity, between the instances. Contrastive learning implicitly learns relations but considering all negatives as noise harms the quality of the learned relations. To circumvent this issue, we propose a novel formulation of contrastive learning using semantic similarity between instances called Similarity Contrastive Estimation (SCE). Our training objective is a soft contrastive learning one. Instead of hard classifying positives and negatives, we estimate from one view of a batch a continuous distribution to push or pull instances based on their semantic similarities. This target similarity distribution is sharpened to eliminate noisy relations. The model predicts for each instance, from another view, the target distribution while contrasting its positive with negatives. Experimental results show that SCE is Top-1 on the ImageNet linear evaluation protocol at 100 pretraining epochs with 72.1% accuracy and is competitive with state-of-the-art algorithms by reaching 75.4% for 200 epochs with multi-crop. We also show that SCE is able to generalize to several tasks. Source code is available here: https://github.com/CEA-LIST/SCE.

LGSep 6, 2017Code
Neural Networks Regularization Through Class-wise Invariant Representation Learning

Soufiane Belharbi, Clément Chatelain, Romain Hérault et al.

Training deep neural networks is known to require a large number of training samples. However, in many applications only few training samples are available. In this work, we tackle the issue of training neural networks for classification task when few training samples are available. We attempt to solve this issue by proposing a new regularization term that constrains the hidden layers of a network to learn class-wise invariant representations. In our regularization framework, learning invariant representations is generalized to the class membership where samples with the same class should have the same representation. Numerical experiments over MNIST and its variants showed that our proposal helps improving the generalization of neural network particularly when trained with few samples. We provide the source code of our framework https://github.com/sbelharbi/learning-class-invariant-features .

LGApr 28, 2015Code
Deep Neural Networks Regularization for Structured Output Prediction

Soufiane Belharbi, Romain Hérault, Clément Chatelain et al.

A deep neural network model is a powerful framework for learning representations. Usually, it is used to learn the relation $x \to y$ by exploiting the regularities in the input $x$. In structured output prediction problems, $y$ is multi-dimensional and structural relations often exist between the dimensions. The motivation of this work is to learn the output dependencies that may lie in the output data in order to improve the prediction accuracy. Unfortunately, feedforward networks are unable to exploit the relations between the outputs. In order to overcome this issue, we propose in this paper a regularization scheme for training neural networks for these particular tasks using a multi-task framework. Our scheme aims at incorporating the learning of the output representation $y$ in the training process in an unsupervised fashion while learning the supervised mapping function $x \to y$. We evaluate our framework on a facial landmark detection problem which is a typical structured output task. We show over two public challenging datasets (LFPW and HELEN) that our regularization scheme improves the generalization of deep neural networks and accelerates their training. The use of unlabeled data and label-only data is also explored, showing an additional improvement of the results. We provide an opensource implementation (https://github.com/sbelharbi/structured-output-ae) of our framework.

CVOct 13, 2025
Pre to Post-Treatment Glioblastoma MRI Prediction using a Latent Diffusion Model

Alexandre G. Leclercq, Sébastien Bougleux, Noémie N. Moreau et al.

Glioblastoma (GBM) is an aggressive primary brain tumor with a median survival of approximately 15 months. In clinical practice, the Stupp protocol serves as the standard first-line treatment. However, patients exhibit highly heterogeneous therapeutic responses which required at least two months before first visual impact can be observed, typically with MRI. Early prediction treatment response is crucial for advancing personalized medicine. Disease Progression Modeling (DPM) aims to capture the trajectory of disease evolution, while Treatment Response Prediction (TRP) focuses on assessing the impact of therapeutic interventions. Whereas most TRP approaches primarly rely on timeseries data, we consider the problem of early visual TRP as a slice-to-slice translation model generating post-treatment MRI from a pre-treatment MRI, thus reflecting the tumor evolution. To address this problem we propose a Latent Diffusion Model with a concatenation-based conditioning from the pre-treatment MRI and the tumor localization, and a classifier-free guidance to enhance generation quality using survival information, in particular post-treatment tumor evolution. Our model were trained and tested on a local dataset consisting of 140 GBM patients collected at Centre François Baclesse. For each patient we collected pre and post T1-Gd MRI, tumor localization manually delineated in the pre-treatment MRI by medical experts, and survival information.

CVOct 11, 2024
TD-Paint: Faster Diffusion Inpainting Through Time Aware Pixel Conditioning

Tsiry Mayet, Pourya Shamsolmoali, Simon Bernard et al.

Diffusion models have emerged as highly effective techniques for inpainting, however, they remain constrained by slow sampling rates. While recent advances have enhanced generation quality, they have also increased sampling time, thereby limiting scalability in real-world applications. We investigate the generative sampling process of diffusion-based inpainting models and observe that these models make minimal use of the input condition during the initial sampling steps. As a result, the sampling trajectory deviates from the data manifold, requiring complex synchronization mechanisms to realign the generation process. To address this, we propose Time-aware Diffusion Paint (TD-Paint), a novel approach that adapts the diffusion process by modeling variable noise levels at the pixel level. This technique allows the model to efficiently use known pixel values from the start, guiding the generation process toward the target manifold. By embedding this information early in the diffusion process, TD-Paint significantly accelerates sampling without compromising image quality. Unlike conventional diffusion-based inpainting models, which require a dedicated architecture or an expensive generation loop, TD-Paint achieves faster sampling times without architectural modifications. Experimental results across three datasets show that TD-Paint outperforms state-of-the-art diffusion models while maintaining lower complexity.

CVDec 12, 2023
Adversarial Semi-Supervised Domain Adaptation for Semantic Segmentation: A New Role for Labeled Target Samples

Marwa Kechaou, Mokhtar Z. Alaya, Romain Hérault et al.

Adversarial learning baselines for domain adaptation (DA) approaches in the context of semantic segmentation are under explored in semi-supervised framework. These baselines involve solely the available labeled target samples in the supervision loss. In this work, we propose to enhance their usefulness on both semantic segmentation and the single domain classifier neural networks. We design new training objective losses for cases when labeled target data behave as source samples or as real target samples. The underlying rationale is that considering the set of labeled target samples as part of source domain helps reducing the domain discrepancy and, hence, improves the contribution of the adversarial loss. To support our approach, we consider a complementary method that mixes source and labeled target data, then applies the same adaptation process. We further propose an unsupervised selection procedure using entropy to optimize the choice of labeled target samples for adaptation. We illustrate our findings through extensive experiments on the benchmarks GTA5, SYNTHIA, and Cityscapes. The empirical evaluation highlights competitive performance of our proposed approach.

CVSep 3, 2023
COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action Spotting using Transformers

Julien Denize, Mykola Liashuha, Jaonary Rabarisoa et al.

We present COMEDIAN, a novel pipeline to initialize spatiotemporal transformers for action spotting, which involves self-supervised learning and knowledge distillation. Action spotting is a timestamp-level temporal action detection task. Our pipeline consists of three steps, with two initialization stages. First, we perform self-supervised initialization of a spatial transformer using short videos as input. Additionally, we initialize a temporal transformer that enhances the spatial transformer's outputs with global context through knowledge distillation from a pre-computed feature bank aligned with each short video segment. In the final step, we fine-tune the transformers to the action spotting task. The experiments, conducted on the SoccerNet-v2 dataset, demonstrate state-of-the-art performance and validate the effectiveness of COMEDIAN's pretraining paradigm. Our results highlight several advantages of our pretraining pipeline, including improved performance and faster convergence compared to non-pretrained models.

LGOct 2, 2020
Open Set Domain Adaptation using Optimal Transport

Marwa Kechaou, Romain Hérault, Mokhtar Z. Alaya et al.

We present a 2-step optimal transport approach that performs a mapping from a source distribution to a target distribution. Here, the target has the particularity to present new classes not present in the source domain. The first step of the approach aims at rejecting the samples issued from these new classes using an optimal transport plan. The second step solves the target (class ratio) shift still as an optimal transport problem. We develop a dual approach to solve the optimization problem involved at each step and we prove that our results outperform recent state-of-the-art performances. We further apply the approach to the setting where the source and target distributions present both a label-shift and an increasing covariate (features) shift to show its robustness.

IVFeb 4, 2020
Pixel-wise Conditioned Generative Adversarial Networks for Image Synthesis and Completion

Cyprien Ruffino, Romain Hérault, Eric Laloy et al.

Generative Adversarial Networks (GANs) have proven successful for unsupervised image generation. Several works have extended GANs to image inpainting by conditioning the generation with parts of the image to be reconstructed. Despite their success, these methods have limitations in settings where only a small subset of the image pixels is known beforehand. In this paper we investigate the effectiveness of conditioning GANs when very few pixel values are provided. We propose a modelling framework which results in adding an explicit cost term to the GAN objective function to enforce pixel-wise conditioning. We investigate the influence of this regularization term on the quality of the generated images and the fulfillment of the given pixel constraints. Using the recent PacGAN technique, we ensure that we keep diversity in the generated samples. Conducted experiments on FashionMNIST show that the regularization term effectively controls the trade-off between quality of the generated images and the conditioning. Experimental evaluation on the CIFAR-10 and CelebA datasets evidences that our method achieves accurate results both visually and quantitatively in term of Fréchet Inception Distance, while still enforcing the pixel conditioning. We also evaluate our method on a texture image generation task using fully-convolutional networks. As a final contribution, we apply the method to a classical geological simulation application.

CVNov 2, 2019
Pixel-wise Conditioning of Generative Adversarial Networks

Cyprien Ruffino, Romain Hérault, Eric Laloy et al.

Generative Adversarial Networks (GANs) have proven successful for unsupervised image generation. Several works extended GANs to image inpainting by conditioning the generation with parts of the image one wants to reconstruct. However, these methods have limitations in settings where only a small subset of the image pixels is known beforehand. In this paper, we study the effectiveness of conditioning GANs by adding an explicit regularization term to enforce pixel-wise conditions when very few pixel values are provided. In addition, we also investigate the influence of this regularization term on the quality of the generated images and the satisfaction of the conditions. Conducted experiments on MNIST and FashionMNIST show evidence that this regularization term allows for controlling the trade-off between quality of the generated images and constraint satisfaction.

CVMay 15, 2019
Dilated Spatial Generative Adversarial Networks for Ergodic Image Generation

Cyprien Ruffino, Romain Hérault, Eric Laloy et al.

Generative models have recently received renewed attention as a result of adversarial learning. Generative adversarial networks consist of samples generation model and a discrimination model able to distinguish between genuine and synthetic samples. In combination with convolutional (for the discriminator) and de-convolutional (for the generator) layers, they are particularly suitable for image generation, especially of natural scenes. However, the presence of fully connected layers adds global dependencies in the generated images. This may lead to high and global variations in the generated sample for small local variations in the input noise. In this work we propose to use architec-tures based on fully convolutional networks (including among others dilated layers), architectures specifically designed to generate globally ergodic images, that is images without global dependencies. Conducted experiments reveal that these architectures are well suited for generating natural textures such as geologic structures .

CVDec 12, 2018
An efficient supervised dictionary learning method for audio signal recognition

Imad Rida, Romain Hérault, Gilles Gasso

Machine hearing or listening represents an emerging area. Conventional approaches rely on the design of handcrafted features specialized to a specific audio task and that can hardly generalized to other audio fields. For example, Mel-Frequency Cepstral Coefficients (MFCCs) and its variants were successfully applied to computational auditory scene recognition while Chroma vectors are good at music chord recognition. Unfortunately, these predefined features may be of variable discrimination power while extended to other tasks or even within the same task due to different nature of clips. Motivated by this need of a principled framework across domain applications for machine listening, we propose a generic and data-driven representation learning approach. For this sake, a novel and efficient supervised dictionary learning method is presented. The method learns dissimilar dictionaries, one per each class, in order to extract heterogeneous information for classification. In other words, we are seeking to minimize the intra-class homogeneity and maximize class separability. This is made possible by promoting pairwise orthogonality between class specific dictionaries and controlling the sparsity structure of the audio clip's decomposition over these dictionaries. The resulting optimization problem is non-convex and solved using a proximal gradient descent method. Experiments are performed on both computational auditory scene (East Anglia and Rouen) and synthetic music chord recognition datasets. Obtained results show that our method is capable to reach state-of-the-art hand-crafted features for both applications.

MLOct 25, 2017
Inversion using a new low-dimensional representation of complex binary geological media based on a deep neural network

Eric Laloy, Romain Hérault, John Lee et al.

Efficient and high-fidelity prior sampling and inversion for complex geological media is still a largely unsolved challenge. Here, we use a deep neural network of the variational autoencoder type to construct a parametric low-dimensional base model parameterization of complex binary geological media. For inversion purposes, it has the attractive feature that random draws from an uncorrelated standard normal distribution yield model realizations with spatial characteristics that are in agreement with the training set. In comparison with the most commonly used parametric representations in probabilistic inversion, we find that our dimensionality reduction (DR) approach outperforms principle component analysis (PCA), optimization-PCA (OPCA) and discrete cosine transform (DCT) DR techniques for unconditional geostatistical simulation of a channelized prior model. For the considered examples, important compression ratios (200 - 500) are achieved. Given that the construction of our parameterization requires a training set of several tens of thousands of prior model realizations, our DR approach is more suited for probabilistic (or deterministic) inversion than for unconditional (or point-conditioned) geostatistical simulation. Probabilistic inversions of 2D steady-state and 3D transient hydraulic tomography data are used to demonstrate the DR-based inversion. For the 2D case study, the performance is superior compared to current state-of-the-art multiple-point statistics inversion by sequential geostatistical resampling (SGR). Inversion results for the 3D application are also encouraging.

MLAug 16, 2017
Training-image based geostatistical inversion using a spatial generative adversarial neural network

Eric Laloy, Romain Hérault, Diederik Jacques et al.

Probabilistic inversion within a multiple-point statistics framework is often computationally prohibitive for high-dimensional problems. To partly address this, we introduce and evaluate a new training-image based inversion approach for complex geologic media. Our approach relies on a deep neural network of the generative adversarial network (GAN) type. After training using a training image (TI), our proposed spatial GAN (SGAN) can quickly generate 2D and 3D unconditional realizations. A key characteristic of our SGAN is that it defines a (very) low-dimensional parameterization, thereby allowing for efficient probabilistic inversion using state-of-the-art Markov chain Monte Carlo (MCMC) methods. In addition, available direct conditioning data can be incorporated within the inversion. Several 2D and 3D categorical TIs are first used to analyze the performance of our SGAN for unconditional geostatistical simulation. Training our deep network can take several hours. After training, realizations containing a few millions of pixels/voxels can be produced in a matter of seconds. This makes it especially useful for simulating many thousands of realizations (e.g., for MCMC inversion) as the relative cost of the training per realization diminishes with the considered number of realizations. Synthetic inversion case studies involving 2D steady-state flow and 3D transient hydraulic tomography with and without direct conditioning data are used to illustrate the effectiveness of our proposed SGAN-based inversion. For the 2D case, the inversion rapidly explores the posterior model distribution. For the 3D case, the inversion recovers model realizations that fit the data close to the target level and visually resemble the true model well.

APJun 23, 2015
Automatic sensor-based detection and classification of climbing activities

Jérémie Boulanger, Ludovic Seifert, Romain Hérault et al.

This article presents a method to automatically detect and classify climbing activities using inertial measurement units (IMUs) attached to the wrists, feet and pelvis of the climber. The IMUs record limb acceleration and angular velocity. Detection requires a learning phase with manual annotation to construct the statistical models used in the cusum algorithm. Full-body activity is then classified based on the detection of each IMU.

MLJan 7, 2014
Key point selection and clustering of swimmer coordination through Sparse Fisher-EM

John Komar, Romain Hérault, Ludovic Seifert

To answer the existence of optimal swimmer learning/teaching strategies, this work introduces a two-level clustering in order to analyze temporal dynamics of motor learning in breaststroke swimming. Each level have been performed through Sparse Fisher-EM, a unsupervised framework which can be applied efficiently on large and correlated datasets. The induced sparsity selects key points of the coordination phase without any prior knowledge.