CVOct 30, 2024Code
Bringing NeRFs to the Latent Space: Inverse Graphics AutoencoderAntoine Schnepf, Karim Kassab, Jean-Yves Franceschi et al.
While pre-trained image autoencoders are increasingly utilized in computer vision, the application of inverse graphics in 2D latent spaces has been under-explored. Yet, besides reducing the training and rendering complexity, applying inverse graphics in the latent space enables a valuable interoperability with other latent-based 2D methods. The major challenge is that inverse graphics cannot be directly applied to such image latent spaces because they lack an underlying 3D geometry. In this paper, we propose an Inverse Graphics Autoencoder (IG-AE) that specifically addresses this issue. To this end, we regularize an image autoencoder with 3D-geometry by aligning its latent space with jointly trained latent 3D scenes. We utilize the trained IG-AE to bring NeRFs to the latent space with a latent NeRF training pipeline, which we implement in an open-source extension of the Nerfstudio framework, thereby unlocking latent scene learning for its supported methods. We experimentally confirm that Latent NeRFs trained with IG-AE present an improved quality compared to a standard autoencoder, all while exhibiting training and rendering accelerations with respect to NeRFs trained in the image space. Our project page can be found at https://ig-ae.github.io .
CVOct 31, 2024Code
Fused-Planes: Improving Planar Representations for Learning Large Sets of 3D ScenesKarim Kassab, Antoine Schnepf, Jean-Yves Franceschi et al.
To learn large sets of scenes, Tri-Planes are commonly employed for their planar structure that enables an interoperability with image models, and thus diverse 3D applications. However, this advantage comes at the cost of resource efficiency, as Tri-Planes are not the most computationally efficient option. In this paper, we introduce Fused-Planes, a new planar architecture that improves Tri-Planes resource-efficiency in the framework of learning large sets of scenes, which we call "multi-scene inverse graphics". To learn a large set of scenes, our method divides it into two subsets and operates as follows: (i) we train the first subset of scenes jointly with a compression model, (ii) we use that compression model to learn the remaining scenes. This compression model consists of a 3D-aware latent space in which Fused-Planes are learned, enabling a reduced rendering resolution, and shared structures across scenes that reduce scene representation complexity. Fused-Planes present competitive resource costs in multi-scene inverse graphics, while preserving Tri-Planes rendering quality, and maintaining their widely favored planar structure. Our codebase is publicly available as open-source. Our project page can be found at https://fused-planes.github.io .
IRAug 12, 2025
Closing the Performance Gap in Generative Recommenders with Collaborative Tokenization and Efficient ModelingSimon Lepage, Jeremie Mary, David Picard
Recent work has explored generative recommender systems as an alternative to traditional ID-based models, reframing item recommendation as a sequence generation task over discrete item tokens. While promising, such methods often underperform in practice compared to well-tuned ID-based baselines like SASRec. In this paper, we identify two key limitations holding back generative approaches: the lack of collaborative signal in item tokenization, and inefficiencies in the commonly used encoder-decoder architecture. To address these issues, we introduce COSETTE, a contrastive tokenization method that integrates collaborative information directly into the learned item representations, jointly optimizing for both content reconstruction and recommendation relevance. Additionally, we propose MARIUS, a lightweight, audio-inspired generative model that decouples timeline modeling from item decoding. MARIUS reduces inference cost while improving recommendation accuracy. Experiments on standard sequential recommendation benchmarks show that our approach narrows, or even eliminates, the performance gap between generative and modern ID-based models, while retaining the benefits of the generative paradigm.
LGAug 5, 2025
Markov Chain Estimation with In-Context LearningSimon Lepage, Jeremie Mary, David Picard
We investigate the capacity of transformers to learn algorithms involving their context while solely being trained using next token prediction. We set up Markov chains with random transition matrices and we train transformers to predict the next token. Matrices used during training and test are different and we show that there is a threshold in transformer size and in training set size above which the model is able to learn to estimate the transition probabilities from its context instead of memorizing the training patterns. Additionally, we show that more involved encoding of the states enables more robust prediction for Markov chains with structures different than those seen during training.
CVMar 18, 2024
Exploring 3D-aware Latent Spaces for Efficiently Learning Numerous ScenesAntoine Schnepf, Karim Kassab, Jean-Yves Franceschi et al.
We present a method enabling the scaling of NeRFs to learn a large number of semantically-similar scenes. We combine two techniques to improve the required training time and memory cost per scene. First, we learn a 3D-aware latent space in which we train Tri-Plane scene representations, hence reducing the resolution at which scenes are learned. Moreover, we present a way to share common information across scenes, hence allowing for a reduction of model complexity to learn a particular scene. Our method reduces effective per-scene memory costs by 44% and per-scene time costs by 86% when training 1000 scenes. Our project page can be found at https://3da-ae.github.io .
LGOct 19, 2021
Latent reweighting, an almost free improvement for GANsThibaut Issenhuth, Ugo Tanielian, David Picard et al.
Standard formulations of GANs, where a continuous function deforms a connected latent space, have been shown to be misspecified when fitting different classes of images. In particular, the generator will necessarily sample some low-quality images in between the classes. Rather than modifying the architecture, a line of works aims at improving the sampling quality from pre-trained generators at the expense of increased computational cost. Building on this, we introduce an additional network to predict latent importance weights and two associated sampling methods to avoid the poorest samples. This idea has several advantages: 1) it provides a way to inject disconnectedness into any GAN architecture, 2) since the rejection happens in the latent space, it avoids going through both the generator and the discriminator, saving computation time, 3) this importance weights formulation provides a principled way to reduce the Wasserstein's distance to the target distribution. We demonstrate the effectiveness of our method on several datasets, both synthetic and high-dimensional.
MLJun 8, 2020
Learning disconnected manifolds: a no GANs landUgo Tanielian, Thibaut Issenhuth, Elvis Dohmatob et al.
Typical architectures of Generative AdversarialNetworks make use of a unimodal latent distribution transformed by a continuous generator. Consequently, the modeled distribution always has connected support which is cumbersome when learning a disconnected set of manifolds. We formalize this problem by establishing a no free lunch theorem for the disconnected manifold learning stating an upper bound on the precision of the targeted distribution. This is done by building on the necessary existence of a low-quality region where the generator continuously samples data between two disconnected modes. Finally, we derive a rejection sampling method based on the norm of generators Jacobian and show its efficiency on several generators including BigGAN.
CLMar 15, 2017
End-to-end optimization of goal-driven and visually grounded dialogue systemsFlorian Strub, Harm de Vries, Jeremie Mary et al.
End-to-end design of dialogue systems has recently become a popular research topic thanks to powerful tools such as encoder-decoder architectures for sequence-to-sequence learning. Yet, most current approaches cast human-machine dialogue management as a supervised learning problem, aiming at predicting the next utterance of a participant given the full history of the dialogue. This vision is too simplistic to render the intrinsic planning problem inherent to dialogue as well as its grounded nature, making the context of a dialogue larger than the sole history. This is why only chit-chat and question answering tasks have been addressed so far using end-to-end architectures. In this paper, we introduce a Deep Reinforcement Learning method to optimize visually grounded task-oriented dialogues, based on the policy gradient algorithm. This approach is tested on a dataset of 120k dialogues collected through Mechanical Turk and provides encouraging results at solving both the problem of generating natural dialogues and the task of discovering a specific object in a complex picture.
IRMar 2, 2016
Hybrid Collaborative Filtering with AutoencodersFlorian Strub, Jeremie Mary, Romaric Gaudel
Collaborative Filtering aims at exploiting the feedback of users to provide personalised recommendations. Such algorithms look for latent variables in a large sparse matrix of ratings. They can be enhanced by adding side information to tackle the well-known cold start problem. While Neu-ral Networks have tremendous success in image and speech recognition, they have received less attention in Collaborative Filtering. This is all the more surprising that Neural Networks are able to discover latent variables in large and heterogeneous datasets. In this paper, we introduce a Collaborative Filtering Neural network architecture aka CFN which computes a non-linear Matrix Factorization from sparse rating inputs and side information. We show experimentally on the MovieLens and Douban dataset that CFN outper-forms the state of the art and benefits from side information. We provide an implementation of the algorithm as a reusable plugin for Torch, a popular Neural Network framework.