Yuxin Hou

CV
h-index47
10papers
135citations
Novelty54%
AI Score39

10 Papers

CVNov 1, 2022
Expansion of Visual Hints for Improved Generalization in Stereo Matching

Andrea Pilzer, Yuxin Hou, Niki Loppi et al.

We introduce visual hints expansion for guiding stereo matching to improve generalization. Our work is motivated by the robustness of Visual Inertial Odometry (VIO) in computer vision and robotics, where a sparse and unevenly distributed set of feature points characterizes a scene. To improve stereo matching, we propose to elevate 2D hints to 3D points. These sparse and unevenly distributed 3D visual hints are expanded using a 3D random geometric graph, which enhances the learning and inference process. We evaluate our proposal on multiple widely adopted benchmarks and show improved performance without access to additional sensors other than the image sequence. To highlight practical applicability and symbiosis with visual odometry, we demonstrate how our methods run on embedded hardware.

LGOct 8, 2025Code
Generative World Modelling for Humanoids: 1X World Model Challenge Technical Report

Riccardo Mereu, Aidan Scannell, Yuxin Hou et al.

World models are a powerful paradigm in AI and robotics, enabling agents to reason about the future by predicting visual observations or compact latent states. The 1X World Model Challenge introduces an open-source benchmark of real-world humanoid interaction, with two complementary tracks: sampling, focused on forecasting future image frames, and compression, focused on predicting future discrete latent codes. For the sampling track, we adapt the video generation foundation model Wan-2.2 TI2V-5B to video-state-conditioned future frame prediction. We condition the video generation on robot states using AdaLN-Zero, and further post-train the model using LoRA. For the compression track, we train a Spatio-Temporal Transformer model from scratch. Our models achieve 23.0 dB PSNR in the sampling task and a Top-500 CE of 6.6386 in the compression task, securing 1st place in both challenges.

LGFeb 26, 2025
Efficient Reinforcement Learning by Guiding Generalist World Models with Non-Curated Data

Yi Zhao, Aidan Scannell, Wenshuai Zhao et al.

Leveraging offline data is a promising way to improve the sample efficiency of online reinforcement learning (RL). This paper expands the pool of usable data for offline-to-online RL by leveraging abundant non-curated data that is reward-free, of mixed quality, and collected across multiple embodiments. Although learning a world model appears promising for utilizing such data, we find that naive fine-tuning fails to accelerate RL training on many tasks. Through careful investigation, we attribute this failure to the distributional shift between offline and online data during fine-tuning. To address this issue and effectively use the offline data, we propose two essential techniques: \emph{i)} experience rehearsal and \emph{ii)} execution guidance. With these modifications, the non-curated offline data substantially improves RL's sample efficiency. Under limited sample budgets, our method achieves a 102.8\% relative improvement in aggregate score over learning-from-scratch baselines across 72 visuomotor tasks spanning 6 embodiments. On challenging tasks such as locomotion and robotic manipulation, it outperforms prior methods that utilize offline data by a decent margin.

CVJan 5, 2021
Novel View Synthesis via Depth-guided Skip Connections

Yuxin Hou, Arno Solin, Juho Kannala

We introduce a principled approach for synthesizing new views of a scene given a single source image. Previous methods for novel view synthesis can be divided into image-based rendering methods (e.g. flow prediction) or pixel generation methods. Flow predictions enable the target view to re-use pixels directly, but can easily lead to distorted results. Directly regressing pixels can produce structurally consistent results but generally suffer from the lack of low-level details. In this paper, we utilize an encoder-decoder architecture to regress pixels of a target view. In order to maintain details, we couple the decoder aligned feature maps with skip connections, where the alignment is guided by predicted depth map of the target view. Our experimental results show that our method does not suffer from distortions and successfully preserves texture details with aligned skip connections.

CVOct 18, 2020
Movement-induced Priors for Deep Stereo

Yuxin Hou, Muhammad Kamran Janjua, Juho Kannala et al.

We propose a method for fusing stereo disparity estimation with movement-induced prior information. Instead of independent inference frame-by-frame, we formulate the problem as a non-parametric learning task in terms of a temporal Gaussian process prior with a movement-driven kernel for inter-frame reasoning. We present a hierarchy of three Gaussian process kernels depending on the availability of motion information, where our main focus is on a new gyroscope-driven kernel for handheld devices with low-quality MEMS sensors, thus also relaxing the requirement of having full 6D camera poses available. We show how our method can be combined with two state-of-the-art deep stereo methods. The method either work in a plug-and-play fashion with pre-trained deep stereo networks, or further improved by jointly training the kernels together with encoder-decoder architectures, leading to consistent improvement.

LGDec 21, 2019
Deep Automodulators

Ari Heljakka, Yuxin Hou, Juho Kannala et al.

We introduce a new category of generative autoencoders called automodulators. These networks can faithfully reproduce individual real-world input images like regular autoencoders, but also generate a fused sample from an arbitrary combination of several such images, allowing instantaneous 'style-mixing' and other new applications. An automodulator decouples the data flow of decoder operations from statistical properties thereof and uses the latent vector to modulate the former by the latter, with a principled approach for mutual disentanglement of decoder layers. Prior work has explored similar decoder architecture with GANs, but their focus has been on random sampling. A corresponding autoencoder could operate on real input images. For the first time, we show how to train such a general-purpose model with sharp outputs in high resolution, using novel training techniques, demonstrated on four image data sets. Besides style-mixing, we show state-of-the-art results in autoencoder comparison, and visual image quality nearly indistinguishable from state-of-the-art GANs. We expect the automodulator variants to become a useful building block for image applications and other data domains.

MLDec 6, 2019
Gaussian Process Priors for View-Aware Inference

Yuxin Hou, Ari Heljakka, Arno Solin

While frame-independent predictions with deep neural networks have become the prominent solutions to many computer vision tasks, the potential benefits of utilizing correlations between frames have received less attention. Even though probabilistic machine learning provides the ability to encode correlation as prior knowledge for inference, there is a tangible gap between the theory and practice of applying probabilistic methods to modern vision problems. For this, we derive a principled framework to combine information coupling between camera poses (translation and orientation) with deep models. We proposed a novel view kernel that generalizes the standard periodic kernel in $\mathrm{SO}(3)$. We show how this soft-prior knowledge can aid several pose-related vision tasks like novel view synthesis and predict arbitrary points in the latent space of generative models, pointing towards a range of new applications for inter-frame reasoning.

CVJun 2, 2019
Iterative Path Reconstruction for Large-Scale Inertial Navigation on Smartphones

Santiago Cortés Reina, Yuxin Hou, Juho Kannala et al.

Modern smartphones have all the sensing capabilities required for accurate and robust navigation and tracking. In specific environments some data streams may be absent, less reliable, or flat out wrong. In particular, the GNSS signal can become flawed or silent inside buildings or in streets with tall buildings. In this application paper, we aim to advance the current state-of-the-art in motion estimation using inertial measurements in combination with partial GNSS data on standard smartphones. We show how iterative estimation methods help refine the positioning path estimates in retrospective use cases that can cover both fixed-interval and fixed-lag scenarios. We compare estimation results provided by global iterated Kalman filtering methods to those of a visual-inertial tracking scheme (Apple ARKit). The practical applicability is demonstrated on real-world use cases on empirical data acquired from both smartphones and tablet devices.

CVApr 12, 2019
Multi-View Stereo by Temporal Nonparametric Fusion

Yuxin Hou, Juho Kannala, Arno Solin

We propose a novel idea for depth estimation from multi-view image-pose pairs, where the model has capability to leverage information from previous latent-space encodings of the scene. This model uses pairs of images and poses, which are passed through an encoder--decoder model for disparity estimation. The novelty lies in soft-constraining the bottleneck layer by a nonparametric Gaussian process prior. We propose a pose-kernel structure that encourages similar poses to have resembling latent spaces. The flexibility of the Gaussian process (GP) prior provides adapting memory for fusing information from previous views. We train the encoder--decoder and the GP hyperparameters jointly end-to-end. In addition to a batch method, we derive a lightweight estimation scheme that circumvents standard pitfalls in scaling Gaussian process inference, and demonstrate how our scheme can run in real-time on smart devices.

CVFeb 6, 2019
Unstructured Multi-View Depth Estimation Using Mask-Based Multiplane Representation

Yuxin Hou, Arno Solin, Juho Kannala

This paper presents a novel method, MaskMVS, to solve depth estimation for unstructured multi-view image-pose pairs. In the plane-sweep procedure, the depth planes are sampled by histogram matching that ensures covering the depth range of interest. Unlike other plane-sweep methods, we do not rely on a cost metric to explicitly build the cost volume, but instead infer a multiplane mask representation which regularizes the learning. Compared to many previous approaches, we show that our method is lightweight and generalizes well without requiring excessive training. We outperform the current state-of-the-art and show results on the sun3d, scenes11, MVS, and RGBD test data sets.