Jiaxin Xie

CV
h-index19
11papers
325citations
Novelty52%
AI Score50

11 Papers

CVNov 28, 2022
High-fidelity 3D GAN Inversion by Pseudo-multi-view Optimization

Jiaxin Xie, Hao Ouyang, Jingtan Piao et al.

We present a high-fidelity 3D generative adversarial network (GAN) inversion framework that can synthesize photo-realistic novel views while preserving specific details of the input image. High-fidelity 3D GAN inversion is inherently challenging due to the geometry-texture trade-off in 3D inversion, where overfitting to a single view input image often damages the estimated geometry during the latent optimization. To solve this challenge, we propose a novel pipeline that builds on the pseudo-multi-view estimation with visibility analysis. We keep the original textures for the visible parts and utilize generative priors for the occluded parts. Extensive experiments show that our approach achieves advantageous reconstruction and novel view synthesis quality over state-of-the-art methods, even for images with out-of-distribution textures. The proposed pipeline also enables image attribute editing with the inverted latent code and 3D-aware texture modification. Our approach enables high-fidelity 3D rendering from a single image, which is promising for various applications of AI-generated 3D content.

83.5NAMay 28
On subspace-constrained preconditioning for randomized iterative methods

Yonghan Sun, Hou-Duo Qi, Deren Han et al.

In this paper, we further investigate and refine the subspace-constrained preconditioning technique to enhance the theoretical and numerical convergence properties of randomized iterative methods for solving linear systems. In particular, we design a QR-like factorization that transforms the original linear system into an equivalent block-orthogonal form, thus avoiding the full-rank assumptions adopted in existing work. Moreover, this reformulation reduces the problem to solving a smaller linear system with a favorable singular value distribution, provided an appropriate initial point is employed. The proposed framework can be implemented implicitly within the iteration and does not require explicitly constructing either a preconditioner matrix or a preconditioned linear system, which eliminates the prohibitive cost of forming a fully preconditioned system. Furthermore, we construct orthogonalized search directions from stochastic gradients and develop accelerated variants of the framework. We prove that the proposed algorithmic framework converges linearly in expectation. Numerical experiments demonstrate the benefits of the proposed preconditioning strategy.

27.5NAMay 24
Randomized conjugate gradient least squares

Yun Zeng, Jian-Feng Cai, Deren Han et al.

We develop a novel randomized conjugate gradient least squares (RCGLS) method for solving least-squares problems, in which iterative sketching is employed at each step to reduce the dimension and hence the computational cost. In particular, we propose a new perspective on the classical CGLS method, where the next descent direction is determined via a constraint correction problem associated with the gradient. Based on this insight, we replace the gradient with a randomized coordinate gradient that naturally satisfies the variance reduction property, leading directly to the proposed RCGLS method. We prove that RCGLS converges linearly in expectation, with a better convergence bound compared to the randomized coordinate descent method. Furthermore, we investigate an implementation of the method that avoids full-dimensional vector operations, which are the major bottleneck of vanilla RCGLS for sparse matrices and render it impractical. We also show how to apply the RCGLS method to solve the ridge regression problem, yielding a lightweight, parallelizable, and accelerated method for such problems. Numerical experiments are provided to confirm our results.

CVDec 21, 2021Code
Shape from Polarization for Complex Scenes in the Wild

Chenyang Lei, Chenyang Qi, Jiaxin Xie et al.

We present a new data-driven approach with physics-based priors to scene-level normal estimation from a single polarization image. Existing shape from polarization (SfP) works mainly focus on estimating the normal of a single object rather than complex scenes in the wild. A key barrier to high-quality scene-level SfP is the lack of real-world SfP data in complex scenes. Hence, we contribute the first real-world scene-level SfP dataset with paired input polarization images and ground-truth normal maps. Then we propose a learning-based framework with a multi-head self-attention module and viewing encoding, which is designed to handle increasing polarization ambiguities caused by complex materials and non-orthographic projection in scene-level SfP. Our trained model can be generalized to far-field outdoor scenes as the relationship between polarized light and surface normals is not affected by distance. Experimental results demonstrate that our approach significantly outperforms existing SfP models on two datasets. Our dataset and source code will be publicly available at https://github.com/ChenyangLEI/sfp-wild

CVDec 23, 2024
Large Motion Video Autoencoding with Cross-modal Video VAE

Yazhou Xing, Yang Fei, Yingqing He et al.

Learning a robust video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation. Directly applying image VAEs to individual frames in isolation can result in temporal inconsistencies and suboptimal compression rates due to a lack of temporal compression. Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance. In this paper, we present a novel and powerful video autoencoder capable of high-fidelity video encoding. First, we observe that entangling spatial and temporal compression by merely extending the image VAE to a 3D VAE can introduce motion blur and detail distortion artifacts. Thus, we propose temporal-aware spatial compression to better encode and decode the spatial information. Additionally, we integrate a lightweight motion compression model for further temporal compression. Second, we propose to leverage the textual information inherent in text-to-video datasets and incorporate text guidance into our model. This significantly enhances reconstruction quality, particularly in terms of detail preservation and temporal stability. Third, we further improve the versatility of our model through joint training on both images and videos, which not only enhances reconstruction quality but also enables the model to perform both image and video autoencoding. Extensive evaluations against strong recent baselines demonstrate the superior performance of our method. The project website can be found at~\href{https://yzxing87.github.io/vae/}{https://yzxing87.github.io/vae/}.

CVDec 25, 2024
ModelGrow: Continual Text-to-Video Pre-training with Model Expansion and Language Understanding Enhancement

Zhefan Rao, Liya Ji, Yazhou Xing et al.

Text-to-video (T2V) generation has gained significant attention recently. However, the costs of training a T2V model from scratch remain persistently high, and there is considerable room for improving the generation performance, especially under limited computation resources. This work explores the continual general pre-training of text-to-video models, enabling the model to "grow" its abilities based on a pre-trained foundation, analogous to how humans acquire new knowledge based on past experiences. There is a lack of extensive study of the continual pre-training techniques in T2V generation. In this work, we take the initial step toward exploring this task systematically and propose ModelGrow. Specifically, we break this task into two key aspects: increasing model capacity and improving semantic understanding. For model capacity, we introduce several novel techniques to expand the model size, enabling it to store new knowledge and improve generation performance. For semantic understanding, we propose a method that leverages large language models as advanced text encoders, integrating them into T2V models to enhance language comprehension and guide generation results according to detailed prompts. This approach enables the model to achieve better semantic alignment, particularly in response to complex user prompts. Extensive experiments demonstrate the effectiveness of our method across various metrics. The source code and the model of ModelGrow will be publicly available.

38.6NAApr 7
Linear convergence of Gearhart-Koshy accelerated Kaczmarz methods for tensor linear systems

Yijie Wang, Yonghan Sun, Deren Han et al.

The generalized Gearhart-Koshy acceleration is a recent exact affine search technique designed for the method of cyclic projections onto hyperplanes, i.e., the Kaczmarz method. However, its convergence properties, particularly the linear convergence rate, have not been thoroughly established. In this paper, we systematically establish the linear convergence of the generalized Gearhart-Koshy accelerated Kaczmarz method for tensor linear systems, proving that it converges linearly to the unique least-norm solution. Our analysis is general and applies to several popular Kaczmarz variants, including incremental, shuffle-once, and random-reshuffling schemes, and demonstrates that this acceleration approach yields a better convergence upper bound compared to the plain Kaczmarz method. We also propose an efficient Gram-Schmidt-based implementation that computes the next iterate in linear time. Building on this implementation, we establish a connection between this acceleration framework and Arnoldi-type Krylov subspace methods, further highlighting its efficiency and potential. Our theoretical results are supported by numerical experiments.

CVMar 31, 2025
Pan-LUT: Efficient Pan-sharpening via Learnable Look-Up Tables

Zhongnan Cai, Yingying Wang, Yunlong Lin et al.

Recently, deep learning-based pan-sharpening algorithms have achieved notable advancements over traditional methods. However, many deep learning-based approaches incur substantial computational overhead during inference, especially with high-resolution images. This excessive computational demand limits the applicability of these methods in real-world scenarios, particularly in the absence of dedicated computing devices such as GPUs and TPUs. To address these challenges, we propose Pan-LUT, a novel learnable look-up table (LUT) framework for pan-sharpening that strikes a balance between performance and computational efficiency for high-resolution remote sensing images. To finely control the spectral transformation, we devise the PAN-guided look-up table (PGLUT) for channel-wise spectral mapping. To effectively capture fine-grained spatial details and adaptively learn local contexts, we introduce the spatial details look-up table (SDLUT) and adaptive aggregation look-up table (AALUT). Our proposed method contains fewer than 300K parameters and processes a 8K resolution image in under 1 ms using a single NVIDIA GeForce RTX 2080 Ti GPU, demonstrating significantly faster performance compared to other methods. Experiments reveal that Pan-LUT efficiently processes large remote sensing images in a lightweight manner, bridging the gap to real-world applications. Furthermore, our model surpasses SOTA methods in full-resolution scenes under real-world conditions, highlighting its effectiveness and efficiency.

CVSep 3, 2021
Dual-Camera Super-Resolution with Aligned Attention Modules

Tengfei Wang, Jiaxin Xie, Wenxiu Sun et al.

We present a novel approach to reference-based super-resolution (RefSR) with the focus on dual-camera super-resolution (DCSR), which utilizes reference images for high-quality and high-fidelity results. Our proposed method generalizes the standard patch-based feature matching with spatial alignment operations. We further explore the dual-camera super-resolution that is one promising application of RefSR, and build a dataset that consists of 146 image pairs from the main and telephoto cameras in a smartphone. To bridge the domain gaps between real-world images and the training images, we propose a self-supervised domain adaptation strategy for real-world images. Extensive experiments on our dataset and a public benchmark demonstrate clear improvement achieved by our method over state of the art in both quantitative evaluation and visual comparisons.

CVApr 7, 2020
Depth Sensing Beyond LiDAR Range

Kai Zhang, Jiaxin Xie, Noah Snavely et al.

Depth sensing is a critical component of autonomous driving technologies, but today's LiDAR- or stereo camera-based solutions have limited range. We seek to increase the maximum range of self-driving vehicles' depth perception modules for the sake of better safety. To that end, we propose a novel three-camera system that utilizes small field of view cameras. Our system, along with our novel algorithm for computing metric depth, does not require full pre-calibration and can output dense depth maps with practically acceptable accuracy for scenes and objects at long distances not well covered by most commercial LiDARs.

CVDec 30, 2019
Video Depth Estimation by Fusing Flow-to-Depth Proposals

Jiaxin Xie, Chenyang Lei, Zhuwen Li et al.

Depth from a monocular video can enable billions of devices and robots with a single camera to see the world in 3D. In this paper, we present an approach with a differentiable flow-to-depth layer for video depth estimation. The model consists of a flow-to-depth layer, a camera pose refinement module, and a depth fusion network. Given optical flow and camera pose, our flow-to-depth layer generates depth proposals and the corresponding confidence maps by explicitly solving an epipolar geometry optimization problem. Our flow-to-depth layer is differentiable, and thus we can refine camera poses by maximizing the aggregated confidence in the camera pose refinement module. Our depth fusion network can utilize depth proposals and their confidence maps inferred from different adjacent frames to produce the final depth map. Furthermore, the depth fusion network can additionally take the depth proposals generated by other methods to improve the results further. The experiments on three public datasets show that our approach outperforms state-of-the-art depth estimation methods, and has reasonable cross dataset generalization capability: our model trained on KITTI still performs well on the unseen Waymo dataset.