CVAIIVOct 17, 2023

FocDepthFormer: Transformer with latent LSTM for Depth Estimation from Focal Stack

arXiv:2310.11178v33 citationsh-index: 40
Originality Incremental advance
AI Analysis

This addresses a generalization problem in computer vision for applications like robotics and AR/VR, though it is incremental as it builds on existing Transformer and LSTM techniques.

The paper tackled depth estimation from focal stacks of varying lengths by proposing FocDepthFormer, a Transformer-based network with an LSTM module, which outperformed state-of-the-art methods on benchmark datasets.

Most existing methods for depth estimation from a focal stack of images employ convolutional neural networks (CNNs) using 2D or 3D convolutions over a fixed set of images. However, their effectiveness is constrained by the local properties of CNN kernels, which restricts them to process only focal stacks of fixed number of images during both training and inference. This limitation hampers their ability to generalize to stacks of arbitrary lengths. To overcome these limitations, we present a novel Transformer-based network, FocDepthFormer, which integrates a Transformer with an LSTM module and a CNN decoder. The Transformer's self-attention mechanism allows for the learning of more informative spatial features by implicitly performing non-local cross-referencing. The LSTM module is designed to integrate representations across image stacks of varying lengths. Additionally, we employ multi-scale convolutional kernels in an early-stage encoder to capture low-level features at different degrees of focus/defocus. By incorporating the LSTM, FocDepthFormer can be pre-trained on large-scale monocular RGB depth estimation datasets, improving visual pattern learning and reducing reliance on difficult-to-obtain focal stack data. Extensive experiments on diverse focal stack benchmark datasets demonstrate that our model outperforms state-of-the-art approaches across multiple evaluation metrics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes