Lingtong Kong

h-index8

12papers

360citations

Novelty52%

AI Score43

Ranked #52,605 of 194,257 authors (top 27%)#18,301 in CV (top 31%)

12 Papers

29.7CVMay 29, 2022Code

IFRNet: Intermediate Feature Refine Network for Efficient Frame Interpolation

Lingtong Kong, Boyuan Jiang, Donghao Luo et al. · tencent-ai

Prevailing video frame interpolation algorithms, that generate the intermediate frames from consecutive inputs, typically rely on complex model architectures with heavy parameters or large delay, hindering them from diverse real-time applications. In this work, we devise an efficient encoder-decoder based network, termed IFRNet, for fast intermediate frame synthesizing. It first extracts pyramid features from given inputs, and then refines the bilateral intermediate flow fields together with a powerful intermediate feature until generating the desired output. The gradually refined intermediate feature can not only facilitate intermediate flow estimation, but also compensate for contextual details, making IFRNet do not need additional synthesis or refinement module. To fully release its potential, we further propose a novel task-oriented optical flow distillation loss to focus on learning the useful teacher knowledge towards frame synthesizing. Meanwhile, a new geometry consistency regularization term is imposed on the gradually refined intermediate features to keep better structure layout. Experiments on various benchmarks demonstrate the excellent performance and fast inference speed of proposed approaches. Code is available at https://github.com/ltkong218/IFRNet.

5.9CVSep 11, 2023Code

Towards Better Data Exploitation in Self-Supervised Monocular Depth Estimation

Jinfeng Liu, Lingtong Kong, Jie Yang et al.

Depth estimation plays an important role in the robotic perception system. Self-supervised monocular paradigm has gained significant attention since it can free training from the reliance on depth annotations. Despite recent advancements, existing self-supervised methods still underutilize the available training data, limiting their generalization ability. In this paper, we take two data augmentation techniques, namely Resizing-Cropping and Splitting-Permuting, to fully exploit the potential of training datasets. Specifically, the original image and the generated two augmented images are fed into the training pipeline simultaneously and we leverage them to conduct self-distillation. Additionally, we introduce the detail-enhanced DepthNet with an extra full-scale branch in the encoder and a grid decoder to enhance the restoration of fine details in depth maps. Experimental results demonstrate our method can achieve state-of-the-art performance on the KITTI benchmark, with both raw ground truth and improved ground truth. Moreover, our models also show superior generalization performance when transferring to Make3D and NYUv2 datasets. Our codes are available at https://github.com/Sauf4896/BDEdepth.

3.9CVSep 7, 2023Code

Dynamic Frame Interpolation in Wavelet Domain

Lingtong Kong, Boyuan Jiang, Donghao Luo et al. · tencent-ai

Video frame interpolation is an important low-level vision task, which can increase frame rate for more fluent visual experience. Existing methods have achieved great success by employing advanced motion models and synthesis networks. However, the spatial redundancy when synthesizing the target frame has not been fully explored, that can result in lots of inefficient computation. On the other hand, the computation compression degree in frame interpolation is highly dependent on both texture distribution and scene motion, which demands to understand the spatial-temporal information of each input frame pair for a better compression degree selection. In this work, we propose a novel two-stage frame interpolation framework termed WaveletVFI to address above problems. It first estimates intermediate optical flow with a lightweight motion perception network, and then a wavelet synthesis network uses flow aligned context features to predict multi-scale wavelet coefficients with sparse convolution for efficient target frame reconstruction, where the sparse valid masks that control computation in each scale are determined by a crucial threshold ratio. Instead of setting a fixed value like previous methods, we find that embedding a classifier in the motion perception network to learn a dynamic threshold for each sample can achieve more computation reduction with almost no loss of accuracy. On the common high resolution and animation frame interpolation benchmarks, proposed WaveletVFI can reduce computation up to 40% while maintaining similar accuracy, making it perform more efficiently against other state-of-the-arts. Code is available at https://github.com/ltkong218/WaveletVFI.

10.6CVNov 11, 2022Code

MDFlow: Unsupervised Optical Flow Learning by Reliable Mutual Knowledge Distillation

Lingtong Kong, Jie Yang

Recent works have shown that optical flow can be learned by deep networks from unlabelled image pairs based on brightness constancy assumption and smoothness prior. Current approaches additionally impose an augmentation regularization term for continual self-supervision, which has been proved to be effective on difficult matching regions. However, this method also amplify the inevitable mismatch in unsupervised setting, blocking the learning process towards optimal solution. To break the dilemma, we propose a novel mutual distillation framework to transfer reliable knowledge back and forth between the teacher and student networks for alternate improvement. Concretely, taking estimation of off-the-shelf unsupervised approach as pseudo labels, our insight locates at defining a confidence selection mechanism to extract relative good matches, and then add diverse data augmentation for distilling adequate and reliable knowledge from teacher to student. Thanks to the decouple nature of our method, we can choose a stronger student architecture for sufficient learning. Finally, better student prediction is adopted to transfer knowledge back to the efficient teacher without additional costs in real deployment. Rather than formulating it as a supervised task, we find that introducing an extra unsupervised term for multi-target learning achieves best final results. Extensive experiments show that our approach, termed MDFlow, achieves state-of-the-art real-time accuracy and generalization ability on challenging benchmarks. Code is available at https://github.com/ltkong218/MDFlow.

10.5CVJul 19, 2024Code

Mono-ViFI: A Unified Learning Framework for Self-supervised Single- and Multi-frame Monocular Depth Estimation

Jinfeng Liu, Lingtong Kong, Bo Li et al.

Self-supervised monocular depth estimation has gathered notable interest since it can liberate training from dependency on depth annotations. In monocular video training case, recent methods only conduct view synthesis between existing camera views, leading to insufficient guidance. To tackle this, we try to synthesize more virtual camera views by flow-based video frame interpolation (VFI), termed as temporal augmentation. For multi-frame inference, to sidestep the problem of dynamic objects encountered by explicit geometry-based methods like ManyDepth, we return to the feature fusion paradigm and design a VFI-assisted multi-frame fusion module to align and aggregate multi-frame features, using motion and occlusion information obtained by the flow-based VFI model. Finally, we construct a unified self-supervised learning framework, named Mono-ViFI, to bilaterally connect single- and multi-frame depth. In this framework, spatial data augmentation through image affine transformation is incorporated for data diversity, along with a triplet depth consistency loss for regularization. The single- and multi-frame models can share weights, making our framework compact and memory-efficient. Extensive experiments demonstrate that our method can bring significant improvements to current advanced architectures. Source code is available at https://github.com/LiuJF1226/Mono-ViFI.

7.3CVAug 1, 2022

ATCA: an Arc Trajectory Based Model with Curvature Attention for Video Frame Interpolation

Jinfeng Liu, Lingtong Kong, Jie Yang

Video frame interpolation is a classic and challenging low-level computer vision task. Recently, deep learning based methods have achieved impressive results, and it has been proven that optical flow based methods can synthesize frames with higher quality. However, most flow-based methods assume a line trajectory with a constant velocity between two input frames. Only a little work enforces predictions with curvilinear trajectory, but this requires more than two frames as input to estimate the acceleration, which takes more time and memory to execute. To address this problem, we propose an arc trajectory based model (ATCA), which learns motion prior from only two consecutive frames and also is lightweight. Experiments show that our approach performs better than many SOTA methods with fewer parameters and faster inference speed.

3.7CVNov 11, 2022

Progressive Motion Context Refine Network for Efficient Video Frame Interpolation

Lingtong Kong, Jinfeng Liu, Jie Yang

Recently, flow-based frame interpolation methods have achieved great success by first modeling optical flow between target and input frames, and then building synthesis network for target frame generation. However, above cascaded architecture can lead to large model size and inference delay, hindering them from mobile and real-time applications. To solve this problem, we propose a novel Progressive Motion Context Refine Network (PMCRNet) to predict motion fields and image context jointly for higher efficiency. Different from others that directly synthesize target frame from deep feature, we explore to simplify frame interpolation task by borrowing existing texture from adjacent input frames, which means that decoder in each pyramid level of our PMCRNet only needs to update easier intermediate optical flow, occlusion merge mask and image residual. Moreover, we introduce a new annealed multi-scale reconstruction loss to better guide the learning process of this efficient PMCRNet. Experiments on multiple benchmarks show that proposed approaches not only achieve favorable quantitative and qualitative results but also reduces current model size and running time significantly.

6.2CVOct 21, 2025

Mono4DGS-HDR: High Dynamic Range 4D Gaussian Splatting from Alternating-exposure Monocular Videos

Jinfeng Liu, Lingtong Kong, Mi Zhou et al.

We introduce Mono4DGS-HDR, the first system for reconstructing renderable 4D high dynamic range (HDR) scenes from unposed monocular low dynamic range (LDR) videos captured with alternating exposures. To tackle such a challenging problem, we present a unified framework with two-stage optimization approach based on Gaussian Splatting. The first stage learns a video HDR Gaussian representation in orthographic camera coordinate space, eliminating the need for camera poses and enabling robust initial HDR video reconstruction. The second stage transforms video Gaussians into world space and jointly refines the world Gaussians with camera poses. Furthermore, we propose a temporal luminance regularization strategy to enhance the temporal consistency of the HDR appearance. Since our task has not been studied before, we construct a new evaluation benchmark using publicly available datasets for HDR video reconstruction. Extensive experiments demonstrate that Mono4DGS-HDR significantly outperforms alternative solutions adapted from state-of-the-art methods in both rendering quality and speed.

12.6CVMar 8, 2021Code

FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation

Lingtong Kong, Chunhua Shen, Jie Yang

Dense optical flow estimation plays a key role in many robotic vision tasks. In the past few years, with the advent of deep learning, we have witnessed great progress in optical flow estimation. However, current networks often consist of a large number of parameters and require heavy computation costs, largely hindering its application on low power-consumption devices such as mobile phones. In this paper, we tackle this challenge and design a lightweight model for fast and accurate optical flow prediction. Our proposed FastFlowNet follows the widely-used coarse-to-fine paradigm with following innovations. First, a new head enhanced pooling pyramid (HEPP) feature extractor is employed to intensify high-resolution pyramid features while reducing parameters. Second, we introduce a new center dense dilated correlation (CDDC) layer for constructing compact cost volume that can keep large search radius with reduced computation burden. Third, an efficient shuffle block decoder (SBD) is implanted into each pyramid level to accelerate flow estimation with marginal drops in accuracy. Experiments on both synthetic Sintel data and real-world KITTI datasets demonstrate the effectiveness of the proposed approach, which needs only 1/10 computation of comparable networks to achieve on par accuracy. In particular, FastFlowNet only contains 1.37M parameters; and can execute at 90 FPS (with a single GTX 1080Ti) or 5.7 FPS (embedded Jetson TX2 GPU) on a pair of Sintel images of resolution 1024x436.

2.6CVMar 5, 2021

Unsupervised Motion Representation Enhanced Network for Action Recognition

Xiaohang Yang, Lingtong Kong, Jie Yang

Learning reliable motion representation between consecutive frames, such as optical flow, has proven to have great promotion to video understanding. However, the TV-L1 method, an effective optical flow solver, is time-consuming and expensive in storage for caching the extracted optical flow. To fill the gap, we propose UF-TSN, a novel end-to-end action recognition approach enhanced with an embedded lightweight unsupervised optical flow estimator. UF-TSN estimates motion cues from adjacent frames in a coarse-to-fine manner and focuses on small displacement for each level by extracting pyramid of feature and warping one to the other according to the estimated flow of the last level. Due to the lack of labeled motion for action datasets, we constrain the flow prediction with multi-scale photometric consistency and edge-aware smoothness. Compared with state-of-the-art unsupervised motion representation learning methods, our model achieves better accuracy while maintaining efficiency, which is competitive with some supervised or more complicated approaches.

2.6CVJan 31, 2021

OAS-Net: Occlusion Aware Sampling Network for Accurate Optical Flow

Lingtong Kong, Xiaohang Yang, Jie Yang

Optical flow estimation is an essential step for many real-world computer vision tasks. Existing deep networks have achieved satisfactory results by mostly employing a pyramidal coarse-to-fine paradigm, where a key process is to adopt warped target feature based on previous flow prediction to correlate with source feature for building 3D matching cost volume. However, the warping operation can lead to troublesome ghosting problem that results in ambiguity. Moreover, occluded areas are treated equally with non occluded regions in most existing works, which may cause performance degradation. To deal with these challenges, we propose a lightweight yet efficient optical flow network, named OAS-Net (occlusion aware sampling network) for accurate optical flow. First, a new sampling based correlation layer is employed without noisy warping operation. Second, a novel occlusion aware module is presented to make raw cost volume conscious of occluded regions. Third, a shared flow and occlusion awareness decoder is adopted for structure compactness. Experiments on Sintel and KITTI datasets demonstrate the effectiveness of proposed approaches.

2.3CVJun 22, 2020

FDFlowNet: Fast Optical Flow Estimation using a Deep Lightweight Network

Lingtong Kong, Jie Yang

Significant progress has been made for estimating optical flow using deep neural networks. Advanced deep models achieve accurate flow estimation often with a considerable computation complexity and time-consuming training processes. In this work, we present a lightweight yet effective model for real-time optical flow estimation, termed FDFlowNet (fast deep flownet). We achieve better or similar accuracy on the challenging KITTI and Sintel benchmarks while being about 2 times faster than PWC-Net. This is achieved by a carefully-designed structure and newly proposed components. We first introduce an U-shape network for constructing multi-scale feature which benefits upper levels with global receptive field compared with pyramid network. In each scale, a partial fully connected structure with dilated convolution is proposed for flow estimation that obtains a good balance among speed, accuracy and number of parameters compared with sequential connected and dense connected structures. Experiments demonstrate that our model achieves state-of-the-art performance while being fast and lightweight.