CVAug 25, 2023Code
AccFlow: Backward Accumulation for Long-Range Optical FlowGuangyang Wu, Xiaohong Liu, Kunming Luo et al.
Recent deep learning-based optical flow estimators have exhibited impressive performance in generating local flows between consecutive frames. However, the estimation of long-range flows between distant frames, particularly under complex object deformation and large motion occlusion, remains a challenging task. One promising solution is to accumulate local flows explicitly or implicitly to obtain the desired long-range flow. Nevertheless, the accumulation errors and flow misalignment can hinder the effectiveness of this approach. This paper proposes a novel recurrent framework called AccFlow, which recursively backward accumulates local flows using a deformable module called as AccPlus. In addition, an adaptive blending module is designed along with AccPlus to alleviate the occlusion effect by backward accumulation and rectify the accumulation error. Notably, we demonstrate the superiority of backward accumulation over conventional forward accumulation, which to the best of our knowledge has not been explicitly established before. To train and evaluate the proposed AccFlow, we have constructed a large-scale high-quality dataset named CVO, which provides ground-truth optical flow labels between adjacent and distant frames. Extensive experiments validate the effectiveness of AccFlow in handling long-range optical flow estimation. Codes are available at https://github.com/mulns/AccFlow .
CVJul 22, 2022Code
RealFlow: EM-based Realistic Optical Flow Dataset Generation from VideosYunhui Han, Kunming Luo, Ao Luo et al.
Obtaining the ground truth labels from a video is challenging since the manual annotation of pixel-wise flow labels is prohibitively expensive and laborious. Besides, existing approaches try to adapt the trained model on synthetic datasets to authentic videos, which inevitably suffers from domain discrepancy and hinders the performance for real-world applications. To solve these problems, we propose RealFlow, an Expectation-Maximization based framework that can create large-scale optical flow datasets directly from any unlabeled realistic videos. Specifically, we first estimate optical flow between a pair of video frames, and then synthesize a new image from this pair based on the predicted flow. Thus the new image pairs and their corresponding flows can be regarded as a new training set. Besides, we design a Realistic Image Pair Rendering (RIPR) module that adopts softmax splatting and bi-directional hole filling techniques to alleviate the artifacts of the image synthesis. In the E-step, RIPR renders new images to create a large quantity of training data. In the M-step, we utilize the generated training data to train an optical flow network, which can be used to estimate optical flows in the next E-step. During the iterative learning steps, the capability of the flow network is gradually improved, so is the accuracy of the flow, as well as the quality of the synthesized dataset. Experimental results show that RealFlow outperforms previous dataset generation methods by a considerably large margin. Moreover, based on the generated dataset, our approach achieves state-of-the-art performance on two standard benchmarks compared with both supervised and unsupervised optical flow methods. Our code and dataset are available at https://github.com/megvii-research/RealFlow
CVMar 29Code
LongCat-Next: Lexicalizing Modalities as Discrete TokensMeituan LongCat Team, Bin Xiao, Chao Wang et al.
The prevailing Next-Token Prediction (NTP) paradigm has driven the success of large language models through discrete autoregressive modeling. However, contemporary multimodal systems remain language-centric, often treating non-linguistic modalities as external attachments, leading to fragmented architectures and suboptimal integration. To transcend this limitation, we introduce Discrete Native Autoregressive (DiNA), a unified framework that represents multimodal information within a shared discrete space, enabling a consistent and principled autoregressive modeling across modalities. A key innovation is the Discrete Native Any-resolution Visual Transformer (dNaViT), which performs tokenization and de-tokenization at arbitrary resolutions, transforming continuous visual signals into hierarchical discrete tokens. Building on this foundation, we develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal modality-specific design. As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks. In particular, LongCat-Next addresses the long-standing performance ceiling of discrete vision modeling on understanding tasks and provides a unified approach to effectively reconcile the conflict between understanding and generation. As an attempt toward native multimodality, we open-source the LongCat-Next and its tokenizers, hoping to foster further research and development in the community. GitHub: https://github.com/meituan-longcat/LongCat-Next
CVMar 20, 2023
Learning Optical Flow from Event Camera with Rendered DatasetXinglong Luo, Kunming Luo, Ao Luo et al.
We study the problem of estimating optical flow from event cameras. One important issue is how to build a high-quality event-flow dataset with accurate event values and flow labels. Previous datasets are created by either capturing real scenes by event cameras or synthesizing from images with pasted foreground objects. The former case can produce real event values but with calculated flow labels, which are sparse and inaccurate. The later case can generate dense flow labels but the interpolated events are prone to errors. In this work, we propose to render a physically correct event-flow dataset using computer graphics models. In particular, we first create indoor and outdoor 3D scenes by Blender with rich scene content variations. Second, diverse camera motions are included for the virtual capturing, producing images and accurate flow labels. Third, we render high-framerate videos between images for accurate events. The rendered dataset can adjust the density of events, based on which we further introduce an adaptive density module (ADM). Experiments show that our proposed dataset can facilitate event-flow learning, whereas previous approaches when trained on our dataset can improve their performances constantly by a relatively large margin. In addition, event-flow pipelines when equipped with our ADM can further improve performances.
CVJul 19, 2024Code
PointRegGPT: Boosting 3D Point Cloud Registration using Generative Point-Cloud Pairs for TrainingSuyi Chen, Hao Xu, Haipeng Li et al.
Data plays a crucial role in training learning-based methods for 3D point cloud registration. However, the real-world dataset is expensive to build, while rendering-based synthetic data suffers from domain gaps. In this work, we present PointRegGPT, boosting 3D point cloud registration using generative point-cloud pairs for training. Given a single depth map, we first apply a random camera motion to re-project it into a target depth map. Converting them to point clouds gives a training pair. To enhance the data realism, we formulate a generative model as a depth inpainting diffusion to process the target depth map with the re-projected source depth map as the condition. Also, we design a depth correction module to alleviate artifacts caused by point penetration during the re-projection. To our knowledge, this is the first generative approach that explores realistic data generation for indoor point cloud registration. When equipped with our approach, several recent algorithms can improve their performance significantly and achieve SOTA consistently on two common benchmarks. The code and dataset will be released on https://github.com/Chen-Suyi/PointRegGPT.
CVJan 23, 2023
GyroFlow+: Gyroscope-Guided Unsupervised Deep Homography and Optical Flow LearningHaipeng Li, Kunming Luo, Bing Zeng et al.
Existing homography and optical flow methods are erroneous in challenging scenes, such as fog, rain, night, and snow because the basic assumptions such as brightness and gradient constancy are broken. To address this issue, we present an unsupervised learning approach that fuses gyroscope into homography and optical flow learning. Specifically, we first convert gyroscope readings into motion fields named gyro field. Second, we design a self-guided fusion module (SGF) to fuse the background motion extracted from the gyro field with the optical flow and guide the network to focus on motion details. Meanwhile, we propose a homography decoder module (HD) to combine gyro field and intermediate results of SGF to produce the homography. To the best of our knowledge, this is the first deep learning framework that fuses gyroscope data and image content for both deep homography and optical flow learning. To validate our method, we propose a new dataset that covers regular and challenging scenes. Experiments show that our method outperforms the state-of-the-art methods in both regular and challenging scenes.
CVJul 22, 2023
Fast and Stable Diffusion Inverse Solver with History Gradient UpdateLinchao He, Hongyu Yan, Mengting Luo et al.
Diffusion models have recently been recognised as efficient inverse problem solvers due to their ability to produce high-quality reconstruction results without relying on pairwise data training. Existing diffusion-based solvers utilize Gradient Descent strategy to get a optimal sample solution. However, these solvers only calculate the current gradient and have not utilized any history information of sampling process, thus resulting in unstable optimization progresses and suboptimal solutions. To address this issue, we propose to utilize the history information of the diffusion-based inverse solvers. In this paper, we first prove that, in previous work, using the gradient descent method to optimize the data fidelity term is convergent. Building on this, we introduce the incorporation of historical gradients into this optimization process, termed History Gradient Update (HGU). We also provide theoretical evidence that HGU ensures the convergence of the entire algorithm. It's worth noting that HGU is applicable to both pixel-based and latent-based diffusion model solvers. Experimental results demonstrate that, compared to previous sampling algorithms, sampling algorithms with HGU achieves state-of-the-art results in medical image reconstruction, surpassing even supervised learning methods. Additionally, it achieves competitive results on natural images.
CVOct 5, 2023
Ctrl-Room: Controllable Text-to-3D Room Meshes Generation with Layout ConstraintsChuan Fang, Yuan Dong, Kunming Luo et al.
Text-driven 3D indoor scene generation is useful for gaming, the film industry, and AR/VR applications. However, existing methods cannot faithfully capture the room layout, nor do they allow flexible editing of individual objects in the room. To address these problems, we present Ctrl-Room, which can generate convincing 3D rooms with designer-style layouts and high-fidelity textures from just a text prompt. Moreover, Ctrl-Room enables versatile interactive editing operations such as resizing or moving individual furniture items. Our key insight is to separate the modeling of layouts and appearance. Our proposed method consists of two stages: a Layout Generation Stage and an Appearance Generation Stage. The Layout Generation Stage trains a text-conditional diffusion model to learn the layout distribution with our holistic scene code parameterization. Next, the Appearance Generation Stage employs a fine-tuned ControlNet to produce a vivid panoramic image of the room guided by the 3D scene layout and text prompt. We thus achieve a high-quality 3D room generation with convincing layouts and lively textures. Benefiting from the scene code parameterization, we can easily edit the generated room model through our mask-guided editing module, without expensive edit-specific training. Extensive experiments on the Structured3D dataset demonstrate that our method outperforms existing methods in producing more reasonable, view-consistent, and editable 3D rooms from natural language prompts.
CVMar 27
AutoWeather4D: Autonomous Driving Video Weather Conversion via G-Buffer Dual-Pass EditingTianyu Liu, Weitao Xiong, Kunming Luo et al.
Generative video models have significantly advanced the photorealistic synthesis of adverse weather for autonomous driving; however, they consistently demand massive datasets to learn rare weather scenarios. While 3D-aware editing methods alleviate these data constraints by augmenting existing video footage, they are fundamentally bottlenecked by costly per-scene optimization and suffer from inherent geometric and illumination entanglement. In this work, we introduce AutoWeather4D, a feed-forward 3D-aware weather editing framework designed to explicitly decouple geometry and illumination. At the core of our approach is a G-buffer Dual-pass Editing mechanism. The Geometry Pass leverages explicit structural foundations to enable surface-anchored physical interactions, while the Light Pass analytically resolves light transport, accumulating the contributions of local illuminants into the global illumination to enable dynamic 3D local relighting. Extensive experiments demonstrate that AutoWeather4D achieves comparable photorealism and structural consistency to generative baselines while enabling fine-grained parametric physical control, serving as a practical data engine for autonomous driving.
CVMay 29, 2025Code
UniTEX: Universal High Fidelity Generative Texturing for 3D ShapesYixun Liang, Kunming Luo, Xiao Chen et al.
We present UniTEX, a novel two-stage 3D texture generation framework to create high-quality, consistent textures for 3D assets. Existing approaches predominantly rely on UV-based inpainting to refine textures after reprojecting the generated multi-view images onto the 3D shapes, which introduces challenges related to topological ambiguity. To address this, we propose to bypass the limitations of UV mapping by operating directly in a unified 3D functional space. Specifically, we first propose that lifts texture generation into 3D space via Texture Functions (TFs)--a continuous, volumetric representation that maps any 3D point to a texture value based solely on surface proximity, independent of mesh topology. Then, we propose to predict these TFs directly from images and geometry inputs using a transformer-based Large Texturing Model (LTM). To further enhance texture quality and leverage powerful 2D priors, we develop an advanced LoRA-based strategy for efficiently adapting large-scale Diffusion Transformers (DiTs) for high-quality multi-view texture synthesis as our first stage. Extensive experiments demonstrate that UniTEX achieves superior visual quality and texture integrity compared to existing approaches, offering a generalizable and scalable solution for automated 3D texture generation. Code will available in: https://github.com/YixunLiang/UniTEX.
CVOct 5, 2025Code
Learning Efficient Meshflow and Optical Flow from Event CamerasXinglong Luo, Ao Luo, Kunming Luo et al.
In this paper, we explore the problem of event-based meshflow estimation, a novel task that involves predicting a spatially smooth sparse motion field from event cameras. To start, we review the state-of-the-art in event-based flow estimation, highlighting two key areas for further research: i) the lack of meshflow-specific event datasets and methods, and ii) the underexplored challenge of event data density. First, we generate a large-scale High-Resolution Event Meshflow (HREM) dataset, which showcases its superiority by encompassing the merits of high resolution at 1280x720, handling dynamic objects and complex motion patterns, and offering both optical flow and meshflow labels. These aspects have not been fully explored in previous works. Besides, we propose Efficient Event-based MeshFlow (EEMFlow) network, a lightweight model featuring a specially crafted encoder-decoder architecture to facilitate swift and accurate meshflow estimation. Furthermore, we upgrade EEMFlow network to support dense event optical flow, in which a Confidence-induced Detail Completion (CDC) module is proposed to preserve sharp motion boundaries. We conduct comprehensive experiments to show the exceptional performance and runtime efficiency (30x faster) of our EEMFlow model compared to the recent state-of-the-art flow method. As an extension, we expand HREM into HREM+, a multi-density event dataset contributing to a thorough study of the robustness of existing methods across data with varying densities, and propose an Adaptive Density Module (ADM) to adjust the density of input event data to a more optimal range, enhancing the model's generalization ability. We empirically demonstrate that ADM helps to significantly improve the performance of EEMFlow and EEMFlow+ by 8% and 10%, respectively. Code and dataset are released at https://github.com/boomluo02/EEMFlowPlus.
CVMar 25, 2021Code
GyroFlow: Gyroscope-Guided Unsupervised Optical Flow LearningHaipeng Li, Kunming Luo, Shuaicheng Liu
Existing optical flow methods are erroneous in challenging scenes, such as fog, rain, and night because the basic optical flow assumptions such as brightness and gradient constancy are broken. To address this problem, we present an unsupervised learning approach that fuses gyroscope into optical flow learning. Specifically, we first convert gyroscope readings into motion fields named gyro field. Second, we design a self-guided fusion module to fuse the background motion extracted from the gyro field with the optical flow and guide the network to focus on motion details. To the best of our knowledge, this is the first deep learning-based framework that fuses gyroscope data and image content for optical flow learning. To validate our method, we propose a new dataset that covers regular and challenging scenes. Experiments show that our method outperforms the state-of-art methods in both regular and challenging scenes. Code and dataset are available at https://github.com/megvii-research/GyroFlow.
CVMay 9
Contour-Native Bridge Defect Detection and Compact Digital Archiving with Frequency-Supervised Fourier ContoursJin Liu, Wang Wang, Hongxu Pu et al.
AI-assisted bridge defect inspection often produces bounding boxes with crude geometry or raster masks that are costly to store, transmit, and reuse. This study investigates how detected defects can be represented as compact, recoverable contour-level vector records in image space. We propose Frequency-Supervised Fourier Series Detection (FS-FSD), which directly regresses Fourier contour descriptors and evaluates boxes, masks, and contours under a unified polygon-space protocol. On 3,767 UAV-collected bridge images with 42,346 defect instances, FS-FSD achieves higher polygon-space accuracy and better matched-TP geometric quality than representative detection, segmentation, and contour baselines. These results show that, compared with bounding boxes and raster masks, Fourier contour records preserve defect-boundary geometry in a more compact, recoverable, and shareable form for engineering review and downstream information workflows. Future work will study the modeling of multi-region, fragmented, and adjacent bridge-defect boundaries and extend the framework toward long-term bridge-defect tracking and lifecycle-oriented management.
IVFeb 25, 2024
Diffusion Posterior Proximal Sampling for Image RestorationHongjie Wu, Linchao He, Mingqin Zhang et al.
Diffusion models have demonstrated remarkable efficacy in generating high-quality samples. Existing diffusion-based image restoration algorithms exploit pre-trained diffusion models to leverage data priors, yet they still preserve elements inherited from the unconditional generation paradigm. These strategies initiate the denoising process with pure white noise and incorporate random noise at each generative step, leading to over-smoothed results. In this paper, we present a refined paradigm for diffusion-based image restoration. Specifically, we opt for a sample consistent with the measurement identity at each generative step, exploiting the sampling selection as an avenue for output stability and enhancement. The number of candidate samples used for selection is adaptively determined based on the signal-to-noise ratio of the timestep. Additionally, we start the restoration process with an initialization combined with the measurement signal, providing supplementary information to better align the generative process. Extensive experimental results and analyses validate that our proposed method significantly enhances image restoration performance while consuming negligible additional computational resources.
CVApr 3, 2024
GenN2N: Generative NeRF2NeRF TranslationXiangyue Liu, Han Xue, Kunming Luo et al.
We present GenN2N, a unified NeRF-to-NeRF translation framework for various NeRF translation tasks such as text-driven NeRF editing, colorization, super-resolution, inpainting, etc. Unlike previous methods designed for individual translation tasks with task-specific schemes, GenN2N achieves all these NeRF editing tasks by employing a plug-and-play image-to-image translator to perform editing in the 2D domain and lifting 2D edits into the 3D NeRF space. Since the 3D consistency of 2D edits may not be assured, we propose to model the distribution of the underlying 3D edits through a generative model that can cover all possible edited NeRFs. To model the distribution of 3D edited NeRFs from 2D edited images, we carefully design a VAE-GAN that encodes images while decoding NeRFs. The latent space is trained to align with a Gaussian distribution and the NeRFs are supervised through an adversarial loss on its renderings. To ensure the latent code does not depend on 2D viewpoints but truly reflects the 3D edits, we also regularize the latent code through a contrastive learning scheme. Extensive experiments on various editing tasks show GenN2N, as a universal framework, performs as well or better than task-specific specialists while possessing flexible generative power. More results on our project page: https://xiangyueliu.github.io/GenN2N/
CVMar 23, 2025
SymmCompletion: High-Fidelity and High-Consistency Point Cloud Completion with Symmetry GuidanceHongyu Yan, Zijun Li, Kunming Luo et al.
Point cloud completion aims to recover a complete point shape from a partial point cloud. Although existing methods can form satisfactory point clouds in global completeness, they often lose the original geometry details and face the problem of geometric inconsistency between existing point clouds and reconstructed missing parts. To tackle this problem, we introduce SymmCompletion, a highly effective completion method based on symmetry guidance. Our method comprises two primary components: a Local Symmetry Transformation Network (LSTNet) and a Symmetry-Guidance Transformer (SGFormer). First, LSTNet efficiently estimates point-wise local symmetry transformation to transform key geometries of partial inputs into missing regions, thereby generating geometry-align partial-missing pairs and initial point clouds. Second, SGFormer leverages the geometric features of partial-missing pairs as the explicit symmetric guidance that can constrain the refinement process for initial point clouds. As a result, SGFormer can exploit provided priors to form high-fidelity and geometry-consistency final point clouds. Qualitative and quantitative evaluations on several benchmark datasets demonstrate that our method outperforms state-of-the-art completion networks.
CVJan 17, 2025
GaussianAvatar-Editor: Photorealistic Animatable Gaussian Head Avatar EditorXiangyue Liu, Kunming Luo, Heng Li et al.
We introduce GaussianAvatar-Editor, an innovative framework for text-driven editing of animatable Gaussian head avatars that can be fully controlled in expression, pose, and viewpoint. Unlike static 3D Gaussian editing, editing animatable 4D Gaussian avatars presents challenges related to motion occlusion and spatial-temporal inconsistency. To address these issues, we propose the Weighted Alpha Blending Equation (WABE). This function enhances the blending weight of visible Gaussians while suppressing the influence on non-visible Gaussians, effectively handling motion occlusion during editing. Furthermore, to improve editing quality and ensure 4D consistency, we incorporate conditional adversarial learning into the editing process. This strategy helps to refine the edited results and maintain consistency throughout the animation. By integrating these methods, our GaussianAvatar-Editor achieves photorealistic and consistent results in animatable 4D Gaussian editing. We conduct comprehensive experiments across various subjects to validate the effectiveness of our proposed techniques, which demonstrates the superiority of our approach over existing methods. More results and code are available at: [Project Link](https://xiangyueliu.github.io/GaussianAvatar-Editor/).
CVJun 26, 2025
PoseMaster: Generating 3D Characters in Arbitrary Poses from a Single ImageHongyu Yan, Kunming Luo, Weiyu Li et al.
3D characters play a crucial role in our daily entertainment. To improve the efficiency of 3D character modeling, recent image-based methods use two separate models to achieve pose standardization and 3D reconstruction of the A-pose character. However, these methods are prone to generating distorted and degraded images in the pose standardization stage due to self-occlusion and viewpoints, which further affects the geometric quality of the subsequent reconstruction process. To tackle these problems, we propose PoseMaster, an end-to-end controllable 3D character generation framework. Specifically, we unify pose transformation and 3D character generation into a flow-based 3D native generation framework. To achieve accurate arbitrary-pose control, we propose to leverage the 3D body bones existing in the skeleton of an animatable character as the pose condition. Furthermore, considering the specificity of multi-condition control, we randomly empty the pose condition and the image condition during training to improve the effectiveness and generalizability of pose control. Finally, we create a high-quality pose-control dataset derived from realistic character animation data to make the model learning the implicit relationships between skeleton and skinning weights. Extensive experiments show that PoseMaster outperforms current state-of-the-art techniques in both qualitative and quantitative evaluations for A-pose character generation while demonstrating its powerful ability to achieve precise control for arbitrary poses.
CVSep 28, 2025
DiffPCN: Latent Diffusion Model Based on Multi-view Depth Images for Point Cloud CompletionZijun Li, Hongyu Yan, Shijie Li et al.
Latent diffusion models (LDMs) have demonstrated remarkable generative capabilities across various low-level vision tasks. However, their potential for point cloud completion remains underexplored due to the unstructured and irregular nature of point clouds. In this work, we propose DiffPCN, a novel diffusion-based coarse-to-fine framework for point cloud completion. Our approach comprises two stages: an initial stage for generating coarse point clouds, and a refinement stage that improves their quality through point denoising and upsampling. Specifically, we first project the unordered and irregular partial point cloud into structured depth images, which serve as conditions for a well-designed DepthLDM to synthesize completed multi-view depth images that are used to form coarse point clouds. In this way, our DiffPCN can yield high-quality and high-completeness coarse point clouds by leveraging LDM' s powerful generation and comprehension capabilities. Then, since LDMs inevitably introduce outliers into the generated depth maps, we design a Point Denoising Network to remove artifacts from the coarse point cloud by predicting a per-point distance score. Finally, we devise an Association-Aware Point Upsampler, which guides the upsampling process by leveraging local association features between the input point cloud and the corresponding coarse points, further yielding a dense and high-fidelity output. Experimental results demonstrate that our DiffPCN achieves state-of-the-art performance in geometric accuracy and shape completeness, significantly improving the robustness and consistency of point cloud completion.
CVFeb 8, 2022
Learning Optical Flow with Adaptive Graph ReasoningAo Luo, Fan Yang, Kunming Luo et al.
Estimating per-pixel motion between video frames, known as optical flow, is a long-standing problem in video understanding and analysis. Most contemporary optical flow techniques largely focus on addressing the cross-image matching with feature similarity, with few methods considering how to explicitly reason over the given scene for achieving a holistic motion understanding. In this work, taking a fresh perspective, we introduce a novel graph-based approach, called adaptive graph reasoning for optical flow (AGFlow), to emphasize the value of scene/context information in optical flow. Our key idea is to decouple the context reasoning from the matching procedure, and exploit scene information to effectively assist motion estimation by learning to reason over the adaptive graph. The proposed AGFlow can effectively exploit the context information and incorporate it within the matching procedure, producing more robust and accurate results. On both Sintel clean and final passes, our AGFlow achieves the best accuracy with EPE of 1.43 and 2.47 pixels, outperforming state-of-the-art approaches by 11.2% and 13.6%, respectively.
CVApr 8, 2021
ASFlow: Unsupervised Optical Flow Learning with Adaptive Pyramid SamplingKunming Luo, Ao Luo, Chuan Wang et al.
We present an unsupervised optical flow estimation method by proposing an adaptive pyramid sampling in the deep pyramid network. Specifically, in the pyramid downsampling, we propose an Content Aware Pooling (CAP) module, which promotes local feature gathering by avoiding cross region pooling, so that the learned features become more representative. In the pyramid upsampling, we propose an Adaptive Flow Upsampling (AFU) module, where cross edge interpolation can be avoided, producing sharp motion boundaries. Equipped with these two modules, our method achieves the best performance for unsupervised optical flow estimation on multiple leading benchmarks, including MPI-SIntel, KITTI 2012 and KITTI 2015. Particuarlly, we achieve EPE=1.5 on KITTI 2012 and F1=9.67% KITTI 2015, which outperform the previous state-of-the-art methods by 16.7% and 13.1%, respectively.
CVDec 1, 2020
UPFlow: Upsampling Pyramid for Unsupervised Optical Flow LearningKunming Luo, Chuan Wang, Shuaicheng Liu et al.
We present an unsupervised learning approach for optical flow estimation by improving the upsampling and learning of pyramid network. We design a self-guided upsample module to tackle the interpolation blur problem caused by bilinear upsampling between pyramid levels. Moreover, we propose a pyramid distillation loss to add supervision for intermediate levels via distilling the finest flow as pseudo labels. By integrating these two components together, our method achieves the best performance for unsupervised optical flow learning on multiple leading benchmarks, including MPI-SIntel, KITTI 2012 and KITTI 2015. In particular, we achieve EPE=1.4 on KITTI 2012 and F1=9.38% on KITTI 2015, which outperform the previous state-of-the-art methods by 22.2% and 15.7%, respectively.
CVJun 30, 2020
OccInpFlow: Occlusion-Inpainting Optical Flow Estimation by Unsupervised LearningKunming Luo, Chuan Wang, Nianjin Ye et al.
Occlusion is an inevitable and critical problem in unsupervised optical flow learning. Existing methods either treat occlusions equally as non-occluded regions or simply remove them to avoid incorrectness. However, the occlusion regions can provide effective information for optical flow learning. In this paper, we present OccInpFlow, an occlusion-inpainting framework to make full use of occlusion regions. Specifically, a new appearance-flow network is proposed to inpaint occluded flows based on the image content. Moreover, a boundary warp is proposed to deal with occlusions caused by displacement beyond image border. We conduct experiments on multiple leading flow benchmark data sets such as Flying Chairs, KITTI and MPI-Sintel, which demonstrate that the performance is significantly improved by our proposed occlusion handling framework.