Daqi Liu

CV
h-index9
15papers
245citations
Novelty52%
AI Score33

15 Papers

CVMay 14, 2022
Importance Weighted Structure Learning for Scene Graph Generation

Daqi Liu, Miroslaw Bober, Josef Kittler

Scene graph generation is a structured prediction task aiming to explicitly model objects and their relationships via constructing a visually-grounded scene graph for an input image. Currently, the message passing neural network based mean field variational Bayesian methodology is the ubiquitous solution for such a task, in which the variational inference objective is often assumed to be the classical evidence lower bound. However, the variational approximation inferred from such loose objective generally underestimates the underlying posterior, which often leads to inferior generation performance. In this paper, we propose a novel importance weighted structure learning method aiming to approximate the underlying log-partition function with a tighter importance weighted lower bound, which is computed from multiple samples drawn from a reparameterizable Gumbel-Softmax sampler. A generic entropic mirror descent algorithm is applied to solve the resulting constrained variational inference task. The proposed method achieves the state-of-the-art performance on various popular scene graph generation benchmarks.

CVMar 2, 2022
Asynchronous Optimisation for Event-based Visual Odometry

Daqi Liu, Alvaro Parra, Yasir Latif et al.

Event cameras open up new possibilities for robotic perception due to their low latency and high dynamic range. On the other hand, developing effective event-based vision algorithms that fully exploit the beneficial properties of event cameras remains work in progress. In this paper, we focus on event-based visual odometry (VO). While existing event-driven VO pipelines have adopted continuous-time representations to asynchronously process event data, they either assume a known map, restrict the camera to planar trajectories, or integrate other sensors into the system. Towards map-free event-only monocular VO in SE(3), we propose an asynchronous structure-from-motion optimisation back-end. Our formulation is underpinned by a principled joint optimisation problem involving non-parametric Gaussian Process motion modelling and incremental maximum a posteriori inference. A high-performance incremental computation engine is employed to reason about the camera trajectory with every incoming event. We demonstrate the robustness of our asynchronous back-end in comparison to frame-based methods which depend on accurate temporal accumulation of measurements.

CVSep 27, 2022
Globally Optimal Event-Based Divergence Estimation for Ventral Landing

Sofia McLeod, Gabriele Meoni, Dario Izzo et al.

Event sensing is a major component in bio-inspired flight guidance and control systems. We explore the usage of event cameras for predicting time-to-contact (TTC) with the surface during ventral landing. This is achieved by estimating divergence (inverse TTC), which is the rate of radial optic flow, from the event stream generated during landing. Our core contributions are a novel contrast maximisation formulation for event-based divergence estimation, and a branch-and-bound algorithm to exactly maximise contrast and find the optimal divergence value. GPU acceleration is conducted to speed up the global algorithm. Another contribution is a new dataset containing real event streams from ventral landing that was employed to test and benchmark our method. Owing to global optimisation, our algorithm is much more capable at recovering the true divergence, compared to other heuristic divergence estimators or event-based optic flow methods. With GPU acceleration, our method also achieves competitive runtimes.

CVMar 20, 2025Code
MiLA: Multi-view Intensive-fidelity Long-term Video Generation World Model for Autonomous Driving

Haiguang Wang, Daqi Liu, Hongwei Xie et al.

In recent years, data-driven techniques have greatly advanced autonomous driving systems, but the need for rare and diverse training data remains a challenge, requiring significant investment in equipment and labor. World models, which predict and generate future environmental states, offer a promising solution by synthesizing annotated video data for training. However, existing methods struggle to generate long, consistent videos without accumulating errors, especially in dynamic scenes. To address this, we propose MiLA, a novel framework for generating high-fidelity, long-duration videos up to one minute. MiLA utilizes a Coarse-to-Re(fine) approach to both stabilize video generation and correct distortion of dynamic objects. Additionally, we introduce a Temporal Progressive Denoising Scheduler and Joint Denoising and Correcting Flow modules to improve the quality of generated videos. Extensive experiments on the nuScenes dataset show that MiLA achieves state-of-the-art performance in video generation quality. For more information, visit the project website: https://github.com/xiaomi-mlab/mila.github.io.

CVMar 21, 2024
SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field

Lizhe Liu, Bohua Wang, Hongwei Xie et al.

Vision-centric 3D environment understanding is both vital and challenging for autonomous driving systems. Recently, object-free methods have attracted considerable attention. Such methods perceive the world by predicting the semantics of discrete voxel grids but fail to construct continuous and accurate obstacle surfaces. To this end, in this paper, we propose SurroundSDF to implicitly predict the signed distance field (SDF) and semantic field for the continuous perception from surround images. Specifically, we introduce a query-based approach and utilize SDF constrained by the Eikonal formulation to accurately describe the surfaces of obstacles. Furthermore, considering the absence of precise SDF ground truth, we propose a novel weakly supervised paradigm for SDF, referred to as the Sandwich Eikonal formulation, which emphasizes applying correct and dense constraints on both sides of the surface, thereby enhancing the perceptual accuracy of the surface. Experiments suggest that our method achieves SOTA for both occupancy prediction and 3D scene reconstruction tasks on the nuScenes dataset.

CVMar 24, 2024
Self-Supervised Multi-Frame Neural Scene Flow

Dongrui Liu, Daqi Liu, Xueqian Li et al.

Neural Scene Flow Prior (NSFP) and Fast Neural Scene Flow (FNSF) have shown remarkable adaptability in the context of large out-of-distribution autonomous driving. Despite their success, the underlying reasons for their astonishing generalization capabilities remain unclear. Our research addresses this gap by examining the generalization capabilities of NSFP through the lens of uniform stability, revealing that its performance is inversely proportional to the number of input point clouds. This finding sheds light on NSFP's effectiveness in handling large-scale point cloud scene flow estimation tasks. Motivated by such theoretical insights, we further explore the improvement of scene flow estimation by leveraging historical point clouds across multiple frames, which inherently increases the number of point clouds. Consequently, we propose a simple and effective method for multi-frame point cloud scene flow estimation, along with a theoretical evaluation of its generalization abilities. Our analysis confirms that the proposed method maintains a limited generalization error, suggesting that adding multiple frames to the scene flow optimization process does not detract from its generalizability. Extensive experimental results on large-scale autonomous driving Waymo Open and Argoverse lidar datasets demonstrate that the proposed method achieves state-of-the-art performance.

CVMar 10, 2025
Just Functioning as a Hook for Two-Stage Referring Multi-Object Tracking

Weize Li, Yunhao Du, Qixiang Yin et al.

Referring Multi-Object Tracking (RMOT) aims to localize target trajectories in videos specified by natural language expressions. Despite recent progress, the intrinsic relationship between the two subtasks of tracking and referring in RMOT has not been fully studied. In this paper, we present a systematic analysis of their interdependence, revealing that current two-stage Referring-by-Tracking (RBT) frameworks remain fundamentally limited by insufficient modeling of subtask interactions and inflexible reliance on semantic alignment modules like CLIP. To this end, we propose JustHook, a novel two-stage RBT framework where a Hook module is firstly designed to redefine the linkage between subtasks. The Hook is built centered on grid sampling at the feature-level and is used for context-aware target feature extraction. Moreover, we propose a Parallel Combined Decoder (PCD) that learns in a unified joint feature space rather than relying on pre-defined cross-modal embeddings. Our design not only enhances the interpretability and modularity but also significantly improves the generalization. Extensive experiments on Refer-KITTI, Refer-KITTI-V2, and Refer-Dance demonstrate that JustHook achieves state-of-the-art performance, improving the HOTA by +6.9\% on Refer-KITTI-V2 with superior efficiency. Code will be available soon.

CVMar 10, 2025
Learning A Zero-shot Occupancy Network from Vision Foundation Models via Self-supervised Adaptation

Sihao Lin, Daqi Liu, Ruochong Fu et al.

Estimating the 3D world from 2D monocular images is a fundamental yet challenging task due to the labour-intensive nature of 3D annotations. To simplify label acquisition, this work proposes a novel approach that bridges 2D vision foundation models (VFMs) with 3D tasks by decoupling 3D supervision into an ensemble of image-level primitives, e.g., semantic and geometric components. As a key motivator, we leverage the zero-shot capabilities of vision-language models for image semantics. However, due to the notorious ill-posed problem - multiple distinct 3D scenes can produce identical 2D projections, directly inferring metric depth from a monocular image in a zero-shot manner is unsuitable. In contrast, 2D VFMs provide promising sources of relative depth, which theoretically aligns with metric depth when properly scaled and offset. Thus, we adapt the relative depth derived from VFMs into metric depth by optimising the scale and offset using temporal consistency, also known as novel view synthesis, without access to ground-truth metric depth. Consequently, we project the semantics into 3D space using the reconstructed metric depth, thereby providing 3D supervision. Extensive experiments on nuScenes and SemanticKITTI demonstrate the effectiveness of our framework. For instance, the proposed method surpasses the current state-of-the-art by 3.34% mIoU on nuScenes for voxel occupancy prediction.

CVJun 22, 2022
Doubly Reparameterized Importance Weighted Structure Learning for Scene Graph Generation

Daqi Liu, Miroslaw Bober, Josef Kittler

As a structured prediction task, scene graph generation, given an input image, aims to explicitly model objects and their relationships by constructing a visually-grounded scene graph. In the current literature, such task is universally solved via a message passing neural network based mean field variational Bayesian methodology. The classical loose evidence lower bound is generally chosen as the variational inference objective, which could induce oversimplified variational approximation and thus underestimate the underlying complex posterior. In this paper, we propose a novel doubly reparameterized importance weighted structure learning method, which employs a tighter importance weighted lower bound as the variational inference objective. It is computed from multiple samples drawn from a reparameterizable Gumbel-Softmax sampler and the resulting constrained variational inference task is solved by a generic entropic mirror descent algorithm. The resulting doubly reparameterized gradient estimator reduces the variance of the corresponding derivatives with a beneficial impact on learning. The proposed method achieves the state-of-the-art performance on various popular scene graph generation benchmarks.

CVJan 27, 2022
Constrained Structure Learning for Scene Graph Generation

Daqi Liu, Miroslaw Bober, Josef Kittler

As a structured prediction task, scene graph generation aims to build a visually-grounded scene graph to explicitly model objects and their relationships in an input image. Currently, the mean field variational Bayesian framework is the de facto methodology used by the existing methods, in which the unconstrained inference step is often implemented by a message passing neural network. However, such formulation fails to explore other inference strategies, and largely ignores the more general constrained optimization models. In this paper, we present a constrained structure learning method, for which an explicit constrained variational inference objective is proposed. Instead of applying the ubiquitous message-passing strategy, a generic constrained optimization method - entropic mirror descent - is utilized to solve the constrained variational inference step. We validate the proposed generic model on various popular scene graph generation benchmarks and show that it outperforms the state-of-the-art methods.

CVDec 10, 2021
Neural Belief Propagation for Scene Graph Generation

Daqi Liu, Miroslaw Bober, Josef Kittler

Scene graph generation aims to interpret an input image by explicitly modelling the potential objects and their relationships, which is predominantly solved by the message passing neural network models in previous methods. Currently, such approximation models generally assume the output variables are totally independent and thus ignore the informative structural higher-order interactions. This could lead to the inconsistent interpretations for an input image. In this paper, we propose a novel neural belief propagation method to generate the resulting scene graph. It employs a structural Bethe approximation rather than the mean field approximation to infer the associated marginals. To find a better bias-variance trade-off, the proposed model not only incorporates pairwise interactions but also higher order interactions into the associated scoring function. It achieves the state-of-the-art performance on various popular scene graph generation benchmarks.

CVMar 10, 2021
Spatiotemporal Registration for Event-based Visual Odometry

Daqi Liu, Alvaro Parra, Tat-Jun Chin

A useful application of event sensing is visual odometry, especially in settings that require high-temporal resolution. The state-of-the-art method of contrast maximisation recovers the motion from a batch of events by maximising the contrast of the image of warped events. However, the cost scales with image resolution and the temporal resolution can be limited by the need for large batch sizes to yield sufficient structure in the contrast image. In this work, we propose spatiotemporal registration as a compelling technique for event-based rotational motion estimation. We theoretcally justify the approach and establish its fundamental and practical advantages over contrast maximisation. In particular, spatiotemporal registration also produces feature tracks as a by-product, which directly supports an efficient visual odometry pipeline with graph-based optimisation for motion averaging. The simplicity of our visual odometry pipeline allows it to process more than 1 M events/second. We also contribute a new event dataset for visual odometry, where motion sequences with large velocity variations were acquired using a high-precision robot arm.

CVMar 21, 2020
Topological Sweep for Multi-Target Detection of Geostationary Space Objects

Daqi Liu, Bo Chen, Tat-Jun Chin et al.

Conducting surveillance of the Earth's orbit is a key task towards achieving space situational awareness (SSA). Our work focuses on the optical detection of man-made objects (e.g., satellites, space debris) in Geostationary orbit (GEO), which is home to major space assets such as telecommunications and navigational satellites. GEO object detection is challenging due to the distance of the targets, which appear as small dim points among a clutter of bright stars. In this paper, we propose a novel multi-target detection technique based on topological sweep, to find GEO objects from a short sequence of optical images. Our topological sweep technique exploits the geometric duality that underpins the approximately linear trajectory of target objects across the input sequence, to extract the targets from significant clutter and noise. Unlike standard multi-target methods, our algorithm deterministically solves a combinatorial problem to ensure high-recall rates without requiring accurate initializations. The usage of geometric duality also yields an algorithm that is computationally efficient and suitable for online processing.

CVFeb 25, 2020
Globally Optimal Contrast Maximisation for Event-based Motion Estimation

Daqi Liu, Álvaro Parra, Tat-Jun Chin

Contrast maximisation estimates the motion captured in an event stream by maximising the sharpness of the motion compensated event image. To carry out contrast maximisation, many previous works employ iterative optimisation algorithms, such as conjugate gradient, which require good initialisation to avoid converging to bad local minima. To alleviate this weakness, we propose a new globally optimal event-based motion estimation algorithm. Based on branch-and-bound (BnB), our method solves rotational (3DoF) motion estimation on event streams, which supports practical applications such as video stabilisation and attitude estimation. Underpinning our method are novel bounding functions for contrast maximisation, whose theoretical validity is rigorously established. We show concrete examples from public datasets where globally optimal solutions are vital to the success of contrast maximisation. Despite its exact nature, our algorithm is currently able to process a 50,000 event input in 300 seconds (a locally optimal solver takes 30 seconds on the same input), and has the potential to be further speeded-up using GPUs.

CVMar 13, 2019
Visual Semantic Information Pursuit: A Survey

Daqi Liu, Miroslaw Bober, Josef Kittler

Visual semantic information comprises two important parts: the meaning of each visual semantic unit and the coherent visual semantic relation conveyed by these visual semantic units. Essentially, the former one is a visual perception task while the latter one corresponds to visual context reasoning. Remarkable advances in visual perception have been achieved due to the success of deep learning. In contrast, visual semantic information pursuit, a visual scene semantic interpretation task combining visual perception and visual context reasoning, is still in its early stage. It is the core task of many different computer vision applications, such as object detection, visual semantic segmentation, visual relationship detection or scene graph generation. Since it helps to enhance the accuracy and the consistency of the resulting interpretation, visual context reasoning is often incorporated with visual perception in current deep end-to-end visual semantic information pursuit methods. However, a comprehensive review for this exciting area is still lacking. In this survey, we present a unified theoretical paradigm for all these methods, followed by an overview of the major developments and the future trends in each potential direction. The common benchmark datasets, the evaluation metrics and the comparisons of the corresponding methods are also introduced.