Haiguang Wang

CV
h-index26
9papers
327citations
Novelty45%
AI Score49

9 Papers

CVAug 18, 2023Code
SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos

Haisong Liu, Yao Teng, Tao Lu et al.

Camera-based 3D object detection in BEV (Bird's Eye View) space has drawn great attention over the past few years. Dense detectors typically follow a two-stage pipeline by first constructing a dense BEV feature and then performing object detection in BEV space, which suffers from complex view transformations and high computation cost. On the other side, sparse detectors follow a query-based paradigm without explicit dense BEV feature construction, but achieve worse performance than the dense counterparts. In this paper, we find that the key to mitigate this performance gap is the adaptability of the detector in both BEV and image space. To achieve this goal, we propose SparseBEV, a fully sparse 3D object detector that outperforms the dense counterparts. SparseBEV contains three key designs, which are (1) scale-adaptive self attention to aggregate features with adaptive receptive field in BEV space, (2) adaptive spatio-temporal sampling to generate sampling locations under the guidance of queries, and (3) adaptive mixing to decode the sampled features with dynamic weights from the queries. On the test split of nuScenes, SparseBEV achieves the state-of-the-art performance of 67.5 NDS. On the val split, SparseBEV achieves 55.8 NDS while maintaining a real-time inference speed of 23.5 FPS. Code is available at https://github.com/MCG-NJU/SparseBEV.

CVMar 20, 2025Code
MiLA: Multi-view Intensive-fidelity Long-term Video Generation World Model for Autonomous Driving

Haiguang Wang, Daqi Liu, Hongwei Xie et al.

In recent years, data-driven techniques have greatly advanced autonomous driving systems, but the need for rare and diverse training data remains a challenge, requiring significant investment in equipment and labor. World models, which predict and generate future environmental states, offer a promising solution by synthesizing annotated video data for training. However, existing methods struggle to generate long, consistent videos without accumulating errors, especially in dynamic scenes. To address this, we propose MiLA, a novel framework for generating high-fidelity, long-duration videos up to one minute. MiLA utilizes a Coarse-to-Re(fine) approach to both stabilize video generation and correct distortion of dynamic objects. Additionally, we introduce a Temporal Progressive Denoising Scheduler and Joint Denoising and Correcting Flow modules to improve the quality of generated videos. Extensive experiments on the nuScenes dataset show that MiLA achieves state-of-the-art performance in video generation quality. For more information, visit the project website: https://github.com/xiaomi-mlab/mila.github.io.

CLSep 3, 2025Code
ResearchPulse: Building Method-Experiment Chains through Multi-Document Scientific Inference

Qi Chen, Jingxuan Wei, Zhuoya Yao et al.

Understanding how scientific ideas evolve requires more than summarizing individual papers-it demands structured, cross-document reasoning over thematically related research. In this work, we formalize multi-document scientific inference, a new task that extracts and aligns motivation, methodology, and experimental results across related papers to reconstruct research development chains. This task introduces key challenges, including temporally aligning loosely structured methods and standardizing heterogeneous experimental tables. We present ResearchPulse, an agent-based framework that integrates instruction planning, scientific content extraction, and structured visualization. It consists of three coordinated agents: a Plan Agent for task decomposition, a Mmap-Agent that constructs motivation-method mind maps, and a Lchart-Agent that synthesizes experimental line charts. To support this task, we introduce ResearchPulse-Bench, a citation-aware benchmark of annotated paper clusters. Experiments show that our system, despite using 7B-scale agents, consistently outperforms strong baselines like GPT-4o in semantic alignment, structural consistency, and visual fidelity. The dataset are available in https://huggingface.co/datasets/ResearchPulse/ResearchPulse-Bench.

CVDec 28, 2023
Fully Sparse 3D Occupancy Prediction

Haisong Liu, Yang Chen, Haiguang Wang et al.

Occupancy prediction plays a pivotal role in autonomous driving. Previous methods typically construct dense 3D volumes, neglecting the inherent sparsity of the scene and suffering from high computational costs. To bridge the gap, we introduce a novel fully sparse occupancy network, termed SparseOcc. SparseOcc initially reconstructs a sparse 3D representation from camera-only inputs and subsequently predicts semantic/instance occupancy from the 3D sparse representation by sparse queries. A mask-guided sparse sampling is designed to enable sparse queries to interact with 2D features in a fully sparse manner, thereby circumventing costly dense features or global attention. Additionally, we design a thoughtful ray-based evaluation metric, namely RayIoU, to solve the inconsistency penalty along the depth axis raised in traditional voxel-level mIoU criteria. SparseOcc demonstrates its effectiveness by achieving a RayIoU of 34.0, while maintaining a real-time inference speed of 17.3 FPS, with 7 history frames inputs. By incorporating more preceding frames to 15, SparseOcc continuously improves its performance to 35.1 RayIoU without bells and whistles.

LGFeb 6
Achieving Better Local Regret Bound for Online Non-Convex Bilevel Optimization

Tingkai Jia, Haiguang Wang, Cheng Chen

Online bilevel optimization (OBO) has emerged as a powerful framework for many machine learning problems. Prior works have developed several algorithms that minimize the standard bilevel local regret or the window-averaged bilevel local regret of the OBO problem, but the optimality of existing regret bounds remains unclear. In this work, we establish optimal regret bounds for both settings. For standard bilevel local regret, we propose an algorithm that achieves the optimal regret $Ω(1+V_T)$ with at most $O(T\log T)$ total inner-level gradient evaluations. We further develop a fully single-loop algorithm whose regret bound includes an additional gradient-variation terms. For the window-averaged bilevel local regret, we design an algorithm that captures sublinear environmental variation through a window-based analysis and achieves the optimal regret $Ω(T/W^2)$. Experiments validate our theoretical findings and demonstrate the practical effectiveness of the proposed methods.

CVOct 1, 2025
Arbitrary Generative Video Interpolation

Guozhen Zhang, Haiguang Wang, Chunyu Wang et al.

Video frame interpolation (VFI), which generates intermediate frames from given start and end frames, has become a fundamental function in video generation applications. However, existing generative VFI methods are constrained to synthesize a fixed number of intermediate frames, lacking the flexibility to adjust generated frame rates or total sequence duration. In this work, we present ArbInterp, a novel generative VFI framework that enables efficient interpolation at any timestamp and of any length. Specifically, to support interpolation at any timestamp, we propose the Timestamp-aware Rotary Position Embedding (TaRoPE), which modulates positions in temporal RoPE to align generated frames with target normalized timestamps. This design enables fine-grained control over frame timestamps, addressing the inflexibility of fixed-position paradigms in prior work. For any-length interpolation, we decompose long-sequence generation into segment-wise frame synthesis. We further design a novel appearance-motion decoupled conditioning strategy: it leverages prior segment endpoints to enforce appearance consistency and temporal semantics to maintain motion coherence, ensuring seamless spatiotemporal transitions across segments. Experimentally, we build comprehensive benchmarks for multi-scale frame interpolation (2x to 32x) to assess generalizability across arbitrary interpolation factors. Results show that ArbInterp outperforms prior methods across all scenarios with higher fidelity and more seamless spatiotemporal continuity. Project website: https://mcg-nju.github.io/ArbInterp-Web/.

CVJun 16, 2024
LAIP: Learning Local Alignment from Image-Phrase Modeling for Text-based Person Search

Haiguang Wang, Yu Wu, Mengxia Wu et al.

Text-based person search aims at retrieving images of a particular person based on a given textual description. A common solution for this task is to directly match the entire images and texts, i.e., global alignment, which fails to deal with discerning specific details that discriminate against appearance-similar people. As a result, some works shift their attention towards local alignment. One group matches fine-grained parts using forward attention weights of the transformer yet underutilizes information. Another implicitly conducts local alignment by reconstructing masked parts based on unmasked context yet with a biased masking strategy. All limit performance improvement. This paper proposes the Local Alignment from Image-Phrase modeling (LAIP) framework, with Bidirectional Attention-weighted local alignment (BidirAtt) and Mask Phrase Modeling (MPM) module.BidirAtt goes beyond the typical forward attention by considering the gradient of the transformer as backward attention, utilizing two-sided information for local alignment. MPM focuses on mask reconstruction within the noun phrase rather than the entire text, ensuring an unbiased masking strategy. Extensive experiments conducted on the CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets demonstrate the superiority of the LAIP framework over existing methods.

CRJan 7, 2022
Towards Trustworthy DeFi Oracles: Past,Present and Future

Yinjie Zhao, Xin Kang, Tieyan Li et al.

With the rapid development of blockchain technology in recent years, all kinds of blockchain-based applications have emerged. Among them, the decentralized finance (DeFi) is one of the most successful applications, which is regarded as the future of finance. The great success of DeFi relies on the real-world data which is not directly available on the blockchain. Besides, due to the deterministic nature of blockchain,the blockchain cannot directly obtain in-deterministic data from the outside world (off-chain). Thus, oracles have appeared as a viable solution to feed off-chain data to blockchain applications. In this paper, we carryout a comprehensive study on oracles, especially on DeFi oracles. We first briefly introduce the application scenarios of DeFi oracles, and then we talk about the past of DeFi oracles by categorizing them into several types based on their design features. After that, we introduce five popular DeFi oracles currently in use(such as Chainlink and Band Protocol), with the focus on their system architecture, data validation process,and their incentive mechanisms. We compare these present DeFi oracles from their data trustworthiness,data source trustworthiness and their overall trust models. Finally, we propose a set of metrics for designing trustworthiness DeFi oracles, and propose a potential trust architecture and a few promising techniques for building trustworthiness oracles.

CRJun 14, 2021
On the Trust and Trust Modelling for the Future Fully-Connected Digital World: A Comprehensive Study

Hannah Lim Jing Ting, Xin Kang, Tieyan Li et al.

With the fast development of digital technologies, we are running into a digital world. The relationship among people and the connections among things become more and more complex, and new challenges arise. To tackle these challenges, trust-a soft security mechanism-is considered as a promising technology. Thus, in this survey, we do a comprehensive study on the trust and trust modelling for the future digital world. We revisit the definitions and properties of trust, and analysis the trust theories and discuss their impact on digital trust modelling. We analyze the digital world and its corresponding environment where people, things, and infrastructure connect with each other. We detail the challenges that require trust in these digital scenarios. Under our analysis of trust and the digital world, we define different types of trust relationships and find out the factors that are needed to ensure a fully representative model. Next, to meet the challenges of digital trust modelling, comprehensive trust model evaluation criteria are proposed, and potential securities and privacy issues of trust modelling are analyzed. Finally, we provide a wide-ranging analysis of different methodologies, mathematical theories, and how they can be applied to trust modelling.