Ji Zhao

CV
h-index35
19papers
510citations
Novelty51%
AI Score49

19 Papers

CVJun 22, 2023Code
Affine Correspondences between Multi-Camera Systems for Relative Pose Estimation

Banglei Guan, Ji Zhao

We present a novel method to compute the relative pose of multi-camera systems using two affine correspondences (ACs). Existing solutions to the multi-camera relative pose estimation are either restricted to special cases of motion, have too high computational complexity, or require too many point correspondences (PCs). Thus, these solvers impede an efficient or accurate relative pose estimation when applying RANSAC as a robust estimator. This paper shows that the 6DOF relative pose estimation problem using ACs permits a feasible minimal solution, when exploiting the geometric constraints between ACs and multi-camera systems using a special parameterization. We present a problem formulation based on two ACs that encompass two common types of ACs across two views, i.e., inter-camera and intra-camera. Moreover, the framework for generating the minimal solvers can be extended to solve various relative pose estimation problems, e.g., 5DOF relative pose estimation with known rotation angle prior. Experiments on both virtual and real multi-camera systems prove that the proposed solvers are more efficient than the state-of-the-art algorithms, while resulting in a better relative pose accuracy. Source code is available at https://github.com/jizhaox/relpose-mcs-depth.

CRFeb 17Code
SecCodeBench-V2 Technical Report

Longfei Chen, Ji Zhao, Lanxiao Cui et al.

We introduce SecCodeBench-V2, a publicly released benchmark for evaluating Large Language Model (LLM) copilots' capabilities of generating secure code. SecCodeBench-V2 comprises 98 generation and fix scenarios derived from Alibaba Group's industrial productions, where the underlying security issues span 22 common CWE (Common Weakness Enumeration) categories across five programming languages: Java, C, Python, Go, and JavaScript. SecCodeBench-V2 adopts a function-level task formulation: each scenario provides a complete project scaffold and requires the model to implement or patch a designated target function under fixed interfaces and dependencies. For each scenario, SecCodeBench-V2 provides executable proof-of-concept (PoC) test cases for both functional validation and security verification. All test cases are authored and double-reviewed by security experts, ensuring high fidelity, broad coverage, and reliable ground truth. Beyond the benchmark itself, we build a unified evaluation pipeline that assesses models primarily via dynamic execution. For most scenarios, we compile and run model-generated artifacts in isolated environments and execute PoC test cases to validate both functional correctness and security properties. For scenarios where security issues cannot be adjudicated with deterministic test cases, we additionally employ an LLM-as-a-judge oracle. To summarize performance across heterogeneous scenarios and difficulty levels, we design a Pass@K-based scoring protocol with principled aggregation over scenarios and severity, enabling holistic and comparable evaluation across models. Overall, SecCodeBench-V2 provides a rigorous and reproducible foundation for assessing the security posture of AI coding assistants, with results and artifacts released at https://alibaba.github.io/sec-code-bench. The benchmark is publicly available at https://github.com/alibaba/sec-code-bench.

CVSep 1, 2024Code
Enhancing Vectorized Map Perception with Historical Rasterized Maps

Xiaoyu Zhang, Guangwei Liu, Zihao Liu et al.

In autonomous driving, there is growing interest in end-to-end online vectorized map perception in bird's-eye-view (BEV) space, with an expectation that it could replace traditional high-cost offline high-definition (HD) maps. However, the accuracy and robustness of these methods can be easily compromised in challenging conditions, such as occlusion or adverse weather, when relying only on onboard sensors. In this paper, we propose HRMapNet, leveraging a low-cost Historical Rasterized Map to enhance online vectorized map perception. The historical rasterized map can be easily constructed from past predicted vectorized results and provides valuable complementary information. To fully exploit a historical map, we propose two novel modules to enhance BEV features and map element queries. For BEV features, we employ a feature aggregation module to encode features from both onboard images and the historical map. For map element queries, we design a query initialization module to endow queries with priors from the historical map. The two modules contribute to leveraging map information in online perception. Our HRMapNet can be integrated with most online vectorized map perception methods. We integrate it in two state-of-the-art methods, significantly improving their performance on both the nuScenes and Argoverse 2 datasets. The source code is released at https://github.com/HXMap/HRMapNet.

CVFeb 27, 2024Code
Leveraging Enhanced Queries of Point Sets for Vectorized Map Construction

Zihao Liu, Xiaoyu Zhang, Guangwei Liu et al.

In autonomous driving, the high-definition (HD) map plays a crucial role in localization and planning. Recently, several methods have facilitated end-to-end online map construction in DETR-like frameworks. However, little attention has been paid to the potential capabilities of exploring the query mechanism for map elements. This paper introduces MapQR, an end-to-end method with an emphasis on enhancing query capabilities for constructing online vectorized maps. To probe desirable information efficiently, MapQR utilizes a novel query design, called scatter-and-gather query, which is modelled by separate content and position parts explicitly. The base map instance queries are scattered to different reference points and added with positional embeddings to probe information from BEV features. Then these scatted queries are gathered back to enhance information within each map instance. Together with a simple and effective improvement of a BEV encoder, the proposed MapQR achieves the best mean average precision (mAP) and maintains good efficiency on both nuScenes and Argoverse 2. In addition, integrating our query design into other models can boost their performance significantly. The source code is available at https://github.com/HXMap/MapQR.

CVMar 5, 2025Code
Full-DoF Egomotion Estimation for Event Cameras Using Geometric Solvers

Ji Zhao, Banglei Guan, Zibin Liu et al.

For event cameras, current sparse geometric solvers for egomotion estimation assume that the rotational displacements are known, such as those provided by an IMU. Thus, they can only recover the translational motion parameters. Recovering full-DoF motion parameters using a sparse geometric solver is a more challenging task, and has not yet been investigated. In this paper, we propose several solvers to estimate both rotational and translational velocities within a unified framework. Our method leverages event manifolds induced by line segments. The problem formulations are based on either an incidence relation for lines or a novel coplanarity relation for normal vectors. We demonstrate the possibility of recovering full-DoF egomotion parameters for both angular and linear velocities without requiring extra sensor measurements or motion priors. To achieve efficient optimization, we exploit the Adam framework with a first-order approximation of rotations for quick initialization. Experiments on both synthetic and real-world data demonstrate the effectiveness of our method. The code is available at https://github.com/jizhaox/relpose-event.

CVFeb 28, 2024Code
Six-Point Method for Multi-Camera Systems with Reduced Solution Space

Banglei Guan, Ji Zhao, Laurent Kneip

Relative pose estimation using point correspondences (PC) is a widely used technique. A minimal configuration of six PCs is required for two views of generalized cameras. In this paper, we present several minimal solvers that use six PCs to compute the 6DOF relative pose of multi-camera systems, including a minimal solver for the generalized camera and two minimal solvers for the practical configuration of two-camera rigs. The equation construction is based on the decoupling of rotation and translation. Rotation is represented by Cayley or quaternion parametrization, and translation can be eliminated by using the hidden variable technique. Ray bundle constraints are found and proven when a subset of PCs relate the same cameras across two views. This is the key to reducing the number of solutions and generating numerically stable solvers. Moreover, all configurations of six-point problems for multi-camera systems are enumerated. Extensive experiments demonstrate the superior accuracy and efficiency of our solvers compared to state-of-the-art six-point methods. The code is available at https://github.com/jizhaox/relpose-6pt

CLFeb 5
Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better

Ji Zhao, Yufei Gu, Shitong Shao et al.

As Large Language Models (LLMs) achieve remarkable empirical success through scaling model and data size, pretraining has become increasingly critical yet computationally prohibitive, hindering rapid development. Despite the availability of numerous pretrained LLMs developed at significant computational expense, a fundamental real-world question remains underexplored: \textit{Can we leverage existing small pretrained models to accelerate the training of larger models?} In this paper, we propose a Late-to-Early Training (LET) paradigm that enables LLMs to explicitly learn later knowledge in earlier steps and earlier layers. The core idea is to guide the early layers of an LLM during early training using representations from the late layers of a pretrained (i.e. late training phase) model. We identify two key mechanisms that drive LET's effectiveness: late-to-early-step learning and late-to-early-layer learning. These mechanisms significantly accelerate training convergence while robustly enhancing both language modeling capabilities and downstream task performance, enabling faster training with superior performance. Extensive experiments on 1.4B and 7B parameter models demonstrate LET's efficiency and effectiveness. Notably, when training a 1.4B LLM on the Pile dataset, our method achieves up to 1.6$\times$ speedup with nearly 5\% improvement in downstream task accuracy compared to standard training, even when using a pretrained model with 10$\times$ fewer parameters than the target model.

ROMar 11, 2025
General-Purpose Aerial Intelligent Agents Empowered by Large Language Models

Ji Zhao, Xiao Lin

The emergence of large language models (LLMs) opens new frontiers for unmanned aerial vehicle (UAVs), yet existing systems remain confined to predefined tasks due to hardware-software co-design challenges. This paper presents the first aerial intelligent agent capable of open-world task execution through tight integration of LLM-based reasoning and robotic autonomy. Our hardware-software co-designed system addresses two fundamental limitations: (1) Onboard LLM operation via an edge-optimized computing platform, achieving 5-6 tokens/sec inference for 14B-parameter models at 220W peak power; (2) A bidirectional cognitive architecture that synergizes slow deliberative planning (LLM task planning) with fast reactive control (state estimation, mapping, obstacle avoidance, and motion planning). Validated through preliminary results using our prototype, the system demonstrates reliable task planning and scene understanding in communication-constrained environments, such as sugarcane monitoring, power grid inspection, mine tunnel exploration, and biological observation applications. This work establishes a novel framework for embodied aerial artificial intelligence, bridging the gap between task planning and robotic autonomy in open environments.

CVDec 15, 2024
Adapter-Enhanced Semantic Prompting for Continual Learning

Baocai Yin, Ji Zhao, Huajie Jiang et al.

Continual learning (CL) enables models to adapt to evolving data streams. A major challenge of CL is catastrophic forgetting, where new knowledge will overwrite previously acquired knowledge. Traditional methods usually retain the past data for replay or add additional branches in the model to learn new knowledge, which has high memory requirements. In this paper, we propose a novel lightweight CL framework, Adapter-Enhanced Semantic Prompting (AESP), which integrates prompt tuning and adapter techniques. Specifically, we design semantic-guided prompts to enhance the generalization ability of visual features and utilize adapters to efficiently fuse the semantic information, aiming to learn more adaptive features for the continual learning task. Furthermore, to choose the right task prompt for feature adaptation, we have developed a novel matching mechanism for prompt selection. Extensive experiments on three CL datasets demonstrate that our approach achieves favorable performance across multiple metrics, showing its potential for advancing CL.

CVFeb 24, 2021
On Relative Pose Recovery for Multi-Camera Systems

Ji Zhao, Banglei Guan

The point correspondence (PC) and affine correspondence (AC) are widely used for relative pose estimation. An AC consists of a PC across two views and an affine transformation between the small patches around this PC. Previous work demonstrates that one AC generally provides three independent constraints for relative pose estimation. For multi-camera systems, there is still not any AC-based minimal solver for general relative pose estimation. To deal with this problem, we propose a complete solution to relative pose estimation from two ACs for multi-camera systems, consisting of a series of minimal solvers. The solver generation in our solution is based on Cayley or quaternion parameterization for rotation and hidden variable technique to eliminate translation. This solver generation method is also naturally applied to relative pose estimation from PCs, resulting in a new six-point method for multi-camera systems. A few extensions are made, including relative pose estimation with known rotation angle and/or with unknown focal lengths. Extensive experiments demonstrate that the proposed AC-based solvers and PC-based solvers are effective and efficient on synthetic and real-world datasets.

CVJan 22, 2021
Hybrid Rotation Averaging: A Fast and Robust Rotation Averaging Approach

Yu Chen, Ji Zhao, Laurent Kneip

We address rotation averaging (RA) and its application to real-world 3D reconstruction. Local optimisation based approaches are the de facto choice, though they only guarantee a local optimum. Global optimisers ensure global optimality in low noise conditions, but they are inefficient and may easily deviate under the influence of outliers or elevated noise levels. We push the envelope of rotation averaging by leveraging the advantages of a global RA method and a local RA method. Combined with a fast view graph filtering as preprocessing, the proposed hybrid approach is robust to outliers. We further apply the proposed hybrid rotation averaging approach to incremental Structure from Motion (SfM), the accuracy and robustness of SfM are both improved by adding the resulting global rotations as regularisers to bundle adjustment. Overall, we demonstrate high practicality of the proposed method as bad camera poses are effectively corrected and drift is reduced.

CVJul 21, 2020
Minimal Cases for Computing the Generalized Relative Pose using Affine Correspondences

Banglei Guan, Ji Zhao, Daniel Barath et al.

We propose three novel solvers for estimating the relative pose of a multi-camera system from affine correspondences (ACs). A new constraint is derived interpreting the relationship of ACs and the generalized camera model. Using the constraint, we demonstrate efficient solvers for two types of motions assumed. Considering that the cameras undergo planar motion, we propose a minimal solution using a single AC and a solver with two ACs to overcome the degenerate case. Also, we propose a minimal solution using two ACs with known vertical direction, e.g., from an IMU. Since the proposed methods require significantly fewer correspondences than state-of-the-art algorithms, they can be efficiently used within RANSAC for outlier removal and initial motion estimation. The solvers are tested both on synthetic data and on real-world scenes from the KITTI odometry benchmark. It is shown that the accuracy of the estimated poses is superior to the state-of-the-art techniques.

CVDec 23, 2019
Minimal Solutions for Relative Pose with a Single Affine Correspondence

Banglei Guan, Ji Zhao, Zhang Li et al.

In this paper we present four cases of minimal solutions for two-view relative pose estimation by exploiting the affine transformation between feature points and we demonstrate efficient solvers for these cases. It is shown, that under the planar motion assumption or with knowledge of a vertical direction, a single affine correspondence is sufficient to recover the relative camera pose. The four cases considered are two-view planar relative motion for calibrated cameras as a closed-form and a least-squares solution, a closed-form solution for unknown focal length and the case of a known vertical direction. These algorithms can be used efficiently for outlier detection within a RANSAC loop and for initial motion estimation. All the methods are evaluated on both synthetic data and real-world datasets from the KITTI benchmark. The experimental results demonstrate that our methods outperform comparable state-of-the-art methods in accuracy with the benefit of a reduced number of needed RANSAC iterations.

CVAug 26, 2019
An Evaluation of Feature Matchers for Fundamental Matrix Estimation

Jia-Wang Bian, Yu-Huan Wu, Ji Zhao et al.

Matching two images while estimating their relative geometry is a key step in many computer vision applications. For decades, a well-established pipeline, consisting of SIFT, RANSAC, and 8-point algorithm, has been used for this task. Recently, many new approaches were proposed and shown to outperform previous alternatives on standard benchmarks, including the learned features, correspondence pruning algorithms, and robust estimators. However, whether it is beneficial to incorporate them into the classic pipeline is less-investigated. To this end, we are interested in i) evaluating the performance of these recent algorithms in the context of image matching and epipolar geometry estimation, and ii) leveraging them to design more practical registration systems. The experiments are conducted in four large-scale datasets using strictly defined evaluation metrics, and the promising results provide insight into which algorithms suit which scenarios. According to this, we propose three high-quality matching systems and a Coarse-to-Fine RANSAC estimator. They show remarkable performances and have potentials to a large part of computer vision tasks. To facilitate future research, the full evaluation pipeline and the proposed methods are made publicly available.

CVMar 21, 2019
An Efficient Solution to Non-Minimal Case Essential Matrix Estimation

Ji Zhao

Finding relative pose between two calibrated images is a fundamental task in computer vision. Given five point correspondences, the classical five-point methods can be used to calculate the essential matrix efficiently. For the case of $N$ ($N > 5$) inlier point correspondences, which is called $N$-point problem, existing methods are either inefficient or prone to local minima. In this paper, we propose a certifiably globally optimal and efficient solver for the $N$-point problem. First we formulate the problem as a quadratically constrained quadratic program (QCQP). Then a certifiably globally optimal solution to this problem is obtained by semidefinite relaxation. This allows us to obtain certifiably globally optimal solutions to the original non-convex QCQPs in polynomial time. The theoretical guarantees of the semidefinite relaxation are also provided, including tightness and local stability. To deal with outliers, we propose a robust $N$-point method using M-estimators. Though global optimality cannot be guaranteed for the overall robust framework, the proposed robust $N$-point method can achieve good performance when the outlier ratio is not high. Extensive experiments on synthetic and real-world datasets demonstrated that our $N$-point method is $2\sim3$ orders of magnitude faster than state-of-the-art methods. Moreover, our robust $N$-point method outperforms state-of-the-art methods in terms of robustness and accuracy.

IRMar 18, 2019
POI Semantic Model with a Deep Convolutional Structure

Ji Zhao, Meiyu Yu, Huan Chen et al.

When using the electronic map, POI retrieval is the initial and important step, whose quality directly affects the user experience. Similarity between user query and POI information is the most critical feature in POI retrieval. An accurate similarity calculation is challenging since the mismatch between a query and a retrieval text may exist in the case of a mistyped query or an alias inquiry. In this paper, we propose a POI latent semantic model based on deep networks, which can effectively extract query features and POI information features for the similarity calculation. Our model describes the semantic information of complex texts at multiple layers, and achieves multi-field matches by modeling POI's name and detailed address respectively. Our model is evaluated by the POI retrieval ranking datasets, including the labeled data of relevance and real-world user click data in POI retrieval. Results show that our model significantly outperforms our competitors in POI retrieval ranking tasks. The proposed algorithm has become a critical component of an online system serving millions of people everyday.

CVOct 23, 2014
Density-Based Region Search with Arbitrary Shape for Object Localization

Ji Zhao, Deyu Meng, Jiayi Ma

Region search is widely used for object localization. Typically, the region search methods project the score of a classifier into an image plane, and then search the region with the maximal score. The recently proposed region search methods, such as efficient subwindow search and efficient region search, %which localize objects from the score distribution on an image are much more efficient than sliding window search. However, for some classifiers and tasks, the projected scores are nearly all positive, and hence maximizing the score of a region results in localizing nearly the entire images as objects, which is meaningless. In this paper, we observe that the large scores are mainly concentrated on or around objects. Based on this observation, we propose a method, named level set maximum-weight connected subgraph (LS-MWCS), which localizes objects with arbitrary shapes by searching regions with the densest score rather than the maximal score. The region density can be controlled by a parameter flexibly. And we prove an important property of the proposed LS-MWCS, which guarantees that the region with the densest score can be searched. Moreover, the LS-MWCS can be efficiently optimized by belief propagation. The method is evaluated on the problem of weakly-supervised object localization, and the quantitative results demonstrate the superiorities of our LS-MWCS compared to other state-of-the-art methods.

CVJul 20, 2014
Feature and Region Selection for Visual Learning

Ji Zhao, Liantao Wang, Ricardo Cabral et al.

Visual learning problems such as object classification and action recognition are typically approached using extensions of the popular bag-of-words (BoW) model. Despite its great success, it is unclear what visual features the BoW model is learning: Which regions in the image or video are used to discriminate among classes? Which are the most discriminative visual words? Answering these questions is fundamental for understanding existing BoW models and inspiring better models for visual recognition. To answer these questions, this paper presents a method for feature selection and region selection in the visual BoW model. This allows for an intermediate visualization of the features and regions that are important for visual learning. The main idea is to assign latent weights to the features or regions, and jointly optimize these latent variables with the parameters of a classifier (e.g., support vector machine). There are four main benefits of our approach: (1) Our approach accommodates non-linear additive kernels such as the popular $χ^2$ and intersection kernel; (2) our approach is able to handle both regions in images and spatio-temporal regions in videos in a unified way; (3) the feature selection problem is convex, and both problems can be solved using a scalable reduced gradient method; (4) we point out strong connections with multiple kernel learning and multiple instance learning approaches. Experimental results in the PASCAL VOC 2007, MSR Action Dataset II and YouTube illustrate the benefits of our approach.

AIMay 12, 2014
FastMMD: Ensemble of Circular Discrepancy for Efficient Two-Sample Test

Ji Zhao, Deyu Meng

The maximum mean discrepancy (MMD) is a recently proposed test statistic for two-sample test. Its quadratic time complexity, however, greatly hampers its availability to large-scale applications. To accelerate the MMD calculation, in this study we propose an efficient method called FastMMD. The core idea of FastMMD is to equivalently transform the MMD with shift-invariant kernels into the amplitude expectation of a linear combination of sinusoid components based on Bochner's theorem and Fourier transform (Rahimi & Recht, 2007). Taking advantage of sampling of Fourier transform, FastMMD decreases the time complexity for MMD calculation from $O(N^2 d)$ to $O(L N d)$, where $N$ and $d$ are the size and dimension of the sample set, respectively. Here $L$ is the number of basis functions for approximating kernels which determines the approximation accuracy. For kernels that are spherically invariant, the computation can be further accelerated to $O(L N \log d)$ by using the Fastfood technique (Le et al., 2013). The uniform convergence of our method has also been theoretically proved in both unbiased and biased estimates. We have further provided a geometric explanation for our method, namely ensemble of circular discrepancy, which facilitates us to understand the insight of MMD, and is hopeful to help arouse more extensive metrics for assessing two-sample test. Experimental results substantiate that FastMMD is with similar accuracy as exact MMD, while with faster computation speed and lower variance than the existing MMD approximation methods.