Yijun Cao

CV
h-index11
5papers
38citations
Novelty47%
AI Score29

5 Papers

CVOct 2, 2022
Unsupervised Visual Odometry and Action Integration for PointGoal Navigation in Indoor Environment

Yijun Cao, Xianshi Zhang, Fuya Luo et al.

PointGoal navigation in indoor environment is a fundamental task for personal robots to navigate to a specified point. Recent studies solved this PointGoal navigation task with near-perfect success rate in photo-realistically simulated environments, under the assumptions with noiseless actuation and most importantly, perfect localization with GPS and compass sensors. However, accurate GPS signalis difficult to be obtained in real indoor environment. To improve the PointGoal navigation accuracy without GPS signal, we use visual odometry (VO) and propose a novel action integration module (AIM) trained in unsupervised manner. Sepecifically, unsupervised VO computes the relative pose of the agent from the re-projection error of two adjacent frames, and then replaces the accurate GPS signal with the path integration. The pseudo position estimated by VO is used to train action integration which assists agent to update their internal perception of location and helps improve the success rate of navigation. The training and inference process only use RGB, depth, collision as well as self-action information. The experiments show that the proposed system achieves satisfactory results and outperforms the partially supervised learning algorithms on the popular Gibson dataset.

CVJun 5, 2025
Toward Better SSIM Loss for Unsupervised Monocular Depth Estimation

Yijun Cao, Fuya Luo, Yongjie Li

Unsupervised monocular depth learning generally relies on the photometric relation among temporally adjacent images. Most of previous works use both mean absolute error (MAE) and structure similarity index measure (SSIM) with conventional form as training loss. However, they ignore the effect of different components in the SSIM function and the corresponding hyperparameters on the training. To address these issues, this work proposes a new form of SSIM. Compared with original SSIM function, the proposed new form uses addition rather than multiplication to combine the luminance, contrast, and structural similarity related components in SSIM. The loss function constructed with this scheme helps result in smoother gradients and achieve higher performance on unsupervised depth estimation. We conduct extensive experiments to determine the relatively optimal combination of parameters for our new SSIM. Based on the popular MonoDepth approach, the optimized SSIM loss function can remarkably outperform the baseline on the KITTI-2015 outdoor dataset.

CVMay 14, 2024
Vector-Symbolic Architecture for Event-Based Optical Flow

Hongzhi You, Yijun Cao, Wei Yuan et al.

From a perspective of feature matching, optical flow estimation for event cameras involves identifying event correspondences by comparing feature similarity across accompanying event frames. In this work, we introduces an effective and robust high-dimensional (HD) feature descriptor for event frames, utilizing Vector Symbolic Architectures (VSA). The topological similarity among neighboring variables within VSA contributes to the enhanced representation similarity of feature descriptors for flow-matching points, while its structured symbolic representation capacity facilitates feature fusion from both event polarities and multiple spatial scales. Based on this HD feature descriptor, we propose a novel feature matching framework for event-based optical flow, encompassing both model-based (VSA-Flow) and self-supervised learning (VSA-SM) methods. In VSA-Flow, accurate optical flow estimation validates the effectiveness of HD feature descriptors. In VSA-SM, a novel similarity maximization method based on the HD feature descriptor is proposed to learn optical flow in a self-supervised way from events alone, eliminating the need for auxiliary grayscale images. Evaluation results demonstrate that our VSA-based method achieves superior accuracy in comparison to both model-based and self-supervised learning methods on the DSEC benchmark, while remains competitive among both methods on the MVSEC benchmark. This contribution marks a significant advancement in event-based optical flow within the feature matching methodology.

CVMar 21, 2024
Weak Supervision with Arbitrary Single Frame for Micro- and Macro-expression Spotting

Wang-Wang Yu, Xian-Shi Zhang, Fu-Ya Luo et al.

Frame-level micro- and macro-expression spotting methods require time-consuming frame-by-frame observation during annotation. Meanwhile, video-level spotting lacks sufficient information about the location and number of expressions during training, resulting in significantly inferior performance compared with fully-supervised spotting. To bridge this gap, we propose a point-level weakly-supervised expression spotting (PWES) framework, where each expression requires to be annotated with only one random frame (i.e., a point). To mitigate the issue of sparse label distribution, the prevailing solution is pseudo-label mining, which, however, introduces new problems: localizing contextual background snippets results in inaccurate boundaries and discarding foreground snippets leads to fragmentary predictions. Therefore, we design the strategies of multi-refined pseudo label generation (MPLG) and distribution-guided feature contrastive learning (DFCL) to address these problems. Specifically, MPLG generates more reliable pseudo labels by merging class-specific probabilities, attention scores, fused features, and point-level labels. DFCL is utilized to enhance feature similarity for the same categories and feature variability for different categories while capturing global representations across the entire datasets. Extensive experiments on the CAS(ME)^2, CAS(ME)^3, and SAMM-LV datasets demonstrate PWES achieves promising performance comparable to that of recent fully-supervised methods.

CVNov 22, 2021
Learning Generalized Visual Odometry Using Position-Aware Optical Flow and Geometric Bundle Adjustment

Yijun Cao, Xianshi Zhang, Fuya Luo et al.

Recent visual odometry (VO) methods incorporating geometric algorithm into deep-learning architecture have shown outstanding performance on the challenging monocular VO task. Despite encouraging results are shown, previous methods ignore the requirement of generalization capability under noisy environment and various scenes. To address this challenging issue, this work first proposes a novel optical flow network (PANet). Compared with previous methods that predict optical flow as a direct regression task, our PANet computes optical flow by predicting it into the discrete position space with optical flow probability volume, and then converting it to optical flow. Next, we improve the bundle adjustment module to fit the self-supervised training pipeline by introducing multiple sampling, ego-motion initialization, dynamic damping factor adjustment, and Jacobi matrix weighting. In addition, a novel normalized photometric loss function is advanced to improve the depth estimation accuracy. The experiments show that the proposed system not only achieves comparable performance with other state-of-the-art self-supervised learning-based methods on the KITTI dataset, but also significantly improves the generalization capability compared with geometry-based, learning-based and hybrid VO systems on the noisy KITTI and the challenging outdoor (KAIST) scenes.