Songhao Piao

AI
h-index4
8papers
4,249citations
Novelty57%
AI Score41

8 Papers

CVJul 24, 2025
DSFormer: A Dual-Scale Cross-Learning Transformer for Visual Place Recognition

Haiyang Jiang, Songhao Piao, Chao Gao et al.

Visual Place Recognition (VPR) is crucial for robust mobile robot localization, yet it faces significant challenges in maintaining reliable performance under varying environmental conditions and viewpoints. To address this, we propose a novel framework that integrates Dual-Scale-Former (DSFormer), a Transformer-based cross-learning module, with an innovative block clustering strategy. DSFormer enhances feature representation by enabling bidirectional information transfer between dual-scale features extracted from the final two CNN layers, capturing both semantic richness and spatial details through self-attention for long-range dependencies within each scale and shared cross-attention for cross-scale learning. Complementing this, our block clustering strategy repartitions the widely used San Francisco eXtra Large (SF-XL) training dataset from multiple distinct perspectives, optimizing data organization to further bolster robustness against viewpoint variations. Together, these innovations not only yield a robust global embedding adaptable to environmental changes but also reduce the required training data volume by approximately 30\% compared to previous partitioning methods. Comprehensive experiments demonstrate that our approach achieves state-of-the-art performance across most benchmark datasets, surpassing advanced reranking methods like DELG, Patch-NetVLAD, TransVPR, and R2Former as a global retrieval solution using 512-dim global descriptors, while significantly improving computational efficiency.

CVJun 15, 2021
BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao, Li Dong, Songhao Piao et al.

We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show that our model achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%). The code and pretrained models are available at https://aka.ms/beit.

AIJun 1, 2020
A novel approach for multi-agent cooperative pursuit to capture grouped evaders

Muhammad Zuhair Qadir, Songhao Piao, Haiyang Jiang et al.

An approach of mobile multi-agent pursuit based on application of self-organizing feature map (SOFM) and along with that reinforcement learning based on agent group role membership function (AGRMF) model is proposed. This method promotes dynamic organization of the pursuers' groups and also makes pursuers' group evader according to their desire based on SOFM and AGRMF techniques. This helps to overcome the shortcomings of the pursuers that they cannot fully reorganize when the goal is too independent in process of AGRMF models operation. Besides, we also discuss a new reward function. After the formation of the group, reinforcement learning is applied to get the optimal solution for each agent. The results of each step in capturing process will finally affect the AGR membership function to speed up the convergence of the competitive neural network. The experiments result shows that this approach is more effective for the mobile agents to capture evaders.

CLFeb 28, 2020
UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training

Hangbo Bao, Li Dong, Furu Wei et al.

We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks using a novel training procedure, referred to as a pseudo-masked language model (PMLM). Given an input text with masked tokens, we rely on conventional masks to learn inter-relations between corrupted tokens and context via autoencoding, and pseudo masks to learn intra-relations between masked spans via partially autoregressive modeling. With well-designed position embeddings and self-attention masks, the context encodings are reused to avoid redundant computation. Moreover, conventional masks used for autoencoding provide global masking information, so that all the position embeddings are accessible in partially autoregressive language modeling. In addition, the two tasks pre-train a unified language model as a bidirectional encoder and a sequence-to-sequence decoder, respectively. Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks across several widely used benchmarks.

CLSep 12, 2018
Neural Melody Composition from Lyrics

Hangbo Bao, Shaohan Huang, Furu Wei et al.

In this paper, we study a novel task that learns to compose music from natural language. Given the lyrics as input, we propose a melody composition model that generates lyrics-conditional melody as well as the exact alignment between the generated melody and the given lyrics simultaneously. More specifically, we develop the melody composition model based on the sequence-to-sequence framework. It consists of two neural encoders to encode the current lyrics and the context melody respectively, and a hierarchical decoder to jointly produce musical notes and the corresponding alignment. Experimental results on lyrics-melody pairs of 18,451 pop songs demonstrate the effectiveness of our proposed methods. In addition, we apply a singing voice synthesizer software to synthesize the "singing" of the lyrics and melodies for human evaluation. Results indicate that our generated melodies are more melodious and tuneful compared with the baseline method.

ROApr 13, 2018
Tightly-coupled Monocular Visual-odometric SLAM using Wheels and a MEMS Gyroscope

Meixiang Quan, Songhao Piao, Minglang Tan et al.

In this paper, we present a novel tightly-coupled probabilistic monocular visual-odometric Simultaneous Localization and Mapping algorithm using wheels and a MEMS gyroscope, which can provide accurate, robust and long-term localization for the ground robot moving on a plane. Firstly, we present an odometer preintegration theory that integrates the wheel encoder measurements and gyroscope measurements to a local frame. The preintegration theory properly addresses the manifold structure of the rotation group SO(3) and carefully deals with uncertainty propagation and bias correction. Then the novel odometer error term is formulated using the odometer preintegration model and it is tightly integrated into the visual optimization framework. Furthermore, we introduce a complete tracking framework to provide different strategies for motion tracking when (1) both measurements are available, (2) visual measurements are not available, and (3) wheel encoder experiences slippage, which leads the system to be accurate and robust. Finally, the proposed algorithm is evaluated by performing extensive experiments, the experimental results demonstrate the superiority of the proposed system.

AIJul 17, 2017
Coalition formation for Multi-agent Pursuit based on Neural Network and AGRMF Model

Zhaoyi Pei, Songhao Piao, Mohammed Ei Souidi

An approach for coalition formation of multi-agent pursuit based on neural network and AGRMF model is proposed.This paper constructs a novel neural work called AGRMF-ANN which consists of feature extraction part and group generation part. On one hand,The convolutional layers of feature extraction part can abstract the features of agent group role membership function(AGRMF) for all of the groups,on the other hand,those features will be fed to the group generation part based on self-organizing map(SOM) layer which is used to group the pursuers with similar features in the same group. Besides, we also come up the group attractiveness function(GAF) to evaluate the quality of groups and the pursuers contribution in order to adjust the main ability indicators of AGRMF and other weight of all neural network. The simulation experiment showed that this proposal can improve the effectiveness of coalition formation for multi-agent pursuit and ability to adopt pursuit-evasion problem with the scale of pursuer team growing.

ROJun 12, 2017
Accurate Monocular Visual-inertial SLAM using a Map-assisted EKF Approach

Meixiang Quan, Songhao Piao, Minglang Tan et al.

This paper presents a novel tightly-coupled monocular visual-inertial Simultaneous Localization and Mapping algorithm, which provides accurate and robust localization within the globally consistent map in real time on a standard CPU. This is achieved by firstly performing the visual-inertial extended kalman filter(EKF) to provide motion estimate at a high rate. However the filter becomes inconsistent due to the well known linearization issues. So we perform a keyframe-based visual-inertial bundle adjustment to improve the consistency and accuracy of the system. In addition, a loop closure detection and correction module is also added to eliminate the accumulated drift when revisiting an area. Finally, the optimized motion estimates and map are fed back to the EKF-based visual-inertial odometry module, thus the inconsistency and estimation error of the EKF estimator are reduced. In this way, the system can continuously provide reliable motion estimates for the long-term operation. The performance of the algorithm is validated on public datasets and real-world experiments, which proves the superiority of the proposed algorithm.