Yanyan Fang

CVOct 22, 2021

Cross-domain Trajectory Prediction with CTP-Net

Pingxuan Huang, Zhenhua Cui, Jing Li et al.

Most pedestrian trajectory prediction methods rely on a huge amount of trajectories annotation, which is time-consuming and expensive. Moreover, a well-trained model may not effectively generalize to a new scenario captured by another camera. Therefore, it is desirable to adapt the model trained on an annotated source domain to the target domain. To achieve domain adaptation for trajectory prediction, we propose a Cross-domain Trajectory Prediction Network (CTP-Net). In this framework, encoders are used in both domains to encode the observed trajectories, then their features are aligned by a cross-domain feature discriminator. Further, considering the consistency between the observed and the predicted trajectories, a target domain offset discriminator is utilized to adversarially regularize the future trajectory predictions to be in line with the observed trajectories. Extensive experiments demonstrate the effectiveness of our method on domain adaptation for pedestrian trajectory prediction.

CVJul 18, 2019

Locality-constrained Spatial Transformer Network for Video Crowd Counting

Yanyan Fang, Biyun Zhan, Wandi Cai et al.

Compared with single image based crowd counting, video provides the spatial-temporal information of the crowd that would help improve the robustness of crowd counting. But translation, rotation and scaling of people lead to the change of density map of heads between neighbouring frames. Meanwhile, people walking in/out or being occluded in dynamic scenes leads to the change of head counts. To alleviate these issues in video crowd counting, a Locality-constrained Spatial Transformer Network (LSTN) is proposed. Specifically, we first leverage a Convolutional Neural Networks to estimate the density map for each frame. Then to relate the density maps between neighbouring frames, a Locality-constrained Spatial Transformer (LST) module is introduced to estimate the density map of next frame with that of current frame. To facilitate the performance evaluation, a large-scale video crowd counting dataset is collected, which contains 15K frames with about 394K annotated heads captured from 13 different scenes. As far as we know, it is the largest video crowd counting dataset. Extensive experiments on our dataset and other crowd counting datasets validate the effectiveness of our LSTN for crowd counting.

Yanyan Fang

2 Papers