Beihao Xia

CV
h-index28
22papers
654citations
Novelty48%
AI Score49

22 Papers

CVSep 18, 2022Code
TODE-Trans: Transparent Object Depth Estimation with Transformer

Kang Chen, Shaochen Wang, Beihao Xia et al.

Transparent objects are widely used in industrial automation and daily life. However, robust visual recognition and perception of transparent objects have always been a major challenge. Currently, most commercial-grade depth cameras are still not good at sensing the surfaces of transparent objects due to the refraction and reflection of light. In this work, we present a transformer-based transparent object depth estimation approach from a single RGB-D input. We observe that the global characteristics of the transformer make it easier to extract contextual information to perform depth estimation of transparent areas. In addition, to better enhance the fine-grained features, a feature fusion module (FFM) is designed to assist coherent prediction. Our empirical evidence demonstrates that our model delivers significant improvements in recent popular datasets, e.g., 25% gain on RMSE and 21% gain on REL compared to previous state-of-the-art convolutional-based counterparts in ClearGrasp dataset. Extensive results show that our transformer-based model enables better aggregation of the object's RGB and inaccurate depth information to obtain a better depth representation. Our code and the pre-trained model will be available at https://github.com/yuchendoudou/TODE.

CRJun 14, 2023Code
Efficient Backdoor Attacks for Deep Neural Networks in Real-world Scenarios

Ziqiang Li, Hong Sun, Pengfei Xia et al.

Recent deep neural networks (DNNs) have came to rely on vast amounts of training data, providing an opportunity for malicious attackers to exploit and contaminate the data to carry out backdoor attacks. However, existing backdoor attack methods make unrealistic assumptions, assuming that all training data comes from a single source and that attackers have full access to the training data. In this paper, we introduce a more realistic attack scenario where victims collect data from multiple sources, and attackers cannot access the complete training data. We refer to this scenario as data-constrained backdoor attacks. In such cases, previous attack methods suffer from severe efficiency degradation due to the entanglement between benign and poisoning features during the backdoor injection process. To tackle this problem, we introduce three CLIP-based technologies from two distinct streams: Clean Feature Suppression and Poisoning Feature Augmentation.effective solution for data-constrained backdoor attacks. The results demonstrate remarkable improvements, with some settings achieving over 100% improvement compared to existing attacks in data-constrained scenarios. Code is available at https://github.com/sunh1113/Efficient-backdoor-attacks-for-deep-neural-networks-in-real-world-scenarios

CVOct 9, 2023
SocialCircle: Learning the Angle-based Social Interaction Representation for Pedestrian Trajectory Prediction

Conghao Wong, Beihao Xia, Ziqian Zou et al.

Analyzing and forecasting trajectories of agents like pedestrians and cars in complex scenes has become more and more significant in many intelligent systems and applications. The diversity and uncertainty in socially interactive behaviors among a rich variety of agents make this task more challenging than other deterministic computer vision tasks. Researchers have made a lot of efforts to quantify the effects of these interactions on future trajectories through different mathematical models and network structures, but this problem has not been well solved. Inspired by marine animals that localize the positions of their companions underwater through echoes, we build a new anglebased trainable social interaction representation, named SocialCircle, for continuously reflecting the context of social interactions at different angular orientations relative to the target agent. We validate the effect of the proposed SocialCircle by training it along with several newly released trajectory prediction models, and experiments show that the SocialCircle not only quantitatively improves the prediction performance, but also qualitatively helps better simulate social interactions when forecasting pedestrian trajectories in a way that is consistent with human intuitions.

CRJun 14, 2023
A Proxy Attack-Free Strategy for Practically Improving the Poisoning Efficiency in Backdoor Attacks

Ziqiang Li, Hong Sun, Pengfei Xia et al.

Poisoning efficiency is crucial in poisoning-based backdoor attacks, as attackers aim to minimize the number of poisoning samples while maximizing attack efficacy. Recent studies have sought to enhance poisoning efficiency by selecting effective samples. However, these studies typically rely on a proxy backdoor injection task to identify an efficient set of poisoning samples. This proxy attack-based approach can lead to performance degradation if the proxy attack settings differ from those of the actual victims, due to the shortcut nature of backdoor learning. Furthermore, proxy attack-based methods are extremely time-consuming, as they require numerous complete backdoor injection processes for sample selection. To address these concerns, we present a Proxy attack-Free Strategy (PFS) designed to identify efficient poisoning samples based on the similarity between clean samples and their corresponding poisoning samples, as well as the diversity of the poisoning set. The proposed PFS is motivated by the observation that selecting samples with high similarity between clean and corresponding poisoning samples results in significantly higher attack success rates compared to using samples with low similarity. Additionally, we provide theoretical foundations to explain the proposed PFS. We comprehensively evaluate the proposed strategy across various datasets, triggers, poisoning rates, architectures, and training hyperparameters. Our experimental results demonstrate that PFS enhances backdoor attack efficiency while also offering a remarkable speed advantage over previous proxy attack-based selection methodologies.

CVApr 11, 2023
Another Vertical View: A Hierarchical Network for Heterogeneous Trajectory Prediction via Spectrums

Beihao Xia, Conghao Wong, Duanquan Xu et al.

With the fast development of AI-related techniques, the applications of trajectory prediction are no longer limited to easier scenes and trajectories. More and more trajectories with different forms, such as coordinates, bounding boxes, and even high-dimensional human skeletons, need to be analyzed and forecasted. Among these heterogeneous trajectories, interactions between different elements within a frame of trajectory, which we call ``Dimension-wise Interactions'', would be more complex and challenging. However, most previous approaches focus mainly on a specific form of trajectories, and potential dimension-wise interactions are less concerned. In this work, we expand the trajectory prediction task by introducing the trajectory dimensionality $M$, thus extending its application scenarios to heterogeneous trajectories. We first introduce the Haar transform as an alternative to Fourier transform to better capture the time-frequency properties of each trajectory-dimension. Then, we adopt the bilinear structure to model and fuse two factors simultaneously, including the time-frequency response and the dimension-wise interaction, to forecast heterogeneous trajectories via trajectory spectrums hierarchically in a generic way. Experiments show that the proposed model outperforms most state-of-the-art methods on ETH-UCY, SDD, nuScenes, and Human3.6M with heterogeneous trajectories, including 2D coordinates, 2D/3D bounding boxes, and 3D human skeletons.

CVApr 18, 2022
A Comprehensive Survey on Data-Efficient GANs in Image Generation

Ziqiang Li, Beihao Xia, Jing Zhang et al.

Generative Adversarial Networks (GANs) have achieved remarkable achievements in image synthesis. These successes of GANs rely on large scale datasets, requiring too much cost. With limited training data, how to stable the training process of GANs and generate realistic images have attracted more attention. The challenges of Data-Efficient GANs (DE-GANs) mainly arise from three aspects: (i) Mismatch Between Training and Target Distributions, (ii) Overfitting of the Discriminator, and (iii) Imbalance Between Latent and Data Spaces. Although many augmentation and pre-training strategies have been proposed to alleviate these issues, there lacks a systematic survey to summarize the properties, challenges, and solutions of DE-GANs. In this paper, we revisit and define DE-GANs from the perspective of distribution optimization. We conclude and analyze the challenges of DE-GANs. Meanwhile, we propose a taxonomy, which classifies the existing methods into three categories: Data Selection, GANs Optimization, and Knowledge Sharing. Last but not the least, we attempt to highlight the current problems and the future directions.

CVSep 23, 2024
SocialCircle+: Learning the Angle-based Conditioned Interaction Representation for Pedestrian Trajectory Prediction

Conghao Wong, Beihao Xia, Ziqian Zou et al.

Trajectory prediction is a crucial aspect of understanding human behaviors. Researchers have made efforts to represent socially interactive behaviors among pedestrians and utilize various networks to enhance prediction capability. Unfortunately, they still face challenges not only in fully explaining and measuring how these interactive behaviors work to modify trajectories but also in modeling pedestrians' preferences to plan or participate in social interactions in response to the changeable physical environments as extra conditions. This manuscript mainly focuses on the above explainability and conditionality requirements for trajectory prediction networks. Inspired by marine animals perceiving other companions and the environment underwater by echolocation, this work constructs an angle-based conditioned social interaction representation SocialCircle+ to represent the socially interactive context and its corresponding conditions. It employs a social branch and a conditional branch to describe how pedestrians are positioned in prediction scenes socially and physically in angle-based-cyclic-sequence forms. Then, adaptive fusion is applied to fuse the above conditional clues onto the social ones to learn the final interaction representation. Experiments demonstrate the superiority of SocialCircle+ with different trajectory prediction backbones. Moreover, counterfactual interventions have been made to simultaneously verify the modeling capacity of causalities among interactive variables and the conditioning capability.

CVNov 14, 2025
Reverberation: Learning the Latencies Before Forecasting Trajectories

Conghao Wong, Ziqian Zou, Beihao Xia et al.

Bridging the past to the future, connecting agents both spatially and temporally, lies at the core of the trajectory prediction task. Despite great efforts, it remains challenging to explicitly learn and predict latencies, the temporal delays with which agents respond to different trajectory-changing events and adjust their future paths, whether on their own or interactively. Different agents may exhibit distinct latency preferences for noticing, processing, and reacting to any specific trajectory-changing event. The lack of consideration of such latencies may undermine the causal continuity of the forecasting system and also lead to implausible or unintended trajectories. Inspired by the reverberation curves in acoustics, we propose a new reverberation transform and the corresponding Reverberation (short for Rev) trajectory prediction model, which simulates and predicts different latency preferences of each agent as well as their stochasticity by using two explicit and learnable reverberation kernels, allowing for the controllable trajectory prediction based on these forecasted latencies. Experiments on multiple datasets, whether pedestrians or vehicles, demonstrate that Rev achieves competitive accuracy while revealing interpretable latency dynamics across agents and scenarios. Qualitative analyses further verify the properties of the proposed reverberation transform, highlighting its potential as a general latency modeling approach.

CVJul 1, 2025Code
UAVD-Mamba: Deformable Token Fusion Vision Mamba for Multimodal UAV Detection

Wei Li, Jiaman Tang, Yang Li et al.

Unmanned Aerial Vehicle (UAV) object detection has been widely used in traffic management, agriculture, emergency rescue, etc. However, it faces significant challenges, including occlusions, small object sizes, and irregular shapes. These challenges highlight the necessity for a robust and efficient multimodal UAV object detection method. Mamba has demonstrated considerable potential in multimodal image fusion. Leveraging this, we propose UAVD-Mamba, a multimodal UAV object detection framework based on Mamba architectures. To improve geometric adaptability, we propose the Deformable Token Mamba Block (DTMB) to generate deformable tokens by incorporating adaptive patches from deformable convolutions alongside normal patches from normal convolutions, which serve as the inputs to the Mamba Block. To optimize the multimodal feature complementarity, we design two separate DTMBs for the RGB and infrared (IR) modalities, with the outputs from both DTMBs integrated into the Mamba Block for feature extraction and into the Fusion Mamba Block for feature fusion. Additionally, to improve multiscale object detection, especially for small objects, we stack four DTMBs at different scales to produce multiscale feature representations, which are then sent to the Detection Neck for Mamba (DNM). The DNM module, inspired by the YOLO series, includes modifications to the SPPF and C3K2 of YOLOv11 to better handle the multiscale features. In particular, we employ cross-enhanced spatial attention before the DTMB and cross-channel attention after the Fusion Mamba Block to extract more discriminative features. Experimental results on the DroneVehicle dataset show that our method outperforms the baseline OAFA method by 3.6% in the mAP metric. Codes will be released at https://github.com/GreatPlum-hnu/UAVD-Mamba.git.

CVJul 29, 2021Code
FREE: Feature Refinement for Generalized Zero-Shot Learning

Shiming Chen, Wenjie Wang, Beihao Xia et al.

Generalized zero-shot learning (GZSL) has achieved significant progress, with many efforts dedicated to overcoming the problems of visual-semantic domain gap and seen-unseen bias. However, most existing methods directly use feature extraction models trained on ImageNet alone, ignoring the cross-dataset bias between ImageNet and GZSL benchmarks. Such a bias inevitably results in poor-quality visual features for GZSL tasks, which potentially limits the recognition performance on both seen and unseen classes. In this paper, we propose a simple yet effective GZSL method, termed feature refinement for generalized zero-shot learning (FREE), to tackle the above problem. FREE employs a feature refinement (FR) module that incorporates \textit{semantic$\rightarrow$visual} mapping into a unified generative model to refine the visual features of seen and unseen class samples. Furthermore, we propose a self-adaptive margin center loss (SAMC-loss) that cooperates with a semantic cycle-consistency loss to guide FR to learn class- and semantically-relevant representations, and concatenate the features in FR to extract the fully refined features. Extensive experiments on five benchmark datasets demonstrate the significant performance gain of FREE over its baseline and current state-of-the-art methods. Our codes are available at https://github.com/shiming-chen/FREE .

CVNov 13, 2025
TubeRMC: Tube-conditioned Reconstruction with Mutual Constraints for Weakly-supervised Spatio-Temporal Video Grounding

Jinxuan Li, Yi Zhang, Jian-Fang Hu et al.

Spatio-Temporal Video Grounding (STVG) aims to localize a spatio-temporal tube that corresponds to a given language query in an untrimmed video. This is a challenging task since it involves complex vision-language understanding and spatiotemporal reasoning. Recent works have explored weakly-supervised setting in STVG to eliminate reliance on fine-grained annotations like bounding boxes or temporal stamps. However, they typically follow a simple late-fusion manner, which generates tubes independent of the text description, often resulting in failed target identification and inconsistent target tracking. To address this limitation, we propose a Tube-conditioned Reconstruction with Mutual Constraints (\textbf{TubeRMC}) framework that generates text-conditioned candidate tubes with pre-trained visual grounding models and further refine them via tube-conditioned reconstruction with spatio-temporal constraints. Specifically, we design three reconstruction strategies from temporal, spatial, and spatio-temporal perspectives to comprehensively capture rich tube-text correspondences. Each strategy is equipped with a Tube-conditioned Reconstructor, utilizing spatio-temporal tubes as condition to reconstruct the key clues in the query. We further introduce mutual constraints between spatial and temporal proposals to enhance their quality for reconstruction. TubeRMC outperforms existing methods on two public benchmarks VidSTG and HCSTVG. Further visualization shows that TubeRMC effectively mitigates both target identification errors and inconsistent tracking.

CVJan 13, 2025
Pedestrian Trajectory Prediction Based on Social Interactions Learning With Random Weights

Jiajia Xie, Sheng Zhang, Beihao Xia et al.

Pedestrian trajectory prediction is a critical technology in the evolution of self-driving cars toward complete artificial intelligence. Over recent years, focusing on the trajectories of pedestrians to model their social interactions has surged with great interest in more accurate trajectory predictions. However, existing methods for modeling pedestrian social interactions rely on pre-defined rules, struggling to capture non-explicit social interactions. In this work, we propose a novel framework named DTGAN, which extends the application of Generative Adversarial Networks (GANs) to graph sequence data, with the primary objective of automatically capturing implicit social interactions and achieving precise predictions of pedestrian trajectory. DTGAN innovatively incorporates random weights within each graph to eliminate the need for pre-defined interaction rules. We further enhance the performance of DTGAN by exploring diverse task loss functions during adversarial training, which yields improvements of 16.7\% and 39.3\% on metrics ADE and FDE, respectively. The effectiveness and accuracy of our framework are verified on two public datasets. The experimental results show that our proposed DTGAN achieves superior performance and is well able to understand pedestrians' intentions.

CVMar 21, 2024
Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels

Tianming Liang, Chaolei Tan, Beihao Xia et al.

This paper focuses on open-ended video question answering, which aims to find the correct answers from a large answer set in response to a video-related question. This is essentially a multi-label classification task, since a question may have multiple answers. However, due to annotation costs, the labels in existing benchmarks are always extremely insufficient, typically one answer per question. As a result, existing works tend to directly treat all the unlabeled answers as negative labels, leading to limited ability for generalization. In this work, we introduce a simple yet effective ranking distillation framework (RADI) to mitigate this problem without additional manual annotation. RADI employs a teacher model trained with incomplete labels to generate rankings for potential answers, which contain rich knowledge about label priority as well as label-associated visual cues, thereby enriching the insufficient labeling information. To avoid overconfidence in the imperfect teacher model, we further present two robust and parameter-free ranking distillation approaches: a pairwise approach which introduces adaptive soft margins to dynamically refine the optimization constraints on various pairwise rankings, and a listwise approach which adopts sampling-based partial listwise learning to resist the bias in teacher ranking. Extensive experiments on five popular benchmarks consistently show that both our pairwise and listwise RADIs outperform state-of-the-art methods. Further analysis demonstrates the effectiveness of our methods on the insufficient labeling problem.

CVJul 7, 2025
LTMSformer: A Local Trend-Aware Attention and Motion State Encoding Transformer for Multi-Agent Trajectory Prediction

Yixin Yan, Yang Li, Yuanfan Wang et al.

It has been challenging to model the complex temporal-spatial dependencies between agents for trajectory prediction. As each state of an agent is closely related to the states of adjacent time steps, capturing the local temporal dependency is beneficial for prediction, while most studies often overlook it. Besides, learning the high-order motion state attributes is expected to enhance spatial interaction modeling, but it is rarely seen in previous works. To address this, we propose a lightweight framework, LTMSformer, to extract temporal-spatial interaction features for multi-modal trajectory prediction. Specifically, we introduce a Local Trend-Aware Attention mechanism to capture the local temporal dependency by leveraging a convolutional attention mechanism with hierarchical local time boxes. Next, to model the spatial interaction dependency, we build a Motion State Encoder to incorporate high-order motion state attributes, such as acceleration, jerk, heading, etc. To further refine the trajectory prediction, we propose a Lightweight Proposal Refinement Module that leverages Multi-Layer Perceptrons for trajectory embedding and generates the refined trajectories with fewer model parameters. Experiment results on the Argoverse 1 dataset demonstrate that our method outperforms the baseline HiVT-64, reducing the minADE by approximately 4.35%, the minFDE by 8.74%, and the MR by 20%. We also achieve higher accuracy than HiVT-128 with a 68% reduction in model size.

CVDec 3, 2024
Resonance: Learning to Predict Social-Aware Pedestrian Trajectories as Co-Vibrations

Conghao Wong, Ziqian Zou, Beihao Xia et al.

Learning to forecast trajectories of intelligent agents has caught much more attention recently. However, it remains a challenge to accurately account for agents' intentions and social behaviors when forecasting, and in particular, to simulate the unique randomness within each of those components in an explainable and decoupled way. Inspired by vibration systems and their resonance properties, we propose the Resonance (short for Re) model to encode and forecast pedestrian trajectories in the form of ``co-vibrations''. It decomposes trajectory modifications and randomnesses into multiple vibration portions to simulate agents' reactions to each single cause, and forecasts trajectories as the superposition of these independent vibrations separately. Also, benefiting from such vibrations and their spectral properties, representations of social interactions can be learned by emulating the resonance phenomena, further enhancing its explainability. Experiments on multiple datasets have verified its usefulness both quantitatively and qualitatively.

CVDec 3, 2024
Who Walks With You Matters: Perceiving Social Interactions with Groups for Pedestrian Trajectory Prediction

Ziqian Zou, Conghao Wong, Beihao Xia et al.

Understanding and anticipating human movement has become more critical and challenging in diverse applications such as autonomous driving and surveillance. The complex interactions brought by different relations between agents are a crucial reason that poses challenges to this task. Researchers have put much effort into designing a system using rule-based or data-based models to extract and validate the patterns between pedestrian trajectories and these interactions, which has not been adequately addressed yet. Inspired by how humans perceive social interactions with different level of relations to themself, this work proposes the GrouP ConCeption (short for GPCC) model composed of the Group method, which categorizes nearby agents into either group members or non-group members based on a long-term distance kernel function, and the Conception module, which perceives both visual and acoustic information surrounding the target agent. Evaluated across multiple datasets, the GPCC model demonstrates significant improvements in trajectory prediction accuracy, validating its effectiveness in modeling both social and individual dynamics. The qualitative analysis also indicates that the GPCC framework successfully leverages grouping and perception cues human-like intuitively to validate the proposed model's explainability in pedestrian trajectory forecasting.

CVFeb 17, 2022
CSCNet: Contextual Semantic Consistency Network for Trajectory Prediction in Crowded Spaces

Beihao Xia, Conghao Wong, Qinmu Peng et al.

Trajectory prediction aims to predict the movement trend of the agents like pedestrians, bikers, vehicles. It is helpful to analyze and understand human activities in crowded spaces and widely applied in many areas such as surveillance video analysis and autonomous driving systems. Thanks to the success of deep learning, trajectory prediction has made significant progress. The current methods are dedicated to studying the agents' future trajectories under the social interaction and the sceneries' physical constraints. Moreover, how to deal with these factors still catches researchers' attention. However, they ignore the \textbf{Semantic Shift Phenomenon} when modeling these interactions in various prediction sceneries. There exist several kinds of semantic deviations inner or between social and physical interactions, which we call the "\textbf{Gap}". In this paper, we propose a \textbf{C}ontextual \textbf{S}emantic \textbf{C}onsistency \textbf{Net}work (\textbf{CSCNet}) to predict agents' future activities with powerful and efficient context constraints. We utilize a well-designed context-aware transfer to obtain the intermediate representations from the scene images and trajectories. Then we eliminate the differences between social and physical interactions by aligning activity semantics and scene semantics to cross the Gap. Experiments demonstrate that CSCNet performs better than most of the current methods quantitatively and qualitatively.

CVOct 14, 2021
View Vertically: A Hierarchical Network for Trajectory Prediction via Fourier Spectrums

Conghao Wong, Beihao Xia, Ziming Hong et al.

Understanding and forecasting future trajectories of agents are critical for behavior analysis, robot navigation, autonomous cars, and other related applications. Previous methods mostly treat trajectory prediction as time sequence generation. Different from them, this work studies agents' trajectories in a "vertical" view, i.e., modeling and forecasting trajectories from the spectral domain. Different frequency bands in the trajectory spectrums could hierarchically reflect agents' motion preferences at different scales. The low-frequency and high-frequency portions could represent their coarse motion trends and fine motion variations, respectively. Accordingly, we propose a hierarchical network V$^2$-Net, which contains two sub-networks, to hierarchically model and predict agents' trajectories with trajectory spectrums. The coarse-level keypoints estimation sub-network first predicts the "minimal" spectrums of agents' trajectories on several "key" frequency portions. Then the fine-level spectrum interpolation sub-network interpolates the spectrums to reconstruct the final predictions. Experimental results display the competitiveness and superiority of V$^2$-Net on both ETH-UCY benchmark and the Stanford Drone Dataset.

CVJul 2, 2021
MSN: Multi-Style Network for Trajectory Prediction

Conghao Wong, Beihao Xia, Qinmu Peng et al.

Trajectory prediction aims to forecast agents' possible future locations considering their observations along with the video context. It is strongly needed by many autonomous platforms like tracking, detection, robot navigation, and self-driving cars. Whether it is agents' internal personality factors, interactive behaviors with the neighborhood, or the influence of surroundings, they all impact agents' future planning. However, many previous methods model and predict agents' behaviors with the same strategy or feature distribution, making them challenging to make predictions with sufficient style differences. This paper proposes the Multi-Style Network (MSN), which utilizes style proposal and stylized prediction using two sub-networks, to provide multi-style predictions in a novel categorical way adaptively. The proposed network contains a series of style channels, and each channel is bound to a unique and specific behavior style. We use agents' end-point plannings and their interaction context as the basis for the behavior classification, so as to adaptively learn multiple diverse behavior styles through these channels. Then, we assume that the target agents may plan their future behaviors according to each of these categorized styles, thus utilizing different style channels to make predictions with significant style differences in parallel. Experiments show that the proposed MSN outperforms current state-of-the-art methods up to 10% quantitatively on two widely used datasets, and presents better multi-style characteristics qualitatively.

CVOct 8, 2020
BGM: Building a Dynamic Guidance Map without Visual Images for Trajectory Prediction

Beihao Xia, Conghao Wong, Heng Li et al.

Visual images usually contain the informative context of the environment, thereby helping to predict agents' behaviors. However, they hardly impose the dynamic effects on agents' actual behaviors due to the respectively fixed semantics. To solve this problem, we propose a deterministic model named BGM to construct a guidance map to represent the dynamic semantics, which circumvents to use visual images for each agent to reflect the difference of activities in different periods. We first record all agents' activities in the scene within a period close to the current to construct a guidance map and then feed it to a Context CNN to obtain their context features. We adopt a Historical Trajectory Encoder to extract the trajectory features and then combine them with the context feature as the input of the social energy based trajectory decoder, thus obtaining the prediction that meets the social rules. Experiments demonstrate that BGM achieves state-of-the-art prediction accuracy on the two widely used ETH and UCY datasets and handles more complex scenarios.

CVAug 21, 2020
CDE-GAN: Cooperative Dual Evolution Based Generative Adversarial Network

Shiming Chen, Wenjie Wang, Beihao Xia et al.

Generative adversarial networks (GANs) have been a popular deep generative model for real-world applications. Despite many recent efforts on GANs that have been contributed, mode collapse and instability of GANs are still open problems caused by their adversarial optimization difficulties. In this paper, motivated by the cooperative co-evolutionary algorithm, we propose a Cooperative Dual Evolution based Generative Adversarial Network (CDE-GAN) to circumvent these drawbacks. In essence, CDE-GAN incorporates dual evolution with respect to the generator(s) and discriminators into a unified evolutionary adversarial framework to conduct effective adversarial multi-objective optimization. Thus it exploits the complementary properties and injects dual mutation diversity into training to steadily diversify the estimated density in capturing multi-modes and improve generative performance. Specifically, CDE-GAN decomposes the complex adversarial optimization problem into two subproblems (generation and discrimination), and each subproblem is solved with a separated subpopulation (E-Generator} and E-Discriminators), evolved by its own evolutionary algorithm. Additionally, we further propose a Soft Mechanism to balance the trade-off between E-Generators and E-Discriminators to conduct steady training for CDE-GAN. Extensive experiments on one synthetic dataset and three real-world benchmark image datasets demonstrate that the proposed CDE-GAN achieves a competitive and superior performance in generating good quality and diverse samples over baselines. The code and more generated results are available at our project homepage: https://shiming-chen.github.io/CDE-GAN-website/CDE-GAN.html.

CVMar 13, 2020
A Spatial-Temporal Attentive Network with Spatial Continuity for Trajectory Prediction

Beihao Xia, Conghao Wang, Qinmu Peng et al.

It remains challenging to automatically predict the multi-agent trajectory due to multiple interactions including agent to agent interaction and scene to agent interaction. Although recent methods have achieved promising performance, most of them just consider spatial influence of the interactions and ignore the fact that temporal influence always accompanies spatial influence. Moreover, those methods based on scene information always require extra segmented scene images to generate multiple socially acceptable trajectories. To solve these limitations, we propose a novel model named spatial-temporal attentive network with spatial continuity (STAN-SC). First, spatial-temporal attention mechanism is presented to explore the most useful and important information. Second, we conduct a joint feature sequence based on the sequence and instant state information to make the generative trajectories keep spatial continuity. Experiments are performed on the two widely used ETH-UCY datasets and demonstrate that the proposed model achieves state-of-the-art prediction accuracy and handles more complex scenarios.