Ni Wang

CV
h-index6
10papers
141citations
Novelty46%
AI Score45

10 Papers

AIFeb 10, 2023
Learning cooperative behaviours in adversarial multi-agent systems

Ni Wang, Gautham P. Das, Alan G. Millard

This work extends an existing virtual multi-agent platform called RoboSumo to create TripleSumo -- a platform for investigating multi-agent cooperative behaviors in continuous action spaces, with physical contact in an adversarial environment. In this paper we investigate a scenario in which two agents, namely `Bug' and `Ant', must team up and push another agent `Spider' out of the arena. To tackle this goal, the newly added agent `Bug' is trained during an ongoing match between `Ant' and `Spider'. `Bug' must develop awareness of the other agents' actions, infer the strategy of both sides, and eventually learn an action policy to cooperate. The reinforcement learning algorithm Deep Deterministic Policy Gradient (DDPG) is implemented with a hybrid reward structure combining dense and sparse rewards. The cooperative behavior is quantitatively evaluated by the mean probability of winning the match and mean number of steps needed to win.

CVJan 14
Hybrid guided variational autoencoder for visual place recognition

Ni Wang, Zihan You, Emre Neftci et al.

Autonomous agents such as cars, robots and drones need to precisely localize themselves in diverse environments, including in GPS-denied indoor environments. One approach for precise localization is visual place recognition (VPR), which estimates the place of an image based on previously seen places. State-of-the-art VPR models require high amounts of memory, making them unwieldy for mobile deployment, while more compact models lack robustness and generalization capabilities. This work overcomes these limitations for robotics using a combination of event-based vision sensors and an event-based novel guided variational autoencoder (VAE). The encoder part of our model is based on a spiking neural network model which is compatible with power-efficient low latency neuromorphic hardware. The VAE successfully disentangles the visual features of 16 distinct places in our new indoor VPR dataset with a classification performance comparable to other state-of-the-art approaches while, showing robust performance also under various illumination conditions. When tested with novel visual inputs from unknown scenes, our model can distinguish between these places, which demonstrates a high generalization capability by learning the essential features of location. Our compact and robust guided VAE with generalization capabilities poses a promising model for visual place recognition that can significantly enhance mobile robot navigation in known and unknown indoor environments.

CVSep 22, 2025Code
MVP: Motion Vector Propagation for Zero-Shot Video Object Detection

Binhua Huang, Ni Wang, Wendong Yao et al.

Running a large open-vocabulary (Open-vocab) detector on every video frame is accurate but expensive. We introduce a training-free pipeline that invokes OWLv2 only on fixed-interval keyframes and propagates detections to intermediate frames using compressed-domain motion vectors (MV). A simple 3x3 grid aggregation of motion vectors provides translation and uniform-scale updates, augmented with an area-growth check and an optional single-class switch. The method requires no labels, no fine-tuning, and uses the same prompt list for all open-vocabulary methods. On ILSVRC2015-VID (validation dataset), our approach (MVP) attains mAP@0.5=0.609 and mAP@[0.5:0.95]=0.316. At loose intersection-over-union (IoU) thresholds it remains close to framewise OWLv2-Large (0.747/0.721 at 0.2/0.3 versus 0.784/0.780), reflecting that coarse localization is largely preserved. Under the same keyframe schedule, MVP outperforms tracker-based propagation (MOSSE, KCF, CSRT) at mAP@0.5. A supervised reference (YOLOv12x) reaches 0.631 at mAP@0.5 but requires labeled training, whereas our method remains label-free and open-vocabulary. These results indicate that compressed-domain propagation is a practical way to reduce detector invocations while keeping strong zero-shot coverage in videos. Our code and models are available at https://github.com/microa/MVP.

CVSep 21, 2025Code
MoCLIP-Lite: Efficient Video Recognition by Fusing CLIP with Motion Vectors

Binhua Huang, Ni Wang, Arjun Pakrashi et al.

Video action recognition is a fundamental task in computer vision, but state-of-the-art models are often computationally expensive and rely on extensive video pre-training. In parallel, large-scale vision-language models like Contrastive Language-Image Pre-training (CLIP) offer powerful zero-shot capabilities on static images, while motion vectors (MV) provide highly efficient temporal information directly from compressed video streams. To synergize the strengths of these paradigms, we propose MoCLIP-Lite, a simple yet powerful two-stream late fusion framework for efficient video recognition. Our approach combines features from a frozen CLIP image encoder with features from a lightweight, supervised network trained on raw MV. During fusion, both backbones are frozen, and only a tiny Multi-Layer Perceptron (MLP) head is trained, ensuring extreme efficiency. Through comprehensive experiments on the UCF101 dataset, our method achieves a remarkable 89.2% Top-1 accuracy, significantly outperforming strong zero-shot (65.0%) and MV-only (66.5%) baselines. Our work provides a new, highly efficient baseline for video understanding that effectively bridges the gap between large static models and dynamic, low-cost motion cues. Our code and models are available at https://github.com/microa/MoCLIP-Lite.

CVOct 16, 2024
MambaBEV: An efficient 3D detection model with Mamba2

Zihan You, Ni Wang, Hao Wang et al.

Accurate 3D object detection in autonomous driving relies on Bird's Eye View (BEV) perception and effective temporal fusion.However, existing fusion strategies based on convolutional layers or deformable self attention struggle with global context modeling in BEV space,leading to lower accuracy for large objects. To address this, we introduce MambaBEV, a novel BEV based 3D object detection model that leverages Mamba2, an advanced state space model (SSM) optimized for long sequence processing.Our key contribution is TemporalMamba, a temporal fusion module that enhances global awareness by introducing a BEV feature discrete rearrangement mechanism tailored for Mamba's sequential processing. Additionally, we propose Mamba based DETR as the detection head to improve multi object representation.Evaluations on the nuScenes dataset demonstrate that MambaBEV base achieves an NDS of 51.7\% and an mAP of 42.7\%.Furthermore, an end to end autonomous driving paradigm validates its effectiveness in motion forecasting and planning.Our results highlight the potential of SSMs in autonomous driving perception, particularly in enhancing global context understanding and large object detection.

AIMay 11, 2025
Embodied Intelligence: The Key to Unblocking Generalized Artificial Intelligence

Jinhao Jiang, Changlin Chen, Shile Feng et al.

The ultimate goal of artificial intelligence (AI) is to achieve Artificial General Intelligence (AGI). Embodied Artificial Intelligence (EAI), which involves intelligent systems with physical presence and real-time interaction with the environment, has emerged as a key research direction in pursuit of AGI. While advancements in deep learning, reinforcement learning, large-scale language models, and multimodal technologies have significantly contributed to the progress of EAI, most existing reviews focus on specific technologies or applications. A systematic overview, particularly one that explores the direct connection between EAI and AGI, remains scarce. This paper examines EAI as a foundational approach to AGI, systematically analyzing its four core modules: perception, intelligent decision-making, action, and feedback. We provide a detailed discussion of how each module contributes to the six core principles of AGI. Additionally, we discuss future trends, challenges, and research directions in EAI, emphasizing its potential as a cornerstone for AGI development. Our findings suggest that EAI's integration of dynamic learning and real-world interaction is essential for bridging the gap between narrow AI and AGI.

CVJun 23, 2024
Multi-Scale Temporal Difference Transformer for Video-Text Retrieval

Ni Wang, Dongliang Liao, Xing Xu

Currently, in the field of video-text retrieval, there are many transformer-based methods. Most of them usually stack frame features and regrade frames as tokens, then use transformers for video temporal modeling. However, they commonly neglect the inferior ability of the transformer modeling local temporal information. To tackle this problem, we propose a transformer variant named Multi-Scale Temporal Difference Transformer (MSTDT). MSTDT mainly addresses the defects of the traditional transformer which has limited ability to capture local temporal information. Besides, in order to better model the detailed dynamic information, we make use of the difference feature between frames, which practically reflects the dynamic movement of a video. We extract the inter-frame difference feature and integrate the difference and frame feature by the multi-scale temporal transformer. In general, our proposed MSTDT consists of a short-term multi-scale temporal difference transformer and a long-term temporal transformer. The former focuses on modeling local temporal information, the latter aims at modeling global temporal information. At last, we propose a new loss to narrow the distance of similar samples. Extensive experiments show that backbone, such as CLIP, with MSTDT has attained a new state-of-the-art result.

ROJun 17, 2021
Towards bio-inspired unsupervised representation learning for indoor aerial navigation

Ni Wang, Ozan Catal, Tim Verbelen et al.

Aerial navigation in GPS-denied, indoor environments, is still an open challenge. Drones can perceive the environment from a richer set of viewpoints, while having more stringent compute and energy constraints than other autonomous platforms. To tackle that problem, this research displays a biologically inspired deep-learning algorithm for simultaneous localization and mapping (SLAM) and its application in a drone navigation system. We propose an unsupervised representation learning method that yields low-dimensional latent state descriptors, that mitigates the sensitivity to perceptual aliasing, and works on power-efficient, embedded hardware. The designed algorithm is evaluated on a dataset collected in an indoor warehouse environment, and initial results show the feasibility for robust indoor aerial navigation.

CLApr 10, 2020
A Natural Language Processing Pipeline of Chinese Free-text Radiology Reports for Liver Cancer Diagnosis

Honglei Liu, Yan Xu, Zhiqiang Zhang et al.

Despite the rapid development of natural language processing (NLP) implementation in electronic medical records (EMRs), Chinese EMRs processing remains challenging due to the limited corpus and specific grammatical characteristics, especially for radiology reports. In this study, we designed an NLP pipeline for the direct extraction of clinically relevant features from Chinese radiology reports, which is the first key step in computer-aided radiologic diagnosis. The pipeline was comprised of named entity recognition, synonyms normalization, and relationship extraction to finally derive the radiological features composed of one or more terms. In named entity recognition, we incorporated lexicon into deep learning model bidirectional long short-term memory-conditional random field (BiLSTM-CRF), and the model finally achieved an F1 score of 93.00%. With the extracted radiological features, least absolute shrinkage and selection operator and machine learning methods (support vector machine, random forest, decision tree, and logistic regression) were used to build the classifiers for liver cancer prediction. For liver cancer diagnosis, random forest had the highest predictive performance in liver cancer diagnosis (F1 score 86.97%, precision 87.71%, and recall 86.25%). This work was a comprehensive NLP study focusing on Chinese radiology reports and the application of NLP in cancer risk prediction. The proposed NLP pipeline for the radiological feature extraction could be easily implemented in other kinds of Chinese clinical texts and other disease predictive tasks.

LGDec 11, 2018
Deep Density-based Image Clustering

Yazhou Ren, Ni Wang, Mingxia Li et al.

Recently, deep clustering, which is able to perform feature learning that favors clustering tasks via deep neural networks, has achieved remarkable performance in image clustering applications. However, the existing deep clustering algorithms generally need the number of clusters in advance, which is usually unknown in real-world tasks. In addition, the initial cluster centers in the learned feature space are generated by $k$-means. This only works well on spherical clusters and probably leads to unstable clustering results. In this paper, we propose a two-stage deep density-based image clustering (DDC) framework to address these issues. The first stage is to train a deep convolutional autoencoder (CAE) to extract low-dimensional feature representations from high-dimensional image data, and then apply t-SNE to further reduce the data to a 2-dimensional space favoring density-based clustering algorithms. The second stage is to apply the developed density-based clustering technique on the 2-dimensional embedded data to automatically recognize an appropriate number of clusters with arbitrary shapes. Concretely, a number of local clusters are generated to capture the local structures of clusters, and then are merged via their density relationship to form the final clustering result. Experiments demonstrate that the proposed DDC achieves comparable or even better clustering performance than state-of-the-art deep clustering methods, even though the number of clusters is not given.