Junfeng Wu

CV
h-index19
32papers
634citations
Novelty52%
AI Score60

32 Papers

RONov 1, 2025Code
SonarSweep: Fusing Sonar and Vision for Robust 3D Reconstruction via Plane Sweeping

Lingpeng Chen, Jiakun Tang, Apple Pui-Yi Chui et al. · apple-ml

Accurate 3D reconstruction in visually-degraded underwater environments remains a formidable challenge. Single-modality approaches are insufficient: vision-based methods fail due to poor visibility and geometric constraints, while sonar is crippled by inherent elevation ambiguity and low resolution. Consequently, prior fusion technique relies on heuristics and flawed geometric assumptions, leading to significant artifacts and an inability to model complex scenes. In this paper, we introduce SonarSweep, a novel, end-to-end deep learning framework that overcomes these limitations by adapting the principled plane sweep algorithm for cross-modal fusion between sonar and visual data. Extensive experiments in both high-fidelity simulation and real-world environments demonstrate that SonarSweep consistently generates dense and accurate depth maps, significantly outperforming state-of-the-art methods across challenging conditions, particularly in high turbidity. To foster further research, we will publicly release our code and a novel dataset featuring synchronized stereo-camera and sonar data, the first of its kind.

CVJul 21, 2022
In Defense of Online Models for Video Instance Segmentation

Junfeng Wu, Qihao Liu, Yi Jiang et al.

In recent years, video instance segmentation (VIS) has been largely advanced by offline models, while online models gradually attracted less attention possibly due to their inferior performance. However, online methods have their inherent advantage in handling long video sequences and ongoing videos while offline models fail due to the limit of computational resources. Therefore, it would be highly desirable if online models can achieve comparable or even better performance than offline models. By dissecting current online models and offline models, we demonstrate that the main cause of the performance gap is the error-prone association between frames caused by the similar appearance among different instances in the feature space. Observing this, we propose an online framework based on contrastive learning that is able to learn more discriminative instance embeddings for association and fully exploit history information for stability. Despite its simplicity, our method outperforms all online and offline methods on three benchmarks. Specifically, we achieve 49.5 AP on YouTube-VIS 2019, a significant improvement of 13.2 AP and 2.1 AP over the prior online and offline art, respectively. Moreover, we achieve 30.2 AP on OVIS, a more challenging dataset with significant crowding and occlusions, surpassing the prior art by 14.8 AP. The proposed method won first place in the video instance segmentation track of the 4th Large-scale Video Object Segmentation Challenge (CVPR2022). We hope the simplicity and effectiveness of our method, as well as our insight into current methods, could shed light on the exploration of VIS models.

SYApr 29, 2016
Infinite Horizon Optimal Transmission Power Control for Remote State Estimation over Fading Channels

Xiaoqiang Ren, Junfeng Wu, Karl H. Johansson et al.

Jointly optimal transmission power control and remote estimation over an infinite horizon is studied. A sensor observes a dynamic process and sends its observations to a remote estimator over a wireless fading channel characterized by a time-homogeneous Markov chain. The successful transmission probability depends on both the channel gains and the transmission power used by the sensor. The transmission power control rule and the remote estimator should be jointly designed, aiming to minimize an infinite-horizon cost consisting of the power usage and the remote estimation error. A first question one may ask is: Does this joint optimization problem have a solution? We formulate the joint optimization problem as an average cost belief-state Markov decision process and answer the question by proving that there exists an optimal deterministic and stationary policy. We then show that when the monitored dynamic process is scalar, the optimal remote estimates depend only on the most recently received sensor observation, and the optimal transmission power is symmetric and monotonically increasing with respect to the innovation error.

ROMay 27Code
Provably Guaranteed Polytopic Uncertainty Quantification for SLAM

Guangyang Zeng, Yulong Gao, Yuan Shen et al.

In safety-critical robotics applications, guaranteed and practical uncertainty quantification (UQ) in perception is vital. Many existing works either offer no formal containment guarantee, rely on restrictive modeling assumptions, or focus only on pose estimation rather than a complete SLAM pipeline. This paper presents provably guaranteed UQ algorithms for 3D-3D landmark-based SLAM. The algorithms consist of three basic UQ modules: forward UQ for mapping, backward UQ for pose tracking, and pose compound. Each module produces a certified uncertainty set; when the input uncertainty bounds are deterministic, the output sets inherit deterministic guarantees, i.e., they provably contain the true poses and landmarks. Specifically, we use polytopes to represent uncertainty sets, enabling tractable computations and a unified treatment of pose uncertainty. To enhance algorithms' practical usability, we incorporate conformal prediction to calibrate measurement uncertainty from data with prescribed probability. Simulations and experiments demonstrate that the proposed algorithms provide both strong theoretical guarantees and practical usability. The code is open-sourced at https://github.com/LIAS-CUHKSZ/Polytopic-SLAM-Uncertainty-Quantification.

SYDec 28, 2016
On Stochastic Sensor Network Scheduling for Multiple Processes

Duo Han, Junfeng Wu, Yilin Mo et al.

We consider the problem of multiple sensor scheduling for remote state estimation of multiple process over a shared link. In this problem, a set of sensors monitor mutually independent dynamical systems in parallel but only one sensor can access the shared channel at each time to transmit the data packet to the estimator. We propose a stochastic event-based sensor scheduling in which each sensor makes transmission decisions based on both channel accessibility and distributed event-triggering conditions. The corresponding minimum mean squared error (MMSE) estimator is explicitly given. Considering information patterns accessed by sensor schedulers, time-based ones can be treated as a special case of the proposed one. By ultilizing realtime information, the proposed schedule outperforms the time-based ones in terms of the estimation quality. Resorting to solving an Markov decision process (MDP) problem with average cost criterion, we can find optimal parameters for the proposed schedule. As for practical use, a greedy algorithm is devised for parameter design, which has rather low computational complexity. We also provide a method to quantify the performance gap between the schedule optimized via MDP and any other schedules.

CVMar 14, 2023
InstMove: Instance Motion for Object-centric Video Segmentation

Qihao Liu, Junfeng Wu, Yi Jiang et al.

Despite significant efforts, cutting-edge video segmentation methods still remain sensitive to occlusion and rapid movement, due to their reliance on the appearance of objects in the form of object embeddings, which are vulnerable to these disturbances. A common solution is to use optical flow to provide motion information, but essentially it only considers pixel-level motion, which still relies on appearance similarity and hence is often inaccurate under occlusion and fast movement. In this work, we study the instance-level motion and present InstMove, which stands for Instance Motion for Object-centric Video Segmentation. In comparison to pixel-wise motion, InstMove mainly relies on instance-level motion information that is free from image feature embeddings, and features physical interpretations, making it more accurate and robust toward occlusion and fast-moving objects. To better fit in with the video segmentation tasks, InstMove uses instance masks to model the physical presence of an object and learns the dynamic model through a memory network to predict its position and shape in the next frame. With only a few lines of code, InstMove can be integrated into current SOTA methods for three different video segmentation tasks and boost their performance. Specifically, we improve the previous arts by 1.5 AP on OVIS dataset, which features heavy occlusions, and 4.9 AP on YouTubeVIS-Long dataset, which mainly contains fast-moving objects. These results suggest that instance-level motion is robust and accurate, and hence serving as a powerful solution in complex scenarios for object-centric video segmentation.

SYNov 25, 2017
Kalman Filtering over Fading Channels: Zero-One Laws and Almost Sure Stabilities

Junfeng Wu, Guodong Shi, Brian D. O. Anderson et al.

In this paper, we investigate probabilistic stability of Kalman filtering over fading channels modeled by $\ast$-mixing random processes, where channel fading is allowed to generate non-stationary packet dropouts with temporal and/or spatial correlations. Upper/lower almost sure (a.s.) stabilities and absolutely upper/lower a.s. stabilities are defined for characterizing the sample-path behaviors of the Kalman filtering. We prove that both upper and lower a.s. stabilities follow a zero-one law, i.e., these stabilities must happen with a probability either zero or one, and when the filtering system is one-step observable, the absolutely upper and lower a.s. stabilities can also be interpreted using a zero-one law. We establish general stability conditions for (absolutely) upper and lower a.s. stabilities. In particular, with one-step observability, we show the equivalence between absolutely a.s. stabilities and a.s. ones, and necessary and sufficient conditions in terms of packet arrival rate are derived; for the so-called non-degenerate systems, we also manage to give a necessary and sufficient condition for upper a.s. stability.

SYFeb 28, 2019
Distributed Parameter Estimation Under Event-triggered Communications

Xingkang He, Qian Liu, Junfeng Wu et al.

In this paper, we study a distributed parameter estimation problem with an asynchronous communication protocol over multi-agent systems. Different from traditional time-driven communication schemes, in this work, data can be transmitted between agents intermittently rather than in a steady stream. First, we propose a recursive distributed estimator based on an event-triggered communication scheme, through which each agent can decide whether the current estimate is sent out to its neighbors or not. With this scheme, considerable communications between agents can be effectively reduced. Then, under mild conditions including a collective observability, we provide a design principle of triggering thresholds to guarantee the asymptotic unbiasedness and strong consistency. Furthermore, under certain conditions, we prove that, with probability one, for every agent the time interval between two successive triggered instants goes to infinity as time goes to infinity. Finally, we provide a numerical simulation to validate the theoretical results of this paper.

CVSep 13, 2022
CPnP: Consistent Pose Estimator for Perspective-n-Point Problem with Bias Elimination

Guangyang Zeng, Shiyu Chen, Biqiang Mu et al.

The Perspective-n-Point (PnP) problem has been widely studied in both computer vision and photogrammetry societies. With the development of feature extraction techniques, a large number of feature points might be available in a single shot. It is promising to devise a consistent estimator, i.e., the estimate can converge to the true camera pose as the number of points increases. To this end, we propose a consistent PnP solver, named \emph{CPnP}, with bias elimination. Specifically, linear equations are constructed from the original projection model via measurement model modification and variable elimination, based on which a closed-form least-squares solution is obtained. We then analyze and subtract the asymptotic bias of this solution, resulting in a consistent estimate. Additionally, Gauss-Newton (GN) iterations are executed to refine the consistent solution. Our proposed estimator is efficient in terms of computations -- it has $O(n)$ computational complexity. Experimental tests on both synthetic data and real images show that our proposed estimator is superior to some well-known ones for images with dense visual features, in terms of estimation precision and computing time.

SYFeb 19, 2019
Dynamical Privacy in Distributed Computing -- Part I: Privacy Loss and PPSC Mechanism

Yang Liu, Junfeng Wu, Ian R. Manchester et al.

A distributed computing protocol consists of three components: (i) Data Localization: a network-wide dataset is decomposed into local datasets separately preserved at a network of nodes; (ii) Node Communication: the nodes hold individual dynamical states and communicate with the neighbors about these dynamical states; (iii) Local Computation: state recursions are computed at each individual node. Information about the local datasets enters the computation process through the node-to-node communication and the local computations, which may be leaked to dynamics eavesdroppers having access to global or local node states. In this paper, we systematically investigate this potential computational privacy risks in distributed computing protocols in the form of structured system identification, and then propose and thoroughly analyze a Privacy-Preserving-Summation-Consistent (PPSC) mechanism as a generic privacy encryption subroutine for consensus-based distributed computations. The central idea is that the consensus manifold is where we can both hide node privacy and achieve computational accuracy. In this first part of the paper, we demonstrate the computational privacy risks in distributed algorithms against dynamics eavesdroppers and particularly in distributed linear equation solvers, and then propose the PPSC mechanism and illustrate its usefulness.

SYFeb 19, 2019
Dynamical Privacy in Distributed Computing -- Part II: PPSC Gossip Algorithms

Yang Liu, Junfeng Wu, Ian Manchester et al.

In the first part of the paper, we have studied the computational privacy risks in distributed computing protocols against local or global dynamics eavesdroppers, and proposed a Privacy-Preserving-Summation-Consistent (PPSC) mechanism as a generic privacy encryption subroutine for consensus-based distributed computations. In this part of this paper, we show that the conventional deterministic and random gossip algorithms can be used to realize the PPSC mechanism over a given network. At each time step, a node is selected to interact with one of its neighbors via deterministic or random gossiping. Such node generates a random number as its new state, and sends the subtraction between its current state and that random number to the neighbor; then the neighbor updates its state by adding the received value to its current state. We establish concrete privacy-preservation conditions by proving the impossibility for the reconstruction of the network input from the output of the gossip-based PPSC mechanism against eavesdroppers with full network knowledge, and by showing that the PPSC mechanism can achieve differential privacy at arbitrary privacy levels. The convergence is characterized explicitly and analytically for both deterministic and randomized gossiping, which is essentially achieved in a finite number of steps. Additionally, we illustrate that the proposed algorithms can be easily made adaptive in real-world applications by making realtime trade-offs between resilience against node dropout or communication failure and privacy preservation capabilities.

CVNov 18, 2022
The Runner-up Solution for YouTube-VIS Long Video Challenge 2022

Junfeng Wu, Yi Jiang, Qihao Liu et al.

This technical report describes our 2nd-place solution for the ECCV 2022 YouTube-VIS Long Video Challenge. We adopt the previously proposed online video instance segmentation method IDOL for this challenge. In addition, we use pseudo labels to further help contrastive learning, so as to obtain more temporally consistent instance embedding to improve tracking performance between frames. The proposed method obtains 40.2 AP on the YouTube-VIS 2022 long video dataset and was ranked second place in this challenge. We hope our simple and effective method could benefit further research.

CVFeb 27, 2025Code
UniTok: A Unified Tokenizer for Visual Generation and Understanding

Chuofan Ma, Yi Jiang, Junfeng Wu et al.

Visual generative and understanding models typically rely on distinct tokenizers to process images, presenting a key challenge for unifying them within a single framework. Recent studies attempt to address this by connecting the training of VQVAE (for autoregressive generation) and CLIP (for understanding) to build a unified tokenizer. However, directly combining these training objectives has been observed to cause severe loss conflicts. In this paper, we show that reconstruction and semantic supervision do not inherently conflict. Instead, the underlying bottleneck stems from limited representational capacity of discrete token space. Building on these insights, we introduce UniTok, a unified tokenizer featuring a novel multi-codebook quantization mechanism that effectively scales up the vocabulary size and bottleneck dimension. In terms of final performance, UniTok sets a new record of 0.38 rFID and 78.6% zero-shot accuracy on ImageNet. Besides, UniTok can be seamlessly integrated into MLLMs to unlock native visual generation capability, without compromising the understanding performance. Additionally, we show that UniTok favors cfg-free generation, reducing gFID from 14.6 to 2.5 on ImageNet 256$\times$256 benchmark. GitHub: https://github.com/FoundationVision/UniTok.

CVJul 23, 2024
PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects

Junyi Li, Junfeng Wu, Weizhi Zhao et al.

We present PartGLEE, a part-level foundation model for locating and identifying both objects and parts in images. Through a unified framework, PartGLEE accomplishes detection, segmentation, and grounding of instances at any granularity in the open world scenario. Specifically, we propose a Q-Former to construct the hierarchical relationship between objects and parts, parsing every object into corresponding semantic parts. By incorporating a large amount of object-level data, the hierarchical relationships can be extended, enabling PartGLEE to recognize a rich variety of parts. We conduct comprehensive studies to validate the effectiveness of our method, PartGLEE achieves the state-of-the-art performance across various part-level tasks and obtain competitive results on object-level tasks. The proposed PartGLEE significantly enhances hierarchical modeling capabilities and part-level perception over our previous GLEE model. Further analysis indicates that the hierarchical cognitive ability of PartGLEE is able to facilitate a detailed comprehension in images for mLLMs. The model and code will be released at https://provencestar.github.io/PartGLEE-Vision/ .

CVDec 5, 2024Code
Liquid: Language Models are Scalable and Unified Multi-modal Generators

Junfeng Wu, Yi Jiang, Chuofan Ma et al.

We present Liquid, an auto-regressive generation paradigm that seamlessly integrates visual comprehension and generation by tokenizing images into discrete codes and learning these code embeddings alongside text tokens within a shared feature space for both vision and language. Unlike previous multimodal large language model (MLLM), Liquid achieves this integration using a single large language model (LLM), eliminating the need for external pretrained visual embeddings such as CLIP. For the first time, Liquid uncovers a scaling law that performance drop unavoidably brought by the unified training of visual and language tasks diminishes as the model size increases. Furthermore, the unified token space enables visual generation and comprehension tasks to mutually enhance each other, effectively removing the typical interference seen in earlier models. We show that existing LLMs can serve as strong foundations for Liquid, saving 100x in training costs while outperforming Chameleon in multimodal capabilities and maintaining language performance comparable to mainstream LLMs like LLAMA2. Liquid also outperforms models like SD v2.1 and SD-XL (FID of 5.47 on MJHQ-30K), excelling in both vision-language and text-only tasks. This work demonstrates that LLMs such as Qwen2.5 and GEMMA2 are powerful multimodal generators, offering a scalable solution for enhancing both vision-language understanding and generation. The code and models will be released at https://github.com/FoundationVision/Liquid.

CRSep 3, 2024
$S^2$NeRF: Privacy-preserving Training Framework for NeRF

Bokang Zhang, Yanglin Zhang, Zhikun Zhang et al.

Neural Radiance Fields (NeRF) have revolutionized 3D computer vision and graphics, facilitating novel view synthesis and influencing sectors like extended reality and e-commerce. However, NeRF's dependence on extensive data collection, including sensitive scene image data, introduces significant privacy risks when users upload this data for model training. To address this concern, we first propose SplitNeRF, a training framework that incorporates split learning (SL) techniques to enable privacy-preserving collaborative model training between clients and servers without sharing local data. Despite its benefits, we identify vulnerabilities in SplitNeRF by developing two attack methods, Surrogate Model Attack and Scene-aided Surrogate Model Attack, which exploit the shared gradient data and a few leaked scene images to reconstruct private scene information. To counter these threats, we introduce $S^2$NeRF, secure SplitNeRF that integrates effective defense mechanisms. By introducing decaying noise related to the gradient norm into the shared gradient information, $S^2$NeRF preserves privacy while maintaining a high utility of the NeRF model. Our extensive evaluations across multiple datasets demonstrate the effectiveness of $S^2$NeRF against privacy breaches, confirming its viability for secure NeRF training in sensitive applications.

CVDec 15, 2021Code
SeqFormer: Sequential Transformer for Video Instance Segmentation

Junfeng Wu, Yi Jiang, Song Bai et al.

In this work, we present SeqFormer for video instance segmentation. SeqFormer follows the principle of vision transformer that models instance relationships among video frames. Nevertheless, we observe that a stand-alone instance query suffices for capturing a time sequence of instances in a video, but attention mechanisms shall be done with each frame independently. To achieve this, SeqFormer locates an instance in each frame and aggregates temporal information to learn a powerful representation of a video-level instance, which is used to predict the mask sequences on each frame dynamically. Instance tracking is achieved naturally without tracking branches or post-processing. On YouTube-VIS, SeqFormer achieves 47.4 AP with a ResNet-50 backbone and 49.0 AP with a ResNet-101 backbone without bells and whistles. Such achievement significantly exceeds the previous state-of-the-art performance by 4.6 and 4.4, respectively. In addition, integrated with the recently-proposed Swin transformer, SeqFormer achieves a much higher AP of 59.3. We hope SeqFormer could be a strong baseline that fosters future research in video instance segmentation, and in the meantime, advances this field with a more robust, accurate, neat model. The code is available at https://github.com/wjf5203/SeqFormer.

CVMay 13, 2020Code
Super-Resolution Domain Adaptation Networks for Semantic Segmentation via Pixel and Output Level Aligning

Junfeng Wu, Zhenjie Tang, Congan Xu et al.

Recently, Unsupervised Domain Adaptation (UDA) has attracted increasing attention to address the domain shift problem in the semantic segmentation task. Although previous UDA methods have achieved promising performance, they still suffer from the distribution gaps between source and target domains, especially the resolution discrepany in the remote sensing images. To address this problem, this paper designs a novel end-to-end semantic segmentation network, namely Super-Resolution Domain Adaptation Network (SRDA-Net). SRDA-Net can simultaneously achieve the super-resolution task and the domain adaptation task, thus satisfying the requirement of semantic segmentation for remote sensing images which usually involve various resolution images. The proposed SRDA-Net includes three parts: a Super-Resolution and Segmentation (SRS) model which focuses on recovering high-resolution image and predicting segmentation map, a Pixel-level Domain Classifier (PDC) for determining which domain the pixel belongs to, and an Output-space Domain Classifier (ODC) for distinguishing which domain the pixel contribution is from. By jointly optimizing SRS with two classifiers, the proposed method can not only eliminates the resolution difference between source and target domains, but also improve the performance of the semantic segmentation task. Experimental results on two remote sensing datasets with different resolutions demonstrate that SRDA-Net performs favorably against some state-of-the-art methods in terms of accuracy and visual quality. Code and models are available at https://github.com/tangzhenjie/SRDA-Net.

SIDec 1, 2025
Social Media Data Mining of Human Behaviour during Bushfire Evacuation

Junfeng Wu, Xiangmin Zhou, Erica Kuligowski et al.

Traditional data sources on bushfire evacuation behaviour, such as quantitative surveys and manual observations have severe limitations. Mining social media data related to bushfire evacuations promises to close this gap by allowing the collection and processing of a large amount of behavioural data, which are low-cost, accurate, possibly including location information and rich contextual information. However, social media data have many limitations, such as being scattered, incomplete, informal, etc. Together, these limitations represent several challenges to their usefulness to better understand bushfire evacuation. To overcome these challenges and provide guidance on which and how social media data can be used, this scoping review of the literature reports on recent advances in relevant data mining techniques. In addition, future applications and open problems are discussed. We envision future applications such as evacuation model calibration and validation, emergency communication, personalised evacuation training, and resource allocation for evacuation preparedness. We identify open problems such as data quality, bias and representativeness, geolocation accuracy, contextual understanding, crisis-specific lexicon and semantics, and multimodal data interpretation.

CVDec 14, 2023
General Object Foundation Model for Images and Videos at Scale

Junfeng Wu, Yi Jiang, Qihao Liu et al.

We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos. Through a unified framework, GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario for various object perception tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from diverse data sources with varying supervision levels to formulate general object representations, excelling in zero-shot transfer to new data and tasks. Specifically, we employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks while maintaining state-of-the-art performance. Demonstrated through extensive training on over five million images from diverse benchmarks, GLEE exhibits remarkable versatility and improved generalization performance, efficiently tackling downstream tasks without the need for task-specific adaptation. By integrating large volumes of automatically labeled data, we further enhance its zero-shot generalization capabilities. Additionally, GLEE is capable of being integrated into Large Language Models, serving as a foundational model to provide universal object-level information for multi-modal tasks. We hope that the versatility and universality of our method will mark a significant step in the development of efficient visual foundation models for AGI systems. The model and code will be released at https://glee-vision.github.io .

LGAug 30, 2022
Graph Distance Neural Networks for Predicting Multiple Drug Interactions

Haifan zhou, Wenjing Zhou, Junfeng Wu

Since multidrug combination is widely applied, the accurate prediction of drug-drug interaction (DDI) is becoming more and more critical. In our method, we use graph to represent drug-drug interaction: nodes represent drug; edges represent drug-drug interactions. Based on our assumption, we convert the prediction of DDI to link prediction problem, utilizing known drug node characteristics and DDI types to predict unknown DDI types. This work proposes a Graph Distance Neural Network (GDNN) to predict drug-drug interactions. Firstly, GDNN generates initial features for nodes via target point method, fully including the distance information in the graph. Secondly, GDNN adopts an improved message passing framework to better generate each drug node embedded expression, comprehensively considering the nodes and edges characteristics synchronously. Thirdly, GDNN aggregates the embedded expressions, undergoing MLP processing to generate the final predicted drug interaction type. GDNN achieved Test Hits@20=0.9037 on the ogb-ddi dataset, proving GDNN can predict DDI efficiently.

CVApr 26
Deploy DINO with Many-to-Many Association

Haodong Jiang, Mingzhe Li, Junfeng Wu

Motivated by the limited generalization of supervised image matching models to unseen image domains, we explore the zero-shot deployment of DINO features for this task. The generalist visual representation extracted from DINO has inherent ambiguity when used to match feature points among semantically similar instances, prompting us to adopt a many-to-many (m-to-m) matching paradigm. However, the existing robust mechanism under m-to-m data association is computationally heavy, which requires finding a maximum-cardinality matching in the inlier association graph for each parameter evaluation. To address this inefficiency, we introduce a novel likelihood perspective, which interprets the existing method as a zeroth-order approximation of otherwise intractable likelihood calculation,and inspires us to propose a faster and finer-grained robust mechanism, termed as Harmonic Consensus Maximization (HCM). Take camera pose estimation as an exemplifying downstream task, we demonstrate that general-purpose visual features, used out of the box without any adaptation, can compete with specialized matching models on out-of-distribution datasets when mated with m-to-m association and the HCM mechanism.

CVMay 23, 2025
TokBench: Evaluating Your Visual Tokenizer before Visual Generation

Junfeng Wu, Dongliang Luo, Weizhi Zhao et al.

In this work, we reveal the limitations of visual tokenizers and VAEs in preserving fine-grained features, and propose a benchmark to evaluate reconstruction performance for two challenging visual contents: text and face. Visual tokenizers and VAEs have significantly advanced visual generation and multimodal modeling by providing more efficient compressed or quantized image representations. However, while helping production models reduce computational burdens, the information loss from image compression fundamentally limits the upper bound of visual generation quality. To evaluate this upper bound, we focus on assessing reconstructed text and facial features since they typically: 1) exist at smaller scales, 2) contain dense and rich textures, 3) are prone to collapse, and 4) are highly sensitive to human vision. We first collect and curate a diverse set of clear text and face images from existing datasets. Unlike approaches using VLM models, we employ established OCR and face recognition models for evaluation, ensuring accuracy while maintaining an exceptionally lightweight assessment process <span style="font-weight: bold; color: rgb(214, 21, 21);">requiring just 2GB memory and 4 minutes</span> to complete. Using our benchmark, we analyze text and face reconstruction quality across various scales for different image tokenizers and VAEs. Our results show modern visual tokenizers still struggle to preserve fine-grained features, especially at smaller scales. We further extend this evaluation framework to video, conducting comprehensive analysis of video tokenizers. Additionally, we demonstrate that traditional metrics fail to accurately reflect reconstruction performance for faces and text, while our proposed metrics serve as an effective complement.

LGDec 1, 2024
Online Poisoning Attack Against Reinforcement Learning under Black-box Environments

Jianhui Li, Bokang Zhang, Junfeng Wu

This paper proposes an online environment poisoning algorithm tailored for reinforcement learning agents operating in a black-box setting, where an adversary deliberately manipulates training data to lead the agent toward a mischievous policy. In contrast to prior studies that primarily investigate white-box settings, we focus on a scenario characterized by \textit{unknown} environment dynamics to the attacker and a \textit{flexible} reinforcement learning algorithm employed by the targeted agent. We first propose an attack scheme that is capable of poisoning the reward functions and state transitions. The poisoning task is formalized as a constrained optimization problem, following the framework of \cite{ma2019policy}. Given the transition probabilities are unknown to the attacker in a black-box environment, we apply a stochastic gradient descent algorithm, where the exact gradients are approximated using sample-based estimates. A penalty-based method along with a bilevel reformulation is then employed to transform the problem into an unconstrained counterpart and to circumvent the double-sampling issue. The algorithm's effectiveness is validated through a maze environment.

CVMar 2, 2024
Consistent and Optimal Solution to Camera Motion Estimation

Guangyang Zeng, Qingcheng Zeng, Xinghan Li et al.

Given 2D point correspondences between an image pair, inferring the camera motion is a fundamental issue in the computer vision community. The existing works generally set out from the epipolar constraint and estimate the essential matrix, which is not optimal in the maximum likelihood (ML) sense. In this paper, we dive into the original measurement model with respect to the rotation matrix and normalized translation vector and formulate the ML problem. We then propose a two-step algorithm to solve it: In the first step, we estimate the variance of measurement noises and devise a consistent estimator based on bias elimination; In the second step, we execute a one-step Gauss-Newton iteration on manifold to refine the consistent estimate. We prove that the proposed estimate owns the same asymptotic statistical properties as the ML estimate: The first is consistency, i.e., the estimate converges to the ground truth as the point number increases; The second is asymptotic efficiency, i.e., the mean squared error of the estimate converges to the theoretical lower bound -- Cramer-Rao bound. In addition, we show that our algorithm has linear time complexity. These appealing characteristics endow our estimator with a great advantage in the case of dense point correspondences. Experiments on both synthetic data and real images demonstrate that when the point number reaches the order of hundreds, our estimator outperforms the state-of-the-art ones in terms of estimation accuracy and CPU time.

LGNov 18, 2025
SparseST: Exploiting Data Sparsity in Spatiotemporal Modeling and Prediction

Junfeng Wu, Hadjer Benmeziane, Kaoutar El Maghraoui et al.

Spatiotemporal data mining (STDM) has a wide range of applications in various complex physical systems (CPS), i.e., transportation, manufacturing, healthcare, etc. Among all the proposed methods, the Convolutional Long Short-Term Memory (ConvLSTM) has proved to be generalizable and extendable in different applications and has multiple variants achieving state-of-the-art performance in various STDM applications. However, ConvLSTM and its variants are computationally expensive, which makes them inapplicable in edge devices with limited computational resources. With the emerging need for edge computing in CPS, efficient AI is essential to reduce the computational cost while preserving the model performance. Common methods of efficient AI are developed to reduce redundancy in model capacity (i.e., model pruning, compression, etc.). However, spatiotemporal data mining naturally requires extensive model capacity, as the embedded dependencies in spatiotemporal data are complex and hard to capture, which limits the model redundancy. Instead, there is a fairly high level of data and feature redundancy that introduces an unnecessary computational burden, which has been largely overlooked in existing research. Therefore, we developed a novel framework SparseST, that pioneered in exploiting data sparsity to develop an efficient spatiotemporal model. In addition, we explore and approximate the Pareto front between model performance and computational efficiency by designing a multi-objective composite loss function, which provides a practical guide for practitioners to adjust the model according to computational resource constraints and the performance requirements of downstream tasks.

CVMar 1, 2025
Inteval Analysis for two spherical functions arising from robust Perspective-n-Lines problem

Xiang Zheng, Haodong Jiang, Junfeng Wu

This report presents a comprehensive interval analysis of two spherical functions derived from the robust Perspective-n-Lines (PnL) problem. The study is motivated by the application of a dimension-reduction technique to achieve global solutions for the robust PnL problem. We establish rigorous theoretical results, supported by detailed proofs, and validate our findings through extensive numerical simulations.

IRFeb 9, 2021
Real-time tracking of COVID-19 and coronavirus research updates through text mining

Yutong Jin, Jie Li, Xinyu Wang et al.

The novel coronavirus (SARS-CoV-2) which causes COVID-19 is an ongoing pandemic. There are ongoing studies with up to hundreds of publications uploaded to databases daily. We are exploring the use-case of artificial intelligence and natural language processing in order to efficiently sort through these publications. We demonstrate that clinical trial information, preclinical studies, and a general topic model can be used as text mining data intelligence tools for scientists all over the world to use as a resource for their own research. To evaluate our method, several metrics are used to measure the information extraction and clustering results. In addition, we demonstrate that our workflow not only have a use-case for COVID-19, but for other disease areas as well. Overall, our system aims to allow scientists to more efficiently research coronavirus. Our automatically updating modules are available on our information portal at https://ghddi-ailab.github.io/Targeting2019-nCoV/ for public viewing.

BMFeb 8, 2021
ParaVS: A Simple, Fast, Efficient and Flexible Graph Neural Network Framework for Structure-Based Virtual Screening

Junfeng Wu, Dawei Leng, Lurong Pan

Structure-based virtual screening (SBVS) is a promising in silico technique that integrates computational methods into drug design. An extensively used method in SBVS is molecular docking. However, the docking process can hardly be computationally efficient and accurate simultaneously because classic mechanics scoring function is used to approximate, but hardly reach, the quantum mechanics precision in this method. In order to reduce the computational cost of the protein-ligand scoring process and use data driven approach to boost the scoring function accuracy, we introduce a docking-based SBVS method and, furthermore, a deep learning non-docking-based method that is able to avoid the computational cost of the docking process. Then, we try to integrate these two methods into an easy-to-use framework, ParaVS, that provides both choices for researchers. Graph neural network (GNN) is employed in ParaVS, and we explained how our in-house GNN works and how to model ligands and molecular targets. To verify our approaches, cross validation experiments are done on two datasets, an open dataset Directory of Useful Decoys: Enhanced (DUD.E) and an in-house proprietary dataset without computational generated artificial decoys (NoDecoy). On DUD.E we achieved a state-of-the-art AUC of 0.981 and a state-of-the-art enrichment factor at 2% of 36.2; on NoDecoy we achieved an AUC of 0.974. We further finish inference of an open database, Enamine REAL Database (RDB), that comprises over 1.36 billion molecules in 4050 core-hours using our ParaVS non-docking method (ParaVS-ND). The inference speed of ParaVS-ND is about 3.6e5 molecule / core-hour, while this number of a conventional docking-based method is around 20, which is about 16000 times faster. The experiments indicate that ParaVS is accurate, computationally efficient and can be generalized to different molecular.

SYApr 12, 2019
Event-Triggered Control for Consensus of Multi-Agent Systems with Nonlinear Output and Directed Topologies

Xinlei Yi, Shengjun Zhang, Tao Yang et al.

We propose a distributed event-triggered control law to solve the consensus problem for multi-agent systems with nonlinear output. Under the condition that the underlying digraph is strongly connected, we propose some sufficient conditions related to the nonlinear output function and initial states to guarantee that the event-triggered controller realizes consensus. Then the results are extended to the case where the underlying directed graph contains a directed spanning tree. These theoretical results are illustrated by numerical simulations.

GTNov 21, 2018
Learning Quadratic Games on Networks

Yan Leng, Xiaowen Dong, Junfeng Wu et al.

Individuals, or organizations, cooperate with or compete against one another in a wide range of practical situations. Such strategic interactions are often modeled as games played on networks, where an individual's payoff depends not only on her action but also on that of her neighbors. The current literature has largely focused on analyzing the characteristics of network games in the scenario where the structure of the network, which is represented by a graph, is known beforehand. It is often the case, however, that the actions of the players are readily observable while the underlying interaction network remains hidden. In this paper, we propose two novel frameworks for learning, from the observations on individual actions, network games with linear-quadratic payoffs, and in particular, the structure of the interaction network. Our frameworks are based on the Nash equilibrium of such games and involve solving a joint optimization problem for the graph structure and the individual marginal benefits. Both synthetic and real-world experiments demonstrate the effectiveness of the proposed frameworks, which have theoretical as well as practical implications for understanding strategic interactions in a network environment.

SYSep 4, 2016
Attack Allocation on Remote State Estimation in Multi-Systems: Structural Results and Asymptotic Solution

Xiaoqiang Ren, Junfeng Wu, Subhrakanti Dey et al.

This paper considers optimal attack attention allocation on remote state estimation in multi-systems. Suppose there are $\mathtt{M}$ independent systems, each of which has a remote sensor monitoring the system and sending its local estimates to a fusion center over a packet-dropping channel. An attacker may generate noises to exacerbate the communication channels between sensors and the fusion center. Due to capacity limitation, at each time the attacker can exacerbate at most $\mathtt{N}$ of the $\mathtt{M}$ channels. The goal of the attacker side is to seek an optimal policy maximizing the estimation error at the fusion center. The problem is formulated as a Markov decision process (MDP) problem, and the existence of an optimal deterministic and stationary policy is proved. We further show that the optimal policy has a threshold structure, by which the computational complexity is reduced significantly. Based on the threshold structure, a myopic policy is proposed for homogeneous models and its optimality is established. To overcome the curse of dimensionality of MDP algorithms for general heterogeneous models, we further provide an asymptotically (as $\mathtt{M}$ and $\mathtt{N}$ go to infinity) optimal solution, which is easy to compute and implement. Numerical examples are given to illustrate the main results.