LGMar 29, 2023Code
ALUM: Adversarial Data Uncertainty Modeling from Latent Model Uncertainty CompensationWei Wei, Jiahuan Zhou, Hongze Li et al. · pku
It is critical that the models pay attention not only to accuracy but also to the certainty of prediction. Uncertain predictions of deep models caused by noisy data raise significant concerns in trustworthy AI areas. To explore and handle uncertainty due to intrinsic data noise, we propose a novel method called ALUM to simultaneously handle the model uncertainty and data uncertainty in a unified scheme. Rather than solely modeling data uncertainty in the ultimate layer of a deep model based on randomly selected training data, we propose to explore mined adversarial triplets to facilitate data uncertainty modeling and non-parametric uncertainty estimations to compensate for the insufficiently trained latent model layers. Thus, the critical data uncertainty and model uncertainty caused by noisy data can be readily quantified for improving model robustness. Our proposed ALUM is model-agnostic which can be easily implemented into any existing deep model with little extra computation overhead. Extensive experiments on various noisy learning tasks validate the superior robustness and generalization ability of our method. The code is released at https://github.com/wwzjer/ALUM.
33.8AIMay 27
PetroBench: A Benchmark for Large Language Models in Petroleum EngineeringXiang Wang, Tingting Zhang, Sen Wang et al.
Large Language Models are increasingly applied in the petroleum industry, highlighting the need for a domain-specific evaluation framework. This study develops a benchmark for LLMs in petroleum engineering, including a three-stage process of data preprocessing, quality filtering, and multi-model validation. Using expert review, a standardized question bank with strong domain relevance and discriminative capability was constructed. The benchmark covers production, reservoir, and drilling engineering, with 1,200 questions across multiple-choice, true or false, term definition, and short-answer formats. Eight mainstream LLMs were evaluated under a unified API environment. Results show that models performed better on subjective than objective questions, indicating weaknesses in factual knowledge discrimination. The highest accuracies for multiple-choice and true or false questions were 65.3% and 74.3%, respectively. Gemini-3-Pro, Kimi-K2.5, and Claude-Opus-4.6-Thinking achieved the best overall scores of 72%-74%. Models performed best in production engineering and weakest in reservoir engineering. Chinese models showed advantages in multiple-choice questions, while international models performed slightly better in short-answer questions. The benchmark provides a reproducible and practical reference for evaluating and deploying LLMs in petroleum engineering.
LGMar 16, 2023
Adaptive Modeling of Uncertainties for Traffic ForecastingYing Wu, Yongchao Ye, Adnan Zeb et al.
Deep neural networks (DNNs) have emerged as a dominant approach for developing traffic forecasting models. These models are typically trained to minimize error on averaged test cases and produce a single-point prediction, such as a scalar value for traffic speed or travel time. However, single-point predictions fail to account for prediction uncertainty that is critical for many transportation management scenarios, such as determining the best- or worst-case arrival time. We present QuanTraffic, a generic framework to enhance the capability of an arbitrary DNN model for uncertainty modeling. QuanTraffic requires little human involvement and does not change the base DNN architecture during deployment. Instead, it automatically learns a standard quantile function during the DNN model training to produce a prediction interval for the single-point prediction. The prediction interval defines a range where the true value of the traffic prediction is likely to fall. Furthermore, QuanTraffic develops an adaptive scheme that dynamically adjusts the prediction interval based on the location and prediction window of the test input. We evaluated QuanTraffic by applying it to five representative DNN models for traffic forecasting across seven public datasets. We then compared QuanTraffic against five uncertainty quantification methods. Compared to the baseline uncertainty modeling techniques, QuanTraffic with base DNN architectures delivers consistently better and more robust performance than the existing ones on the reported datasets.
LGApr 24, 2022
An empirical study of the effect of background data size on the stability of SHapley Additive exPlanations (SHAP) for deep learning modelsHan Yuan, Mingxuan Liu, Lican Kang et al.
Nowadays, the interpretation of why a machine learning (ML) model makes certain inferences is as crucial as the accuracy of such inferences. Some ML models like the decision tree possess inherent interpretability that can be directly comprehended by humans. Others like artificial neural networks (ANN), however, rely on external methods to uncover the deduction mechanism. SHapley Additive exPlanations (SHAP) is one of such external methods, which requires a background dataset when interpreting ANNs. Generally, a background dataset consists of instances randomly sampled from the training dataset. However, the sampling size and its effect on SHAP remain to be unexplored. In our empirical study on the MIMIC-III dataset, we show that the two core explanations - SHAP values and variable rankings fluctuate when using different background datasets acquired from random sampling, indicating that users cannot unquestioningly trust the one-shot interpretation from SHAP. Luckily, such fluctuation decreases with the increase of the background dataset size. Also, we notice an U-shape in the stability assessment of SHAP variable rankings, demonstrating that SHAP is more reliable in ranking the most and least important variables compared to moderately important ones. Overall, our results suggest that users should take into account how background data affects SHAP results, with improved SHAP stability as the background sample size increases.
AIJul 12, 2024
FedVAE: Trajectory privacy preserving based on Federated Variational AutoEncoderYuchen Jiang, Ying Wu, Shiyao Zhang et al.
The use of trajectory data with abundant spatial-temporal information is pivotal in Intelligent Transport Systems (ITS) and various traffic system tasks. Location-Based Services (LBS) capitalize on this trajectory data to offer users personalized services tailored to their location information. However, this trajectory data contains sensitive information about users' movement patterns and habits, necessitating confidentiality and protection from unknown collectors. To address this challenge, privacy-preserving methods like K-anonymity and Differential Privacy have been proposed to safeguard private information in the dataset. Despite their effectiveness, these methods can impact the original features by introducing perturbations or generating unrealistic trajectory data, leading to suboptimal performance in downstream tasks. To overcome these limitations, we propose a Federated Variational AutoEncoder (FedVAE) approach, which effectively generates a new trajectory dataset while preserving the confidentiality of private information and retaining the structure of the original features. In addition, FedVAE leverages Variational AutoEncoder (VAE) to maintain the original feature space and generate new trajectory data, and incorporates Federated Learning (FL) during the training stage, ensuring that users' data remains locally stored to protect their personal information. The results demonstrate its superior performance compared to other existing methods, affirming FedVAE as a promising solution for enhancing data privacy and utility in location-based applications.
CVSep 14, 2023
Flexible Visual Recognition by Evidential Modeling of Confusion and IgnoranceLei Fan, Bo Liu, Haoxiang Li et al.
In real-world scenarios, typical visual recognition systems could fail under two major causes, i.e., the misclassification between known classes and the excusable misbehavior on unknown-class images. To tackle these deficiencies, flexible visual recognition should dynamically predict multiple classes when they are unconfident between choices and reject making predictions when the input is entirely out of the training distribution. Two challenges emerge along with this novel task. First, prediction uncertainty should be separately quantified as confusion depicting inter-class uncertainties and ignorance identifying out-of-distribution samples. Second, both confusion and ignorance should be comparable between samples to enable effective decision-making. In this paper, we propose to model these two sources of uncertainty explicitly with the theory of Subjective Logic. Regarding recognition as an evidence-collecting process, confusion is then defined as conflicting evidence, while ignorance is the absence of evidence. By predicting Dirichlet concentration parameters for singletons, comprehensive subjective opinions, including confusion and ignorance, could be achieved via further evidence combinations. Through a series of experiments on synthetic data analysis, visual recognition, and open-set detection, we demonstrate the effectiveness of our methods in quantifying two sources of uncertainties and dealing with flexible recognition.
CVOct 4, 2023
A Grammatical Compositional Model for Video Action DetectionZhijun Zhang, Xu Zou, Jiahuan Zhou et al. · pku
Analysis of human actions in videos demands understanding complex human dynamics, as well as the interaction between actors and context. However, these interaction relationships usually exhibit large intra-class variations from diverse human poses or object manipulations, and fine-grained inter-class differences between similar actions. Thus the performance of existing methods is severely limited. Motivated by the observation that interactive actions can be decomposed into actor dynamics and participating objects or humans, we propose to investigate the composite property of them. In this paper, we present a novel Grammatical Compositional Model (GCM) for action detection based on typical And-Or graphs. Our model exploits the intrinsic structures and latent relationships of actions in a hierarchical manner to harness both the compositionality of grammar models and the capability of expressing rich features of DNNs. The proposed model can be readily embodied into a neural network module for efficient optimization in an end-to-end manner. Extensive experiments are conducted on the AVA dataset and the Something-Else task to demonstrate the superiority of our model, meanwhile the interpretability is enhanced through an inference parsing procedure.
LGMar 29, 2023
Beyond Empirical Risk Minimization: Local Structure Preserving Regularization for Improving Adversarial RobustnessWei Wei, Jiahuan Zhou, Ying Wu · pku
It is broadly known that deep neural networks are susceptible to being fooled by adversarial examples with perturbations imperceptible by humans. Various defenses have been proposed to improve adversarial robustness, among which adversarial training methods are most effective. However, most of these methods treat the training samples independently and demand a tremendous amount of samples to train a robust network, while ignoring the latent structural information among these samples. In this work, we propose a novel Local Structure Preserving (LSP) regularization, which aims to preserve the local structure of the input space in the learned embedding space. In this manner, the attacking effect of adversarial samples lying in the vicinity of clean samples can be alleviated. We show strong empirical evidence that with or without adversarial training, our method consistently improves the performance of adversarial robustness on several image classification datasets compared to the baselines and some state-of-the-art approaches, thus providing promising direction for future research.
CVNov 23, 2023
Evidential Active Recognition: Intelligent and Prudent Open-World Embodied PerceptionLei Fan, Mingfu Liang, Yunxuan Li et al.
Active recognition enables robots to intelligently explore novel observations, thereby acquiring more information while circumventing undesired viewing conditions. Recent approaches favor learning policies from simulated or collected data, wherein appropriate actions are more frequently selected when the recognition is accurate. However, most recognition modules are developed under the closed-world assumption, which makes them ill-equipped to handle unexpected inputs, such as the absence of the target object in the current observation. To address this issue, we propose treating active recognition as a sequential evidence-gathering process, providing by-step uncertainty quantification and reliable prediction under the evidence combination theory. Additionally, the reward function developed in this paper effectively characterizes the merit of actions when operating in open-world environments. To evaluate the performance, we collect a dataset from an indoor simulator, encompassing various recognition challenges such as distance, occlusion levels, and visibility. Through a series of experiments on recognition and robustness analysis, we demonstrate the necessity of introducing uncertainties to active recognition and the superior performance of the proposed method.
CVNov 28, 2023
Active Open-Vocabulary Recognition: Let Intelligent Moving Mitigate CLIP LimitationsLei Fan, Jianxiong Zhou, Xiaoying Xing et al.
Active recognition, which allows intelligent agents to explore observations for better recognition performance, serves as a prerequisite for various embodied AI tasks, such as grasping, navigation and room arrangements. Given the evolving environment and the multitude of object classes, it is impractical to include all possible classes during the training stage. In this paper, we aim at advancing active open-vocabulary recognition, empowering embodied agents to actively perceive and classify arbitrary objects. However, directly adopting recent open-vocabulary classification models, like Contrastive Language Image Pretraining (CLIP), poses its unique challenges. Specifically, we observe that CLIP's performance is heavily affected by the viewpoint and occlusions, compromising its reliability in unconstrained embodied perception scenarios. Further, the sequential nature of observations in agent-environment interactions necessitates an effective method for integrating features that maintains discriminative strength for open-vocabulary classification. To address these issues, we introduce a novel agent for active open-vocabulary recognition. The proposed method leverages inter-frame and inter-concept similarities to navigate agent movements and to fuse features, without relying on class-specific knowledge. Compared to baseline CLIP model with 29.6% accuracy on ShapeNet dataset, the proposed agent could achieve 53.3% accuracy for open-vocabulary recognition, without any fine-tuning to the equipped CLIP model. Additional experiments conducted with the Habitat simulator further affirm the efficacy of our method.
CVMar 18, 2025Code
Where do Large Vision-Language Models Look at when Answering Questions?Xiaoying Xing, Chia-Wen Kuo, Li Fuxin et al.
Large Vision-Language Models (LVLMs) have shown promising performance in vision-language understanding and reasoning tasks. However, their visual understanding behaviors remain underexplored. A fundamental question arises: to what extent do LVLMs rely on visual input, and which image regions contribute to their responses? It is non-trivial to interpret the free-form generation of LVLMs due to their complicated visual architecture (e.g., multiple encoders and multi-resolution) and variable-length outputs. In this paper, we extend existing heatmap visualization methods (e.g., iGOS++) to support LVLMs for open-ended visual question answering. We propose a method to select visually relevant tokens that reflect the relevance between generated answers and input image. Furthermore, we conduct a comprehensive analysis of state-of-the-art LVLMs on benchmarks designed to require visual information to answer. Our findings offer several insights into LVLM behavior, including the relationship between focus region and answer correctness, differences in visual attention across architectures, and the impact of LLM scale on visual understanding. The code and data are available at https://github.com/bytedance/LVLM_Interpretation.
21.4ROApr 7
Tackling the Kidnapped Robot Problem via Sparse Feasible Hypothesis Sampling and Reliable Batched Multi-Stage InferenceMuhua Zhang, Lei Ma, Ying Wu et al.
This paper addresses the Kidnapped Robot Problem (KRP), a core localization challenge of relocalizing a robot in a known map without prior pose estimate upon localization loss or at SLAM initialization. For this purpose, a passive 2-D global relocalization framework is proposed. It estimates the global pose efficiently and reliably from a single LiDAR scan and an occupancy grid map while the robot remains stationary, thereby enhancing the long-term autonomy of mobile robots. The proposed framework casts global relocalization as a non-convex problem and solves it via the multi-hypothesis scheme with batched multi-stage inference and early termination, balancing completeness and efficiency. The Rapidly-exploring Random Tree (RRT), under traversability constraints, asymptotically covers the reachable space to generate sparse, uniformly distributed feasible positional hypotheses, fundamentally reducing the sampling space. The hypotheses are preliminarily ordered by the proposed Scan Mean Absolute Difference (SMAD), a coarse beam-error level metric that facilitates the early termination by prioritizing high-likelihood candidates. The SMAD computation is optimized for limited scan measurements. The Translation-Affinity Scan-to-Map Alignment Metric (TAM) is proposed for reliable orientation selection at hypothesized positions and accurate final global pose evaluation to mitigate degradation in conventional likelihood-field metrics under translational uncertainty induced by sparse hypotheses, as well as non-panoramic LiDAR scan and environmental changes. Real-world experiments on a resource-constrained mobile robot with non-panoramic LiDAR scans show that the proposed framework achieves competitive performance in success rate, robustness under measurement uncertainty, and computational efficiency.
LGOct 10, 2023
A New Causal Rule Learning Approach to Interpretable Estimation of Heterogeneous Treatment EffectYing Wu, Hanzhong Liu, Kai Ren et al.
Interpretability plays a crucial role in the application of statistical learning to estimate heterogeneous treatment effects (HTE) in complex diseases. In this study, we leverage a rule-based workflow, namely causal rule learning (CRL), to estimate and improve our understanding of HTE for atrial septal defect, addressing an overlooked question in the previous literature: what if an individual simultaneously belongs to multiple groups with different average treatment effects? The CRL process consists of three steps: rule discovery, which generates a set of causal rules with corresponding subgroup average treatment effects; rule selection, which identifies a subset of these rules to deconstruct individual-level treatment effects as a linear combination of subgroup-level effects; and rule analysis, which presents a detailed procedure for further analyzing each selected rule from multiple perspectives to identify the most promising rules for validation. Extensive simulation studies and real-world data analysis demonstrate that CRL outperforms other methods in providing interpretable estimates of HTE, especially when dealing with complex ground truth and sufficient sample sizes.
CVOct 10, 2021Code
Morphable Detector for Object Detection on DemandXiangyun Zhao, Xu Zou, Ying Wu
Many emerging applications of intelligent robots need to explore and understand new environments, where it is desirable to detect objects of novel classes on the fly with minimum online efforts. This is an object detection on demand (ODOD) task. It is challenging, because it is impossible to annotate a large number of data on the fly, and the embedded systems are usually unable to perform back-propagation which is essential for training. Most existing few-shot detection methods are confronted here as they need extra training. We propose a novel morphable detector (MD), that simply "morphs" some of its changeable parameters online estimated from the few samples, so as to detect novel classes without any extra training. The MD has two sets of parameters, one for the feature embedding and the other for class representation (called "prototypes"). Each class is associated with a hidden prototype to be learned by integrating the visual and semantic embeddings. The learning of the MD is based on the alternate learning of the feature embedding and the prototypes in an EM-like approach which allows the recovery of an unknown prototype from a few samples of a novel class. Once an MD is learned, it is able to use a few samples of a novel class to directly compute its prototype to fulfill the online morphing process. We have shown the superiority of the MD in Pascal, COCO and FSOD datasets. The code is available https://github.com/Zhaoxiangyun/Morphable-Detector.
CVAug 9, 2019Code
Recognizing Part Attributes with Insufficient DataXiangyun Zhao, Yi Yang, Feng Zhou et al.
Recognizing attributes of objects and their parts is important to many computer vision applications. Although great progress has been made to apply object-level recognition, recognizing the attributes of parts remains less applicable since the training data for part attributes recognition is usually scarce especially for internet-scale applications. Furthermore, most existing part attribute recognition methods rely on the part annotation which is more expensive to obtain. To solve the data insufficiency problem and get rid of dependence on the part annotation, we introduce a novel Concept Sharing Network (CSN) for part attribute recognition. A great advantage of CSN is its capability of recognizing the part attribute (a combination of part location and appearance pattern) that has insufficient or zero training data, by learning the part location and appearance pattern respectively from the training data that usually mix them in a single label. Extensive experiments on CUB-200-2011 [51], CelebA [35] and a newly proposed human attribute dataset demonstrate the effectiveness of CSN and its advantages over other methods, especially for the attributes with few training samples. Further experiments show that CSN can also perform zero-shot part attribute recognition. The code will be made available at https://github.com/Zhaoxiangyun/Concept-Sharing-Network.
CVMar 26, 2024
AIDE: An Automatic Data Engine for Object Detection in Autonomous DrivingMingfu Liang, Jong-Chyi Su, Samuel Schulter et al.
Autonomous vehicle (AV) systems rely on robust perception models as a cornerstone of safety assurance. However, objects encountered on the road exhibit a long-tailed distribution, with rare or unseen categories posing challenges to a deployed perception model. This necessitates an expensive process of continuously curating and annotating data with significant human effort. We propose to leverage recent advances in vision-language and large language models to design an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios. This process operates iteratively, allowing for continuous self-improvement of the model. We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.
CYApr 1, 2025
From Intuition to Understanding: Using AI Peers to Overcome Physics MisconceptionsRuben Weijers, Denton Wu, Hannah Betts et al.
Generative AI has the potential to transform personalization and accessibility of education. However, it raises serious concerns about accuracy and helping students become independent critical thinkers. In this study, we designed a helpful AI "Peer" to help students correct fundamental physics misconceptions related to Newtonian mechanic concepts. In contrast to approaches that seek near-perfect accuracy to create an authoritative AI tutor or teacher, we directly inform students that this AI can answer up to 40% of questions incorrectly. In a randomized controlled trial with 165 students, those who engaged in targeted dialogue with the AI Peer achieved post-test scores that were, on average, 10.5 percentage points higher - with over 20 percentage points higher normalized gain - than a control group that discussed physics history. Qualitative feedback indicated that 91% of the treatment group's AI interactions were rated as helpful. Furthermore, by comparing student performance on pre- and post-test questions about the same concept, along with experts' annotations of the AI interactions, we find initial evidence suggesting the improvement in performance does not depend on the correctness of the AI. With further research, the AI Peer paradigm described here could open new possibilities for how we learn, adapt to, and grow with AI.
26.3CVApr 21
Feature Perturbation Pool-based Fusion Network for Unified Multi-Class Industrial Defect DetectionYuanchan Xu, Wenjun Zang, Ying Wu
Multi-class defect detection constitutes a critical yet challenging task in industrial quality inspection, where existing approaches typically suffer from two fundamental limitations: (i) the necessity of training separate models for each defect category, resulting in substantial computational and memory overhead, and (ii) degraded robustness caused by inter-class feature perturbation when heterogeneous defect categories are jointly modeled. In this paper, we present FPFNet, a Feature Perturbation Pool-based Fusion Network that synergistically integrates a stochastic feature perturbation pool with a multi-layer feature fusion strategy to address these challenges within a unified detection framework. The feature perturbation pool enriches the training distribution by randomly injecting diverse noise patterns -- including Gaussian noise, F-Noise, and F-Drop -- into the extracted feature representations, thereby strengthening the model's robustness against domain shifts and unseen defect morphologies. Concurrently, the multi-layer feature fusion module aggregates hierarchical feature representations from both the encoder and decoder through residual connections and normalization, enabling the network to capture complex cross-scale relationships while preserving fine-grained spatial details essential for precise defect localization. Built upon the UniAD architecture~\cite{you2022unified}, our method achieves state-of-the-art performance on two widely adopted benchmarks: 97.17\% image-level AUROC and 96.93\% pixel-level AUROC on MVTec-AD, and 91.08\% image-level AUROC and 99.08\% pixel-level AUROC on VisA, surpassing existing methods by notable margins while introducing no additional learnable parameters or computational complexity.
CVMar 26, 2025
Incremental Object Keypoint LearningMingfu Liang, Jiahuan Zhou, Xu Zou et al.
Existing progress in object keypoint estimation primarily benefits from the conventional supervised learning paradigm based on numerous data labeled with pre-defined keypoints. However, these well-trained models can hardly detect the undefined new keypoints in test time, which largely hinders their feasibility for diverse downstream tasks. To handle this, various solutions are explored but still suffer from either limited generalizability or transferability. Therefore, in this paper, we explore a novel keypoint learning paradigm in that we only annotate new keypoints in the new data and incrementally train the model, without retaining any old data, called Incremental object Keypoint Learning (IKL). A two-stage learning scheme as a novel baseline tailored to IKL is developed. In the first Knowledge Association stage, given the data labeled with only new keypoints, an auxiliary KA-Net is trained to automatically associate the old keypoints to these new ones based on their spatial and intrinsic anatomical relations. In the second Mutual Promotion stage, based on a keypoint-oriented spatial distillation loss, we jointly leverage the auxiliary KA-Net and the old model for knowledge consolidation to mutually promote the estimation of all old and new keypoints. Owing to the investigation of the correlations between new and old keypoints, our proposed method can not just effectively mitigate the catastrophic forgetting of old keypoints, but may even further improve the estimation of the old ones and achieve a positive transfer beyond anti-forgetting. Such an observation has been solidly verified by extensive experiments on different keypoint datasets, where our method exhibits superiority in alleviating the forgetting issue and boosting performance while enjoying labeling efficiency even under the low-shot data regime.
IVFeb 16, 2024
GAN-driven Electromagnetic Imaging of 2-D Dielectric ScatterersEhtasham Naseer, Ali Imran Sandhu, Muhammad Adnan Siddique et al.
Inverse scattering problems are inherently challenging, given the fact they are ill-posed and nonlinear. This paper presents a powerful deep learning-based approach that relies on generative adversarial networks to accurately and efficiently reconstruct randomly-shaped two-dimensional dielectric objects from amplitudes of multi-frequency scattered electric fields. An adversarial autoencoder (AAE) is trained to learn to generate the scatterer's geometry from a lower-dimensional latent representation constrained to adhere to the Gaussian distribution. A cohesive inverse neural network (INN) framework is set up comprising a sequence of appropriately designed dense layers, the already-trained generator as well as a separately trained forward neural network. The images reconstructed at the output of the inverse network are validated through comparison with outputs from the forward neural network, addressing the non-uniqueness challenge inherent to electromagnetic (EM) imaging problems. The trained INN demonstrates an enhanced robustness, evidenced by a mean binary cross-entropy (BCE) loss of $0.13$ and a structure similarity index (SSI) of $0.90$. The study not only demonstrates a significant reduction in computational load, but also marks a substantial improvement over traditional objective-function-based methods. It contributes both to the fields of machine learning and EM imaging by offering a real-time quantitative imaging approach. The results obtained with the simulated data, for both training and testing, yield promising results and may open new avenues for radio-frequency inverse imaging.
CVOct 23, 2025
AutoScape: Geometry-Consistent Long-Horizon Scene GenerationJiacheng Chen, Ziyu Jiang, Mingfu Liang et al.
This paper proposes AutoScape, a long-horizon driving scene generation framework. At its core is a novel RGB-D diffusion model that iteratively generates sparse, geometrically consistent keyframes, serving as reliable anchors for the scene's appearance and geometry. To maintain long-range geometric consistency, the model 1) jointly handles image and depth in a shared latent space, 2) explicitly conditions on the existing scene geometry (i.e., rendered point clouds) from previously generated keyframes, and 3) steers the sampling process with a warp-consistent guidance. Given high-quality RGB-D keyframes, a video diffusion model then interpolates between them to produce dense and coherent video frames. AutoScape generates realistic and geometrically consistent driving videos of over 20 seconds, improving the long-horizon FID and FVD scores over the prior state-of-the-art by 48.6\% and 43.0\%, respectively.
NEOct 4, 2025
Evolutionary Computation as Natural Generative AIYaxin Shi, Abhishek Gupta, Ying Wu et al.
Generative AI (GenAI) has achieved remarkable success across a range of domains, but its capabilities remain constrained to statistical models of finite training sets and learning based on local gradient signals. This often results in artifacts that are more derivative than genuinely generative. In contrast, Evolutionary Computation (EC) offers a search-driven pathway to greater diversity and creativity, expanding generative capabilities by exploring uncharted solution spaces beyond the limits of available data. This work establishes a fundamental connection between EC and GenAI, redefining EC as Natural Generative AI (NatGenAI) -- a generative paradigm governed by exploratory search under natural selection. We demonstrate that classical EC with parent-centric operators mirrors conventional GenAI, while disruptive operators enable structured evolutionary leaps, often within just a few generations, to generate out-of-distribution artifacts. Moreover, the methods of evolutionary multitasking provide an unparalleled means of integrating disruptive EC (with cross-domain recombination of evolved features) and moderated selection mechanisms (allowing novel solutions to survive), thereby fostering sustained innovation. By reframing EC as NatGenAI, we emphasize structured disruption and selection pressure moderation as essential drivers of creativity. This perspective extends the generative paradigm beyond conventional boundaries and positions EC as crucial to advancing exploratory design, innovation, scientific discovery, and open-ended generation in the GenAI era.
IMDec 9, 2024
StarWhisper Telescope: An AI framework for automating end-to-end astronomical observationsCunshi Wang, Yu Zhang, Yuyang Li et al.
The exponential growth of large-scale telescope arrays has boosted time-domain astronomy development but introduced operational bottlenecks, including labor-intensive observation planning, data processing, and real-time decision-making. Here we present the StarWhisper Telescope system, an AI agent framework automating end-to-end astronomical observations for surveys like the Nearby Galaxy Supernovae Survey. By integrating large language models with specialized function calls and modular workflows, StarWhisper Telescope autonomously generates site-specific observation lists, executes real-time image analysis via pipelines, and dynamically triggers follow-up proposals upon transient detection. The system reduces human intervention through automated observation planning, telescope controlling and data processing, while enabling seamless collaboration between amateur and professional astronomers. Deployed across Nearby Galaxy Supernovae Survey's network of 10 amateur telescopes, the StarWhisper Telescope has detected transients with promising response times relative to existing surveys. Furthermore, StarWhisper Telescope's scalable agent architecture provides a blueprint for future facilities like the Global Open Transient Telescope Array, where AI-driven autonomy will be critical for managing 60 telescopes.
CVJan 25, 2022
SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question AnsweringPeixi Xiong, Quanzeng You, Pei Yu et al.
Visual Question Answering (VQA) attracts much attention from both industry and academia. As a multi-modality task, it is challenging since it requires not only visual and textual understanding, but also the ability to align cross-modality representations. Previous approaches extensively employ entity-level alignments, such as the correlations between the visual regions and their semantic labels, or the interactions across question words and object features. These attempts aim to improve the cross-modality representations, while ignoring their internal relations. Instead, we propose to apply structured alignments, which work with graph representation of visual and textual content, aiming to capture the deep connections between the visual and textual modalities. Nevertheless, it is nontrivial to represent and integrate graphs for structured alignments. In this work, we attempt to solve this issue by first converting different modality entities into sequential nodes and the adjacency graph, then incorporating them for structured alignments. As demonstrated in our experimental results, such a structured alignment improves reasoning performance. In addition, our model also exhibits better interpretability for each generated answer. The proposed model, without any pretraining, outperforms the state-of-the-art methods on GQA dataset, and beats the non-pretrained state-of-the-art methods on VQA-v2 dataset.
LGSep 21, 2021
Toward a Fairness-Aware Scoring System for Algorithmic Decision-MakingYi Yang, Ying Wu, Mei Li et al.
Scoring systems, as a type of predictive model, have significant advantages in interpretability and transparency and facilitate quick decision-making. As such, scoring systems have been extensively used in a wide variety of industries such as healthcare and criminal justice. However, the fairness issues in these models have long been criticized, and the use of big data and machine learning algorithms in the construction of scoring systems heightens this concern. In this paper, we propose a general framework to create fairness-aware, data-driven scoring systems. First, we develop a social welfare function that incorporates both efficiency and group fairness. Then, we transform the social welfare maximization problem into the risk minimization task in machine learning, and derive a fairness-aware scoring system with the help of mixed integer programming. Lastly, several theoretical bounds are derived for providing parameter selection suggestions. Our proposed framework provides a suitable solution to address group fairness concerns in the development of scoring systems. It enables policymakers to set and customize their desired fairness requirements as well as other application-specific constraints. We test the proposed algorithm with several empirical data sets. Experimental evidence supports the effectiveness of the proposed scoring system in achieving the optimal welfare of stakeholders and in balancing the needs for interpretability, fairness, and efficiency.
CVJul 19, 2021
Unsupervised Embedding Learning from Uncertainty Momentum ModelingJiahuan Zhou, Yansong Tang, Bing Su et al.
Existing popular unsupervised embedding learning methods focus on enhancing the instance-level local discrimination of the given unlabeled images by exploring various negative data. However, the existed sample outliers which exhibit large intra-class divergences or small inter-class variations severely limit their learning performance. We justify that the performance limitation is caused by the gradient vanishing on these sample outliers. Moreover, the shortage of positive data and disregard for global discrimination consideration also pose critical issues for unsupervised learning but are always ignored by existing methods. To handle these issues, we propose a novel solution to explicitly model and directly explore the uncertainty of the given unlabeled learning samples. Instead of learning a deterministic feature point for each sample in the embedding space, we propose to represent a sample by a stochastic Gaussian with the mean vector depicting its space localization and covariance vector representing the sample uncertainty. We leverage such uncertainty modeling as momentum to the learning which is helpful to tackle the outliers. Furthermore, abundant positive candidates can be readily drawn from the learned instance-specific distributions which are further adopted to mitigate the aforementioned issues. Thorough rationale analyses and extensive experiments are presented to verify our superiority.
CVMar 24, 2021
Beyond Visual Attractiveness: Physically Plausible Single Image HDR Reconstruction for Spherical PanoramasWei Wei, Li Guan, Yue Liu et al.
HDR reconstruction is an important task in computer vision with many industrial needs. The traditional approaches merge multiple exposure shots to generate HDRs that correspond to the physical quantity of illuminance of the scene. However, the tedious capturing process makes such multi-shot approaches inconvenient in practice. In contrast, recent single-shot methods predict a visually appealing HDR from a single LDR image through deep learning. But it is not clear whether the previously mentioned physical properties would still hold, without training the network to explicitly model them. In this paper, we introduce the physical illuminance constraints to our single-shot HDR reconstruction framework, with a focus on spherical panoramas. By the proposed physical regularization, our method can generate HDRs which are not only visually appealing but also physically plausible. For evaluation, we collect a large dataset of LDR and HDR images with ground truth illuminance measures. Extensive experiments show that our HDR images not only maintain high visual quality but also top all baseline methods in illuminance prediction accuracy.
ROMar 7, 2021
MetaView: Few-shot Active Object RecognitionWei Wei, Haonan Yu, Haichao Zhang et al.
In robot sensing scenarios, instead of passively utilizing human captured views, an agent should be able to actively choose informative viewpoints of a 3D object as discriminative evidence to boost the recognition accuracy. This task is referred to as active object recognition. Recent works on this task rely on a massive amount of training examples to train an optimal view selection policy. But in realistic robot sensing scenarios, the large-scale training data may not exist and whether the intelligent view selection policy can be still learned from few object samples remains unclear. In this paper, we study this new problem which is extremely challenging but very meaningful in robot sensing -- Few-shot Active Object Recognition, i.e., to learn view selection policies from few object samples, which has not been considered and addressed before. We solve the proposed problem by adopting the framework of meta learning and name our method "MetaView". Extensive experiments on both category-level and instance-level classification tasks demonstrate that the proposed method can efficiently resolve issues that are hard for state-of-the-art active object recognition methods to handle, and outperform several baselines by large margins.
CVDec 13, 2020
Contrastive Learning for Label-Efficient Semantic SegmentationXiangyun Zhao, Raviteja Vemulapalli, Philip Mansfield et al.
Collecting labeled data for the task of semantic segmentation is expensive and time-consuming, as it requires dense pixel-level annotations. While recent Convolutional Neural Network (CNN) based semantic segmentation approaches have achieved impressive results by using large amounts of labeled training data, their performance drops significantly as the amount of labeled data decreases. This happens because deep CNNs trained with the de facto cross-entropy loss can easily overfit to small amounts of labeled data. To address this issue, we propose a simple and effective contrastive learning-based training strategy in which we first pretrain the network using a pixel-wise, label-based contrastive loss, and then fine-tune it using the cross-entropy loss. This approach increases intra-class compactness and inter-class separability, thereby resulting in a better pixel classifier. We demonstrate the effectiveness of the proposed training strategy using the Cityscapes and PASCAL VOC 2012 segmentation datasets. Our results show that pretraining with the proposed contrastive loss results in large performance gains (more than 20% absolute improvement in some settings) when the amount of labeled data is limited. In many settings, the proposed contrastive pretraining strategy, which does not use any additional data, is able to match or outperform the widely-used ImageNet pretraining strategy that uses more than a million additional labeled images.
CVAug 15, 2020
Object Detection with a Unified Label Space from Multiple DatasetsXiangyun Zhao, Samuel Schulter, Gaurav Sharma et al.
Given multiple datasets with different label spaces, the goal of this work is to train a single object detector predicting over the union of all the label spaces. The practical benefits of such an object detector are obvious and significant application-relevant categories can be picked and merged form arbitrary existing datasets. However, naive merging of datasets is not possible in this case, due to inconsistent object annotations. Consider an object category like faces that is annotated in one dataset, but is not annotated in another dataset, although the object itself appears in the latter images. Some categories, like face here, would thus be considered foreground in one dataset, but background in another. To address this challenge, we design a framework which works with such partial annotations, and we exploit a pseudo labeling approach that we adapt for our specific case. We propose loss functions that carefully integrate partial but correct annotations with complementary but noisy pseudo labels. Evaluation in the proposed novel setting requires full annotation on the test set. We collect the required annotations and define a new challenging experimental setup for this task based one existing public datasets. We show improved performances compared to competitive baselines and appropriate adaptations of existing work.
CVAug 11, 2020
Exposing Deep-faked Videos by Anomalous Co-motion Pattern DetectionGengxing Wang, Jiahuan Zhou, Ying Wu
Recent deep learning based video synthesis approaches, in particular with applications that can forge identities such as "DeepFake", have raised great security concerns. Therefore, corresponding deep forensic methods are proposed to tackle this problem. However, existing methods are either based on unexplainable deep networks which greatly degrades the principal interpretability factor to media forensic, or rely on fragile image statistics such as noise pattern, which in real-world scenarios can be easily deteriorated by data compression. In this paper, we propose an fully-interpretable video forensic method that is designed specifically to expose deep-faked videos. To enhance generalizability on videos with various content, we model the temporal motion of multiple specific spatial locations in the videos to extract a robust and reliable representation, called Co-Motion Pattern. Such kind of conjoint pattern is mined across local motion features which is independent of the video contents so that the instance-wise variation can also be largely alleviated. More importantly, our proposed co-motion pattern possesses both superior interpretability and sufficient robustness against data compression for deep-faked videos. We conduct extensive experiments to empirically demonstrate the superiority and effectiveness of our approach under both classification and anomaly detection evaluation settings against the state-of-the-art deep forensic methods.
CVJun 13, 2020
Uncertainty-aware Score Distribution Learning for Action Quality AssessmentYansong Tang, Zanlin Ni, Jiahuan Zhou et al.
Assessing action quality from videos has attracted growing attention in recent years. Most existing approaches usually tackle this problem based on regression algorithms, which ignore the intrinsic ambiguity in the score labels caused by multiple judges or their subjective appraisals. To address this issue, we propose an uncertainty-aware score distribution learning (USDL) approach for action quality assessment (AQA). Specifically, we regard an action as an instance associated with a score distribution, which describes the probability of different evaluated scores. Moreover, under the circumstance where fine-grained score labels are available (e.g., difficulty degree of an action or multiple scores from different judges), we further devise a multi-path uncertainty-aware score distributions learning (MUSDL) method to explore the disentangled components of a score. We conduct experiments on three AQA datasets containing various Olympic actions and surgical activities, where our approaches set new state-of-the-arts under the Spearman's Rank Correlation.
CVMay 16, 2019
Bimodal Stereo: Joint Shape and Pose Estimation from Color-Depth Image PairChi Zhang, Yuehu Liu, Ying Wu et al.
Mutual calibration between color and depth cameras is a challenging topic in multi-modal data registration. In this paper, we are confronted with a "Bimodal Stereo" problem, which aims to solve camera pose from a pair of an uncalibrated color image and a depth map from different views automatically. To address this problem, an iterative Shape-from-Shading (SfS) based framework is proposed to estimate shape and pose simultaneously. In the pipeline, the estimated shape is refined by the shape prior from the given depth map under the estimated pose. Meanwhile, the estimated pose is improved by the registration of estimated shape and shape from given depth map. We also introduce a shading based refinement in the pipeline to address noisy depth map with holes. Extensive experiments showed that through our method, both the depth map, the recovered shape as well as its pose can be desirably refined and recovered.
CVMar 16, 2019
Visual Query Answering by Entity-Attribute Graph Matching and ReasoningPeixi Xiong, Huayi Zhan, Xin Wang et al.
Visual Query Answering (VQA) is of great significance in offering people convenience: one can raise a question for details of objects, or high-level understanding about the scene, over an image. This paper proposes a novel method to address the VQA problem. In contrast to prior works, our method that targets single scene VQA, replies on graph-based techniques and involves reasoning. In a nutshell, our approach is centered on three graphs. The first graph, referred to as inference graph GI , is constructed via learning over labeled data. The other two graphs, referred to as query graph Q and entity-attribute graph GEA, are generated from natural language query Qnl and image Img, that are issued from users, respectively. As GEA often does not take sufficient information to answer Q, we develop techniques to infer missing information of GEA with GI . Based on GEA and Q, we provide techniques to find matches of Q in GEA, as the answer of Qnl in Img. Unlike commonly used VQA methods that are based on end-to-end neural networks, our graph-based method shows well-designed reasoning capability, and thus is highly interpretable. We also create a dataset on soccer match (Soccer-VQA) with rich annotations. The experimental results show that our approach outperforms the state-of-the-art method and has high potential for future investigation.
CVNov 27, 2018
A Compositional Textual Model for Recognition of Imperfect Word ImagesWei Tang, John Corring, Ying Wu et al.
Printed text recognition is an important problem for industrial OCR systems. Printed text is constructed in a standard procedural fashion in most settings. We develop a mathematical model for this process that can be applied to the backward inference problem of text recognition from an image. Through ablation experiments we show that this model is realistic and that a multi-task objective setting can help to stabilize estimation of its free parameters, enabling use of conventional deep learning methods. Furthermore, by directly modeling the geometric perturbations of text synthesis we show that our model can help recover missing characters from incomplete text regions, the bane of multicomponent OCR systems, enabling recognition even when the detection returns incomplete information.
CVJul 29, 2018
Semi-supervised Transfer Learning for Image Rain RemovalWei Wei, Deyu Meng, Qian Zhao et al.
Single image rain removal is a typical inverse problem in computer vision. The deep learning technique has been verified to be effective for this task and achieved state-of-the-art performance. However, previous deep learning methods need to pre-collect a large set of image pairs with/without synthesized rain for training, which tends to make the neural network be biased toward learning the specific patterns of the synthesized rain, while be less able to generalize to real test samples whose rain types differ from those in the training data. To this issue, this paper firstly proposes a semi-supervised learning paradigm toward this task. Different from traditional deep learning methods which only use supervised image pairs with/without synthesized rain, we further put real rainy images, without need of their clean ones, into the network training process. This is realized by elaborately formulating the residual between an input rainy image and its expected network output (clear image without rain) as a specific parametrized rain streaks distribution. The network is therefore trained to adapt real unsupervised diverse rain types through transferring from the supervised synthesized rain, and thus both the short-of-training-sample and bias-to-supervised-sample issues can be evidently alleviated. Experiments on synthetic and real data verify the superiority of our model compared to the state-of-the-arts.
CVJul 17, 2018
A Modulation Module for Multi-task Learning with Applications in Image RetrievalXiangyun Zhao, Haoxiang Li, Xiaohui Shen et al.
Multi-task learning has been widely adopted in many computer vision tasks to improve overall computation efficiency or boost the performance of individual tasks, under the assumption that those tasks are correlated and complementary to each other. However, the relationships between the tasks are complicated in practice, especially when the number of involved tasks scales up. When two tasks are of weak relevance, they may compete or even distract each other during joint training of shared parameters, and as a consequence undermine the learning of all the tasks. This will raise destructive interference which decreases learning efficiency of shared parameters and lead to low quality loss local optimum w.r.t. shared parameters. To address the this problem, we propose a general modulation module, which can be inserted into any convolutional neural network architecture, to encourage the coupling and feature sharing of relevant tasks while disentangling the learning of irrelevant tasks with minor parameters addition. Equipped with this module, gradient directions from different tasks can be enforced to be consistent for those shared parameters, which benefits multi-task joint training. The module is end-to-end learnable without ad-hoc design for specific tasks, and can naturally handle many tasks at the same time. We apply our approach on two retrieval tasks, face retrieval on the CelebA dataset [1] and product retrieval on the UT-Zappos50K dataset [2, 3], and demonstrate its advantage over other multi-task learning methods in both accuracy and storage efficiency.
LGMar 23, 2018
Fictitious GAN: Training GANs with Historical ModelsHao Ge, Yin Xia, Xu Chen et al.
Generative adversarial networks (GANs) are powerful tools for learning generative models. In practice, the training may suffer from lack of convergence. GANs are commonly viewed as a two-player zero-sum game between two neural networks. Here, we leverage this game theoretic view to study the convergence behavior of the training process. Inspired by the fictitious play learning process, a novel training method, referred to as Fictitious GAN, is introduced. Fictitious GAN trains the deep neural networks using a mixture of historical models. Specifically, the discriminator (resp. generator) is updated according to the best-response to the mixture outputs from a sequence of previously trained generators (resp. discriminators). It is shown that Fictitious GAN can effectively resolve some convergence issues that cannot be resolved by the standard training approach. It is proved that asymptotically the average of the generator outputs has the same distribution as the data samples.
CVMay 12, 2014
Cross-view Action Modeling, Learning and RecognitionJiang wang, Xiaohan Nie, Yin Xia et al.
Existing methods on video-based action recognition are generally view-dependent, i.e., performing recognition from the same views seen in the training data. We present a novel multiview spatio-temporal AND-OR graph (MST-AOG) representation for cross-view action recognition, i.e., the recognition is performed on the video from an unknown and unseen view. As a compositional model, MST-AOG compactly represents the hierarchical combinatorial structures of cross-view actions by explicitly modeling the geometry, appearance and motion variations. This paper proposes effective methods to learn the structure and parameters of MST-AOG. The inference based on MST-AOG enables action recognition from novel views. The training of MST-AOG takes advantage of the 3D human skeleton data obtained from Kinect cameras to avoid annotating enormous multi-view video frames, which is error-prone and time-consuming, but the recognition does not need 3D information and is based on 2D video input. A new Multiview Action3D dataset has been created and will be released. Extensive experiments have demonstrated that this new action representation significantly improves the accuracy and robustness for cross-view action recognition on 2D videos.
CVApr 17, 2014
Learning Fine-grained Image Similarity with Deep RankingJiang Wang, Yang song, Thomas Leung et al.
Learning fine-grained image similarity is a challenging task. It needs to capture between-class and within-class image differences. This paper proposes a deep ranking model that employs deep learning techniques to learn similarity metric directly from images.It has higher learning capability than models based on hand-crafted features. A novel multiscale network structure has been developed to describe the images effectively. An efficient triplet sampling algorithm is proposed to learn the model with distributed asynchronized stochastic gradient. Extensive experiments show that the proposed algorithm outperforms models based on hand-crafted visual features and deep classification models.
CVApr 15, 2014
Scalable Matting: A Sub-linear ApproachPhilip G. Lee, Ying Wu
Natural image matting, which separates foreground from background, is a very important intermediate step in recent computer vision algorithms. However, it is severely underconstrained and difficult to solve. State-of-the-art approaches include matting by graph Laplacian, which significantly improves the underconstrained nature by reducing the solution space. However, matting by graph Laplacian is still very difficult to solve and gets much harder as the image size grows: current iterative methods slow down as $\mathcal{O}\left(n^2 \right)$ in the resolution $n$. This creates uncomfortable practical limits on the resolution of images that we can matte. Current literature mitigates the problem, but they all remain super-linear in complexity. We expose properties of the problem that remain heretofore unexploited, demonstrating that an optimization technique originally intended to solve PDEs can be adapted to take advantage of this knowledge to solve the matting problem, not heuristically, but exactly and with sub-linear complexity. This makes ours the most efficient matting solver currently known by a very wide margin and allows matting finally to be practical and scalable in the future as consumer photos exceed many dozens of megapixels, and also relieves matting from being a bottleneck for vision algorithms that depend on it.