CVJul 3, 2024Code
A Radiometric Correction based Optical Modeling Approach to Removing Reflection Noise in TLS Point Clouds of Urban ScenesLi Fang, Tianyu Li, Yanghong Lin et al.
Point clouds are vital in computer vision tasks such as 3D reconstruction, autonomous driving, and robotics. However, TLS-acquired point clouds often contain virtual points from reflective surfaces, causing disruptions. This study presents a reflection noise elimination algorithm for TLS point clouds. Our innovative reflection plane detection algorithm, based on geometry-optical models and physical properties, identifies and categorizes reflection points per optical reflection theory. We've adapted the LSFH feature descriptor to retain reflection features, mitigating interference from symmetrical architectural structures. By incorporating the Hausdorff feature distance, the algorithm enhances resilience to ghosting and deformation, improving virtual point detection accuracy. Extensive experiments on the 3DRN benchmark dataset, featuring diverse urban environments with virtual TLS reflection noise, show our algorithm improves precision and recall rates for 3D points in reflective regions by 57.03\% and 31.80\%, respectively. Our method achieves a 9.17\% better outlier detection rate and 5.65\% higher accuracy than leading methods. Access the 3DRN dataset at (https://github.com/Tsuiky/3DRN).
OCFeb 7, 2023
Averaged Method of Multipliers for Bi-Level Optimization without Lower-Level Strong ConvexityRisheng Liu, Yaohua Liu, Wei Yao et al.
Gradient methods have become mainstream techniques for Bi-Level Optimization (BLO) in learning fields. The validity of existing works heavily rely on either a restrictive Lower-Level Strong Convexity (LLSC) condition or on solving a series of approximation subproblems with high accuracy or both. In this work, by averaging the upper and lower level objectives, we propose a single loop Bi-level Averaged Method of Multipliers (sl-BAMM) for BLO that is simple yet efficient for large-scale BLO and gets rid of the limited LLSC restriction. We further provide non-asymptotic convergence analysis of sl-BAMM towards KKT stationary points, and the comparative advantage of our analysis lies in the absence of strong gradient boundedness assumption, which is always required by others. Thus our theory safely captures a wider variety of applications in deep learning, especially where the upper-level objective is quadratic w.r.t. the lower-level variable. Experimental results demonstrate the superiority of our method.
CVJan 4, 2023
Classification of Single Tree Decay Stages from Combined Airborne LiDAR Data and CIR ImageryTsz Chung Wong, Abubakar Sani-Mohammed, Jinhong Wang et al.
Understanding forest health is of great importance for the conservation of the integrity of forest ecosystems. In this regard, evaluating the amount and quality of dead wood is of utmost interest as they are favorable indicators of biodiversity. Apparently, remote sensing-based machine learning techniques have proven to be more efficient and sustainable with unprecedented accuracy in forest inventory. This study, for the first time, automatically categorizing individual coniferous trees (Norway spruce) into five decay stages (live, declining, dead, loose bark, and clean) from combined airborne laser scanning (ALS) point clouds and color infrared (CIR) images using three different Machine Learning methods - 3D point cloud-based deep learning (KPConv), Convolutional Neural Network (CNN), and Random Forest (RF). First, CIR colorized point clouds are created by fusing the ALS point clouds and color infrared images. Then, individual tree segmentation is conducted, after which the results are further projected onto four orthogonal planes. Finally, the classification is conducted on the two datasets (3D multispectral point clouds and 2D projected images) based on the three Machine Learning algorithms. All models achieved promising results, reaching overall accuracy (OA) of up to 88.8%, 88.4% and 85.9% for KPConv, CNN and RF, respectively. The experimental results reveal that color information, 3D coordinates, and intensity of point clouds have significant impact on the promising classification performance. The performance of our models, therefore, shows the significance of machine/deep learning for individual tree decay stages classification and landscape-wide assessment of the dead wood amount and quality by using modern airborne remote sensing techniques. The proposed method can contribute as an important and reliable tool for monitoring biodiversity in forest ecosystems.
CVNov 23, 2022
One Class One Click: Quasi Scene-level Weakly Supervised Point Cloud Semantic Segmentation with Active LearningPuzuo Wang, Wei Yao, Jie Shao
Reliance on vast annotations to achieve leading performance severely restricts the practicality of large-scale point cloud semantic segmentation. For the purpose of reducing data annotation costs, effective labeling schemes are developed and contribute to attaining competitive results under weak supervision strategy. Revisiting current weak label forms, we introduce One Class One Click (OCOC), a low cost yet informative quasi scene-level label, which encapsulates point-level and scene-level annotations. An active weakly supervised framework is proposed to leverage scarce labels by involving weak supervision from global and local perspectives. Contextual constraints are imposed by an auxiliary scene classification task, respectively based on global feature embedding and point-wise prediction aggregation, which restricts the model prediction merely to OCOC labels. Furthermore, we design a context-aware pseudo labeling strategy, which effectively supplement point-level supervisory signals. Finally, an active learning scheme with a uncertainty measure - temporal output discrepancy is integrated to examine informative samples and provides guidance on sub-clouds query, which is conducive to quickly attaining desirable OCOC annotations and reduces the labeling cost to an extremely low extent. Extensive experimental analysis using three LiDAR benchmarks collected from airborne, mobile and ground platforms demonstrates that our proposed method achieves very promising results though subject to scarce labels. It considerably outperforms genuine scene-level weakly supervised methods by up to 25\% in terms of average F1 score and achieves competitive results against full supervision schemes. On terrestrial LiDAR dataset - Semantics3D, using approximately 2\textpertenthousand{} of labels, our method achieves an average F1 score of 85.2\%, which increases by 11.58\% compared to the baseline model.
CVOct 20, 2022
VideoPipe 2022 Challenge: Real-World Video Understanding for Urban Pipe InspectionYi Liu, Xuan Zhang, Ying Li et al.
Video understanding is an important problem in computer vision. Currently, the well-studied task in this research is human action recognition, where the clips are manually trimmed from the long videos, and a single class of human action is assumed for each clip. However, we may face more complicated scenarios in the industrial applications. For example, in the real-world urban pipe system, anomaly defects are fine-grained, multi-labeled, domain-relevant. To recognize them correctly, we need to understand the detailed video content. For this reason, we propose to advance research areas of video understanding, with a shift from traditional action recognition to industrial anomaly analysis. In particular, we introduce two high-quality video benchmarks, namely QV-Pipe and CCTV-Pipe, for anomaly inspection in the real-world urban pipe systems. Based on these new datasets, we will host two competitions including (1) Video Defect Classification on QV-Pipe and (2) Temporal Defect Localization on CCTV-Pipe. In this report, we describe the details of these benchmarks, the problem definitions of competition tracks, the evaluation metric, and the result summary. We expect that, this competition would bring new opportunities and challenges for video understanding in smart city and beyond. The details of our VideoPipe challenge can be found in https://videopipe.github.io.
IVApr 14, 2022
Information fusion approach for biomass estimation in a plateau mountainous forest using a synergistic system comprising UAS-based digital camera and LiDARRong Huang, Wei Yao, Zhong Xu et al.
Forest land plays a vital role in global climate, ecosystems, farming and human living environments. Therefore, forest biomass estimation methods are necessary to monitor changes in the forest structure and function, which are key data in natural resources research. Although accurate forest biomass measurements are important in forest inventory and assessments, high-density measurements that involve airborne light detection and ranging (LiDAR) at a low flight height in large mountainous areas are highly expensive. The objective of this study was to quantify the aboveground biomass (AGB) of a plateau mountainous forest reserve using a system that synergistically combines an unmanned aircraft system (UAS)-based digital aerial camera and LiDAR to leverage their complementary advantages. In this study, we utilized digital aerial photogrammetry (DAP), which has the unique advantages of speed, high spatial resolution, and low cost, to compensate for the deficiency of forestry inventory using UAS-based LiDAR that requires terrain-following flight for high-resolution data acquisition. Combined with the sparse LiDAR points acquired by using a high-altitude and high-speed UAS for terrain extraction, dense normalized DAP point clouds can be obtained to produce an accurate and high-resolution canopy height model (CHM). Based on the CHM and spectral attributes obtained from multispectral images, we estimated and mapped the AGB of the region of interest with considerable cost efficiency. Our study supports the development of predictive models for large-scale wall-to-wall AGB mapping by leveraging the complementarity between DAP and LiDAR measurements. This work also reveals the potential of utilizing a UAS-based digital camera and LiDAR synergistically in a plateau mountainous forest area.
CVApr 23, 2023
Urban GeoBIM construction by integrating semantic LiDAR point clouds with as-designed BIM modelsJie Shao, Wei Yao, Puzuo Wang et al.
Developments in three-dimensional real worlds promote the integration of geoinformation and building information models (BIM) known as GeoBIM in urban construction. Light detection and ranging (LiDAR) integrated with global navigation satellite systems can provide geo-referenced spatial information. However, constructing detailed urban GeoBIM poses challenges in terms of LiDAR data quality. BIM models designed from software are rich in geometrical information but often lack accurate geo-referenced locations. In this paper, we propose a complementary strategy that integrates LiDAR point clouds with as-designed BIM models for reconstructing urban scenes. A state-of-the-art deep learning framework and graph theory are first combined for LiDAR point cloud segmentation. A coarse-to-fine matching program is then developed to integrate object point clouds with corresponding BIM models. Results show the overall segmentation accuracy of LiDAR datasets reaches up to 90%, and average positioning accuracies of BIM models are 0.023 m for pole-like objects and 0.156 m for buildings, demonstrating the effectiveness of the method in segmentation and matching processes. This work offers a practical solution for rapid and accurate urban GeoBIM construction.
AIDec 23, 2025Code
A DeepSeek-Powered AI System for Automated Chest Radiograph Interpretation in Clinical PracticeYaowei Bai, Ruiheng Zhang, Yu Lei et al.
A global shortage of radiologists has been exacerbated by the significant volume of chest X-ray workloads, particularly in primary care. Although multimodal large language models show promise, existing evaluations predominantly rely on automated metrics or retrospective analyses, lacking rigorous prospective clinical validation. Janus-Pro-CXR (1B), a chest X-ray interpretation system based on DeepSeek Janus-Pro model, was developed and rigorously validated through a multicenter prospective trial (NCT07117266). Our system outperforms state-of-the-art X-ray report generation models in automated report generation, surpassing even larger-scale models including ChatGPT 4o (200B parameters), while demonstrating reliable detection of six clinically critical radiographic findings. Retrospective evaluation confirms significantly higher report accuracy than Janus-Pro and ChatGPT 4o. In prospective clinical deployment, AI assistance significantly improved report quality scores, reduced interpretation time by 18.3% (P < 0.001), and was preferred by a majority of experts in 54.3% of cases. Through lightweight architecture and domain-specific optimization, Janus-Pro-CXR improves diagnostic reliability and workflow efficiency, particularly in resource-constrained settings. The model architecture and implementation framework will be open-sourced to facilitate the clinical translation of AI-assisted radiology solutions.
OCMay 20, 2022
Towards Extremely Fast Bilevel Optimization with Self-governed Convergence GuaranteesRisheng Liu, Xuan Liu, Wei Yao et al.
Gradient methods have become mainstream techniques for Bi-Level Optimization (BLO) in learning and vision fields. The validity of existing works heavily relies on solving a series of approximation subproblems with extraordinarily high accuracy. Unfortunately, to achieve the approximation accuracy requires executing a large quantity of time-consuming iterations and computational burden is naturally caused. This paper is thus devoted to address this critical computational issue. In particular, we propose a single-level formulation to uniformly understand existing explicit and implicit Gradient-based BLOs (GBLOs). This together with our designed counter-example can clearly illustrate the fundamental numerical and theoretical issues of GBLOs and their naive accelerations. By introducing the dual multipliers as a new variable, we then establish Bilevel Alternating Gradient with Dual Correction (BAGDC), a general framework, which significantly accelerates different categories of existing methods by taking specific settings. A striking feature of our convergence result is that, compared to those original unaccelerated GBLO versions, the fast BAGDC admits a unified non-asymptotic convergence theory towards stationarity. A variety of numerical experiments have also been conducted to demonstrate the superiority of the proposed algorithmic framework.
CVJul 8, 2024
Test-time adaptation for geospatial point cloud semantic segmentation with distinct domain shiftsPuzuo Wang, Wei Yao, Jie Shao et al.
Domain adaptation (DA) techniques help deep learning models generalize across data shifts for point cloud semantic segmentation (PCSS). Test-time adaptation (TTA) allows direct adaptation of a pre-trained model to unlabeled data during inference stage without access to source data or additional training, avoiding privacy issues and large computational resources. We address TTA for geospatial PCSS by introducing three domain shift paradigms: photogrammetric to airborne LiDAR, airborne to mobile LiDAR, and synthetic to mobile laser scanning. We propose a TTA method that progressively updates batch normalization (BN) statistics with each testing batch. Additionally, a self-supervised learning module optimizes learnable BN affine parameters. Information maximization and reliability-constrained pseudo-labeling improve prediction confidence and supply supervisory signals. Experimental results show our method improves classification accuracy by up to 20\% mIoU, outperforming other methods. For photogrammetric (SensatUrban) to airborne (Hessigheim 3D) adaptation at the inference stage, our method achieves 59.46\% mIoU and 85.97\% OA without retraining or fine-turning.
SYApr 8
Knowledge-data fusion framework for frequency security assessment in low-inertia power systemsYurun Zhang, Wei Yao, Yutian Lan et al.
The integration of renewable energy via power electronics is transforming power grids into low-inertia systems, heightening the risks of frequency insecurity and widespread outages. Therefore, frequency security assessment (FSA) methods are urgently needed to ensure the reliable system operation. Recently, knowledge-data fusion models attempt to address the limitations of knowledge-driven (accuracy) and data-driven (generalization) FSA methods. However, current methods remain confined to shallow knowledge-data integration due to challenges in representing heterogeneous knowledge and establishing interactive mechanisms. Here, by classifing FSA domain knowledge into physics-guided and physics-constrained categories, we propose a guided learning-constrained network (GL-CN) framework, which deeply integrates domain knowledge across both network architecture and training process. In this framework, a data-driven model with dual input channels combining graph convolutional networks (GCN) and multilayer perceptrons (MLP) is proposed to extract both nodal and system-level power system features. Furthermore, guided learning enhances model generalization through data augmentation in pre-training utilizing physics-guided knowledge, while constrained network encodes physics-constrained knowledge into the network architecture and loss function to ensure physics-consistent and robust predictions. Validated on Yunnan Provincial Power Grid in China, our method reduces FSA time from days to seconds compared to traditional simulation, achieving 98% accuracy, robustness against 39.0% knowledge error, and generalization for 40%-60% renewable penetration. This provides a solid solution for mitigating blackouts caused by frequency insecurity and offers a generalizable paradigm for broader cross-domain problems.
CVMar 30Code
Industrial3D: A Terrestrial LiDAR Point Cloud Dataset and CrossParadigm Benchmark for Industrial InfrastructureChao Yin, Hongzhe Yue, Qing Han et al.
Automated semantic understanding of dense point clouds is a prerequisite for Scan-to-BIM pipelines, digital twin construction, and as-built verification--core tasks in the digital transformation of the construction industry. Yet for industrial mechanical, electrical, and plumbing (MEP) facilities, this challenge remains largely unsolved: TLS acquisitions of water treatment plants, chiller halls, and pumping stations exhibit extreme geometric ambiguity, severe occlusion, and extreme class imbalance that architectural benchmarks (e.g., S3DIS or ScanNet) cannot adequately represent. We present Industrial3D, a terrestrial LiDAR dataset comprising 612 million expertly labelled points at 6 mm resolution from 13 water treatment facilities. At 6.6x the scale of the closest comparable MEP dataset, Industrial3D provides the largest and most demanding testbed for industrial 3D scene understanding to date. We further establish the first industrial cross-paradigm benchmark, evaluating nine representative methods across fully supervised, weakly supervised, unsupervised, and foundation model settings under a unified benchmark protocol. The best supervised method achieves 55.74% mIoU, whereas zero-shot Point-SAM reaches only 15.79%--a 39.95 percentage-point gap that quantifies the unresolved domain-transfer challenge for industrial TLS data. Systematic analysis reveals that this gap originates from a dual crisis: statistical rarity (215:1 imbalance, 3.5x more severe than S3DIS) and geometric ambiguity (tail-class points share cylindrical primitives with head-class pipes) that frequency-based re-weighting alone cannot resolve. Industrial3D, along with benchmark code and pre-trained models, will be publicly available at https://github.com/pointcloudyc/Industrial3D.
SYNov 13, 2025
Optimized Design of the Generalized Bilinear Transformation for Discretizing Analog SystemsShen Chen, Yanlong Li, Jiamin Cui et al.
A common approach to digital system design involves transforming a continuous-time (s-domain) transfer function into the discrete-time (z-domain) using methods such as Euler or Tustin. These transformations are shown to be specific cases of the Generalized Bilinear Transformation (GBT), characterized by a design parameter, $α$, whose physical interpretation and optimal selection remain inadequately explored. In this paper, we propose an alternative derivation of the GBT derived by employing a new hexagonal shape to approximate the enclosed area of the error function, and we define the parameter $α$ as a shape factor. We reveal, for the first time, the physical meaning of $α$ as the backward rectangular ratio of the proposed hexagonal shape. Through domain mapping, the stable range of is rigorously established to be [0.5, 1]. Depending on the operating frequency and the chosen $α$, we observe two distinct distortion modes, i.e., the magnitude and phase distortion. We further develop an optimal design method for $α$ by minimizing a normalized magnitude or phase error objective function. The effectiveness of the proposed method is validated through the design and testing of a low-pass filter (LPF), demonstrating strong agreement between theoretical predictions and experimental results.
CVNov 29, 2023
W-HMR: Monocular Human Mesh Recovery in World Space with Weak-Supervised CalibrationWei Yao, Hongwen Zhang, Yunlian Sun et al.
Previous methods for 3D human motion recovery from monocular images often fall short due to reliance on camera coordinates, leading to inaccuracies in real-world applications. The limited availability and diversity of focal length labels further exacerbate misalignment issues in reconstructed 3D human bodies. To address these challenges, we introduce W-HMR, a weak-supervised calibration method that predicts "reasonable" focal lengths based on body distortion information, eliminating the need for precise focal length labels. Our approach enhances 2D supervision precision and recovery accuracy. Additionally, we present the OrientCorrect module, which corrects body orientation for plausible reconstructions in world space, avoiding the error accumulation associated with inaccurate camera rotation predictions. Our contributions include a novel weak-supervised camera calibration technique, an effective orientation correction module, and a decoupling strategy that significantly improves the generalizability and accuracy of human motion recovery in both camera and world coordinates. The robustness of W-HMR is validated through extensive experiments on various datasets, showcasing its superiority over existing methods. Codes and demos have been made available on the project page https://yw0208.github.io/w-hmr/.
LGOct 17, 2023
Understanding Fairness Surrogate Functions in Algorithmic FairnessWei Yao, Zhanke Zhou, Zhicong Li et al.
It has been observed that machine learning algorithms exhibit biased predictions against certain population groups. To mitigate such bias while achieving comparable accuracy, a promising approach is to introduce surrogate functions of the concerned fairness definition and solve a constrained optimization problem. However, it is intriguing in previous work that such fairness surrogate functions may yield unfair results and high instability. In this work, in order to deeply understand them, taking a widely used fairness definition--demographic parity as an example, we show that there is a surrogate-fairness gap between the fairness definition and the fairness surrogate function. Also, the theoretical analysis and experimental results about the gap motivate us that the fairness and stability will be affected by the points far from the decision boundary, which is the large margin points issue investigated in this paper. To address it, we propose the general sigmoid surrogate to simultaneously reduce both the surrogate-fairness gap and the variance, and offer a rigorous fairness and stability upper bound. Interestingly, the theory also provides insights into two important issues that deal with the large margin points as well as obtaining a more balanced dataset are beneficial to fairness and stability. Furthermore, we elaborate a novel and general algorithm called Balanced Surrogate, which iteratively reduces the gap to mitigate unfairness. Finally, we provide empirical evidence showing that our methods consistently improve fairness and stability while maintaining accuracy comparable to the baselines in three real-world datasets.
CVJan 3, 2024Code
STAF: 3D Human Mesh Recovery from Video with Spatio-Temporal Alignment FusionWei Yao, Hongwen Zhang, Yunlian Sun et al.
The recovery of 3D human mesh from monocular images has significantly been developed in recent years. However, existing models usually ignore spatial and temporal information, which might lead to mesh and image misalignment and temporal discontinuity. For this reason, we propose a novel Spatio-Temporal Alignment Fusion (STAF) model. As a video-based model, it leverages coherence clues from human motion by an attention-based Temporal Coherence Fusion Module (TCFM). As for spatial mesh-alignment evidence, we extract fine-grained local information through predicted mesh projection on the feature maps. Based on the spatial features, we further introduce a multi-stage adjacent Spatial Alignment Fusion Module (SAFM) to enhance the feature representation of the target frame. In addition to the above, we propose an Average Pooling Module (APM) to allow the model to focus on the entire input sequence rather than just the target frame. This method can remarkably improve the smoothness of recovery results from video. Extensive experiments on 3DPW, MPII3D, and H36M demonstrate the superiority of STAF. We achieve a state-of-the-art trade-off between precision and smoothness. Our code and more video results are on the project page https://yw0208.github.io/staf/
CLFeb 29, 2024Code
Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language ModelsChen Qian, Jie Zhang, Wei Yao et al.
Ensuring the trustworthiness of large language models (LLMs) is crucial. Most studies concentrate on fully pre-trained LLMs to better understand and improve LLMs' trustworthiness. In this paper, to reveal the untapped potential of pre-training, we pioneer the exploration of LLMs' trustworthiness during this period, focusing on five key dimensions: reliability, privacy, toxicity, fairness, and robustness. To begin with, we apply linear probing to LLMs. The high probing accuracy suggests that \textit{LLMs in early pre-training can already distinguish concepts in each trustworthiness dimension}. Therefore, to further uncover the hidden possibilities of pre-training, we extract steering vectors from a LLM's pre-training checkpoints to enhance the LLM's trustworthiness. Finally, inspired by~\citet{choi2023understanding} that mutual information estimation is bounded by linear probing accuracy, we also probe LLMs with mutual information to investigate the dynamics of trustworthiness during pre-training. We are the first to observe a similar two-phase phenomenon: fitting and compression~\citep{shwartz2017opening}. This research provides an initial exploration of trustworthiness modeling during LLM pre-training, seeking to unveil new insights and spur further developments in the field. We will make our code publicly accessible at \url{https://github.com/ChnQ/TracingLLM}.
CVJan 27
Resolving Primitive-Sharing Ambiguity in Long-Tailed Industrial Point Cloud Segmentation via Spatial Context ConstraintsChao Yin, Qing Han, Zhiwei Hou et al.
Industrial point cloud segmentation for Digital Twin construction faces a persistent challenge: safety-critical components such as reducers and valves are systematically misclassified. These failures stem from two compounding factors: such components are rare in training data, yet they share identical local geometry with dominant structures like pipes. This work identifies a dual crisis unique to industrial 3D data extreme class imbalance 215:1 ratio compounded by geometric ambiguity where most tail classes share cylindrical primitives with head classes. Existing frequency-based re-weighting methods address statistical imbalance but cannot resolve geometric ambiguity. We propose spatial context constraints that leverage neighborhood prediction consistency to disambiguate locally similar structures. Our approach extends the Class-Balanced (CB) Loss framework with two architecture-agnostic mechanisms: (1) Boundary-CB, an entropy-based constraint that emphasizes ambiguous boundaries, and (2) Density-CB, a density-based constraint that compensates for scan-dependent variations. Both integrate as plug-and-play modules without network modifications, requiring only loss function replacement. On the Industrial3D dataset (610M points from water treatment facilities), our method achieves 55.74% mIoU with 21.7% relative improvement on tail-class performance (29.59% vs. 24.32% baseline) while preserving head-class accuracy (88.14%). Components with primitive-sharing ambiguity show dramatic gains: reducer improves from 0% to 21.12% IoU; valve improves by 24.3% relative. This resolves geometric ambiguity without the typical head-tail trade-off, enabling reliable identification of safety-critical components for automated knowledge extraction in Digital Twin applications.
LGDec 9, 2022
AuE-IPA: An AU Engagement Based Infant Pain Assessment MethodMingze Sun, Haoxiang Wang, Wei Yao et al.
Recent studies have found that pain in infancy has a significant impact on infant development, including psychological problems, possible brain injury, and pain sensitivity in adulthood. However, due to the lack of specialists and the fact that infants are unable to express verbally their experience of pain, it is difficult to assess infant pain. Most existing infant pain assessment systems directly apply adult methods to infants ignoring the differences between infant expressions and adult expressions. Meanwhile, as the study of facial action coding system continues to advance, the use of action units (AUs) opens up new possibilities for expression recognition and pain assessment. In this paper, a novel AuE-IPA method is proposed for assessing infant pain by leveraging different engagement levels of AUs. First, different engagement levels of AUs in infant pain are revealed, by analyzing the class activation map of an end-to-end pain assessment model. The intensities of top-engaged AUs are then used in a regression model for achieving automatic infant pain assessment. The model proposed is trained and experimented on YouTube Immunization dataset, YouTube Blood Test dataset, and iCOPEVid dataset. The experimental results show that our AuE-IPA method is more applicable to infants and possesses stronger generalization ability than end-to-end assessment model and the classic PSPI metric.
CLMar 12
Beyond the Black Box: A Survey on the Theory and Mechanism of Large Language ModelsZeyu Gan, Ruifeng Ren, Wei Yao et al.
The rapid emergence of Large Language Models (LLMs) has precipitated a profound paradigm shift in Artificial Intelligence, delivering monumental engineering successes that increasingly impact modern society. However, a critical paradox persists within the current field: despite the empirical efficacy, our theoretical understanding of LLMs remains disproportionately nascent, forcing these systems to be treated largely as ``black boxes''. To address this theoretical fragmentation, this survey proposes a unified lifecycle-based taxonomy that organizes the research landscape into six distinct stages: Data Preparation, Model Preparation, Training, Alignment, Inference, and Evaluation. Within this framework, we provide a systematic review of the foundational theories and internal mechanisms driving LLM performance. Specifically, we analyze core theoretical issues such as the mathematical justification for data mixtures, the representational limits of various architectures, and the optimization dynamics of alignment algorithms. Moving beyond current best practices, we identify critical frontier challenges, including the theoretical limits of synthetic data self-improvement, the mathematical bounds of safety guarantees, and the mechanistic origins of emergent intelligence. By connecting empirical observations with rigorous scientific inquiry, this work provides a structured roadmap for transitioning LLM development from engineering heuristics toward a principled scientific discipline.
CVSep 12, 2024
Expansive Supervision for Neural Radiance FieldWeixiang Zhang, Shuzhao Xie, Shijia Ge et al.
Neural Radiance Field (NeRF) has achieved remarkable success in creating immersive media representations through its exceptional reconstruction capabilities. However, the computational demands of dense forward passes and volume rendering during training continue to challenge its real-world applications. In this paper, we introduce Expansive Supervision to reduce time and memory costs during NeRF training from the perspective of partial ray selection for supervision. Specifically, we observe that training errors exhibit a long-tail distribution correlated with image content. Based on this observation, our method selectively renders a small but crucial subset of pixels and expands their values to estimate errors across the entire area for each iteration. Compared to conventional supervision, our approach effectively bypasses redundant rendering processes, resulting in substantial reductions in both time and memory consumption. Experimental results demonstrate that integrating Expansive Supervision within existing state-of-the-art acceleration frameworks achieves 52% memory savings and 16% time savings while maintaining comparable visual quality.
OCMar 21
SPABA: A Single-Loop and Probabilistic Stochastic Bilevel Algorithm Achieving Optimal Sample ComplexityTianshu Chu, Dachuan Xu, Wei Yao et al.
While stochastic bilevel optimization methods have been extensively studied for addressing large-scale nested optimization problems in machine learning, it remains an open question whether the optimal complexity bounds for solving bilevel optimization are the same as those in single-level optimization. Our main result resolves this question: SPABA, an adaptation of the PAGE method for nonconvex optimization in (Li et al., 2021) to the bilevel setting, can achieve optimal sample complexity in both the finite-sum and expectation settings. We show the optimality of SPABA by proving that there is no gap in complexity analysis between stochastic bilevel and single-level optimization when implementing PAGE. Notably, as indicated by the results of (Dagréou et al., 2022), there might exist a gap in complexity analysis when implementing other stochastic gradient estimators, like SGD and SAGA. In addition to SPABA, we propose several other single-loop stochastic bilevel algorithms, that either match or improve the state-of-the-art sample complexity results, leveraging our convergence rate and complexity analysis. Numerical experiments demonstrate the superior practical performance of the proposed methods.
CLOct 24, 2024Code
From Blind Solvers to Logical Thinkers: Benchmarking LLMs' Logical Integrity on Faulty Mathematical ProblemsA M Muntasir Rahman, Junyi Ye, Wei Yao et al.
Consider the math problem: "Lily received 3 cookies from her best friend yesterday and ate 5 for breakfast. Today, her friend gave her 3 more cookies. How many cookies does Lily have now?" Many large language models (LLMs) in previous research approach this problem by calculating the answer "1" using the equation "3 - 5 + 3." However, from a human perspective, we recognize the inherent flaw in this problem: Lily cannot eat 5 cookies if she initially only had 3. This discrepancy prompts a key question: Are current LLMs merely Blind Solver that apply mathematical operations without deeper reasoning, or can they function as Logical Thinker capable of identifying logical inconsistencies? To explore this question, we propose a benchmark dataset, FaultyMath, which includes faulty math problems of rich diversity: i) multiple mathematical categories, e.g., algebra, geometry, number theory, etc., ii) varying levels of difficulty, and iii) different origins of faultiness -- ranging from violations of common sense and ambiguous statements to mathematical contradictions and more. We evaluate a broad spectrum of LLMs, including open-source, closed-source, and math-specialized models, using FaultyMath across three dimensions: (i) How accurately can the models detect faulty math problems without being explicitly prompted to do so? (ii) When provided with hints -- either correct or misleading -- about the validity of the problems, to what extent do LLMs adapt to become reliable Logical Thinker? (iii) How trustworthy are the explanations generated by LLMs when they recognize a math problem as flawed? Through extensive experimentation and detailed analysis, our results demonstrate that existing LLMs largely function as Blind Solver and fall short of the reasoning capabilities required to perform as Logical Thinker.
CVApr 20
SynAgent: Generalizable Cooperative Humanoid Manipulation via Solo-to-Cooperative Agent SynergyWei Yao, Haohan Ma, Hongwen Zhang et al.
Controllable cooperative humanoid manipulation is a fundamental yet challenging problem for embodied intelligence, due to severe data scarcity, complexities in multi-agent coordination, and limited generalization across objects. In this paper, we present SynAgent, a unified framework that enables scalable and physically plausible cooperative manipulation by leveraging Solo-to-Cooperative Agent Synergy to transfer skills from single-agent human-object interaction to multi-agent human-object-human scenarios. To maintain semantic integrity during motion transfer, we introduce an interaction-preserving retargeting method based on an Interact Mesh constructed via Delaunay tetrahedralization, which faithfully maintains spatial relationships among humans and objects. Building upon this refined data, we propose a single-agent pretraining and adaptation paradigm that bootstraps synergistic collaborative behaviors from abundant single-human data through decentralized training and multi-agent PPO. Finally, we develop a trajectory-conditioned generative policy using a conditional VAE, trained via multi-teacher distillation from motion imitation priors to achieve stable and controllable object-level trajectory execution. Extensive experiments demonstrate that SynAgent significantly outperforms existing baselines in both cooperative imitation and trajectory-conditioned control, while generalizing across diverse object geometries. Codes and data will be available after publication. Project Page: http://yw0208.github.io/synagent
CVJun 20, 2023
Self-supervised Multi-task Learning Framework for Safety and Health-Oriented Connected Driving Environment Perception using Onboard CameraShaocheng Jia, Wei Yao
Cutting-edge connected vehicle (CV) technologies have drawn much attention in recent years. The real-time traffic data captured by a CV can be shared with other CVs and data centers so as to open new possibilities for solving diverse transportation problems. However, imagery captured by onboard cameras in a connected environment, are not sufficiently investigated, especially for safety and health-oriented visual perception. In this paper, a bidirectional process of image synthesis and decomposition (BPISD) approach is proposed, and thus a novel self-supervised multi-task learning framework, to simultaneously estimate depth map, atmospheric visibility, airlight, and PM2.5 mass concentration, in which depth map and visibility are considered highly associated with traffic safety, while airlight and PM2.5 mass concentration are directly correlated with human health. Both the training and testing phases of the proposed system solely require a single image as input. Due to the innovative training pipeline, the depth estimation network can manage various levels of visibility conditions and overcome inherent problems in current image-synthesis-based depth estimation, thereby generating high-quality depth maps even in low-visibility situations and further benefiting accurate estimations of visibility, airlight, and PM2.5 mass concentration. Extensive experiments on the synthesized data from the KITTI and real-world data collected in Beijing demonstrate that the proposed method can (1) achieve performance competitive in depth estimation as compared with state-of-the-art methods when taking clear images as input; (2) predict vivid depth map for images contaminated by various levels of haze; and (3) accurately estimate visibility, airlight, and PM2.5 mass concentrations. Beneficial applications can be developed based on the presented work to improve traffic safety, air quality, and public health.
CVJan 26
Cross-Domain Transfer with Self-Supervised Spectral-Spatial Modeling for Hyperspectral Image ClassificationJianshu Chao, Tianhua Lv, Qiqiong Ma et al.
Self-supervised learning has demonstrated considerable potential in hyperspectral representation, yet its application in cross-domain transfer scenarios remains under-explored. Existing methods, however, still rely on source domain annotations and are susceptible to distribution shifts, leading to degraded generalization performance in the target domain. To address this, this paper proposes a self-supervised cross-domain transfer framework that learns transferable spectral-spatial joint representations without source labels and achieves efficient adaptation under few samples in the target domain. During the self-supervised pre-training phase, a Spatial-Spectral Transformer (S2Former) module is designed. It adopts a dual-branch spatial-spectral transformer and introduces a bidirectional cross-attention mechanism to achieve spectral-spatial collaborative modeling: the spatial branch enhances structural awareness through random masking, while the spectral branch captures fine-grained differences. Both branches mutually guide each other to improve semantic consistency. We further propose a Frequency Domain Constraint (FDC) to maintain frequency-domain consistency through real Fast Fourier Transform (rFFT) and high-frequency magnitude loss, thereby enhancing the model's capability to discern fine details and boundaries. During the fine-tuning phase, we introduce a Diffusion-Aligned Fine-tuning (DAFT) distillation mechanism. This aligns semantic evolution trajectories through a teacher-student structure, enabling robust transfer learning under low-label conditions. Experimental results demonstrate stable classification performance and strong cross-domain adaptability across four hyperspectral datasets, validating the method's effectiveness under resource-constrained conditions.
LGMay 7
On the Blessing of Pre-training in Weak-to-Strong GeneralizationWei Yao, Wang Zhaoyang, Gengze Xu et al.
The paradigm of Weak-to-Strong Generalization (W2SG) suggests that a pre-trained strong model can surpass its weak supervisor, yet the decisive role of pre-training remains theoretically and empirically under-explored. In this work, we identify pre-training as the essential prerequisite for the emergence of W2SG. Theoretically, we formalize the W2SG problem within a high-dimensional single-index model framework using spiked Gaussian data, modeling pre-training as a spectral initialization step. Building upon prior impossibility results regarding the failure of learning under random initialization, we prove that W2SG is achievable when pre-training provides a geometric warm start that places the model within an "effective region" characterized by a perturbed strong-convexity geometry. Within this region, we derive a rigorous generalization bound that naturally captures the optimization dynamics: an initial performance improvement followed by a saturation bottleneck dictated by the weak supervisor's bias. Empirically, we first validate all our assumptions and theoretical insights through controlled synthetic simulations. Finally, through a massive-scale evaluation of hundreds of intermediate pre-training checkpoints from large language models, we demonstrate that W2SG is not an innate capability but emerges via a phase transition tightly coupled with the progression of pre-training.
LGJan 29, 2024
Constrained Bi-Level Optimization: Proximal Lagrangian Value function Approach and Hessian-free AlgorithmWei Yao, Chengming Yu, Shangzhi Zeng et al.
This paper presents a new approach and algorithm for solving a class of constrained Bi-Level Optimization (BLO) problems in which the lower-level problem involves constraints coupling both upper-level and lower-level variables. Such problems have recently gained significant attention due to their broad applicability in machine learning. However, conventional gradient-based methods unavoidably rely on computationally intensive calculations related to the Hessian matrix. To address this challenge, we begin by devising a smooth proximal Lagrangian value function to handle the constrained lower-level problem. Utilizing this construct, we introduce a single-level reformulation for constrained BLOs that transforms the original BLO problem into an equivalent optimization problem with smooth constraints. Enabled by this reformulation, we develop a Hessian-free gradient-based algorithm-termed proximal Lagrangian Value function-based Hessian-free Bi-level Algorithm (LV-HBA)-that is straightforward to implement in a single loop manner. Consequently, LV-HBA is especially well-suited for machine learning applications. Furthermore, we offer non-asymptotic convergence analysis for LV-HBA, eliminating the need for traditional strong convexity assumptions for the lower-level problem while also being capable of accommodating non-singleton scenarios. Empirical results substantiate the algorithm's superior practical performance.
CVJan 26
AI-enabled Satellite Edge Computing: A Single-Pixel Feature based Shallow Classification Model for Hyperspectral ImagingLi Fang, Tianyu Li, Yanghong Lin et al.
As the important component of the Earth observation system, hyperspectral imaging satellites provide high-fidelity and enriched information for the formulation of related policies due to the powerful spectral measurement capabilities. However, the transmission speed of the satellite downlink has become a major bottleneck in certain applications, such as disaster monitoring and emergency mapping, which demand a fast response ability. We propose an efficient AI-enabled Satellite Edge Computing paradigm for hyperspectral image classification, facilitating the satellites to attain autonomous decision-making. To accommodate the resource constraints of satellite platforms, the proposed method adopts a lightweight, non-deep learning framework integrated with a few-shot learning strategy. Moreover, onboard processing on satellites could be faced with sensor failure and scan pattern errors, which result in degraded image quality with bad/misaligned pixels and mixed noise. To address these challenges, we develop a novel two-stage pixel-wise label propagation scheme that utilizes only intrinsic spectral features at the single pixel level without the necessity to consider spatial structural information as requested by deep neural networks. In the first stage, initial pixel labels are obtained by propagating selected anchor labels through the constructed anchor-pixel affinity matrix. Subsequently, a top-k pruned sparse graph is generated by directly computing pixel-level similarities. In the second stage, a closed-form solution derived from the sparse graph is employed to replace iterative computations. Furthermore, we developed a rank constraint-based graph clustering algorithm to determine the anchor labels.
OCMay 16, 2024
Moreau Envelope for Nonconvex Bi-Level Optimization: A Single-loop and Hessian-free Solution StrategyRisheng Liu, Zhu Liu, Wei Yao et al.
This work focuses on addressing two major challenges in the context of large-scale nonconvex Bi-Level Optimization (BLO) problems, which are increasingly applied in machine learning due to their ability to model nested structures. These challenges involve ensuring computational efficiency and providing theoretical guarantees. While recent advances in scalable BLO algorithms have primarily relied on lower-level convexity simplification, our work specifically tackles large-scale BLO problems involving nonconvexity in both the upper and lower levels. We simultaneously address computational and theoretical challenges by introducing an innovative single-loop gradient-based algorithm, utilizing the Moreau envelope-based reformulation, and providing non-asymptotic convergence analysis for general nonconvex BLO problems. Notably, our algorithm relies solely on first-order gradient information, enhancing its practicality and efficiency, especially for large-scale BLO learning tasks. We validate our approach's effectiveness through experiments on various synthetic problems, two typical hyper-parameter learning tasks, and a real-world neural architecture search application, collectively demonstrating its superior performance.
CVMar 4, 2024
UB-FineNet: Urban Building Fine-grained Classification Network for Open-access Satellite ImagesZhiyi He, Wei Yao, Jie Shao et al.
Fine classification of city-scale buildings from satellite remote sensing imagery is a crucial research area with significant implications for urban planning, infrastructure development, and population distribution analysis. However, the task faces big challenges due to low-resolution overhead images acquired from high altitude space-borne platforms and the long-tail sample distribution of fine-grained urban building categories, leading to severe class imbalance problem. To address these issues, we propose a deep network approach to fine-grained classification of urban buildings using open-access satellite images. A Denoising Diffusion Probabilistic Model (DDPM) based super-resolution method is first introduced to enhance the spatial resolution of satellite images, which benefits from domain-adaptive knowledge distillation. Then, a new fine-grained classification network with Category Information Balancing Module (CIBM) and Contrastive Supervision (CS) technique is proposed to mitigate the problem of class imbalance and improve the classification robustness and accuracy. Experiments on Hong Kong data set with 11 fine building types revealed promising classification results with a mean Top-1 accuracy of 60.45\%, which is on par with street-view image based approaches. Extensive ablation study shows that CIBM and CS improve Top-1 accuracy by 2.6\% and 3.5\% compared to the baseline method, respectively. And both modules can be easily inserted into other classification networks and similar enhancements have been achieved. Our research contributes to the field of urban analysis by providing a practical solution for fine classification of buildings in challenging mega city scenarios solely using open-access satellite images. The proposed method can serve as a valuable tool for urban planners, aiding in the understanding of economic, industrial, and population distribution.
CVJan 26
Semi-Supervised Hyperspectral Image Classification with Edge-Aware Superpixel Label Propagation and Adaptive Pseudo-LabelingYunfei Qiu, Qiqiong Ma, Tianhua Lv et al.
Significant progress has been made in semi-supervised hyperspectral image (HSI) classification regarding feature extraction and classification performance. However, due to high annotation costs and limited sample availability, semi-supervised learning still faces challenges such as boundary label diffusion and pseudo-label instability. To address these issues, this paper proposes a novel semi-supervised hyperspectral classification framework integrating spatial prior information with a dynamic learning mechanism. First, we design an Edge-Aware Superpixel Label Propagation (EASLP) module. By integrating edge intensity penalty with neighborhood correction strategy, it mitigates label diffusion from superpixel segmentation while enhancing classification robustness in boundary regions. Second, we introduce a Dynamic History-Fused Prediction (DHP) method. By maintaining historical predictions and dynamically weighting them with current results, DHP smoothens pseudo-label fluctuations and improves temporal consistency and noise resistance. Concurrently, incorporating condifence and consistency measures, the Adaptive Tripartite Sample Categorization (ATSC) strategy implements hierarchical utilization of easy, ambiguous, and hard samples, leading to enhanced pseudo-label quality and learning efficiency. The Dynamic Reliability-Enhanced Pseudo-Label Framework (DREPL), composed of DHP and ATSC, strengthens pseudo-label stability across temporal and sample domains. Through synergizes operation with EASLP, it achieves spatio-temporal consistency optimization. Evaluations on four benchmark datasets demonstrate its capability to maintain superior classification performance.
LGDec 10, 2024
A Review of Human Emotion Synthesis Based on Generative TechnologyFei Ma, Yukan Li, Yifan Xie et al.
Human emotion synthesis is a crucial aspect of affective computing. It involves using computational methods to mimic and convey human emotions through various modalities, with the goal of enabling more natural and effective human-computer interactions. Recent advancements in generative models, such as Autoencoders, Generative Adversarial Networks, Diffusion Models, Large Language Models, and Sequence-to-Sequence Models, have significantly contributed to the development of this field. However, there is a notable lack of comprehensive reviews in this field. To address this problem, this paper aims to address this gap by providing a thorough and systematic overview of recent advancements in human emotion synthesis based on generative models. Specifically, this review will first present the review methodology, the emotion models involved, the mathematical principles of generative models, and the datasets used. Then, the review covers the application of different generative models to emotion synthesis based on a variety of modalities, including facial images, speech, and text. It also examines mainstream evaluation metrics. Additionally, the review presents some major findings and suggests future research directions, providing a comprehensive understanding of the role of generative technology in the nuanced domain of emotion synthesis.
CVDec 5, 2023
FG-MDM: Towards Zero-Shot Human Motion Generation via ChatGPT-Refined DescriptionsXu Shi, Wei Yao, Chuanchen Luo et al.
Recently, significant progress has been made in text-based motion generation, enabling the generation of diverse and high-quality human motions that conform to textual descriptions. However, generating motions beyond the distribution of original datasets remains challenging, i.e., zero-shot generation. By adopting a divide-and-conquer strategy, we propose a new framework named Fine-Grained Human Motion Diffusion Model (FG-MDM) for zero-shot human motion generation. Specifically, we first parse previous vague textual annotations into fine-grained descriptions of different body parts by leveraging a large language model. We then use these fine-grained descriptions to guide a transformer-based diffusion model, which further adopts a design of part tokens. FG-MDM can generate human motions beyond the scope of original datasets owing to descriptions that are closer to motion essence. Our experimental results demonstrate the superiority of FG-MDM over previous methods in zero-shot settings. We will release our fine-grained textual annotations for HumanML3D and KIT.
LGFeb 16, 2025
Revisiting Weak-to-Strong Generalization in Theory and Practice: Reverse KL vs. Forward KLWei Yao, Wenkai Yang, Ziqiao Wang et al.
As large language models advance toward superhuman performance, ensuring their alignment with human values and abilities grows increasingly complex. Weak-to-strong generalization offers a promising approach by leveraging predictions from weaker models to guide stronger systems, but its effectiveness could be constrained by the inherent noise and inaccuracies in these weak predictions. To address this, we propose a theoretically grounded approach that replaces forward KL divergence-whose mass-covering behavior risks overfitting to imperfect weak signals-with reverse KL divergence. Reverse KL divergence's zero-forcing effect prioritizes high-confidence predictions, effectively mitigating the influence of unreliable weak supervision. Theoretically, we extend existing bounds and derive tighter lower bounds for both forward and reverse KL divergence, establishing that reverse KL achieves at least comparable guarantees to forward KL. Notably, when a sufficiently pre-trained strong model is fine-tuned on the last linear layer, reverse KL guarantees that it outperforms its weak supervisor by the magnitude of their disagreement. Empirically, we demonstrate that reverse KL and reverse cross-entropy enable strong models to successfully outperform those trained with forward KL and standard cross-entropy across most settings, highlighting the practical advantages of these reverse losses.
CVJan 23, 2024
Unlocking the Potential: Multi-task Deep Learning for Spaceborne Quantitative Monitoring of Fugitive Methane PlumesGuoxin Si, Shiliang Fu, Wei Yao
As global warming intensifies, increased attention is being paid to monitoring fugitive methane emissions and detecting gas plumes from landfills. We have divided methane emission monitoring into three subtasks: methane concentration inversion, plume segmentation, and emission rate estimation. Traditional algorithms face certain limitations: methane concentration inversion typically employs the matched filter, which is sensitive to the global spectrum distribution and prone to significant noise. There is scant research on plume segmentation, with many studies depending on manual segmentation, which can be subjective. The estimation of methane emission rate frequently uses the IME algorithm, which necessitates meteorological measurement data. Utilizing the WENT landfill site in Hong Kong along with PRISMA hyperspectral satellite imagery, we introduce a novel deep learning-based framework for quantitative methane emission monitoring from remote sensing images that is grounded in physical simulation. We create simulated methane plumes using large eddy simulation (LES) and various concentration maps of fugitive emissions using the radiative transfer equation (RTE), while applying augmentation techniques to construct a simulated PRISMA dataset. We train a U-Net network for methane concentration inversion, a Mask R-CNN network for methane plume segmentation, and a ResNet-50 network for methane emission rate estimation. All three deep networks yield higher validation accuracy compared to traditional algorithms. Furthermore, we combine the first two subtasks and the last two subtasks to design multi-task learning models, MTL-01 and MTL-02, both of which outperform single-task models in terms of accuracy. Our research exemplifies the application of multi-task deep learning to quantitative methane monitoring and can be generalized to a wide array of methane monitoring tasks.
CLOct 16, 2025
Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM AgentsGuoqing Wang, Sunhao Dai, Guangze Ye et al.
Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. However, existing approaches typically rely on outcome-based rewards that are only provided at the final answer. This reward sparsity becomes particularly problematic in multi-turn settings, where long trajectories exacerbate two critical issues: (i) advantage collapse, where all rollouts receive identical rewards and provide no useful learning signals, and (ii) lack of fine-grained credit assignment, where dependencies between turns are obscured, especially in long-horizon tasks. In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training. IGPO models each interaction turn as an incremental process of acquiring information about the ground truth, and defines turn-level rewards as the marginal increase in the policy's probability of producing the correct answer. Unlike prior process-level reward approaches that depend on external reward models or costly Monte Carlo estimation, IGPO derives intrinsic rewards directly from the model's own belief updates. These intrinsic turn-level rewards are combined with outcome-level supervision to form dense reward trajectories. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that IGPO consistently outperforms strong baselines in multi-turn scenarios, achieving higher accuracy and improved sample efficiency.
LGAug 4, 2025
Graph Embedding in the Graph Fractional Fourier Transform DomainChangjie Sheng, Zhichao Zhang, Wei Yao
Spectral graph embedding plays a critical role in graph representation learning by generating low-dimensional vector representations from graph spectral information. However, the embedding space of traditional spectral embedding methods often exhibit limited expressiveness, failing to exhaustively capture latent structural features across alternative transform domains. To address this issue, we use the graph fractional Fourier transform to extend the existing state-of-the-art generalized frequency filtering embedding (GEFFE) into fractional domains, giving birth to the generalized fractional filtering embedding (GEFRFE), which enhances embedding informativeness via the graph fractional domain. The GEFRFE leverages graph fractional domain filtering and a nonlinear composition of eigenvector components derived from a fractionalized graph Laplacian. To dynamically determine the fractional order, two parallel strategies are introduced: search-based optimization and a ResNet18-based adaptive learning. Extensive experiments on six benchmark datasets demonstrate that the GEFRFE captures richer structural features and significantly enhance classification performance. Notably, the proposed method retains computational complexity comparable to GEFFE approaches.
GRJul 10, 2025
SD-GS: Structured Deformable 3D Gaussians for Efficient Dynamic Scene ReconstructionWei Yao, Shuzhao Xie, Letian Li et al.
Current 4D Gaussian frameworks for dynamic scene reconstruction deliver impressive visual fidelity and rendering speed, however, the inherent trade-off between storage costs and the ability to characterize complex physical motions significantly limits the practical application of these methods. To tackle these problems, we propose SD-GS, a compact and efficient dynamic Gaussian splatting framework for complex dynamic scene reconstruction, featuring two key contributions. First, we introduce a deformable anchor grid, a hierarchical and memory-efficient scene representation where each anchor point derives multiple 3D Gaussians in its local spatiotemporal region and serves as the geometric backbone of the 3D scene. Second, to enhance modeling capability for complex motions, we present a deformation-aware densification strategy that adaptively grows anchors in under-reconstructed high-dynamic regions while reducing redundancy in static areas, achieving superior visual quality with fewer anchors. Experimental results demonstrate that, compared to state-of-the-art methods, SD-GS achieves an average of 60\% reduction in model size and an average of 100\% improvement in FPS, significantly enhancing computational efficiency while maintaining or even surpassing visual quality.
LGMay 30, 2025
On the Emergence of Weak-to-Strong Generalization: A Bias-Variance PerspectiveGengze Xu, Wei Yao, Ziqiao Wang et al.
Weak-to-strong generalization (W2SG) refers to the phenomenon where a strong student model, trained on a dataset labeled by a weak teacher, ultimately outperforms the teacher on the target task. Recent studies attribute this performance gain to the prediction misfit between the student and teacher models. In this work, we theoretically investigate the emergence of W2SG through a generalized bias-variance decomposition of Bregman divergence. Specifically, we show that the expected population risk gap between the student and teacher is quantified by the expected misfit between the two models. While this aligns with previous results, our analysis removes several restrictive assumptions, most notably, the convexity of the student's hypothesis class, required in earlier works. Moreover, we show that W2SG is more likely to emerge when the student model approximates its posterior mean teacher, rather than mimicking an individual teacher. Using a concrete example, we demonstrate that if the student model size is sufficiently large, it can indeed converge to the posterior mean teacher in expectation. Our analysis also suggests that avoiding overfitting to the teacher's supervision and reducing the entropy of student's prediction further facilitate W2SG. In addition, we show that the reverse cross-entropy loss, unlike the standard forward cross-entropy, is less sensitive to the predictive uncertainty of the teacher. Finally, we empirically verify our theoretical insights and demonstrate that incorporating the reverse cross-entropy loss consistently improves student performance.
OCMay 4, 2025
Efficient Curvature-Aware Hypergradient Approximation for Bilevel OptimizationYouran Dong, Junfeng Yang, Wei Yao et al.
Bilevel optimization is a powerful tool for many machine learning problems, such as hyperparameter optimization and meta-learning. Estimating hypergradients (also known as implicit gradients) is crucial for developing gradient-based methods for bilevel optimization. In this work, we propose a computationally efficient technique for incorporating curvature information into the approximation of hypergradients and present a novel algorithmic framework based on the resulting enhanced hypergradient computation. We provide convergence rate guarantees for the proposed framework in both deterministic and stochastic scenarios, particularly showing improved computational complexity over popular gradient-based methods in the deterministic setting. This improvement in complexity arises from a careful exploitation of the hypergradient structure and the inexact Newton method. In addition to the theoretical speedup, numerical experiments demonstrate the significant practical performance benefits of incorporating curvature information.
LGFeb 3, 2025
The Capabilities and Limitations of Weak-to-Strong Generalization: Generalization and CalibrationWei Yao, Wenkai Yang, Gengze Xu et al.
Weak-to-strong generalization, where weakly supervised strong models outperform their weaker teachers, offers a promising approach to aligning superhuman models with human values. To deepen the understanding of this approach, we provide theoretical insights into its capabilities and limitations. First, in the classification setting, we establish upper and lower generalization error bounds for the strong model, identifying the primary limitations as stemming from the weak model's generalization error and the optimization objective itself. Additionally, we derive lower and upper bounds on the calibration error of the strong model. These theoretical bounds reveal two critical insights: (1) the weak model should demonstrate strong generalization performance and maintain well-calibrated predictions, and (2) the strong model's training process must strike a careful balance, as excessive optimization could undermine its generalization capability by over-relying on the weak supervision signals. Finally, in the regression setting, we extend the work of Charikar et al. (2024) to a loss function based on Kullback-Leibler (KL) divergence, offering guarantees that the strong student can outperform its weak teacher by at least the magnitude of their disagreement. We conduct sufficient experiments to validate our theory.
LGFeb 3, 2025
qNBO: quasi-Newton Meets Bilevel OptimizationSheng Fang, Yong-Jin Liu, Wei Yao et al.
Bilevel optimization, addressing challenges in hierarchical learning tasks, has gained significant interest in machine learning. The practical implementation of the gradient descent method to bilevel optimization encounters computational hurdles, notably the computation of the exact lower-level solution and the inverse Hessian of the lower-level objective. Although these two aspects are inherently connected, existing methods typically handle them separately by solving the lower-level problem and a linear system for the inverse Hessian-vector product. In this paper, we introduce a general framework to address these computational challenges in a coordinated manner. Specifically, we leverage quasi-Newton algorithms to accelerate the resolution of the lower-level problem while efficiently approximating the inverse Hessian-vector product. Furthermore, by exploiting the superlinear convergence properties of BFGS, we establish the non-asymptotic convergence analysis of the BFGS adaptation within our framework. Numerical experiments demonstrate the comparable or superior performance of the proposed algorithms in real-world learning tasks, including hyperparameter optimization, data hyper-cleaning, and few-shot meta-learning.
CVSep 5, 2025
TemporalFlowViz: Parameter-Aware Visual Analytics for Interpreting Scramjet Combustion EvolutionYifei Jia, Shiyu Cheng, Yu Dong et al.
Understanding the complex combustion dynamics within scramjet engines is critical for advancing high-speed propulsion technologies. However, the large scale and high dimensionality of simulation-generated temporal flow field data present significant challenges for visual interpretation, feature differentiation, and cross-case comparison. In this paper, we present TemporalFlowViz, a parameter-aware visual analytics workflow and system designed to support expert-driven clustering, visualization, and interpretation of temporal flow fields from scramjet combustion simulations. Our approach leverages hundreds of simulated combustion cases with varying initial conditions, each producing time-sequenced flow field images. We use pretrained Vision Transformers to extract high-dimensional embeddings from these frames, apply dimensionality reduction and density-based clustering to uncover latent combustion modes, and construct temporal trajectories in the embedding space to track the evolution of each simulation over time. To bridge the gap between latent representations and expert reasoning, domain specialists annotate representative cluster centroids with descriptive labels. These annotations are used as contextual prompts for a vision-language model, which generates natural-language summaries for individual frames and full simulation cases. The system also supports parameter-based filtering, similarity-based case retrieval, and coordinated multi-view exploration to facilitate in-depth analysis. We demonstrate the effectiveness of TemporalFlowViz through two expert-informed case studies and expert feedback, showing TemporalFlowViz enhances hypothesis generation, supports interpretable pattern discovery, and enhances knowledge discovery in large-scale scramjet combustion analysis.
LGAug 16, 2025
A Novel Unified Extended Matrix for Graph Signal Processing: Theory and ApplicationYunyan Zheng, Zhichao Zhang, Wei Yao
Graph signal processing has become an essential tool for analyzing data structured on irregular domains. While conventional graph shift operators (GSOs) are effective for certain tasks, they inherently lack flexibility in modeling dependencies between non-adjacent nodes, limiting their ability to represent complex graph structures. To address this limitation, this paper proposes the unified extended matrix (UEM) framework, which integrates the extended-adjacency matrix and the unified graph representation matrix through parametric design, so as to be able to flexibly adapt to different graph structures and reveal more graph signal information. Theoretical analysis of the UEM is conducted, demonstrating positive semi-definiteness and eigenvalue monotonicity under specific conditions. Then, we propose graph Fourier transform based on UEM (UEM-GFT), which can adaptively tune spectral properties to enhance signal processing performance. Experimental results on synthetic and real-world datasets demonstrate that the UEM-GFT outperforms existing GSO-based methods in anomaly detection tasks, achieving superior performance across varying network topologies.
CVJun 9, 2025
PhysiInter: Integrating Physical Mapping for High-Fidelity Human Interaction GenerationWei Yao, Yunlian Sun, Chang Liu et al.
Driven by advancements in motion capture and generative artificial intelligence, leveraging large-scale MoCap datasets to train generative models for synthesizing diverse, realistic human motions has become a promising research direction. However, existing motion-capture techniques and generative models often neglect physical constraints, leading to artifacts such as interpenetration, sliding, and floating. These issues are exacerbated in multi-person motion generation, where complex interactions are involved. To address these limitations, we introduce physical mapping, integrated throughout the human interaction generation pipeline. Specifically, motion imitation within a physics-based simulation environment is used to project target motions into a physically valid space. The resulting motions are adjusted to adhere to real-world physics constraints while retaining their original semantic meaning. This mapping not only improves MoCap data quality but also directly informs post-processing of generated motions. Given the unique interactivity of multi-person scenarios, we propose a tailored motion representation framework. Motion Consistency (MC) and Marker-based Interaction (MI) loss functions are introduced to improve model performance. Experiments show our method achieves impressive results in generated human motion quality, with a 3%-89% improvement in physical fidelity. Project page http://yw0208.github.io/physiinter
LGJun 3, 2025
On Weak-to-Strong Generalization and f-DivergenceWei Yao, Gengze Xu, Huayi Tang et al.
Weak-to-strong generalization (W2SG) has emerged as a promising paradigm for stimulating the capabilities of strong pre-trained models by leveraging supervision from weaker supervisors. To improve the performance of the strong model, existing methods often require additional weak models or complex procedures, leading to substantial computational and memory overhead. Motivated by the effectiveness of $f$-divergence loss in various machine learning domains, we introduce $f$-divergence as an information-theoretic loss function framework in W2SG. Our theoretical analysis reveals fundamental limitations and equivalence of different $f$-divergence losses in W2SG, supported by sample complexity bounds and information-theoretic insights. We empirically demonstrate that $f$-divergence loss, which generalizes widely-used metrics like KL divergence, effectively improves generalization and noise tolerance of the strong model in practice.
CVJun 2, 2025
HOSIG: Full-Body Human-Object-Scene Interaction Generation with Hierarchical Scene PerceptionWei Yao, Yunlian Sun, Hongwen Zhang et al.
Generating high-fidelity full-body human interactions with dynamic objects and static scenes remains a critical challenge in computer graphics and animation. Existing methods for human-object interaction often neglect scene context, leading to implausible penetrations, while human-scene interaction approaches struggle to coordinate fine-grained manipulations with long-range navigation. To address these limitations, we propose HOSIG, a novel framework for synthesizing full-body interactions through hierarchical scene perception. Our method decouples the task into three key components: 1) a scene-aware grasp pose generator that ensures collision-free whole-body postures with precise hand-object contact by integrating local geometry constraints, 2) a heuristic navigation algorithm that autonomously plans obstacle-avoiding paths in complex indoor environments via compressed 2D floor maps and dual-component spatial reasoning, and 3) a scene-guided motion diffusion model that generates trajectory-controlled, full-body motions with finger-level accuracy by incorporating spatial anchors and dual-space classifier-free guidance. Extensive experiments on the TRUMANS dataset demonstrate superior performance over state-of-the-art methods. Notably, our framework supports unlimited motion length through autoregressive generation and requires minimal manual intervention. This work bridges the critical gap between scene-aware navigation and dexterous object manipulation, advancing the frontier of embodied interaction synthesis. Codes will be available after publication. Project page: http://yw0208.github.io/hosig
OCMay 2, 2025
A Provably Convergent Plug-and-Play Framework for Stochastic Bilevel OptimizationTianshu Chu, Dachuan Xu, Wei Yao et al.
Bilevel optimization has recently attracted significant attention in machine learning due to its wide range of applications and advanced hierarchical optimization capabilities. In this paper, we propose a plug-and-play framework, named PnPBO, for developing and analyzing stochastic bilevel optimization methods. This framework integrates both modern unbiased and biased stochastic estimators into the single-loop bilevel optimization framework introduced in [9], with several improvements. In the implementation of PnPBO, all stochastic estimators for different variables can be independently incorporated, and an additional moving average technique is applied when using an unbiased estimator for the upper-level variable. In the theoretical analysis, we provide a unified convergence and complexity analysis for PnPBO, demonstrating that the adaptation of various stochastic estimators (including PAGE, ZeroSARAH, and mixed strategies) within the PnPBO framework achieves optimal sample complexity, comparable to that of single-level optimization. This resolves the open question of whether the optimal complexity bounds for solving bilevel optimization are identical to those for single-level optimization. Finally, we empirically validate our framework, demonstrating its effectiveness on several benchmark problems and confirming our theoretical findings.
CLJun 17, 2024
Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong GeneralizationWenkai Yang, Shiqi Shen, Guangyao Shen et al.
Superalignment, where humans act as weak supervisors for superhuman models, has become a crucial problem with the rapid development of Large Language Models (LLMs). Recent work has preliminarily studied this problem by using weak models to supervise strong models, and discovered that weakly supervised strong students can consistently outperform weak teachers towards the alignment target, leading to a weak-to-strong generalization phenomenon. However, we are concerned that behind such a promising phenomenon, whether there exists an issue of weak-to-strong deception, where strong models deceive weak models by exhibiting well-aligned in areas known to weak models but producing misaligned behaviors in cases weak models do not know. We take an initial step towards exploring this security issue in a specific but realistic multi-objective alignment case, where there may be some alignment targets conflicting with each other (e.g., helpfulness v.s. harmlessness). We aim to explore whether, in such cases, strong models might deliberately make mistakes in areas known to them but unknown to weak models within one alignment dimension, in exchange for a higher reward in another dimension. Through extensive experiments in both the reward modeling and preference optimization scenarios, we find: (1) The weak-to-strong deception phenomenon exists across all settings. (2) The deception intensifies as the capability gap between weak and strong models increases. (3) Bootstrapping with an intermediate model can mitigate the deception to some extent, though its effectiveness remains limited. Our work highlights the urgent need to pay more attention to the true reliability of superalignment.