Heng Yang

CV
h-index41
55papers
2,721citations
Novelty59%
AI Score61

55 Papers

CVJun 24, 2022Code
Optimal and Robust Category-level Perception: Object Pose and Shape Estimation from 2D and 3D Semantic Keypoints

Jingnan Shi, Heng Yang, Luca Carlone · mit

We consider a category-level perception problem, where one is given 2D or 3D sensor data picturing an object of a given category (e.g., a car), and has to reconstruct the 3D pose and shape of the object despite intra-class variability (i.e., different car models have different shapes). We consider an active shape model, where -- for an object category -- we are given a library of potential CAD models describing objects in that category, and we adopt a standard formulation where pose and shape are estimated from 2D or 3D keypoints via non-convex optimization. Our first contribution is to develop PACE3D* and PACE2D*, the first certifiably optimal solvers for pose and shape estimation using 3D and 2D keypoints, respectively. Both solvers rely on the design of tight (i.e., exact) semidefinite relaxations. Our second contribution is to develop outlier-robust versions of both solvers, named PACE3D# and PACE2D#. Towards this goal, we propose ROBIN, a general graph-theoretic framework to prune outliers, which uses compatibility hypergraphs to model measurements' compatibility. We show that in category-level perception problems these hypergraphs can be built from the winding orders of the keypoints (in 2D) or their convex hulls (in 3D), and many outliers can be filtered out via maximum hyperclique computation. The last contribution is an extensive experimental evaluation. Besides providing an ablation study on simulated datasets and on the PASCAL3D+ dataset, we combine our solver with a deep keypoint detector, and show that PACE3D# improves over the state of the art in vehicle pose estimation in the ApolloScape datasets, and its runtime is compatible with practical applications. We release our code at https://github.com/MIT-SPARK/PACE.

CLAug 2, 2022Code
PyABSA: A Modularized Framework for Reproducible Aspect-based Sentiment Analysis

Heng Yang, Chen Zhang, Ke Li

The advancement of aspect-based sentiment analysis (ABSA) has urged the lack of a user-friendly framework that can largely lower the difficulty of reproducing state-of-the-art ABSA performance, especially for beginners. To meet the demand, we present \our, a modularized framework built on PyTorch for reproducible ABSA. To facilitate ABSA research, PyABSA supports several ABSA subtasks, including aspect term extraction, aspect sentiment classification, and end-to-end aspect-based sentiment analysis. Concretely, PyABSA integrates 29 models and 26 datasets. With just a few lines of code, the result of a model on a specific dataset can be reproduced. With a modularized design, PyABSA can also be flexibly extended to considered models, datasets, and other related tasks. Besides, PyABSA highlights its data augmentation and annotation features, which significantly address data scarcity. All are welcome to have a try at \url{https://github.com/yangheng95/PyABSA}.

ROJun 3
X4Val: Learning Neural Surrogates for Variance-Reduced Policy Evaluation

Rachel Luo, Michael Watson, Apoorva Sharma et al.

Rigorous evaluation of learning-based robotic systems is an essential prerequisite for deployment. However, real-world test data is expensive to gather; moreover, in a typical iterative development context, data gathered from the latest policy is necessarily limited in scale. This motivates evaluation methodologies that make use of heterogeneous data sources, including simulation, historical policy logs, and data collected from related platforms or environments. While such auxiliary data are abundant and inexpensive, they are generally not directly representative of real-world outcomes -- for example, performance in simulation may differ substantially from performance in the real world -- making their principled use for high-confidence performance estimation challenging. In this paper, we introduce X4Val, a general framework for variance-reduced real-world metric estimation in the presence of non-paired, multi-domain data. X4Val embeds samples from real and auxiliary domains into a shared representation space and learns a transferable predictor of real-world metrics; this learned predictor is then incorporated into a control-variates estimator, enabling variance reduction even when paired samples are unavailable. We provide theoretical analysis and empirical evaluations on autonomous driving and real-world robot manipulation tasks, domains across which X4Val achieves up to 38.4% variance reduction and demonstrates consistent improvements over strong baselines. These results show that non-paired, heterogeneous data can be leveraged to substantially improve the sample efficiency of rigorous robotic system validation.

CVMar 22, 2023
Object Pose Estimation with Statistical Guarantees: Conformal Keypoint Detection and Geometric Uncertainty Propagation

Heng Yang, Marco Pavone

The two-stage object pose estimation paradigm first detects semantic keypoints on the image and then estimates the 6D pose by minimizing reprojection errors. Despite performing well on standard benchmarks, existing techniques offer no provable guarantees on the quality and uncertainty of the estimation. In this paper, we inject two fundamental changes, namely conformal keypoint detection and geometric uncertainty propagation, into the two-stage paradigm and propose the first pose estimator that endows an estimation with provable and computable worst-case error bounds. On one hand, conformal keypoint detection applies the statistical machinery of inductive conformal prediction to convert heuristic keypoint detections into circular or elliptical prediction sets that cover the groundtruth keypoints with a user-specified marginal probability (e.g., 90%). Geometric uncertainty propagation, on the other, propagates the geometric constraints on the keypoints to the 6D object pose, leading to a Pose UnceRtainty SEt (PURSE) that guarantees coverage of the groundtruth pose with the same probability. The PURSE, however, is a nonconvex set that does not directly lead to estimated poses and uncertainties. Therefore, we develop RANdom SAmple averaGing (RANSAG) to compute an average pose and apply semidefinite relaxation to upper bound the worst-case errors between the average pose and the groundtruth. On the LineMOD Occlusion dataset we demonstrate: (i) the PURSE covers the groundtruth with valid probabilities; (ii) the worst-case error bounds provide correct uncertainty quantification; and (iii) the average pose achieves better or similar accuracy as representative methods based on sparse keypoints.

LGSep 27, 2023
Improving Adaptive Online Learning Using Refined Discretization

Zhiyu Zhang, Heng Yang, Ashok Cutkosky et al. · harvard

We study unconstrained Online Linear Optimization with Lipschitz losses. Motivated by the pursuit of instance optimality, we propose a new algorithm that simultaneously achieves ($i$) the AdaGrad-style second order gradient adaptivity; and ($ii$) the comparator norm adaptivity also known as "parameter freeness" in the literature. In particular, - our algorithm does not employ the impractical doubling trick, and does not require an a priori estimate of the time-uniform Lipschitz constant; - the associated regret bound has the optimal $O(\sqrt{V_T})$ dependence on the gradient variance $V_T$, without the typical logarithmic multiplicative factor; - the leading constant in the regret bound is "almost" optimal. Central to these results is a continuous time approach to online learning. We first show that the aimed simultaneous adaptivity can be achieved fairly easily in a continuous time analogue of the problem, where the environment is modeled by an arbitrary continuous semimartingale. Then, our key innovation is a new discretization argument that preserves such adaptivity in the discrete time adversarial setting. This refines a non-gradient-adaptive discretization argument from (Harvey et al., 2023), both algorithmically and analytically, which could be of independent interest.

CVApr 24Code
Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset

Wenhui Huang, Songyan Zhang, Collister Chua et al.

Urban transportation systems face growing safety challenges that require scalable intelligence for emerging smart mobility infrastructures. While recent advances in foundation models and large-scale multimodal datasets have strengthened perception and reasoning in intelligent transportation systems (ITS), existing research remains largely centered on microscopic autonomous driving (AD), with limited attention to city-scale traffic analysis. In particular, open-ended safety-oriented visual question answering (VQA) and corresponding foundation models for reasoning over heterogeneous roadside camera observations remain underexplored. To address this gap, we introduce the Land Transportation Dataset (LTD), a large-scale open-source vision-language dataset for open-ended reasoning in urban traffic environments. LTD contains 11.6K high-quality VQA pairs collected from heterogeneous roadside cameras, spanning diverse road geometries, traffic participants, illumination conditions, and adverse weather. The dataset integrates three complementary tasks: fine-grained multi-object grounding, multi-image camera selection, and multi-image risk analysis, requiring joint reasoning over minimally correlated views to infer hazardous objects, contributing factors, and risky road directions. To ensure annotation fidelity, we combine multi-model vision-language generation with cross-validation and human-in-the-loop refinement. Building upon LTD, we further propose UniVLT, a transportation foundation model trained via curriculum-based knowledge transfer to unify microscopic AD reasoning and macroscopic traffic analysis within a single architecture. Extensive experiments on LTD and multiple AD benchmarks demonstrate that UniVLT achieves SOTA performance on open-ended reasoning tasks across diverse domains, while exposing limitations of existing foundation models in complex multi-view traffic scenarios.

CVJan 24, 2023
Few-shot Font Generation by Learning Style Difference and Similarity

Xiao He, Mingrui Zhu, Nannan Wang et al.

Few-shot font generation (FFG) aims to preserve the underlying global structure of the original character while generating target fonts by referring to a few samples. It has been applied to font library creation, a personalized signature, and other scenarios. Existing FFG methods explicitly disentangle content and style of reference glyphs universally or component-wisely. However, they ignore the difference between glyphs in different styles and the similarity of glyphs in the same style, which results in artifacts such as local distortions and style inconsistency. To address this issue, we propose a novel font generation approach by learning the Difference between different styles and the Similarity of the same style (DS-Font). We introduce contrastive learning to consider the positive and negative relationship between styles. Specifically, we propose a multi-layer style projector for style encoding and realize a distinctive style representation via our proposed Cluster-level Contrastive Style (CCS) loss. In addition, we design a multi-task patch discriminator, which comprehensively considers different areas of the image and ensures that each style can be distinguished independently. We conduct qualitative and quantitative evaluations comprehensively to demonstrate that our approach achieves significantly better results than state-of-the-art methods.

CVFeb 16Code
Advances in Global Solvers for 3D Vision

Zhenjun Zhao, Heng Yang, Bangyan Liao et al.

Global solvers have emerged as a powerful paradigm for 3D vision, offering certifiable solutions to nonconvex geometric optimization problems traditionally addressed by local or heuristic methods. This survey presents the first systematic review of global solvers in geometric vision, unifying the field through a comprehensive taxonomy of three core paradigms: Branch-and-Bound (BnB), Convex Relaxation (CR), and Graduated Non-Convexity (GNC). We present their theoretical foundations, algorithmic designs, and practical enhancements for robustness and scalability, examining how each addresses the fundamental nonconvexity of geometric estimation problems. Our analysis spans ten core vision tasks, from Wahba problem to bundle adjustment, revealing the optimality-robustness-scalability trade-offs that govern solver selection. We identify critical future directions: scaling algorithms while maintaining guarantees, integrating data-driven priors with certifiable optimization, establishing standardized benchmarks, and addressing societal implications for safety-critical deployment. By consolidating theoretical foundations, practical advances, and broader impacts, this survey provides a unified perspective and roadmap toward certifiable, trustworthy perception for real-world applications. A continuously-updated literature summary and companion code tutorials are available at https://github.com/ericzzj1989/Awesome-Global-Solvers-for-3D-Vision.

CLOct 6, 2022
BootAug: Boosting Text Augmentation via Hybrid Instance Filtering Framework

Heng Yang, Ke Li

Text augmentation is an effective technique for addressing the problem of insufficient data in natural language processing. However, existing text augmentation methods tend to focus on few-shot scenarios and usually perform poorly on large public datasets. Our research indicates that existing augmentation methods often generate instances with shifted feature spaces, which leads to a drop in performance on the augmented data (for example, EDA generally loses $\approx 2\%$ in aspect-based sentiment classification). To address this problem, we propose a hybrid instance-filtering framework (BootAug) based on pre-trained language models that can maintain a similar feature space with natural datasets. BootAug is transferable to existing text augmentation methods (such as synonym substitution and back translation) and significantly improves the augmentation performance by $\approx 2-3\%$ in classification accuracy. Our experimental results on three classification tasks and nine public datasets show that BootAug addresses the performance drop problem and outperforms state-of-the-art text augmentation methods. Additionally, we release the code to help improve existing augmentation methods on large datasets.

RODec 8, 2025Code
Sparse Variable Projection in Robotic Perception: Exploiting Separable Structure for Efficient Nonlinear Optimization

Alan Papalia, Nikolas Sanderson, Haoyu Han et al.

Robotic perception often requires solving large nonlinear least-squares (NLS) problems. While sparsity has been well-exploited to scale solvers, a complementary and underexploited structure is \emph{separability} -- where some variables (e.g., visual landmarks) appear linearly in the residuals and, for any estimate of the remaining variables (e.g., poses), have a closed-form solution. Variable projection (VarPro) methods are a family of techniques that exploit this structure by analytically eliminating the linear variables and presenting a reduced problem in the remaining variables that has favorable properties. However, VarPro has seen limited use in robotic perception; a major challenge arises from gauge symmetries (e.g., cost invariance to global shifts and rotations), which are common in perception and induce specific computational challenges in standard VarPro approaches. We present a VarPro scheme designed for problems with gauge symmetries that jointly exploits separability and sparsity. Our method can be applied as a one-time preprocessing step to construct a \emph{matrix-free Schur complement operator}. This operator allows efficient evaluation of costs, gradients, and Hessian-vector products of the reduced problem and readily integrates with standard iterative NLS solvers. We provide precise conditions under which our method applies, and describe extensions when these conditions are only partially met. Across synthetic and real benchmarks in SLAM, SNL, and SfM, our approach achieves up to \textbf{2$\times$--35$\times$ faster runtimes} than state-of-the-art methods while maintaining accuracy. We release an open-source C++ implementation and all datasets from our experiments.

GNJul 15, 2024
Bridging Sequence-Structure Alignment in RNA Foundation Models

Heng Yang, Renzhi Chen, Ke Li

The alignment between RNA sequences and structures in foundation models (FMs) has yet to be thoroughly investigated. Existing FMs have struggled to establish sequence-structure alignment, hindering the free flow of genomic information between RNA sequences and structures. In this study, we introduce OmniGenome, an RNA FM trained to align RNA sequences with respect to secondary structures based on structure-contextualised modelling. The alignment enables free and bidirectional mappings between sequences and structures by utilising the flexible RNA modelling paradigm that supports versatile input and output modalities, i.e., sequence and/or structure as input/output. We implement RNA design and zero-shot secondary structure prediction as case studies to evaluate the Seq2Str and Str2Seq mapping capacity of OmniGenome. Results on the EternaV2 benchmark show that OmniGenome solved 74% of puzzles, whereas existing FMs only solved up to 3% of the puzzles due to the oversight of sequence-structure alignment. We leverage four comprehensive in-silico genome modelling benchmarks to evaluate performance across a diverse set of genome downstream tasks, where the results show that OmniGenome achieves state-of-the-art performance on RNA and DNA benchmarks, even without any training on DNA genomes.

CLOct 26, 2023
InstOptima: Evolutionary Multi-objective Instruction Optimization via Large Language Model-based Instruction Operators

Heng Yang, Ke Li

Instruction-based language modeling has received significant attention in pretrained language models. However, the efficiency of instruction engineering remains low and hinders the development of instruction studies. Recent studies have focused on automating instruction generation, but they primarily aim to improve performance without considering other crucial objectives that impact instruction quality, such as instruction length and perplexity. Therefore, we propose a novel approach (i.e., InstOptima) that treats instruction generation as an evolutionary multi-objective optimization problem. In contrast to text edition-based methods, our approach utilizes a large language model (LLM) to simulate instruction operators, including mutation and crossover. Furthermore, we introduce an objective-guided mechanism for these operators, allowing the LLM to comprehend the objectives and enhance the quality of the generated instructions. Experimental results demonstrate improved fine-tuning performance and the generation of a diverse set of high-quality instructions.

LGNov 4, 2024Code
Learning Multiple Initial Solutions to Optimization Problems

Elad Sharony, Heng Yang, Tong Che et al.

Sequentially solving similar optimization problems under strict runtime constraints is essential for many applications, such as robot control, autonomous driving, and portfolio management. The performance of local optimization methods in these settings is sensitive to the initial solution: poor initialization can lead to slow convergence or suboptimal solutions. To address this challenge, we propose learning to predict \emph{multiple} diverse initial solutions given parameters that define the problem instance. We introduce two strategies for utilizing multiple initial solutions: (i) a single-optimizer approach, where the most promising initial solution is chosen using a selection function, and (ii) a multiple-optimizers approach, where several optimizers, potentially run in parallel, are each initialized with a different solution, with the best solution chosen afterward. Notably, by including a default initialization among predicted ones, the cost of the final output is guaranteed to be equal or lower than with the default initialization. We validate our method on three optimal control benchmark tasks: cart-pole, reacher, and autonomous driving, using different optimizers: DDP, MPPI, and iLQR. We find significant and consistent improvement with our method across all evaluation settings and demonstrate that it efficiently scales with the number of initial solutions required. The code is available at MISO (https://github.com/EladSharony/miso).

GNMay 20, 2025Code
OmniGenBench: A Modular Platform for Reproducible Genomic Foundation Models Benchmarking

Heng Yang, Jack Cole, Yuan Li et al.

The code of nature, embedded in DNA and RNA genomes since the origin of life, holds immense potential to impact both humans and ecosystems through genome modeling. Genomic Foundation Models (GFMs) have emerged as a transformative approach to decoding the genome. As GFMs scale up and reshape the landscape of AI-driven genomics, the field faces an urgent need for rigorous and reproducible evaluation. We present OmniGenBench, a modular benchmarking platform designed to unify the data, model, benchmarking, and interpretability layers across GFMs. OmniGenBench enables standardized, one-command evaluation of any GFM across five benchmark suites, with seamless integration of over 31 open-source models. Through automated pipelines and community-extensible features, the platform addresses critical reproducibility challenges, including data transparency, model interoperability, benchmark fragmentation, and black-box interpretability. OmniGenBench aims to serve as foundational infrastructure for reproducible genomic AI research, accelerating trustworthy discovery and collaborative innovation in the era of genome-scale modeling.

ROMay 9
BEACON: Cross-Domain Co-Training of Generative Robot Policies via Best-Effort Adaptation

Antong Zhang, Han Qi, Heng Yang

We introduce BEACON--Best-Effort Adaptation for Cross-Domain Co-Training--a theory-driven framework for training generative robot policies with abundant source demonstrations and limited target demonstrations. BEACON casts cross-domain co-training as a discrepancy-aware importance-reweighting problem, jointly learning a diffusion-based visuomotor policy and per-sample source weights that minimize an objective informed by target-domain generalization guarantees. To make best-effort adaptation practical for high-dimensional sequence policies, we develop scalable instance-level discrepancy estimators, stochastic alternating updates for policy and weights, and a multi-source extension that balances heterogeneous source domains. Across sim-to-sim, sim-to-real, and multi-source manipulation settings, BEACON improves robustness and data efficiency over target-only, fixed-ratio co-training, and feature-alignment baselines. Importantly, even without an explicit alignment objective, BEACON achieves feature alignment as an implicit result of discrepancy-aware cross-domain co-training.

CHEM-PHNov 14, 2025
Socrates-Mol: Self-Oriented Cognitive Reasoning through Autonomous Trial-and-Error with Empirical-Bayesian Screening for Molecules

Xiangru Wang, Zekun Jiang, Heng Yang et al.

Molecular property prediction is fundamental to chemical engineering applications such as solvent screening. We present Socrates-Mol, a framework that transforms language models into empirical Bayesian reasoners through context engineering, addressing cold start problems without model fine-tuning. The system implements a reflective-prediction cycle where initial outputs serve as priors, retrieved molecular cases provide evidence, and refined predictions form posteriors, extracting reusable chemical rules from sparse data. We introduce ranking tasks aligned with industrial screening priorities and employ cross-model self-consistency across five language models to reduce variance. Experiments on amine solvent LogP prediction reveal task-dependent patterns: regression achieves 72% MAE reduction and 112% R-squared improvement through self-consistency, while ranking tasks show limited gains due to systematic multi-model biases. The framework reduces deployment costs by over 70% compared to full fine-tuning, providing a scalable solution for molecular property prediction while elucidating the task-adaptive nature of self-consistency mechanisms.

CVNov 25, 2024
PreF3R: Pose-Free Feed-Forward 3D Gaussian Splatting from Variable-length Image Sequence

Zequn Chen, Jiezhi Yang, Heng Yang

We present PreF3R, Pose-Free Feed-forward 3D Reconstruction from an image sequence of variable length. Unlike previous approaches, PreF3R removes the need for camera calibration and reconstructs the 3D Gaussian field within a canonical coordinate frame directly from a sequence of unposed images, enabling efficient novel-view rendering. We leverage DUSt3R's ability for pair-wise 3D structure reconstruction, and extend it to sequential multi-view input via a spatial memory network, eliminating the need for optimization-based global alignment. Additionally, PreF3R incorporates a dense Gaussian parameter prediction head, which enables subsequent novel-view synthesis with differentiable rasterization. This allows supervising our model with the combination of photometric loss and pointmap regression loss, enhancing both photorealism and structural accuracy. Given a sequence of ordered images, PreF3R incrementally reconstructs the 3D Gaussian field at 20 FPS, therefore enabling real-time novel-view rendering. Empirical experiments demonstrate that PreF3R is an effective solution for the challenging task of pose-free feed-forward novel-view synthesis, while also exhibiting robust generalization to unseen scenes.

CVMar 12, 2024
Q-SLAM: Quadric Representations for Monocular SLAM

Chensheng Peng, Chenfeng Xu, Yue Wang et al. · berkeley

In this paper, we reimagine volumetric representations through the lens of quadrics. We posit that rigid scene components can be effectively decomposed into quadric surfaces. Leveraging this assumption, we reshape the volumetric representations with million of cubes by several quadric planes, which results in more accurate and efficient modeling of 3D scenes in SLAM contexts. First, we use the quadric assumption to rectify noisy depth estimations from RGB inputs. This step significantly improves depth estimation accuracy, and allows us to efficiently sample ray points around quadric planes instead of the entire volume space in previous NeRF-SLAM systems. Second, we introduce a novel quadric-decomposed transformer to aggregate information across quadrics. The quadric semantics are not only explicitly used for depth correction and scene decomposition, but also serve as an implicit supervision signal for the mapping network. Through rigorous experimental evaluation, our method exhibits superior performance over other approaches relying on estimated depth, and achieves comparable accuracy to methods utilizing ground truth depth on both synthetic and real-world datasets.

ROFeb 2, 2025
Strengthening Generative Robot Policies through Predictive World Modeling

Han Qi, Haocheng Yin, Aris Zhu et al.

We present generative predictive control (GPC), a learning control framework that (i) clones a generative diffusion-based policy from expert demonstrations, (ii) trains a predictive action-conditioned world model from both expert demonstrations and random explorations, and (iii) synthesizes an online planner that ranks and optimizes the action proposals from (i) by looking ahead into the future using the world model from (ii). Across a variety of robotic manipulation tasks, we demonstrate that GPC consistently outperforms behavior cloning in both state-based and vision-based settings, in simulation and in the real world.

LGDec 7, 2023
PAC-Bayes Generalization Certificates for Learned Inductive Conformal Prediction

Apoorva Sharma, Sushant Veer, Asher Hancock et al.

Inductive Conformal Prediction (ICP) provides a practical and effective approach for equipping deep learning models with uncertainty estimates in the form of set-valued predictions which are guaranteed to contain the ground truth with high probability. Despite the appeal of this coverage guarantee, these sets may not be efficient: the size and contents of the prediction sets are not directly controlled, and instead depend on the underlying model and choice of score function. To remedy this, recent work has proposed learning model and score function parameters using data to directly optimize the efficiency of the ICP prediction sets. While appealing, the generalization theory for such an approach is lacking: direct optimization of empirical efficiency may yield prediction sets that are either no longer efficient on test data, or no longer obtain the required coverage on test data. In this work, we use PAC-Bayes theory to obtain generalization bounds on both the coverage and the efficiency of set-valued predictors which can be directly optimized to maximize efficiency while satisfying a desired test coverage. In contrast to prior work, our framework allows us to utilize the entire calibration dataset to learn the parameters of the model and score function, instead of requiring a separate hold-out set for obtaining test-time coverage guarantees. We leverage these theoretical results to provide a practical algorithm for using calibration data to simultaneously fine-tune the parameters of a model and score function while guaranteeing test-time coverage and efficiency of the resulting prediction sets. We evaluate the approach on regression and classification tasks, and outperform baselines calibrated using a Hoeffding bound-based PAC guarantee on ICP, especially in the low-data regime.

LGFeb 5, 2024
Discounted Adaptive Online Learning: Towards Better Regularization

Zhiyu Zhang, David Bombara, Heng Yang · harvard

We study online learning in adversarial nonstationary environments. Since the future can be very different from the past, a critical challenge is to gracefully forget the history while new data comes in. To formalize this intuition, we revisit the discounted regret in online convex optimization, and propose an adaptive (i.e., instance optimal), FTRL-based algorithm that improves the widespread non-adaptive baseline -- gradient descent with a constant learning rate. From a practical perspective, this refines the classical idea of regularization in lifelong learning: we show that designing good regularizers can be guided by the principled theory of adaptive online optimization. Complementing this result, we also consider the (Gibbs and Candès, 2021)-style online conformal prediction problem, where the goal is to sequentially predict the uncertainty sets of a black-box machine learning model. We show that the FTRL nature of our algorithm can simplify the conventional gradient-descent-based analysis, leading to instance-dependent performance guarantees.

LGApr 23
Tempered Sequential Monte Carlo for Trajectory and Policy Optimization with Differentiable Dynamics

Heng Yang

We propose a sampling-based framework for finite-horizon trajectory and policy optimization under differentiable dynamics by casting controller design as inference. Specifically, we minimize a KL-regularized expected trajectory cost, which yields an optimal "Boltzmann-tilted" distribution over controller parameters that concentrates on low-cost solutions as temperature decreases. To sample efficiently from this sharp, potentially multimodal target, we introduce tempered sequential Monte Carlo (TSMC): an annealing scheme that adaptively reweights and resamples particles along a tempering path from a prior to the target distribution, while using Hamiltonian Monte Carlo rejuvenation to maintain diversity and exploit exact gradients obtained by differentiating through trajectory rollouts. For policy optimization, we extend TSMC via (i) a deterministic empirical approximation of the initial-state distribution and (ii) an extended-space construction that treats rollout randomness as auxiliary variables. Experiments across trajectory- and policy-optimization benchmarks show that TSMC is broadly applicable and compares favorably to state-of-the-art baselines.

CVDec 10, 2024
LoRA3D: Low-Rank Self-Calibration of 3D Geometric Foundation Models

Ziqi Lu, Heng Yang, Danfei Xu et al.

Emerging 3D geometric foundation models, such as DUSt3R, offer a promising approach for in-the-wild 3D vision tasks. However, due to the high-dimensional nature of the problem space and scarcity of high-quality 3D data, these pre-trained models still struggle to generalize to many challenging circumstances, such as limited view overlap or low lighting. To address this, we propose LoRA3D, an efficient self-calibration pipeline to $\textit{specialize}$ the pre-trained models to target scenes using their own multi-view predictions. Taking sparse RGB images as input, we leverage robust optimization techniques to refine multi-view predictions and align them into a global coordinate frame. In particular, we incorporate prediction confidence into the geometric optimization process, automatically re-weighting the confidence to better reflect point estimation accuracy. We use the calibrated confidence to generate high-quality pseudo labels for the calibrating views and use low-rank adaptation (LoRA) to fine-tune the models on the pseudo-labeled data. Our method does not require any external priors or manual labels. It completes the self-calibration process on a $\textbf{single standard GPU within just 5 minutes}$. Each low-rank adapter requires only $\textbf{18MB}$ of storage. We evaluated our method on $\textbf{more than 160 scenes}$ from the Replica, TUM and Waymo Open datasets, achieving up to $\textbf{88% performance improvement}$ on 3D reconstruction, multi-view pose estimation and novel-view rendering.

ROFeb 7, 2025
Building Rome with Convex Optimization

Haoyu Han, Heng Yang

Global bundle adjustment is made easy by depth prediction and convex optimization. We (i) propose a scaled bundle adjustment (SBA) formulation that lifts 2D keypoint measurements to 3D with learned depth, (ii) design an empirically tight convex semidfinite program (SDP) relaxation that solves SBA to certfiable global optimality, (iii) solve the SDP relaxations at extreme scale with Burer-Monteiro factorization and a CUDA-based trust-region Riemannian optimizer (dubbed XM), (iv) build a structure from motion (SfM) pipeline with XM as the optimization engine and show that XM-SfM compares favorably with existing pipelines in terms of reconstruction quality while being significantly faster, more scalable, and initialization-free.

CVOct 16, 2024
Cocoon: Robust Multi-Modal Perception with Uncertainty-Aware Sensor Fusion

Minkyoung Cho, Yulong Cao, Jiachen Sun et al.

An important paradigm in 3D object detection is the use of multiple modalities to enhance accuracy in both normal and challenging conditions, particularly for long-tail scenarios. To address this, recent studies have explored two directions of adaptive approaches: MoE-based adaptive fusion, which struggles with uncertainties arising from distinct object configurations, and late fusion for output-level adaptive fusion, which relies on separate detection pipelines and limits comprehensive understanding. In this work, we introduce Cocoon, an object- and feature-level uncertainty-aware fusion framework. The key innovation lies in uncertainty quantification for heterogeneous representations, enabling fair comparison across modalities through the introduction of a feature aligner and a learnable surrogate ground truth, termed feature impression. We also define a training objective to ensure that their relationship provides a valid metric for uncertainty quantification. Cocoon consistently outperforms existing static and adaptive methods in both normal and challenging conditions, including those with natural and artificial corruptions. Furthermore, we show the validity and efficacy of our uncertainty metric across diverse datasets.

OCFeb 1
Non-Uniform Noise-to-Signal Ratio in the REINFORCE Policy-Gradient Estimator

Haoyu Han, Heng Yang

Policy-gradient methods are widely used in reinforcement learning, yet training often becomes unstable or slows down as learning progresses. We study this phenomenon through the noise-to-signal ratio (NSR) of a policy-gradient estimator, defined as the estimator variance (noise) normalized by the squared norm of the true gradient (signal). Our main result is that, for (i) finite-horizon linear systems with Gaussian policies and linear state-feedback, and (ii) finite-horizon polynomial systems with Gaussian policies and polynomial feedback, the NSR of the REINFORCE estimator can be characterized exactly-either in closed form or via numerical moment-evaluation algorithms-without approximation. For general nonlinear dynamics and expressive policies (including neural policies), we further derive a general upper bound on the variance. These characterizations enable a direct examination of how NSR varies across policy parameters and how it evolves along optimization trajectories (e.g. SGD and Adam). Across a range of examples, we find that the NSR landscape is highly non-uniform and typically increases as the policy approaches an optimum; in some regimes it blows up, which can trigger training instability and policy collapse.

ROOct 5, 2025
Flexible Locomotion Learning with Diffusion Model Predictive Control

Runhan Huang, Haldun Balim, Heng Yang et al.

Legged locomotion demands controllers that are both robust and adaptable, while remaining compatible with task and safety considerations. However, model-free reinforcement learning (RL) methods often yield a fixed policy that can be difficult to adapt to new behaviors at test time. In contrast, Model Predictive Control (MPC) provides a natural approach to flexible behavior synthesis by incorporating different objectives and constraints directly into its optimization process. However, classical MPC relies on accurate dynamics models, which are often difficult to obtain in complex environments and typically require simplifying assumptions. We present Diffusion-MPC, which leverages a learned generative diffusion model as an approximate dynamics prior for planning, enabling flexible test-time adaptation through reward and constraint based optimization. Diffusion-MPC jointly predicts future states and actions; at each reverse step, we incorporate reward planning and impose constraint projection, yielding trajectories that satisfy task objectives while remaining within physical limits. To obtain a planning model that adapts beyond imitation pretraining, we introduce an interactive training algorithm for diffusion based planner: we execute our reward-and-constraint planner in environment, then filter and reweight the collected trajectories by their realized returns before updating the denoiser. Our design enables strong test-time adaptability, allowing the planner to adjust to new reward specifications without retraining. We validate Diffusion-MPC on real world, demonstrating strong locomotion and flexible adaptation.

ROSep 19, 2025
Compose by Focus: Scene Graph-based Atomic Skills

Han Qi, Changhe Chen, Heng Yang

A key requirement for generalist robots is compositional generalization - the ability to combine atomic skills to solve complex, long-horizon tasks. While prior work has primarily focused on synthesizing a planner that sequences pre-learned skills, robust execution of the individual skills themselves remains challenging, as visuomotor policies often fail under distribution shifts induced by scene composition. To address this, we introduce a scene graph-based representation that focuses on task-relevant objects and relations, thereby mitigating sensitivity to irrelevant variation. Building on this idea, we develop a scene-graph skill learning framework that integrates graph neural networks with diffusion-based imitation learning, and further combine "focused" scene-graph skills with a vision-language model (VLM) based task planner. Experiments in both simulation and real-world manipulation tasks demonstrate substantially higher success rates than state-of-the-art baselines, highlighting improved robustness and compositional generalization in long-horizon tasks.

LGMay 20, 2025
FedGraM: Defending Against Untargeted Attacks in Federated Learning via Embedding Gram Matrix

Di Wu, Qian Li, Heng Yang et al.

Federated Learning (FL) enables geographically distributed clients to collaboratively train machine learning models by sharing only their local models, ensuring data privacy. However, FL is vulnerable to untargeted attacks that aim to degrade the global model's performance on the underlying data distribution. Existing defense mechanisms attempt to improve FL's resilience against such attacks, but their effectiveness is limited in practical FL environments due to data heterogeneity. On the contrary, we aim to detect and remove the attacks to mitigate their impact. Generalization contribution plays a crucial role in distinguishing untargeted attacks. Our observations indicate that, with limited data, the divergence between embeddings representing different classes provides a better measure of generalization than direct accuracy. In light of this, we propose a novel robust aggregation method, FedGraM, designed to defend against untargeted attacks in FL. The server maintains an auxiliary dataset containing one sample per class to support aggregation. This dataset is fed to the local models to extract embeddings. Then, the server calculates the norm of the Gram Matrix of the embeddings for each local model. The norm serves as an indicator of each model's inter-class separation capability in the embedding space. FedGraM identifies and removes potentially malicious models by filtering out those with the largest norms, then averages the remaining local models to form the global model. We conduct extensive experiments to evaluate the performance of FedGraM. Our empirical results show that with limited data samples used to construct the auxiliary dataset, FedGraM achieves exceptional performance, outperforming state-of-the-art defense methods.

CVApr 17, 2025
SPIE: Semantic and Structural Post-Training of Image Editing Diffusion Models with AI feedback

Elior Benarous, Yilun Du, Heng Yang

This paper presents SPIE: a novel approach for semantic and structural post-training of instruction-based image editing diffusion models, addressing key challenges in alignment with user prompts and consistency with input images. We introduce an online reinforcement learning framework that aligns the diffusion model with human preferences without relying on extensive human annotations or curating a large dataset. Our method significantly improves the alignment with instructions and realism in two ways. First, SPIE captures fine nuances in the desired edit by leveraging a visual prompt, enabling detailed control over visual edits without lengthy textual prompts. Second, it achieves precise and structurally coherent modifications in complex scenes while maintaining high fidelity in instruction-irrelevant areas. This approach simplifies users' efforts to achieve highly specific edits, requiring only 5 reference images depicting a certain concept for training. Experimental results demonstrate that SPIE can perform intricate edits in complex scenes, after just 10 training steps. Finally, we showcase the versatility of our method by applying it to robotics, where targeted image edits enhance the visual realism of simulated environments, which improves their utility as proxy for real-world settings.

LGJun 3, 2024
Adapting Prediction Sets to Distribution Shifts Without Labels

Kevin Kasa, Zhiyu Zhang, Heng Yang et al.

Recently there has been a surge of interest to deploy confidence set predictions rather than point predictions in machine learning. Unfortunately, the effectiveness of such prediction sets is frequently impaired by distribution shifts in practice, and the challenge is often compounded by the lack of ground truth labels at test time. Focusing on a standard set-valued prediction framework called conformal prediction (CP), this paper studies how to improve its practical performance using only unlabeled data from the shifted test domain. This is achieved by two new methods called ECP and EACP, whose main idea is to adjust the score function in CP according to its base model's own uncertainty evaluation. Through extensive experiments on a number of large-scale datasets and neural network architectures, we show that our methods provide consistent improvement over existing baselines and nearly match the performance of fully supervised methods.

CLMay 6, 2023
The Best Defense is Attack: Repairing Semantics in Textual Adversarial Examples

Heng Yang, Ke Li

Recent studies have revealed the vulnerability of pre-trained language models to adversarial attacks. Existing adversarial defense techniques attempt to reconstruct adversarial examples within feature or text spaces. However, these methods struggle to effectively repair the semantics in adversarial examples, resulting in unsatisfactory performance and limiting their practical utility. To repair the semantics in adversarial examples, we introduce a novel approach named Reactive Perturbation Defocusing (Rapid). Rapid employs an adversarial detector to identify fake labels of adversarial examples and leverage adversarial attackers to repair the semantics in adversarial examples. Our extensive experimental results conducted on four public datasets, convincingly demonstrate the effectiveness of Rapid in various adversarial attack scenarios. To address the problem of defense performance validation in previous works, we provide a demonstration of adversarial detection and repair based on our work, which can be easily evaluated at https://tinyurl.com/22ercuf8.

CLOct 16, 2021
LSA: Modeling Aspect Sentiment Coherency via Local Sentiment Aggregation

Heng Yang, Ke Li

Aspect sentiment coherency is an intriguing yet underexplored topic in the field of aspect-based sentiment classification. This concept reflects the common pattern where adjacent aspects often share similar sentiments. Despite its prevalence, current studies have not fully recognized the potential of modeling aspect sentiment coherency, including its implications in adversarial defense. To model aspect sentiment coherency, we propose a novel local sentiment aggregation (LSA) paradigm based on constructing a differential-weighted sentiment aggregation window. We have rigorously evaluated our model through experiments, and the results affirm the proficiency of LSA in terms of aspect coherency prediction and aspect sentiment classification. For instance, it outperforms existing models and achieves state-of-the-art sentiment classification performance across five public datasets. Furthermore, we demonstrate the promising ability of LSA in ABSC adversarial defense, thanks to its sentiment coherency modeling. To encourage further exploration and application of this concept, we have made our code publicly accessible. This will provide researchers with a valuable tool to delve into sentiment coherency modeling in future research.

CVSep 7, 2021
Certifiably Optimal Outlier-Robust Geometric Perception: Semidefinite Relaxations and Scalable Global Optimization

Heng Yang, Luca Carlone

We propose the first general and scalable framework to design certifiable algorithms for robust geometric perception in the presence of outliers. Our first contribution is to show that estimation using common robust costs, such as truncated least squares (TLS), maximum consensus, Geman-McClure, Tukey's biweight, among others, can be reformulated as polynomial optimization problems (POPs). By focusing on the TLS cost, our second contribution is to exploit sparsity in the POP and propose a sparse semidefinite programming (SDP) relaxation that is much smaller than the standard Lasserre's hierarchy while preserving empirical exactness, i.e., the SDP recovers the optimizer of the nonconvex POP with an optimality certificate. Our third contribution is to solve the SDP relaxations at an unprecedented scale and accuracy by presenting STRIDE, a solver that blends global descent on the convex SDP with fast local search on the nonconvex POP. Our fourth contribution is an evaluation of the proposed framework on six geometric perception problems including single and multiple rotation averaging, point cloud and mesh registration, absolute pose estimation, and category-level object pose and shape estimation. Our experiments demonstrate that (i) our sparse SDP relaxation is empirically exact with up to 60%-90% outliers across applications; (ii) while still being far from real-time, STRIDE is up to 100 times faster than existing SDP solvers on medium-scale problems, and is the only solver that can solve large-scale SDPs with hundreds of thousands of constraints to high accuracy; (iii) STRIDE safeguards existing fast heuristics for robust estimation (e.g., RANSAC or Graduated Non-Convexity), i.e., it certifies global optimality if the heuristic estimates are optimal, or detects and allows escaping local optima when the heuristic estimates are suboptimal.

OCMay 28, 2021
An Inexact Projected Gradient Method with Rounding and Lifting by Nonlinear Programming for Solving Rank-One Semidefinite Relaxation of Polynomial Optimization

Heng Yang, Ling Liang, Luca Carlone et al.

We consider solving high-order semidefinite programming (SDP) relaxations of nonconvex polynomial optimization problems (POPs) that often admit degenerate rank-one optimal solutions. Instead of solving the SDP alone, we propose a new algorithmic framework that blends local search using the nonconvex POP into global descent using the convex SDP. In particular, we first design a globally convergent inexact projected gradient method (iPGM) for solving the SDP that serves as the backbone of our framework. We then accelerate iPGM by taking long, but safeguarded, rank-one steps generated by fast nonlinear programming algorithms. We prove that the new framework is still globally convergent for solving the SDP. To solve the iPGM subproblem of projecting a given point onto the feasible set of the SDP, we design a two-phase algorithm with phase one using a symmetric Gauss-Seidel based accelerated proximal gradient method (sGS-APG) to generate a good initial point, and phase two using a modified limited-memory BFGS (L-BFGS) method to obtain an accurate solution. We analyze the convergence for both phases and establish a novel global convergence result for the modified L-BFGS that does not require the objective function to be twice continuously differentiable. We conduct numerical experiments for solving second-order SDP relaxations arising from a diverse set of POPs. Our framework demonstrates state-of-the-art efficiency, scalability, and robustness in solving degenerate rank-one SDPs to high accuracy, even in the presence of millions of equality constraints.

CVApr 16, 2021
Optimal Pose and Shape Estimation for Category-level 3D Object Perception

Jingnan Shi, Heng Yang, Luca Carlone

We consider a category-level perception problem, where one is given 3D sensor data picturing an object of a given category (e.g. a car), and has to reconstruct the pose and shape of the object despite intra-class variability (i.e. different car models have different shapes). We consider an active shape model, where -- for an object category -- we are given a library of potential CAD models describing objects in that category, and we adopt a standard formulation where pose and shape estimation are formulated as a non-convex optimization. Our first contribution is to provide the first certifiably optimal solver for pose and shape estimation. In particular, we show that rotation estimation can be decoupled from the estimation of the object translation and shape, and we demonstrate that (i) the optimal object rotation can be computed via a tight (small-size) semidefinite relaxation, and (ii) the translation and shape parameters can be computed in closed-form given the rotation. Our second contribution is to add an outlier rejection layer to our solver, hence making it robust to a large number of misdetections. Towards this goal, we wrap our optimal solver in a robust estimation scheme based on graduated non-convexity. To further enhance robustness to outliers, we also develop the first graph-theoretic formulation to prune outliers in category-level perception, which removes outliers via convex hull and maximum clique computations; the resulting approach is robust to 70%-90% outliers. Our third contribution is an extensive experimental evaluation. Besides providing an ablation study on a simulated dataset and on the PASCAL3D+ dataset, we combine our solver with a deep-learned keypoint detector, and show that the resulting approach improves over the state of the art in vehicle pose estimation in the ApolloScape datasets.

CVMar 10, 2021
Dynamical Pose Estimation

Heng Yang, Chris Doran, Jean-Jacques Slotine

We study the problem of aligning two sets of 3D geometric primitives given known correspondences. Our first contribution is to show that this primitive alignment framework unifies five perception problems including point cloud registration, primitive (mesh) registration, category-level 3D registration, absolution pose estimation (APE), and category-level APE. Our second contribution is to propose DynAMical Pose estimation (DAMP), the first general and practical algorithm to solve primitive alignment problem by simulating rigid body dynamics arising from virtual springs and damping, where the springs span the shortest distances between corresponding primitives. We evaluate DAMP in simulated and real datasets across all five problems, and demonstrate (i) DAMP always converges to the globally optimal solution in the first three problems with 3D-3D correspondences; (ii) although DAMP sometimes converges to suboptimal solutions in the last two problems with 2D-3D correspondences, using a scheme for escaping local minima, DAMP always succeeds. Our third contribution is to demystify the surprising empirical performance of DAMP and formally prove a global convergence result in the case of point cloud registration by charactering local stability of the equilibrium points of the underlying dynamical system.

CVMar 4, 2021
Self-supervised Geometric Perception

Heng Yang, Wei Dong, Luca Carlone et al.

We present self-supervised geometric perception (SGP), the first general framework to learn a feature descriptor for correspondence matching without any ground-truth geometric model labels (e.g., camera poses, rigid transformations). Our first contribution is to formulate geometric perception as an optimization problem that jointly optimizes the feature descriptor and the geometric models given a large corpus of visual measurements (e.g., images, point clouds). Under this optimization formulation, we show that two important streams of research in vision, namely robust model fitting and deep feature learning, correspond to optimizing one block of the unknown variables while fixing the other block. This analysis naturally leads to our second contribution -- the SGP algorithm that performs alternating minimization to solve the joint optimization. SGP iteratively executes two meta-algorithms: a teacher that performs robust model fitting given learned features to generate geometric pseudo-labels, and a student that performs deep feature learning under noisy supervision of the pseudo-labels. As a third contribution, we apply SGP to two perception problems on large-scale real datasets, namely relative camera pose estimation on MegaDepth and point cloud registration on 3DMatch. We demonstrate that SGP achieves state-of-the-art performance that is on-par or superior to the supervised oracles trained using ground-truth labels.

CVNov 7, 2020
ROBIN: a Graph-Theoretic Approach to Reject Outliers in Robust Estimation using Invariants

Jingnan Shi, Heng Yang, Luca Carlone

Many estimation problems in robotics, computer vision, and learning require estimating unknown quantities in the face of outliers. Outliers are typically the result of incorrect data association or feature matching, and it is common to have problems where more than 90% of the measurements used for estimation are outliers. While current approaches for robust estimation are able to deal with moderate amounts of outliers, they fail to produce accurate estimates in the presence of many outliers. This paper develops an approach to prune outliers. First, we develop a theory of invariance that allows us to quickly check if a subset of measurements are mutually compatible without explicitly solving the estimation problem. Second, we develop a graph-theoretic framework, where measurements are modeled as vertices and mutual compatibility is captured by edges. We generalize existing results showing that the inliers form a clique in this graph and typically belong to the maximum clique. We also show that in practice the maximum k-core of the compatibility graph provides an approximation of the maximum clique, while being faster to compute in large problems. These two contributions leads to ROBIN, our approach to Reject Outliers Based on INvariants, which allows us to quickly prune outliers in generic estimation problems. We demonstrate ROBIN in four geometric perception problems and show it boosts robustness of existing solvers while running in milliseconds in large problems.

CLOct 2, 2020
Enhancing Fine-grained Sentiment Classification Exploiting Local Context Embedding

Heng Yang, Biqing Zeng

Target-oriented sentiment classification is a fine-grained task of natural language processing to analyze the sentiment polarity of the targets. To improve the performance of sentiment classification, many approaches proposed various attention mechanisms to capture the important context words of a target. However, previous approaches ignored the significant relatedness of a target's sentiment and its local context. This paper proposes a local context-aware network (LCA-Net), equipped with the local context embedding and local context prediction loss, to strengthen the model by emphasizing the sentiment information of the local context. The experimental results on three common datasets show that local context-aware network performs superior to existing approaches in extracting local context features. Besides, the local context-aware framework is easy to adapt to many models, with the potential to improve other target-level tasks.

CVJul 29, 2020
Outlier-Robust Estimation: Hardness, Minimally Tuned Algorithms, and Applications

Pasquale Antonante, Vasileios Tzoumas, Heng Yang et al.

Nonlinear estimation in robotics and vision is typically plagued with outliers due to wrong data association, or to incorrect detections from signal processing and machine learning methods. This paper introduces two unifying formulations for outlier-robust estimation, Generalized Maximum Consensus (G-MC) and Generalized Truncated Least Squares (G-TLS), and investigates fundamental limits, practical algorithms, and applications. Our first contribution is a proof that outlier-robust estimation is inapproximable: in the worst case, it is impossible to (even approximately) find the set of outliers, even with slower-than-polynomial-time algorithms (particularly, algorithms running in quasi-polynomial time). As a second contribution, we review and extend two general-purpose algorithms. The first, Adaptive Trimming (ADAPT), is combinatorial, and is suitable for G-MC; the second, Graduated Non-Convexity (GNC), is based on homotopy methods, and is suitable for G-TLS. We extend ADAPT and GNC to the case where the user does not have prior knowledge of the inlier-noise statistics (or the statistics may vary over time) and is unable to guess a reasonable threshold to separate inliers from outliers (as the one commonly used in RANSAC). We propose the first minimally tuned algorithms for outlier rejection, that dynamically decide how to separate inliers from outliers. Our third contribution is an evaluation of the proposed algorithms on robot perception problems: mesh registration, image-based object detection (shape alignment), and pose graph optimization. ADAPT and GNC execute in real-time, are deterministic, outperform RANSAC, and are robust up to 80-90% outliers. Their minimally tuned versions also compare favorably with the state of the art, even though they do not rely on a noise bound for the inliers.

OCJun 11, 2020
One Ring to Rule Them All: Certifiably Robust Geometric Perception with Outliers

Heng Yang, Luca Carlone

We propose the first general and practical framework to design certifiable algorithms for robust geometric perception in the presence of a large amount of outliers. We investigate the use of a truncated least squares (TLS) cost function, which is known to be robust to outliers, but leads to hard, nonconvex, and nonsmooth optimization problems. Our first contribution is to show that -for a broad class of geometric perception problems- TLS estimation can be reformulated as an optimization over the ring of polynomials and Lasserre's hierarchy of convex moment relaxations is empirically tight at the minimum relaxation order (i.e., certifiably obtains the global minimum of the nonconvex TLS problem). Our second contribution is to exploit the structural sparsity of the objective and constraint polynomials and leverage basis reduction to significantly reduce the size of the semidefinite program (SDP) resulting from the moment relaxation, without compromising its tightness. Our third contribution is to develop scalable dual optimality certifiers from the lens of sums-of-squares (SOS) relaxation, that can compute the suboptimality gap and possibly certify global optimality of any candidate solution (e.g., returned by fast heuristics such as RANSAC or graduated non-convexity). Our dual certifiers leverage Douglas-Rachford Splitting to solve a convex feasibility SDP. Numerical experiments across different perception problems, including single rotation averaging, shape alignment, 3D point cloud and mesh registration, and high-integrity satellite pose estimation, demonstrate the tightness of our relaxations, the correctness of the certification, and the scalability of the proposed dual certifiers to large problems, beyond the reach of current SDP solvers.

CVMay 7, 2020
A Dynamical Perspective on Point Cloud Registration

Heng Yang

We provide a dynamical perspective on the classical problem of 3D point cloud registration with correspondences. A point cloud is considered as a rigid body consisting of particles. The problem of registering two point clouds is formulated as a dynamical system, where the dynamic model point cloud translates and rotates in a viscous environment towards the static scene point cloud, under forces and torques induced by virtual springs placed between each pair of corresponding points. We first show that the potential energy of the system recovers the objective function of the maximum likelihood estimation. We then adopt Lyapunov analysis, particularly the invariant set theorem, to analyze the rigid body dynamics and show that the system globally asymptotically tends towards the set of equilibrium points, where the globally optimal registration solution lies in. We conjecture that, besides the globally optimal equilibrium point, the system has either three or infinite "spurious" equilibrium points, and these spurious equilibria are all locally unstable. The case of three spurious equilibria corresponds to generic shape of the point cloud, while the case of infinite spurious equilibria happens when the point cloud exhibits symmetry. Therefore, simulating the dynamics with random perturbations guarantees to obtain the globally optimal registration solution. Numerical experiments support our analysis and conjecture.

ROJan 21, 2020
TEASER: Fast and Certifiable Point Cloud Registration

Heng Yang, Jingnan Shi, Luca Carlone

We propose the first fast and certifiable algorithm for the registration of two sets of 3D points in the presence of large amounts of outlier correspondences. We first reformulate the registration problem using a Truncated Least Squares (TLS) cost that is insensitive to a large fraction of spurious correspondences. Then, we provide a general graph-theoretic framework to decouple scale, rotation, and translation estimation, which allows solving in cascade for the three transformations. Despite the fact that each subproblem is still non-convex and combinatorial in nature, we show that (i) TLS scale and (component-wise) translation estimation can be solved in polynomial time via adaptive voting, (ii) TLS rotation estimation can be relaxed to a semidefinite program (SDP) and the relaxation is tight, even in the presence of extreme outlier rates, and (iii) the graph-theoretic framework allows drastic pruning of outliers by finding the maximum clique. We name the resulting algorithm TEASER (Truncated least squares Estimation And SEmidefinite Relaxation). While solving large SDP relaxations is typically slow, we develop a second fast and certifiable algorithm, named TEASER++, that uses graduated non-convexity to solve the rotation subproblem and leverages Douglas-Rachford Splitting to efficiently certify global optimality. For both algorithms, we provide theoretical bounds on the estimation errors, which are the first of their kind for robust registration problems. Moreover, we test their performance on standard, object detection, and the 3DMatch benchmarks, and show that (i) both algorithms dominate the state of the art and are robust to more than 99% outliers, (ii) TEASER++ can run in milliseconds, and (iii) TEASER++ is so robust it can also solve problems without correspondences, where it largely outperforms ICP and it is more accurate than Go-ICP while being orders of magnitude faster.

CLDec 17, 2019
A Multi-task Learning Model for Chinese-oriented Aspect Polarity Classification and Aspect Term Extraction

Heng Yang, Biqing Zeng, JianHao Yang et al.

Aspect-based sentiment analysis (ABSA) task is a multi-grained task of natural language processing and consists of two subtasks: aspect term extraction (ATE) and aspect polarity classification (APC). Most of the existing work focuses on the subtask of aspect term polarity inferring and ignores the significance of aspect term extraction. Besides, the existing researches do not pay attention to the research of the Chinese-oriented ABSA task. Based on the local context focus (LCF) mechanism, this paper firstly proposes a multi-task learning model for Chinese-oriented aspect-based sentiment analysis, namely LCF-ATEPC. Compared with existing models, this model equips the capability of extracting aspect term and inferring aspect term polarity synchronously, moreover, this model is effective to analyze both Chinese and English comments simultaneously and the experiment on a multilingual mixed dataset proved its availability. By integrating the domain-adapted BERT model, the LCF-ATEPC model achieved the state-of-the-art performance of aspect term extraction and aspect polarity classification in four Chinese review datasets. Besides, the experimental results on the most commonly used SemEval-2014 task4 Restaurant and Laptop datasets outperform the state-of-the-art performance on the ATE and APC subtask.

CVNov 27, 2019
In Perfect Shape: Certifiably Optimal 3D Shape Reconstruction from 2D Landmarks

Heng Yang, Luca Carlone

We study the problem of 3D shape reconstruction from 2D landmarks extracted in a single image. We adopt the 3D deformable shape model and formulate the reconstruction as a joint optimization of the camera pose and the linear shape parameters. Our first contribution is to apply Lasserre's hierarchy of convex Sums-of-Squares (SOS) relaxations to solve the shape reconstruction problem and show that the SOS relaxation of minimum order 2 empirically solves the original non-convex problem exactly. Our second contribution is to exploit the structure of the polynomial in the objective function and find a reduced set of basis monomials for the SOS relaxation that significantly decreases the size of the resulting semidefinite program (SDP) without compromising its accuracy. These two contributions, to the best of our knowledge, lead to the first certifiably optimal solver for 3D shape reconstruction, that we name Shape*. Our third contribution is to add an outlier rejection layer to Shape* using a truncated least squares (TLS) robust cost function and leveraging graduated non-convexity to solve TLS without initialization. The result is a robust reconstruction algorithm, named Shape#, that tolerates a large amount of outlier measurements. We evaluate the performance of Shape* and Shape# in both simulated and real experiments, showing that Shape* outperforms local optimization and previous convex relaxation techniques, while Shape# achieves state-of-the-art performance and is robust against 70% outliers in the FG3DCar dataset.

CVSep 18, 2019
Graduated Non-Convexity for Robust Spatial Perception: From Non-Minimal Solvers to Global Outlier Rejection

Heng Yang, Pasquale Antonante, Vasileios Tzoumas et al.

Semidefinite Programming (SDP) and Sums-of-Squares (SOS) relaxations have led to certifiably optimal non-minimal solvers for several robotics and computer vision problems. However, most non-minimal solvers rely on least-squares formulations, and, as a result, are brittle against outliers. While a standard approach to regain robustness against outliers is to use robust cost functions, the latter typically introduce other non-convexities, preventing the use of existing non-minimal solvers. In this paper, we enable the simultaneous use of non-minimal solvers and robust estimation by providing a general-purpose approach for robust global estimation, which can be applied to any problem where a non-minimal solver is available for the outlier-free case. To this end, we leverage the Black-Rangarajan duality between robust estimation and outlier processes (which has been traditionally applied to early vision problems), and show that graduated non-convexity (GNC) can be used in conjunction with non-minimal solvers to compute robust solutions, without requiring an initial guess. Although GNC's global optimality cannot be guaranteed, we demonstrate the empirical robustness of the resulting robust non-minimal solvers in applications, including point cloud and mesh registration, pose graph optimization, and image-based object pose estimation (also called shape alignment). Our solvers are robust to 70-80% of outliers, outperform RANSAC, are more accurate than specialized local solvers, and faster than specialized global solvers. We also propose the first certifiably optimal non-minimal solver for shape alignment using SOS relaxation.

OCMay 29, 2019
A Quaternion-based Certifiably Optimal Solution to the Wahba Problem with Outliers

Heng Yang, Luca Carlone

The Wahba problem, also known as rotation search, seeks to find the best rotation to align two sets of vector observations given putative correspondences, and is a fundamental routine in many computer vision and robotics applications. This work proposes the first polynomial-time certifiably optimal approach for solving the Wahba problem when a large number of vector observations are outliers. Our first contribution is to formulate the Wahba problem using a Truncated Least Squares (TLS) cost that is insensitive to a large fraction of spurious correspondences. The second contribution is to rewrite the problem using unit quaternions and show that the TLS cost can be framed as a Quadratically-Constrained Quadratic Program (QCQP). Since the resulting optimization is still highly non-convex and hard to solve globally, our third contribution is to develop a convex Semidefinite Programming (SDP) relaxation. We show that while a naive relaxation performs poorly in general, our relaxation is tight even in the presence of large noise and outliers. We validate the proposed algorithm, named QUASAR (QUAternion-based Semidefinite relAxation for Robust alignment), in both synthetic and real datasets showing that the algorithm outperforms RANSAC, robust local optimization techniques, global outlier-removal procedures, and Branch-and-Bound methods. QUASAR is able to compute certifiably optimal solutions (i.e. the relaxation is exact) even in the case when 95% of the correspondences are outliers.

ROMar 20, 2019
A Polynomial-time Solution for Robust Registration with Extreme Outlier Rates

Heng Yang, Luca Carlone

We propose a robust approach for the registration of two sets of 3D points in the presence of a large amount of outliers. Our first contribution is to reformulate the registration problem using a Truncated Least Squares (TLS) cost that makes the estimation insensitive to a large fraction of spurious point-to-point correspondences. The second contribution is a general framework to decouple rotation, translation, and scale estimation, which allows solving in cascade for the three transformations. Since each subproblem (scale, rotation, and translation estimation) is still non-convex and combinatorial in nature, out third contribution is to show that (i) TLS scale and (component-wise) translation estimation can be solved exactly and in polynomial time via an adaptive voting scheme, (ii) TLS rotation estimation can be relaxed to a semidefinite program and the relaxation is tight in practice, even in the presence of an extreme amount of outliers. We validate the proposed algorithm, named TEASER (Truncated least squares Estimation And SEmidefinite Relaxation), in standard registration benchmarks showing that the algorithm outperforms RANSAC and robust local optimization techniques, and favorably compares with Branch-and-Bound methods, while being a polynomial-time algorithm. TEASER can tolerate up to 99% outliers and returns highly-accurate solutions.

CVJun 14, 2018
SCSP: Spectral Clustering Filter Pruning with Soft Self-adaption Manners

Huiyuan Zhuo, Xuelin Qian, Yanwei Fu et al.

Deep Convolutional Neural Networks (CNN) has achieved significant success in computer vision field. However, the high computational cost of the deep complex models prevents the deployment on edge devices with limited memory and computational resource. In this paper, we proposed a novel filter pruning for convolutional neural networks compression, namely spectral clustering filter pruning with soft self-adaption manners (SCSP). We first apply spectral clustering on filters layer by layer to explore their intrinsic connections and only count on efficient groups. By self-adaption manners, the pruning operations can be done in few epochs to let the network gradually choose meaningful groups. According to this strategy, we not only achieve model compression while keeping considerable performance, but also find a novel angle to interpret the model compression process.