Peihao Li

CV
h-index27
18papers
299citations
Novelty53%
AI Score56

18 Papers

RODec 17, 2025
Large Video Planner Enables Generalizable Robot Control

Boyuan Chen, Tianyuan Zhang, Haoran Geng et al. · mit

General-purpose robots require decision-making models that generalize across diverse tasks and environments. Recent works build robot foundation models by extending multimodal large language models (MLLMs) with action outputs, creating vision-language-action (VLA) systems. These efforts are motivated by the intuition that MLLMs' large-scale language and image pretraining can be effectively transferred to the action output modality. In this work, we explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models. Unlike static images and language, videos capture spatio-temporal sequences of states and actions in the physical world that are naturally aligned with robotic behavior. We curate an internet-scale video dataset of human activities and task demonstrations, and train, for the first time at a foundation-model scale, an open video model for generative robotics planning. The model produces zero-shot video plans for novel scenes and tasks, which we post-process to extract executable robot actions. We evaluate task-level generalization through third-party selected tasks in the wild and real-robot experiments, demonstrating successful physical execution. Together, these results show robust instruction following, strong generalization, and real-world feasibility. We release both the model and dataset to support open, reproducible video-based robot learning. Our website is available at https://www.boyuan.space/large-video-planner/.

CVApr 13, 2023
NeRFVS: Neural Radiance Fields for Free View Synthesis via Geometry Scaffolds

Chen Yang, Peihao Li, Zanwei Zhou et al.

We present NeRFVS, a novel neural radiance fields (NeRF) based method to enable free navigation in a room. NeRF achieves impressive performance in rendering images for novel views similar to the input views while suffering for novel views that are significantly different from the training views. To address this issue, we utilize the holistic priors, including pseudo depth maps and view coverage information, from neural reconstruction to guide the learning of implicit neural representations of 3D indoor scenes. Concretely, an off-the-shelf neural reconstruction method is leveraged to generate a geometry scaffold. Then, two loss functions based on the holistic priors are proposed to improve the learning of NeRF: 1) A robust depth loss that can tolerate the error of the pseudo depth map to guide the geometry learning of NeRF; 2) A variance loss to regularize the variance of implicit neural representations to reduce the geometry and color ambiguity in the learning procedure. These two loss functions are modulated during NeRF optimization according to the view coverage information to reduce the negative influence brought by the view coverage imbalance. Extensive results demonstrate that our NeRFVS outperforms state-of-the-art view synthesis methods quantitatively and qualitatively on indoor scenes, achieving high-fidelity free navigation results.

CVMay 24
Fishbone: From One 3D Asset to a Million Controllable Edits

Yumeng He, Xiaoying Wang, Peihao Li et al.

Large-scale controllable 3D assets are critical for computer graphics, embodied AI, robotics, and interactive content creation, yet creating diverse 3D assets remains challenging due to the high cost of manual modeling and rigging. Shape deformation offers a natural way to generate variations from existing meshes, but existing data-driven methods often rely on sparse user inputs, while parametric editing frameworks require manually designed control structures and category-specific configurations. Inspired by natural creatures, where a central spine governs global shape and cross-sectional ribs control local variation, we introduce Fishbone, a unified rib-spine representation for general shapes that supports controllable parametric mesh deformation, reduced-space dynamics, and animation. Given an input mesh, Fishbone computes a geodesic scalar field with an adaptive heat method, extracts iso-contours as cross-sectional ribs, constructs a smooth geometry-aware spine through rib centers, and associates surface vertices with nearby rib and spine structures using Gaussian-weighted skinning. The resulting representation enables real-time and predictable deformation: ribs control local profiles such as thickness, orientation, and cross-sectional variation, while the spine controls global bending, twisting, and stretching. The same structure also supports reduced-space simulation and keyframe animation. We further construct Fishbone-136K by augmenting Hunyuan3D with rib-spine structures, and demonstrate applications in controllable 3D generation, deformation-based data augmentation for robot learning, interactive mesh editing, and agentic generation. Experiments demonstrate the effectiveness, efficiency, and versatility of the proposed framework.

CVFeb 6
Condition Matters in Full-head 3D GANs

Heyuan Li, Huimin Zhang, Yuda Qiu et al.

Conditioning is crucial for stable training of full-head 3D GANs. Without any conditioning signal, the model suffers from severe mode collapse, making it impractical to training. However, a series of previous full-head 3D GANs conventionally choose the view angle as the conditioning input, which leads to a bias in the learned 3D full-head space along the conditional view direction. This is evident in the significant differences in generation quality and diversity between the conditional view and non-conditional views of the generated 3D heads, resulting in global incoherence across different head regions. In this work, we propose to use view-invariant semantic feature as the conditioning input, thereby decoupling the generative capability of 3D heads from the viewing direction. To construct a view-invariant semantic condition for each training image, we create a novel synthesized head image dataset. We leverage FLUX.1 Kontext to extend existing high-quality frontal face datasets to a wide range of view angles. The image clip feature extracted from the frontal view is then used as a shared semantic condition across all views in the extended images, ensuring semantic alignment while eliminating directional bias. This also allows supervision from different views of the same subject to be consolidated under a shared semantic condition, which accelerates training and enhances the global coherence of the generated 3D heads. Moreover, as GANs often experience slower improvements in diversity once the generator learns a few modes that successfully fool the discriminator, our semantic conditioning encourages the generator to follow the true semantic distribution, thereby promoting continuous learning and diverse generation. Extensive experiments on full-head synthesis and single-view GAN inversion demonstrate that our method achieves significantly higher fidelity, diversity, and generalizability.

CVDec 8, 2025
ViSA: 3D-Aware Video Shading for Real-Time Upper-Body Avatar Creation

Fan Yang, Heyuan Li, Peihao Li et al.

Generating high-fidelity upper-body 3D avatars from one-shot input image remains a significant challenge. Current 3D avatar generation methods, which rely on large reconstruction models, are fast and capable of producing stable body structures, but they often suffer from artifacts such as blurry textures and stiff, unnatural motion. In contrast, generative video models show promising performance by synthesizing photorealistic and dynamic results, but they frequently struggle with unstable behavior, including body structural errors and identity drift. To address these limitations, we propose a novel approach that combines the strengths of both paradigms. Our framework employs a 3D reconstruction model to provide robust structural and appearance priors, which in turn guides a real-time autoregressive video diffusion model for rendering. This process enables the model to synthesize high-frequency, photorealistic details and fluid dynamics in real time, effectively reducing texture blur and motion stiffness while preventing the structural inconsistencies common in video generation methods. By uniting the geometric stability of 3D reconstruction with the generative capabilities of video models, our method produces high-fidelity digital avatars with realistic appearance and dynamic, temporally coherent motion. Experiments demonstrate that our approach significantly reduces artifacts and achieves substantial improvements in visual quality over leading methods, providing a robust and efficient solution for real-time applications such as gaming and virtual reality. Project page: https://lhyfst.github.io/visa

CEMay 6
A Blockchain-as-a-Service Solution for TAFES-Compliant Verification of Fair Trade Certifications

Nadia Dahmani, Peihao Li, Ravi S. Sharma

\abstract{\textbf{Purpose:} This study addresses the lack of trust in ethical product labels by designing a blockchain platform grounded in the TAFES principles (Transparency, Accountability, Fairness, Ethics, Safety). It aims to bridge the gap between blockchain's theoretical transparency and a responsible, real-world implementation for certification ecosystems. \textbf{Design/Methodology/Approach:} Using Action Design Research, we developed a proof-of-concept platform for label authentication. A hybrid architecture records critical events on an Ethereum Layer-2 network for security, while supporting evidence is stored off-chain via IPFS and linked via content identifiers. The solution was validated through a coffee supply chain scenario. \textbf{Findings:} The proof of concept demonstrates how a TAFES-aligned blockchain platform can support verification of label claims without requiring trust in a single intermediary by creating tamper-evident provenance records and auditable certification evidence across multiple stakeholders. The design supports low-cost, near-real-time anchoring of supply chain events while mitigating adoption barriers related to scalability, privacy, and operational viability. \textbf{Originality/Value:} This research contributes an integrated ethical and technical blueprint for trustworthy label authentication systems by translating TAFES into implementable design requirements and evaluation checks, and validating them through an ADR driven proof of concept. It advances prior work by moving from the question of whether blockchain can help to the question of how it should be implemented responsibly in multi stakeholder certification ecosystems.}

CRJul 13, 2023
Proof of Training (PoT): Harnessing Crypto Mining Power for Distributed AI Training

Peihao Li

In the midst of the emerging trend of integrating artificial intelligence (AI) with crypto mining, we identify three major challenges that create a gap between these two fields. To bridge this gap, we introduce the proof-of-training (PoT) protocol, an approach that combines the strengths of both AI and blockchain technology. The PoT protocol utilizes the practical Byzantine fault tolerance (PBFT) consensus mechanism to synchronize global states. To evaluate the performance of the protocol design, we present an implementation of a decentralized training network (DTN) that adopts the PoT protocol. Our results indicate that the protocol exhibits considerable potential in terms of task throughput, system robustness, and network security.

CVAug 18, 2025Code
Harnessing Group-Oriented Consistency Constraints for Semi-Supervised Semantic Segmentation in CdZnTe Semiconductors

Peihao Li, Yan Fang, Man Liu et al.

Labeling Cadmium Zinc Telluride (CdZnTe) semiconductor images is challenging due to the low-contrast defect boundaries, necessitating annotators to cross-reference multiple views. These views share a single ground truth (GT), forming a unique ``many-to-one'' relationship. This characteristic renders advanced semi-supervised semantic segmentation (SSS) methods suboptimal, as they are generally limited by a ``one-to-one'' relationship, where each image is independently associated with its GT. Such limitation may lead to error accumulation in low-contrast regions, further exacerbating confirmation bias. To address this issue, we revisit the SSS pipeline from a group-oriented perspective and propose a human-inspired solution: the Intra-group Consistency Augmentation Framework (ICAF). First, we experimentally validate the inherent consistency constraints within CdZnTe groups, establishing a group-oriented baseline using the Intra-group View Sampling (IVS). Building on this insight, we introduce the Pseudo-label Correction Network (PCN) to enhance consistency representation, which consists of two key modules. The View Augmentation Module (VAM) improves boundary details by dynamically synthesizing a boundary-aware view through the aggregation of multiple views. In the View Correction Module (VCM), this synthesized view is paired with other views for information interaction, effectively emphasizing salient regions while minimizing noise. Extensive experiments demonstrate the effectiveness of our solution for CdZnTe materials. Leveraging DeepLabV3+ with a ResNet-101 backbone as our segmentation model, we achieve a 70.6\% mIoU on the CdZnTe dataset using only 2 group-annotated data (5\textperthousand). The code is available at \href{https://github.com/pipixiapipi/ICAF}{https://github.com/pipixiapipi/ICAF}.

CVMar 23, 2024
Gaussian in the Wild: 3D Gaussian Splatting for Unconstrained Image Collections

Dongbin Zhang, Chuming Wang, Weitao Wang et al.

Novel view synthesis from unconstrained in-the-wild images remains a meaningful but challenging task. The photometric variation and transient occluders in those unconstrained images make it difficult to reconstruct the original scene accurately. Previous approaches tackle the problem by introducing a global appearance feature in Neural Radiance Fields (NeRF). However, in the real world, the unique appearance of each tiny point in a scene is determined by its independent intrinsic material attributes and the varying environmental impacts it receives. Inspired by this fact, we propose Gaussian in the wild (GS-W), a method that uses 3D Gaussian points to reconstruct the scene and introduces separated intrinsic and dynamic appearance feature for each point, capturing the unchanged scene appearance along with dynamic variation like illumination and weather. Additionally, an adaptive sampling strategy is presented to allow each Gaussian point to focus on the local and detailed information more effectively. We also reduce the impact of transient occluders using a 2D visibility map. More experiments have demonstrated better reconstruction quality and details of GS-W compared to NeRF-based methods, with a faster rendering speed. Video results and code are available at https://eastbeanzhang.github.io/GS-W/.

CEMay 7
Arbitrage and the Stability of AMM Price Tracking

Peihao Li, Nadia Dahmani, Wenqi Cai

Automated market makers (AMMs) quote prices from pool state rather than from a limit order book. AMM pools often stay close to a reference price because arbitrageurs correct profitable mispricing. A large part of decentralized finance therefore relies on a simple economic premise: once the AMM price drifts away from the reference price, arbitrage incentives push it back. This paper studies when that premise is strong enough to guarantee block-scale stability. We model the gap between the reference price and the AMM price as a stochastic tracking error, treat arbitrage as the corrective input, and place blockchain execution inside the loop through fees, discrete blocks, transaction ordering, delays, and transaction failure. The detailed execution layer is reduced to the total successful correction confirmed in each block. Under a block-level correction condition, we prove geometric ergodicity of the tracking error and obtain explicit one-step bounds that connect tracking quality to liquidity and execution quality. We also show in a constant-product example how fees, fixed execution costs, and local liquidity map into the no-trade band and the optimal corrective trade. Finally, we build empirical proxies for the theorem quantities from realized block data and use them to organize reduced and mechanism-focused simulations whose comparative statics are consistent with the theory. The contribution is to turn a basic economic intuition behind decentralized finance into a quantitative stability statement together with a tractable calibration interface.

AIOct 29, 2023
Web3 Meets AI Marketplace: Exploring Opportunities, Analyzing Challenges, and Suggesting Solutions

Peihao Li

Web3 and AI have been among the most discussed fields over the recent years, with substantial hype surrounding each field's potential to transform the world as we know it. However, as the hype settles, it's evident that neither AI nor Web3 can address all challenges independently. Consequently, the intersection of AI and Web3 is gaining increased attention, emerging as a new field with the potential to address the limitations of each. In this article, we will focus on the integration of web3 and the AI marketplace, where AI services and products can be provided in a decentralized manner (DeAI). A comprehensive review is provided by summarizing the opportunities and challenges on this topic. Additionally, we offer analyses and solutions to address these challenges. We've developed a framework that lets users pay with any kind of cryptocurrency to get AI services. Additionally, they can also enjoy AI services for free on our platform by simply locking up their assets temporarily in the protocol. This unique approach is a first in the industry. Before this, offering free AI services in the web3 community wasn't possible. Our solution opens up exciting opportunities for the AI marketplace in the web3 space to grow and be widely adopted.

CVMar 13, 2025
LHM: Large Animatable Human Reconstruction Model from a Single Image in Seconds

Lingteng Qiu, Xiaodong Gu, Peihao Li et al.

Animatable 3D human reconstruction from a single image is a challenging problem due to the ambiguity in decoupling geometry, appearance, and deformation. Recent advances in 3D human reconstruction mainly focus on static human modeling, and the reliance of using synthetic 3D scans for training limits their generalization ability. Conversely, optimization-based video methods achieve higher fidelity but demand controlled capture conditions and computationally intensive refinement processes. Motivated by the emergence of large reconstruction models for efficient static reconstruction, we propose LHM (Large Animatable Human Reconstruction Model) to infer high-fidelity avatars represented as 3D Gaussian splatting in a feed-forward pass. Our model leverages a multimodal transformer architecture to effectively encode the human body positional features and image features with attention mechanism, enabling detailed preservation of clothing geometry and texture. To further boost the face identity preservation and fine detail recovery, we propose a head feature pyramid encoding scheme to aggregate multi-scale features of the head regions. Extensive experiments demonstrate that our LHM generates plausible animatable human in seconds without post-processing for face and hands, outperforming existing methods in both reconstruction accuracy and generalization ability.

CLMay 30, 2025
AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time

Junyu Zhang, Runpei Dong, Han Wang et al.

This paper presents AlphaOne ($α$1), a universal framework for modulating reasoning progress in large reasoning models (LRMs) at test time. $α$1 first introduces $α$ moment, which represents the scaled thinking phase with a universal parameter $α$. Within this scaled pre-$α$ moment phase, it dynamically schedules slow thinking transitions by modeling the insertion of reasoning transition tokens as a Bernoulli stochastic process. After the $α$ moment, $α$1 deterministically terminates slow thinking with the end-of-thinking token, thereby fostering fast reasoning and efficient answer generation. This approach unifies and generalizes existing monotonic scaling methods by enabling flexible and dense slow-to-fast reasoning modulation. Extensive empirical studies on various challenging benchmarks across mathematical, coding, and scientific domains demonstrate $α$1's superior reasoning capability and efficiency. Project page: https://alphaone-project.github.io/

CVDec 11, 2024
SLGaussian: Fast Language Gaussian Splatting in Sparse Views

Kangjie Chen, BingQuan Dai, Minghan Qin et al.

3D semantic field learning is crucial for applications like autonomous navigation, AR/VR, and robotics, where accurate comprehension of 3D scenes from limited viewpoints is essential. Existing methods struggle under sparse view conditions, relying on inefficient per-scene multi-view optimizations, which are impractical for many real-world tasks. To address this, we propose SLGaussian, a feed-forward method for constructing 3D semantic fields from sparse viewpoints, allowing direct inference of 3DGS-based scenes. By ensuring consistent SAM segmentations through video tracking and using low-dimensional indexing for high-dimensional CLIP features, SLGaussian efficiently embeds language information in 3D space, offering a robust solution for accurate 3D scene understanding under sparse view conditions. In experiments on two-view sparse 3D object querying and segmentation in the LERF and 3D-OVS datasets, SLGaussian outperforms existing methods in chosen IoU, Localization Accuracy, and mIoU. Moreover, our model achieves scene inference in under 30 seconds and open-vocabulary querying in just 0.011 seconds per query.

CVJun 16, 2025
PF-LHM: 3D Animatable Avatar Reconstruction from Pose-free Articulated Human Images

Lingteng Qiu, Peihao Li, Qi Zuo et al.

Reconstructing an animatable 3D human from casually captured images of an articulated subject without camera or human pose information is a practical yet challenging task due to view misalignment, occlusions, and the absence of structural priors. While optimization-based methods can produce high-fidelity results from monocular or multi-view videos, they require accurate pose estimation and slow iterative optimization, limiting scalability in unconstrained scenarios. Recent feed-forward approaches enable efficient single-image reconstruction but struggle to effectively leverage multiple input images to reduce ambiguity and improve reconstruction accuracy. To address these challenges, we propose PF-LHM, a large human reconstruction model that generates high-quality 3D avatars in seconds from one or multiple casually captured pose-free images. Our approach introduces an efficient Encoder-Decoder Point-Image Transformer architecture, which fuses hierarchical geometric point features and multi-view image features through multimodal attention. The fused features are decoded to recover detailed geometry and appearance, represented using 3D Gaussian splats. Extensive experiments on both real and synthetic datasets demonstrate that our method unifies single- and multi-image 3D human reconstruction, achieving high-fidelity and animatable 3D human avatars without requiring camera and human pose annotations. Code and models will be released to the public.

CLOct 15, 2025
GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians

Xiuyuan Chen, Tao Sun, Dexin Su et al.

Current benchmarks for AI clinician systems, often based on multiple-choice exams or manual rubrics, fail to capture the depth, robustness, and safety required for real-world clinical practice. To address this, we introduce the GAPS framework, a multidimensional paradigm for evaluating \textbf{G}rounding (cognitive depth), \textbf{A}dequacy (answer completeness), \textbf{P}erturbation (robustness), and \textbf{S}afety. Critically, we developed a fully automated, guideline-anchored pipeline to construct a GAPS-aligned benchmark end-to-end, overcoming the scalability and subjectivity limitations of prior work. Our pipeline assembles an evidence neighborhood, creates dual graph and tree representations, and automatically generates questions across G-levels. Rubrics are synthesized by a DeepResearch agent that mimics GRADE-consistent, PICO-driven evidence review in a ReAct loop. Scoring is performed by an ensemble of large language model (LLM) judges. Validation confirmed our automated questions are high-quality and align with clinician judgment. Evaluating state-of-the-art models on the benchmark revealed key failure modes: performance degrades sharply with increased reasoning depth (G-axis), models struggle with answer completeness (A-axis), and they are highly vulnerable to adversarial perturbations (P-axis) as well as certain safety issues (S-axis). This automated, clinically-grounded approach provides a reproducible and scalable method for rigorously evaluating AI clinician systems and guiding their development toward safer, more reliable clinical practice.

CVJun 9, 2025
NOVA3D: Normal Aligned Video Diffusion Model for Single Image to 3D Generation

Yuxiao Yang, Peihao Li, Yuhong Zhang et al.

3D AI-generated content (AIGC) has made it increasingly accessible for anyone to become a 3D content creator. While recent methods leverage Score Distillation Sampling to distill 3D objects from pretrained image diffusion models, they often suffer from inadequate 3D priors, leading to insufficient multi-view consistency. In this work, we introduce NOVA3D, an innovative single-image-to-3D generation framework. Our key insight lies in leveraging strong 3D priors from a pretrained video diffusion model and integrating geometric information during multi-view video fine-tuning. To facilitate information exchange between color and geometric domains, we propose the Geometry-Temporal Alignment (GTA) attention mechanism, thereby improving generalization and multi-view consistency. Moreover, we introduce the de-conflict geometry fusion algorithm, which improves texture fidelity by addressing multi-view inaccuracies and resolving discrepancies in pose alignment. Extensive experiments validate the superiority of NOVA3D over existing baselines.

CRNov 6, 2024
A Novel Access Control and Privacy-Enhancing Approach for Models in Edge Computing

Peihao Li

With the widespread adoption of edge computing technologies and the increasing prevalence of deep learning models in these environments, the security risks and privacy threats to models and data have grown more acute. Attackers can exploit various techniques to illegally obtain models or misuse data, leading to serious issues such as intellectual property infringement and privacy breaches. Existing model access control technologies primarily rely on traditional encryption and authentication methods; however, these approaches exhibit significant limitations in terms of flexibility and adaptability in dynamic environments. Although there have been advancements in model watermarking techniques for marking model ownership, they remain limited in their ability to proactively protect intellectual property and prevent unauthorized access. To address these challenges, we propose a novel model access control method tailored for edge computing environments. This method leverages image style as a licensing mechanism, embedding style recognition into the model's operational framework to enable intrinsic access control. Consequently, models deployed on edge platforms are designed to correctly infer only on license data with specific style, rendering them ineffective on any other data. By restricting the input data to the edge model, this approach not only prevents attackers from gaining unauthorized access to the model but also enhances the privacy of data on terminal devices. We conducted extensive experiments on benchmark datasets, including MNIST, CIFAR-10, and FACESCRUB, and the results demonstrate that our method effectively prevents unauthorized access to the model while maintaining accuracy. Additionally, the model shows strong resistance against attacks such as forged licenses and fine-tuning. These results underscore the method's usability, security, and robustness.