CVNov 6, 2023
InstructPix2NeRF: Instructed 3D Portrait Editing from a Single ImageJianhui Li, Shilong Liu, Zidong Liu et al.
With the success of Neural Radiance Field (NeRF) in 3D-aware portrait editing, a variety of works have achieved promising results regarding both quality and 3D consistency. However, these methods heavily rely on per-prompt optimization when handling natural language as editing instructions. Due to the lack of labeled human face 3D datasets and effective architectures, the area of human-instructed 3D-aware editing for open-world portraits in an end-to-end manner remains under-explored. To solve this problem, we propose an end-to-end diffusion-based framework termed InstructPix2NeRF, which enables instructed 3D-aware portrait editing from a single open-world image with human instructions. At its core lies a conditional latent 3D diffusion process that lifts 2D editing to 3D space by learning the correlation between the paired images' difference and the instructions via triplet data. With the help of our proposed token position randomization strategy, we could even achieve multi-semantic editing through one single pass with the portrait identity well-preserved. Besides, we further propose an identity consistency module that directly modulates the extracted identity signals into our diffusion process, which increases the multi-view 3D identity consistency. Extensive experiments verify the effectiveness of our method and show its superiority against strong baselines quantitatively and qualitatively. Source code and pre-trained models can be found on our project page: \url{https://mybabyyh.github.io/InstructPix2NeRF}.
LGMay 24, 2024Code
Achieving Dimension-Free Communication in Federated Learning via Zeroth-Order OptimizationZhe Li, Bicheng Ying, Zidong Liu et al.
Federated Learning (FL) offers a promising framework for collaborative and privacy-preserving machine learning across distributed data sources. However, the substantial communication costs associated with FL significantly challenge its efficiency. Specifically, in each communication round, the communication costs scale linearly with the model's dimension, which presents a formidable obstacle, especially in large model scenarios. Despite various communication-efficient strategies, the intrinsic dimension-dependent communication cost remains a major bottleneck for current FL implementations. This paper proposes a novel dimension-free communication algorithm - DeComFL, which leverages the zeroth-order optimization techniques and reduces the communication cost from $\mathscr{O}(d)$ to $\mathscr{O}(1)$ by transmitting only a constant number of scalar values between clients and the server in each round, regardless of the dimension $d$ of the model parameters. Theoretically, in non-convex functions, we prove that our algorithm achieves state-of-the-art rates, which show a linear speedup of the number of clients and local steps under standard assumptions. With additional low effective rank assumption, we can further show the convergence rate is independent of the model dimension $d$ as well. Empirical evaluations, encompassing both classic deep learning training and large language model fine-tuning, demonstrate significant reductions in communication overhead. Notably, DeComFL achieves this by transmitting only around 1MB of data in total between the server and a client to fine-tune a model with billions of parameters. Our code is available at https://github.com/ZidongLiu/DeComFL.
66.5AIMay 14
TeachAnything: A Multimodal Crowdsourcing Platform for Training Embodied AI Agents in Symmetrical RealityZidong Liu, Rongkai Liu, Yue Li et al.
Symmetrical Reality (SR) is emerging as a future trend for human-agent coexistence, placing higher demands on agents to acquire human-like intelligence. It calls for richer and more diverse human guidance. We introduce a three-stage demonstration paradigm integrating multimodal demonstration signals. Building on this paradigm, we developed TeachAnything, a cloud-based, crowdsourcing-oriented demonstration platform with physics simulation capable of collecting diverse demonstration data across varied scenes, tasks, and embodiments. By unifying virtual and physical interactions through both methodological design and physics simulation, the system serves as a practical foundation for developing embodied agents aligned with Symmetrical Reality.
LGOct 27, 2025Code
Can Language Models Compose Skills In-Context?Zidong Liu, Zhuoyan Xu, Zhenmei Shi et al.
Composing basic skills from simple tasks to accomplish composite tasks is crucial for modern intelligent systems. We investigate the in-context composition ability of language models to perform composite tasks that combine basic skills demonstrated in in-context examples. This is more challenging than the standard setting, where skills and their composition can be learned in training. We conduct systematic experiments on various representative open-source language models, utilizing linguistic and logical tasks designed to probe composition abilities. The results reveal that simple task examples can have a surprising negative impact on the performance, because the models generally struggle to recognize and assemble the skills correctly, even with Chain-of-Thought examples. Theoretical analysis further shows that it is crucial to align examples with the corresponding steps in the composition. This inspires a method for the probing tasks, whose improved performance provides positive support for our insights.
55.8LGMay 5
Learning Dynamics of Zeroth-Order Optimization: A Kernel PerspectiveZhe Li, Bicheng Ying, Zidong Liu et al.
Classical optimization theory establishes that zeroth-order (ZO) algorithms suffer from a dimension-dependent slowdown, with convergence rates typically scaling with the model dimension compared to first-order methods. However, in contrast to these theoretical expectations, a growing body of recent work demonstrates the successful application of ZO methods to fine-tuning Large Language Models (LLMs) with billions of parameters. To explain this paradox, we derive the one-step learning dynamics of ZO SGD, where the empirical Neural Tangent Kernel (eNTK) naturally emerges as the key term governing the learning behavior. Inspection of the eNTK produced by ZO SGD reveals that each element corresponds to the inner product of neural tangent vectors projected onto a random low-dimensional subspace. Thus, by invoking the Johnson-Lindenstrauss Lemma, our analysis shows that the fidelity of the ZO eNTK is governed primarily by the number of perturbations. Crucially, the approximation error depends on the model output size rather than the massive parameter dimension. This dimension-free property provides a theoretical justification for the scalability of ZO methods to LLMs finetuning tasks. We believe that this kernel-based framework offers a novel perspective for understanding ZO methods within the context of learning dynamics.
CVJan 15, 2024
VeCAF: Vision-language Collaborative Active Finetuning with Training Objective AwarenessRongyu Zhang, Zefan Cai, Huanrui Yang et al.
Finetuning a pretrained vision model (PVM) is a common technique for learning downstream vision tasks. However, the conventional finetuning process with randomly sampled data points results in diminished training efficiency. To address this drawback, we propose a novel approach, Vision-language Collaborative Active Finetuning (VeCAF). With the emerging availability of labels and natural language annotations of images through web-scale crawling or controlled generation, VeCAF makes use of these information to perform parametric data selection for PVM finetuning. VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence to meet the performance goal. This process is assisted by the inherent semantic richness of the text embedding space which we use to augment image features. Furthermore, the flexibility of text-domain augmentation allows VeCAF to handle out-of-distribution scenarios without external data. Extensive experiments show the leading performance and high computational efficiency of VeCAF that is superior to baselines in both in-distribution and out-of-distribution image classification tasks. On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning, and achieves an accuracy improvement of 2.7% over the state-of-the-art active finetuning method with the same number of batches.
LGJun 3, 2025
Reconciling Hessian-Informed Acceleration and Scalar-Only Communication for Efficient Federated Zeroth-Order Fine-TuningZhe Li, Bicheng Ying, Zidong Liu et al.
Recent dimension-free communication frameworks in Federated Learning (FL), such as DeComFL, significantly reduce per-round communication by transmitting only scalars via zeroth-order stochastic gradient descent (ZO-SGD). This method is particularly advantageous for federated fine-tuning of Large Language Models (LLMs). Yet, the high variance in ZO gradient estimation typically leads to slow convergence. Although leveraging Hessian information is known to enhance optimization speed, integrating this into FL presents significant challenges. These include clients' restrictions on local data and the critical need to maintain the dimension-free communication property. To overcome this limitation, we first introduce a generalized scalar-only communication FL framework that decouples dimension-free communication from standard ZO-SGD, enabling the integration of more advanced optimization strategies. Building on this framework, we propose HiSo, a fast federated fine-tuning method via Hessian-informed zeroth-order optimization and Scalar-only communication. Specifically, it leverages global curvature information to accelerate convergence while preserving the same minimal communication cost per round. Theoretically, we establish convergence guarantees that are independent of the global Lipschitz constant, and further show that HiSo achieves faster rates when the global Hessian exhibits a low effective rank -- a common phenomenon in LLMs. Extensive experiments on benchmark datasets and LLM fine-tuning tasks confirm that HiSo significantly outperforms existing ZO-based FL methods in both convergence speed and communication efficiency.