Zhiwei Zhao

CV
h-index37
15papers
15,783citations
Novelty51%
AI Score59

15 Papers

AIJul 31, 2024
The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri et al. · allen-ai, berkeley

Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical evaluation of Llama 3. We find that Llama 3 delivers comparable quality to leading language models such as GPT-4 on a plethora of tasks. We publicly release Llama 3, including pre-trained and post-trained versions of the 405B parameter language model and our Llama Guard 3 model for input and output safety. The paper also presents the results of experiments in which we integrate image, video, and speech capabilities into Llama 3 via a compositional approach. We observe this approach performs competitively with the state-of-the-art on image, video, and speech recognition tasks. The resulting models are not yet being broadly released as they are still under development.

HCMar 26, 2022Code
Implementation of an Automated Learning System for Non-experts

Phoenix X. Huang, Zhiwei Zhao, Chao Liu et al.

Automated machine learning systems for non-experts could be critical for industries to adopt artificial intelligence to their own applications. This paper detailed the engineering system implementation of an automated machine learning system called YMIR, which completely relies on graphical interface to interact with users. After importing training/validation data into the system, a user without AI knowledge can label the data, train models, perform data mining and evaluation by simply clicking buttons. The paper described: 1) Open implementation of model training and inference through docker containers. 2) Implementation of task and resource management. 3) Integration of Labeling software. 4) Implementation of HCI (Human Computer Interaction) with a rebuilt collaborative development paradigm. We also provide subsequent case study on training models with the system. We hope this paper can facilitate the prosperity of our automated machine learning community from industry application perspective. The code of the system has already been released to GitHub (https://github.com/industryessentials/ymir).

CVSep 22, 2023
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding

Jiazhen Wang, Bin Liu, Changtao Miao et al.

AI-synthesized text and images have gained significant attention, particularly due to the widespread dissemination of multi-modal manipulations on the internet, which has resulted in numerous negative impacts on society. Existing methods for multi-modal manipulation detection and grounding primarily focus on fusing vision-language features to make predictions, while overlooking the importance of modality-specific features, leading to sub-optimal results. In this paper, we construct a simple and novel transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. To achieve this, we introduce visual/language pre-trained encoders and dual-branch cross-attention (DCA) to extract and fuse modality-unique features. Furthermore, we design decoupled fine-grained classifiers (DFC) to enhance modality-specific feature mining and mitigate modality competition. Moreover, we propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality using learnable queries, thereby improving the discovery of forged details. Extensive experiments on the $\rm DGM^4$ dataset demonstrate the superior performance of our proposed model compared to state-of-the-art approaches.

CVJun 16, 2023
EVOPOSE: A Recursive Transformer For 3D Human Pose Estimation With Kinematic Structure Priors

Yaqi Zhang, Yan Lu, Bin Liu et al.

Transformer is popular in recent 3D human pose estimation, which utilizes long-term modeling to lift 2D keypoints into the 3D space. However, current transformer-based methods do not fully exploit the prior knowledge of the human skeleton provided by the kinematic structure. In this paper, we propose a novel transformer-based model EvoPose to introduce the human body prior knowledge for 3D human pose estimation effectively. Specifically, a Structural Priors Representation (SPR) module represents human priors as structural features carrying rich body patterns, e.g. joint relationships. The structural features are interacted with 2D pose sequences and help the model to achieve more informative spatiotemporal features. Moreover, a Recursive Refinement (RR) module is applied to refine the 3D pose outputs by utilizing estimated results and further injects human priors simultaneously. Extensive experiments demonstrate the effectiveness of EvoPose which achieves a new state of the art on two most popular benchmarks, Human3.6M and MPI-INF-3DHP.

CVMay 6
Advancing Aesthetic Image Generation via Composition Transfer

Kai Zou, Zhiwei Zhao, Bin Liu et al.

Composition is a cornerstone of visual aesthetics, influencing the appeal of an image. While its principles operate independently of specific content, in practice, composition is often coupled with semantics. As a result, existing methods often enhance composition either through implicit learning or by semantics-based layout control, rather than explicitly modeling composition itself. To address this gap, we introduce Composer, a framework rooted in aesthetic theory, designed to model composition in a semantic-agnostic manner. First, it supports composition transfer by extracting key composition-aware representations from a reference image and leveraging a tailored conditional guidance module to control composition based on pre-trained diffusion models. Second, when users specify only text themes without a composition reference, Composer supports theme-driven composition retrieval by leveraging the in-context learning capabilities of Large Vision-Language Models (LVLMs), achieving explicit composition planning. To enhance composition in a reference-free mode, we conduct text-to-composition fine-tuning on the trained control module to enable implicit composition planning. Furthermore, we curated a high-quality dataset comprising 2 million image-text pairs using state-of-the-art generative models to support model training. Experimental results demonstrate that Composer significantly enhances aesthetic quality in text-to-image tasks and facilitates personalized composition control and transfer, offering users precision and flexibility in the creative process.

CEMar 17
A scalable neural bundle map for multiphysics prediction in lithium-ion battery across varying configurations

Zhiwei Zhao, Changqing Liu, Jie Lin et al.

Efficient and accurate prediction of Multiphysics evolution across diverse cell geometries is fundamental to the design, management and safety of lithium-ion batteries. However, existing computational frameworks struggle to capture the coupled electrochemical, thermal, and mechanical dynamics across diverse cell geometries and varying operating conditions. Here, we present a Neural Bundle Map (NBM), a mathematically rigorous framework that reformulates multiphysics evolution as a bundle map over a geometric base manifold. This approach enables the complete decoupling of geometric complexity from underlying physical laws, ensuring strong operator continuity across varying domains. Our framework achieves high-fidelity spatiotemporal predictions with a normalized mean absolute error of less than 1% across varying configurations, while maintaining stability during long-horizon forecasting far beyond the training window and reducing computational costs by two orders of magnitude compared with conventional solvers. Leveraging this capability, we rapidly explored a vast configurational space to identify an optimal battery design that yields a 38% increase in energy density while adhering to thermal safety constraints. Furthermore, the NBM demonstrates remarkable scalability to multi-cell systems through few-shot transfer learning, providing a foundational paradigm for the intelligent design and real-time monitoring of complex energy storage infrastructures.

CVMar 18
EchoGen: Cycle-Consistent Learning for Unified Layout-Image Generation and Understanding

Kai Zou, Hongbo Liu, Dian Zheng et al.

In this work, we present EchoGen, a unified framework for layout-to-image generation and image grounding, capable of generating images with accurate layouts and high fidelity to text descriptions (e.g., spatial relationships), while grounding the image robustly at the same time. We believe that image grounding possesses strong text and layout understanding abilities, which can compensate for the corresponding limitations in layout-to-image generation. At the same time, images generated from layouts exhibit high diversity in content, thereby enhancing the robustness of image grounding. Jointly training both tasks within a unified model can promote performance improvements for each. However, we identify that this joint training paradigm encounters several optimization challenges and results in restricted performance. To address these issues, we propose progressive training strategies. First, the Parallel Multi-Task Pre-training (PMTP) stage equips the model with basic abilities for both tasks, leveraging shared tokens to accelerate training. Next, the Dual Joint Optimization (DJO) stage exploits task duality to sequentially integrate the two tasks, enabling unified optimization. Finally, the Cycle RL stage eliminates reliance on visual supervision by using consistency constraints as rewards, significantly enhancing the model's unified capabilities via the GRPO strategy. Extensive experiments demonstrate state-of-the-art results on both layout-to-image generation and image grounding benchmarks, and reveal clear synergistic gains from optimizing the two tasks together.

NAFeb 19, 2024
Diffeomorphism Neural Operator for various domains and parameters of partial differential equations

Zhiwei Zhao, Changqing Liu, Yingguang Li et al.

In scientific and engineering applications, solving partial differential equations (PDEs) across various parameters and domains normally relies on resource-intensive numerical methods. Neural operators based on deep learning offered a promising alternative to PDEs solving by directly learning physical laws from data. However, the current neural operator methods were limited to solve PDEs on fixed domains. Expanding neural operators to solve PDEs on various domains hold significant promise in medical imaging, engineering design and manufacturing applications, where geometric and parameter changes are essential. This paper presents a novel neural operator learning framework for solving PDEs with various domains and parameters defined for physical systems, named diffeomorphism neural operator (DNO). The main idea is that a neural operator learns in a generic domain which is diffeomorphically mapped from various physics domains expressed by the same PDE. In this way, the challenge of operator learning on various domains is transformed into operator learning on the generic domain. The generalization performance of DNO on different domains can be assessed by a proposed method which evaluates the geometric similarity between a new domain and the domains of training dataset after diffeomorphism. Experiments on Darcy flow, pipe flow, airfoil flow and mechanics were carried out, where harmonic and volume parameterization were used as the diffeomorphism for 2D and 3D domains. The DNO framework demonstrated robust learning capabilities and strong generalization performance across various domains and parameters.

CLAug 11, 2025
Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions

Bangsheng Tang, Carl Chengyan Fu, Fei Kou et al.

Speculative decoding is a standard method for accelerating the inference speed of large language models. However, scaling it for production environments poses several engineering challenges, including efficiently implementing different operations (e.g., tree attention and multi-round speculative decoding) on GPU. In this paper, we detail the training and inference optimization techniques that we have implemented to enable EAGLE-based speculative decoding at a production scale for Llama models. With these changes, we achieve a new state-of-the-art inference latency for Llama models. For example, Llama4 Maverick decodes at a speed of about 4 ms per token (with a batch size of one) on 8 NVIDIA H100 GPUs, which is 10% faster than the previously best known method. Furthermore, for EAGLE-based speculative decoding, our optimizations enable us to achieve a speed-up for large batch sizes between 1.4x and 2.0x at production scale.

CVMay 5, 2025
VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection

Hao Cheng, Zhiwei Zhao, Yichao He et al.

Audiovisual emotion recognition (AVER) aims to infer human emotions from nonverbal visual-audio (VA) cues, offering modality-complementary and language-agnostic advantages. However, AVER remains challenging due to the inherent ambiguity of emotional expressions, cross-modal expressive disparities, and the scarcity of reliably annotated data. Recent self-supervised AVER approaches have introduced strong multimodal representations, yet they predominantly rely on modality-specific encoders and coarse content-level alignment, limiting fine-grained emotional semantic modeling. To address these issues, we propose VAEmo, an efficient two-stage framework for emotion-centric joint VA representation learning with external knowledge injection. In Stage~1, a unified and lightweight representation network is pre-trained on large-scale speaker-centric VA corpora via masked reconstruction and contrastive objectives, mitigating the modality gap and learning expressive, complementary representations without emotion labels. In Stage~2, multimodal large language models automatically generate detailed affective descriptions according to our well-designed chain-of-thought prompting for only a small subset of VA samples; these rich textual semantics are then injected by aligning their corresponding embeddings with VA representations through dual-path contrastive learning, further bridging the emotion gap. Extensive experiments on multiple downstream AVER benchmarks show that VAEmo achieves state-of-the-art performance with a compact design, highlighting the benefit of unified cross-modal encoding and emotion-aware semantic guidance for efficient, generalizable VA emotion representations.

LGSep 9, 2025
Neural Diffeomorphic-Neural Operator for Residual Stress-Induced Deformation Prediction

Changqing Liu, Kaining Dai, Zhiwei Zhao et al.

Accurate prediction of machining deformation in structural components is essential for ensuring dimensional precision and reliability. Such deformation often originates from residual stress fields, whose distribution and influence vary significantly with geometric complexity. Conventional numerical methods for modeling the coupling between residual stresses and deformation are computationally expensive, particularly when diverse geometries are considered. Neural operators have recently emerged as a powerful paradigm for efficiently solving partial differential equations, offering notable advantages in accelerating residual stress-deformation analysis. However, their direct application across changing geometric domains faces theoretical and practical limitations. To address this challenge, a novel framework based on diffeomorphic embedding neural operators named neural diffeomorphic-neural operator (NDNO) is introduced. Complex three-dimensional geometries are explicitly mapped to a common reference domain through a diffeomorphic neural network constrained by smoothness and invertibility. The neural operator is then trained on this reference domain, enabling efficient learning of deformation fields induced by residual stresses. Once trained, both the diffeomorphic neural network and the neural operator demonstrate efficient prediction capabilities, allowing rapid adaptation to varying geometries. The proposed method thus provides an effective and computationally efficient solution for deformation prediction in structural components subject to varying geometries. The proposed method is validated to predict both main-direction and multi-direction deformation fields, achieving high accuracy and efficiency across parts with diverse geometries including component types, dimensions and features.

GRMar 19, 2024
A Physics-embedded Deep Learning Framework for Cloth Simulation

Zhiwei Zhao

Delicate cloth simulations have long been desired in computer graphics. Various methods were proposed to improve engaged force interactions, collision handling, and numerical integrations. Deep learning has the potential to achieve fast and real-time simulation, but common neural network structures often demand many parameters to capture cloth dynamics. This paper proposes a physics-embedded learning framework that directly encodes physical features of cloth simulation. The convolutional neural network is used to represent spatial correlations of the mass-spring system, after which three branches are designed to learn linear, nonlinear, and time derivate features of cloth physics. The framework can also integrate with other external forces and collision handling through either traditional simulators or sub neural networks. The model is tested across different cloth animation cases, without training with new data. Agreement with baselines and predictive realism successfully validate its generalization ability. Inference efficiency of the proposed model also defeats traditional physics simulation. This framework is also designed to easily integrate with other visual refinement techniques like wrinkle carving, which leaves significant chances to incorporate prevailing macing learning techniques in 3D cloth amination.

LGAug 20, 2021
Mobility-Aware Cluster Federated Learning in Hierarchical Wireless Networks

Chenyuan Feng, Howard H. Yang, Deshun Hu et al.

Implementing federated learning (FL) algorithms in wireless networks has garnered a wide range of attention. However, few works have considered the impact of user mobility on the learning performance. To fill this research gap, firstly, we develop a theoretical model to characterize the hierarchical federated learning (HFL) algorithm in wireless networks where the mobile users may roam across multiple edge access points, leading to incompletion of inconsistent FL training. Secondly, we provide the convergence analysis of HFL with user mobility. Our analysis proves that the learning performance of HFL deteriorates drastically with highly-mobile users. And this decline in the learning performance will be exacerbated with small number of participants and large data distribution divergences among local data of users. To circumvent these issues, we propose a mobility-aware cluster federated learning (MACFL) algorithm by redesigning the access mechanism, local update rule and model aggregation scheme. Finally, we provide experiments to evaluate the learning performance of HFL and our MACFL. The results show that our MACFL can enhance the learning performance, especially for three different cases, namely, the case of users with non-independent and identical distribution data, the case of users with high mobility, and the cases with a small number of users.

AIMar 3, 2020
Hierarchical Context Enhanced Multi-Domain Dialogue System for Multi-domain Task Completion

Jingyuan Yang, Guang Liu, Yuzhao Mao et al.

Task 1 of the DSTC8-track1 challenge aims to develop an end-to-end multi-domain dialogue system to accomplish complex users' goals under tourist information desk settings. This paper describes our submitted solution, Hierarchical Context Enhanced Dialogue System (HCEDS), for this task. The main motivation of our system is to comprehensively explore the potential of hierarchical context for sufficiently understanding complex dialogues. More specifically, we apply BERT to capture token-level information and employ the attention mechanism to capture sentence-level information. The results listed in the leaderboard show that our system achieves first place in automatic evaluation and the second place in human evaluation.

NIOct 10, 2017
Link Quality Aware Channel Allocation for Multichannel Body Sensor Networks

Weifeng Gao, Zhiwei Zhao, Geyong Min et al.

Body Sensor Network (BSN) is a typical Internet-of-Things (IoT) application for personalized health care. It consists of economically powered, wireless and implanted medical monitoring sensor nodes, which are designed to continually collect the medical information of the target patients. Multichannel is often used in BSNs to reduce the spectrum competition of the tremendous sensor nodes and the problem of channel assignment has attracted much research attention. The health sensing data in BSNs is often required to be delivered to a sink node (or server) before a certain deadline for real time monitoring or health emergency alarm. Therefore, deadline is of significant importance for multichannel allocation and scheduling. The existing works, though designed to meet the deadline, often overlook the impact of the unreliable wireless links. As a result, the health sensing data can still be overdue because of the scheduled lossy links. Besides, potential collisions in the schedules also incur considerable delay in delivering the sensing data. In this paper, we propose a novel deadline- driven Link quality Aware Channel Assignment scheme (LACA), where link quality, deadlines and collisions are jointly considered. LACA prioritizes links with urgent deadlines and heavy collisions. Besides, LACA allows the exploition of the spare slots for retransmissions on lossy links, which can further reduce the retransmission delay. Extensive simulation experiments show that compared to the existing approaches, LACA can better utilize the wireless spectrum and achieve higher packet delivery ratio before the deadline.