Ziyi Zhong

LG
h-index27
3papers
97citations
Novelty45%
AI Score34

3 Papers

LGJul 17, 2025
Apple Intelligence Foundation Language Models: Tech Report 2025

Ethan Li, Anders Boesen Lindbo Larsen, Chen Zhang et al. · apple-ml, cmu

We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transformer that combines track parallelism, mixture-of-experts sparse computation, and interleaved global-local attention to deliver high quality with competitive cost on Apple's Private Cloud Compute platform. Both models are trained on large-scale multilingual and multimodal datasets sourced via responsible web crawling, licensed corpora, and high-quality synthetic data, then further refined with supervised fine-tuning and reinforcement learning on a new asynchronous platform. The resulting models support several additional languages while understanding images and executing tool calls. In public benchmarks and human evaluations, both the server model and the on-device model match or surpass comparably sized open baselines. A new Swift-centric Foundation Models framework exposes guided generation, constrained tool calling, and LoRA adapter fine-tuning, allowing developers to integrate these capabilities with a few lines of code. The latest advancements in Apple Intelligence models are grounded in our Responsible AI approach with safeguards like content filtering and locale-specific evaluation, as well as our commitment to protecting our users' privacy with innovations like Private Cloud Compute.

LGJul 2, 2019
E-Sports Talent Scouting Based on Multimodal Twitch Stream Data

Anna Belova, Wen He, Ziyi Zhong

We propose and investigate feasibility of a novel task that consists in finding e-sports talent using multimodal Twitch chat and video stream data. In that, we focus on predicting the ranks of Counter-Strike: Global Offensive (CS:GO) gamers who broadcast their games on Twitch. During January 2019-April 2019, we have built two Twitch stream collections: One for 425 publicly ranked CS:GO gamers and one for 9,928 unranked CS:GO gamers. We extract neural features from video, audio and text chat data and estimate modality-specific probabilities for a gamer to be top-ranked during the data collection time-frame. A hierarchical Bayesian model is then used to pool the evidence across modalities and generate estimates of intrinsic skill for each gamer. Our modeling is validated through correlating the intrinsic skill predictions with May 2019 ranks of the publicly profiled gamers.

CLMay 7, 2018
Multimodal Machine Translation with Reinforcement Learning

Xin Qian, Ziyi Zhong, Jieli Zhou

Multimodal machine translation is one of the applications that integrates computer vision and language processing. It is a unique task given that in the field of machine translation, many state-of-the-arts algorithms still only employ textual information. In this work, we explore the effectiveness of reinforcement learning in multimodal machine translation. We present a novel algorithm based on the Advantage Actor-Critic (A2C) algorithm that specifically cater to the multimodal machine translation task of the EMNLP 2018 Third Conference on Machine Translation (WMT18). We experiment our proposed algorithm on the Multi30K multilingual English-German image description dataset and the Flickr30K image entity dataset. Our model takes two channels of inputs, image and text, uses translation evaluation metrics as training rewards, and achieves better results than supervised learning MLE baseline models. Furthermore, we discuss the prospects and limitations of using reinforcement learning for machine translation. Our experiment results suggest a promising reinforcement learning solution to the general task of multimodal sequence to sequence learning.