Qiao Zhang

CL
h-index117
16papers
9,871citations
Novelty60%
AI Score62

16 Papers

CLFeb 2Code
Kimi K2.5: Visual Agentic Intelligence

Kimi Team, Tongtong Bai, Yifan Bai et al.

We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Building on this multimodal foundation, K2.5 introduces Agent Swarm, a self-directed parallel agent orchestration framework that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently. Extensive evaluations show that Kimi K2.5 achieves state-of-the-art results across various domains including coding, vision, reasoning, and agentic tasks. Agent Swarm also reduces latency by up to $4.5\times$ over single-agent baselines. We release the post-trained Kimi K2.5 model checkpoint to facilitate future research and real-world applications of agentic intelligence.

CVDec 6, 2022
MobilePTX: Sparse Coding for Pneumothorax Detection Given Limited Training Examples

Darryl Hannan, Steven C. Nesbit, Ximing Wen et al.

Point-of-Care Ultrasound (POCUS) refers to clinician-performed and interpreted ultrasonography at the patient's bedside. Interpreting these images requires a high level of expertise, which may not be available during emergencies. In this paper, we support POCUS by developing classifiers that can aid medical professionals by diagnosing whether or not a patient has pneumothorax. We decomposed the task into multiple steps, using YOLOv4 to extract relevant regions of the video and a 3D sparse coding model to represent video features. Given the difficulty in acquiring positive training videos, we trained a small-data classifier with a maximum of 15 positive and 32 negative examples. To counteract this limitation, we leveraged subject matter expert (SME) knowledge to limit the hypothesis space, thus reducing the cost of data collection. We present results using two lung ultrasound datasets and demonstrate that our model is capable of achieving performance on par with SMEs in pneumothorax identification. We then developed an iOS application that runs our full system in less than 4 seconds on an iPad Pro, and less than 8 seconds on an iPhone 13 Pro, labeling key regions in the lung sonogram to provide interpretable diagnoses.

CVDec 29, 2025Code
HY-Motion 1.0: Scaling Flow Matching Models for Text-To-Motion Generation

Yuxin Wen, Qing Shuai, Di Kang et al.

We present HY-Motion 1.0, a series of state-of-the-art, large-scale, motion generation models capable of generating 3D human motions from textual descriptions. HY-Motion 1.0 represents the first successful attempt to scale up Diffusion Transformer (DiT)-based flow matching models to the billion-parameter scale within the motion generation domain, delivering instruction-following capabilities that significantly outperform current open-source benchmarks. Uniquely, we introduce a comprehensive, full-stage training paradigm -- including large-scale pretraining on over 3,000 hours of motion data, high-quality fine-tuning on 400 hours of curated data, and reinforcement learning from both human feedback and reward models -- to ensure precise alignment with the text instruction and high motion quality. This framework is supported by our meticulous data processing pipeline, which performs rigorous motion cleaning and captioning. Consequently, our model achieves the most extensive coverage, spanning over 200 motion categories across 6 major classes. We release HY-Motion 1.0 to the open-source community to foster future research and accelerate the transition of 3D human motion generation models towards commercial maturity.

AIMay 24, 2022
Do it Like the Doctor: How We Can Design a Model That Uses Domain Knowledge to Diagnose Pneumothorax

Glen Smith, Qiao Zhang, Christopher MacLellan · gatech

Computer-aided diagnosis for medical imaging is a well-studied field that aims to provide real-time decision support systems for physicians. These systems attempt to detect and diagnose a plethora of medical conditions across a variety of image diagnostic technologies including ultrasound, x-ray, MRI, and CT. When designing AI models for these systems, we are often limited by little training data, and for rare medical conditions, positive examples are difficult to obtain. These issues often cause models to perform poorly, so we needed a way to design an AI model in light of these limitations. Thus, our approach was to incorporate expert domain knowledge into the design of an AI model. We conducted two qualitative think-aloud studies with doctors trained in the interpretation of lung ultrasound diagnosis to extract relevant domain knowledge for the condition Pneumothorax. We extracted knowledge of key features and procedures used to make a diagnosis. With this knowledge, we employed knowledge engineering concepts to make recommendations for an AI model design to automatically diagnose Pneumothorax.

30.1CRMar 14
SecDTD: Dynamic Token Drop for Secure Transformers Inference

Yifei Cai, Zhuoran Li, Yizhou Feng et al.

The rapid adoption of Transformer-based AI has been driven by accessible models such as ChatGPT, which provide API-based services for developers and businesses. However, as these online inference services increasingly handle sensitive inputs, privacy concerns have emerged as a significant challenge. To address this, secure inference frameworks have been proposed, but their high computational and communication overhead often limit practical deployment. In plaintext settings, token drop is an effective technique for reducing inference cost; however, our analysis reveals that directly applying such methods to ciphertext scenarios is suboptimal due to distinct cost distributions in secure computation. We propose SecDTD, a dynamic token drop scheme tailored for secure Transformer inference. SecDTD advances token drop by shifting the dropping to earlier inference stages, effectively reducing the cost of key components such as Softmax. To support this, we introduce two core techniques. Max-Centric Normalization (MCN): A novel, Softmax-independent scoring method that enables early token drop with minimal overhead and improved normalization, supporting more aggressive dropping without accuracy loss. OMSel: A faster, oblivious median selection protocol that securely identifies the median of importance scores to support token drop. Compared to existing sorting-based methods, OMSel achieves a 16.9$\times$ speedup while maintaining security, obliviousness and randomness. We evaluate SecDTD through 48 experiments across eight GLUE datasets under various network settings using the BOLT and BumbleBee frameworks. SecDTD achieves 4.47 times end-to-end inference acceleration without degradation in accuracy.

CLMar 8, 2024
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei et al. · deepmind, mila

In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.

CLJul 7, 2025
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann et al. · amazon-science, baidu

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

88.0CRApr 29
PRAG End-to-End Privacy-Preserving Retrieval-Augmented Generation

Zhijun Li, Minghui Xu, Huayi Qi et al.

Retrieval-Augmented Generation (RAG) is essential for enhancing Large Language Models (LLMs) with external knowledge, but its reliance on cloud environments exposes sensitive data to privacy risks. Existing privacy-preserving solutions often sacrifice retrieval quality due to noise injection or only provide partial encryption. We propose PRAG, an end-to-end privacy-preserving RAG system that achieves end-to-end confidentiality for both documents and queries without sacrificing the scalability of cloud-hosted RAG. PRAG features a dual-mode architecture: a non-interactive PRAG-I utilizes homomorphic-friendly approximations for low-latency retrieval, while an interactive PRAG-II leverages client assistance to match the accuracy of non-private RAG. To ensure robust semantic ordering, we introduce Operation-Error Estimation (OEE), a mechanism that stabilizes ranking against homomorphic noise. Experiments on large-scale datasets demonstrate that PRAG achieves competitive recall (72.45%-74.45%), practical retrieval latency, and strong resilience against graph reconstruction attacks while maintaining end-to-end confidentiality. This work confirms the feasibility of secure, high-performance RAG at scale.

LGMay 24, 2024
Comet: A Communication-efficient and Performant Approximation for Private Transformer Inference

Xiangrui Xu, Qiao Zhang, Rui Ning et al.

The prevalent use of Transformer-like models, exemplified by ChatGPT in modern language processing applications, underscores the critical need for enabling private inference essential for many cloud-based services reliant on such models. However, current privacy-preserving frameworks impose significant communication burden, especially for non-linear computation in Transformer model. In this paper, we introduce a novel plug-in method Comet to effectively reduce the communication cost without compromising the inference performance. We second introduce an efficient approximation method to eliminate the heavy communication in finding good initial approximation. We evaluate our Comet on Bert and RoBERTa models with GLUE benchmark datasets, showing up to 3.9$\times$ less communication and 3.5$\times$ speedups while keep competitive model performance compared to the prior art.

LGFeb 6, 2025
On the Expressive Power of Subgraph Graph Neural Networks for Graphs with Bounded Cycles

Ziang Chen, Qiao Zhang, Runzhong Wang

Graph neural networks (GNNs) have been widely used in graph-related contexts. It is known that the separation power of GNNs is equivalent to that of the Weisfeiler-Lehman (WL) test; hence, GNNs are imperfect at identifying all non-isomorphic graphs, which severely limits their expressive power. This work investigates $k$-hop subgraph GNNs that aggregate information from neighbors with distances up to $k$ and incorporate the subgraph structure. We prove that under appropriate assumptions, the $k$-hop subgraph GNNs can approximate any permutation-invariant/equivariant continuous function over graphs without cycles of length greater than $2k+1$ within any error tolerance. We also provide an extension to $k$-hop GNNs without incorporating the subgraph structure. Our numerical experiments on established benchmarks and novel architectures validate our theory on the relationship between the information aggregation distance and the cycle size.

IVMar 4, 2024
Interpretable Models for Detecting and Monitoring Elevated Intracranial Pressure

Darryl Hannan, Steven C. Nesbit, Ximing Wen et al.

Detecting elevated intracranial pressure (ICP) is crucial in diagnosing and managing various neurological conditions. These fluctuations in pressure are transmitted to the optic nerve sheath (ONS), resulting in changes to its diameter, which can then be detected using ultrasound imaging devices. However, interpreting sonographic images of the ONS can be challenging. In this work, we propose two systems that actively monitor the ONS diameter throughout an ultrasound video and make a final prediction as to whether ICP is elevated. To construct our systems, we leverage subject matter expert (SME) guidance, structuring our processing pipeline according to their collection procedure, while also prioritizing interpretability and computational efficiency. We conduct a number of experiments, demonstrating that our proposed systems are able to outperform various baselines. One of our SMEs then manually validates our top system's performance, lending further credibility to our approach while demonstrating its potential utility in a clinical setting.

62.9CRMar 13
Almost-Free Queue Jumping for Prior Inputs in Private Neural Inference

Qiao Zhang, Minghui Xu, Tingchuang Zhang et al.

Privacy-Preserving Machine Learning as a Service (PP-MLaaS) enables secure neural network inference by integrating cryptographic primitives such as homomorphic encryption (HE) and multi-party computation (MPC), protecting both client data and server models. Recent mixed-primitive frameworks have significantly improved inference efficiency, yet they process batched inputs sequentially, offering little flexibility for prioritizing urgent requests. Naïve queue jumping introduces considerable computational and communication overhead, increasing non-negligible latency for in-queue inputs. We initiate the study of privacy-preserving queue jumping in batched inference and propose PrivQJ, a novel framework that enables efficient priority handling without degrading overall system performance. PrivQJ exploits shared computation across inputs via in-processing slot recycling, allowing prior inputs to be piggybacked onto ongoing batch computation with almost no additional cryptographic cost. Both theoretical analysis and experimental results demonstrate over an order-of-magnitude reduction in overhead compared to state-of-the-art PP-MLaaS systems.

CLDec 19, 2023
Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud et al.

This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.

CLMay 17, 2023
PaLM 2 Technical Report

Rohan Anil, Andrew M. Dai, Orhan Firat et al.

We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference compared to PaLM. This improved efficiency enables broader deployment while also allowing the model to respond faster, for a more natural pace of interaction. PaLM 2 demonstrates robust reasoning capabilities exemplified by large improvements over PaLM on BIG-Bench and other reasoning tasks. PaLM 2 exhibits stable performance on a suite of responsible AI evaluations, and enables inference-time control over toxicity without additional overhead or impact on other capabilities. Overall, PaLM 2 achieves state-of-the-art performance across a diverse set of tasks and capabilities. When discussing the PaLM 2 family, it is important to distinguish between pre-trained models (of various sizes), fine-tuned variants of these models, and the user-facing products that use these models. In particular, user-facing products typically include additional pre- and post-processing steps. Additionally, the underlying models may evolve over time. Therefore, one should not expect the performance of user-facing products to exactly match the results reported in this report.

CRMay 5, 2021
GALA: Greedy ComputAtion for Linear Algebra in Privacy-Preserved Neural Networks

Qiao Zhang, Chunsheng Xin, Hongyi Wu

Machine Learning as a Service (MLaaS) is enabling a wide range of smart applications on end devices. However, privacy-preserved computation is still expensive. Our investigation has found that the most time-consuming component of the HE-based linear computation is a series of Permutation (Perm) operations that are imperative for dot product and convolution in privacy-preserved MLaaS. To this end, we propose GALA: Greedy computAtion for Linear Algebra in privacy-preserved neural networks, which views the HE-based linear computation as a series of Homomorphic Add, Mult and Perm operations and chooses the least expensive operation in each linear computation step to reduce the overall cost. GALA makes the following contributions: (1) It introduces a row-wise weight matrix encoding and combines the share generation that is needed for the GC-based nonlinear computation, to reduce the Perm operations for the dot product; (2) It designs a first-Add-second-Perm approach (named kernel grouping) to reduce Perm operations for convolution. As such, GALA efficiently reduces the cost for the HE-based linear computation, which is a critical building block in almost all of the recent frameworks for privacy-preserved neural networks, including GAZELLE (Usenix Security'18), DELPHI (Usenix Security'20), and CrypTFlow2 (CCS'20). With its deep optimization of the HE-based linear computation, GALA can be a plug-and-play module integrated into these systems to further boost their efficiency. Our experiments show that it achieves a significant speedup up to 700x for the dot product and 14x for the convolution computation under different data dimensions. Meanwhile, GALA demonstrates an encouraging runtime boost by 2.5x, 2.7x, 3.2x, 8.3x, 7.7x, and 7.5x over GAZELLE and 6.5x, 6x, 5.7x, 4.5x, 4.2x, and 4.1x over CrypTFlow2, on AlexNet, VGG, ResNet-18, ResNet-50, ResNet-101, and ResNet-152, respectively.

LGNov 12, 2019
CHEETAH: An Ultra-Fast, Approximation-Free, and Privacy-Preserved Neural Network Framework based on Joint Obscure Linear and Nonlinear Computations

Qiao Zhang, Cong Wang, Chunsheng Xin et al.

Machine Learning as a Service (MLaaS) is enabling a wide range of smart applications on end devices. However, such convenience comes with a cost of privacy because users have to upload their private data to the cloud. This research aims to provide effective and efficient MLaaS such that the cloud server learns nothing about user data and the users cannot infer the proprietary model parameters owned by the server. This work makes the following contributions. First, it unveils the fundamental performance bottleneck of existing schemes due to the heavy permutations in computing linear transformation and the use of communication intensive Garbled Circuits for nonlinear transformation. Second, it introduces an ultra-fast secure MLaaS framework, CHEETAH, which features a carefully crafted secret sharing scheme that runs significantly faster than existing schemes without accuracy loss. Third, CHEETAH is evaluated on the benchmark of well-known, practical deep networks such as AlexNet and VGG-16 on the MNIST and ImageNet datasets. The results demonstrate more than 100x speedup over the fastest GAZELLE (Usenix Security'18), 2000x speedup over MiniONN (ACM CCS'17) and five orders of magnitude speedup over CryptoNets (ICML'16). This significant speedup enables a wide range of practical applications based on privacy-preserved deep neural networks.