Zhe Zhou

h-index27

18papers

860citations

Novelty59%

AI Score56

Ranked #21,610 of 201,326 authors (top 11%)#312 in CR (top 4%)

18 Papers

98.9ARMay 7

MoE-Hub: Taming Software Complexity for Seamless MoE Overlap with Hardware-Accelerated Communication on Multi-GPU Systems

Zhuoshan Zhou, Chen Zhang, Shuyi Zhang et al.

The Mixture-of-Experts (MoE) architecture is crucial for scaling large language models, but its scalability is severely limited by inter-GPU communication bottlenecks in multi-GPU systems. Although overlapping communication with computation is a widely recognized optimization, its effective deployment still remains challenging, both in terms of performance and programmability. In this work, we identify the root cause as a fundamental abstraction mismatch between MoE's dynamic, irregular token-to-expert mapping and the static, address-centric communication model of modern GPUs, which necessitates a complex software mediation phase to resolve addresses before data transfers, limiting performance and software flexibility. To resolve this, we propose MoE-Hub, a hardware-software co-design that introduces a destination-agnostic communication paradigm. MoE-Hub decouples data transmission from address management, allowing producers to send data immediately after routing using only a logical destination, while address allocation and data-flow orchestration are handled transparently by lightweight hardware in the GPU hub. By hardware-accelerating the entire communication control plane, MoE-Hub enables seamless and transparent overlap. Our evaluation shows that MoE-Hub achieves 1.40x-3.08x per-layer and 1.21x-1.98x end-to-end speedup over state-of-the-art systems.

CLFeb 26, 2024Code

LLM Inference Unveiled: Survey and Roofline Model Insights

Zhihang Yuan, Yuzhang Shang, Yang Zhou et al.

The field of efficient Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges. Although the field has expanded and is vibrant, there hasn't been a concise framework that analyzes the various methods of LLM Inference to provide a clear understanding of this domain. Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model for systematic analysis of LLM inference techniques. This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems, such as why LLMs are memory-bound, how much memory and computation they need, and how to choose the right hardware. We systematically collate the latest advancements in efficient LLM inference, covering crucial areas such as model compression (e.g., Knowledge Distillation and Quantization), algorithm improvements (e.g., Early Exit and Mixture-of-Expert), and both hardware and system-level enhancements. Our survey stands out by analyzing these methods with roofline model, helping us understand their impact on memory access and computation. This distinctive approach not only showcases the current research landscape but also delivers valuable insights for practical implementation, positioning our work as an indispensable resource for researchers new to the field as well as for those seeking to deepen their understanding of efficient LLM deployment. The analyze tool, LLM-Viewer, is open-sourced.

94.6ARMay 7

Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems

Chen Zhang, Qijun Zhang, Zhuoshan Zhou et al.

Tensor parallelism (TP) in large-scale LLM inference and training introduces frequent collective operations that dominate inter-GPU communication. While in-switch computing, exemplified by NVLink SHARP (NVLS), accelerates collective operations by reducing redundant data transfer, its communication-centric design philosophy introduces the mismatch between its communication mode and the memory semantic requirement of LLM's computation kernel. Such a mismatch isolates the compute and communication phases, resulting in underutilized resources and limited overlap in multi-GPU systems. To address the limitation, we propose CAIS, the first Compute-Aware In-Switch computing framework that aligns communication modes with computation's memory semantics requirement. CAIS consists of three integral techniques: (1) compute-aware ISA and microarchitecture extension to enable compute-aware in-switch computing. (2) merging-aware TB (Thread Block) coordination to improve the temporal alignment for efficient request merging. (3) graph-level dataflow optimizer to achieve a tight cross-kernel overlap. Evaluations on LLM workloads show that CAIS achieves 1.38$\times$ average end-to-end training speedup over the SOTA NVLS-enabled solution, and 1.61$\times$ over T3, the SOTA compute-communicate overlap solutions but do not leverage NVLS, demonstrating its effectiveness in accelerating TP on multi-GPU systems.

87.1ARMay 7

Accelerating MoE with Dynamic In-Switch Computing on Multi-GPUs

Qijun Zhang, Chen Zhang, Zhuoshan Zhou et al.

Mixture-of-Experts (MoE) has been adopted by many leading large models to reduce computational requirements. However, frequent inter-GPU communication in MoE expert parallelism (EP) becomes a performance challenge. We observe substantial redundant inter-GPU data transfers in MoE that can be potentially addressed by in-switch computing. Unfortunately, the existing solution, NVLink SHARP (NVLS), can only support static collectives with regular patterns, incapable of dynamic communication with irregular patterns in MoE. To bridge the functionality gap, we propose DySHARP, an integral dynamic in-switch computing solution to accelerate MoE, encompassing both communication primitives and communication-aware scheduling: 1) Dynamic multimem addressing co-designs ISA, architecture, and runtime, as a dynamic extension to NVLS, reducing redundant traffic. However, the resulting traffic reduction is inherently asymmetric between two directions, preventing it from directly translating into speedup. 2) Token-centric kernel fusion deeply fuses the dispatch-computation-combine pipeline, resolving this asymmetry to translate traffic reduction into actual speedup. Compared with the state-of-the-art solution, DySHARP achieves up to 1.79$\times$ speedup.

CLDec 18, 2023

Training With "Paraphrasing the Original Text" Teaches LLM to Better Retrieve in Long-context Tasks

Yijiong Yu, Yongfeng Huang, Zhixiao Qi et al.

As Large Language Models (LLMs) continue to evolve, more are being designed to handle long-context inputs. Despite this advancement, most of them still face challenges in accurately handling long-context tasks, often showing the "lost in the middle" issue. We identify that insufficient retrieval capability is one of the important reasons for this issue. To tackle this challenge, we propose a novel approach to design training data for long-context tasks, aiming at augmenting LLMs' proficiency in extracting key information from long context. Specially, we incorporate an additional part named "paraphrasing the original text" when constructing the answer of training samples and then fine-tuning the model. Experimenting on LongBench and NaturalQuestions Multi-document-QA dataset with models of Llama and Qwen series, our method achieves an improvement of up to 8.48% and 4.48% in average scores, respectively, showing effectiveness in improving the model's performance on long-context tasks.

33.7PLApr 6

Trace-Guided Synthesis of Effectful Test Generators

Zhe Zhou, Ankush Desai, Benjamin Delaware et al.

Several recently proposed program logics have incorporated notions of underapproximation into their design, enabling them to reason about reachability rather than safety. In this paper, we explore how similar ideas can be integrated into an expressive type and effect system. We use the resulting underapproximate type specifications to guide the synthesis of test generators that probe the behavior of effectful black-box systems. A key novelty of our type language is its ability to capture underapproximate behaviors of effectful operations using symbolic traces that expose latent data and control dependencies, constraints that must be preserved by the test sequences the generator outputs. We implement this approach in a tool called Clouseau, and evaluate it on a diverse range of applications by integrating Clouseau's synthesized generators into property-based testing frameworks like QCheck and model-checking tools like P. In both settings, the generators synthesized by Clouseau are significantly more effective than the default testing strategy, and are competitive with state-of-the-art, handwritten solutions.

CVDec 11, 2021

On Adversarial Robustness of Point Cloud Semantic Segmentation

Jiacen Xu, Zhe Zhou, Boyuan Feng et al.

Recent research efforts on 3D point cloud semantic segmentation (PCSS) have achieved outstanding performance by adopting neural networks. However, the robustness of these complex models have not been systematically analyzed. Given that PCSS has been applied in many safety-critical applications like autonomous driving, it is important to fill this knowledge gap, especially, how these models are affected under adversarial samples. As such, we present a comparative study of PCSS robustness. First, we formally define the attacker's objective under performance degradation and object hiding. Then, we develop new attack by whether to bound the norm. We evaluate different attack options on two datasets and three PCSS models. We found all the models are vulnerable and attacking point color is more effective. With this study, we call the attention of the research community to develop new approaches to harden PCSS models.

LGNov 1, 2021

GNNear: Accelerating Full-Batch Training of Graph Neural Networks with Near-Memory Processing

Zhe Zhou, Cong Li, Xuechao Wei et al.

Recently, Graph Neural Networks (GNNs) have become state-of-the-art algorithms for analyzing non-euclidean graph data. However, to realize efficient GNN training is challenging, especially on large graphs. The reasons are many-folded: 1) GNN training incurs a substantial memory footprint. Full-batch training on large graphs even requires hundreds to thousands of gigabytes of memory. 2) GNN training involves both memory-intensive and computation-intensive operations, challenging current CPU/GPU platforms. 3) The irregularity of graphs can result in severe resource under-utilization and load-imbalance problems. This paper presents a GNNear accelerator to tackle these challenges. GNNear adopts a DIMM-based memory system to provide sufficient memory capacity. To match the heterogeneous nature of GNN training, we offload the memory-intensive Reduce operations to in-DIMM Near-Memory-Engines (NMEs), making full use of the high aggregated local bandwidth. We adopt a Centralized-Acceleration-Engine (CAE) to process the computation-intensive Update operations. We further propose several optimization strategies to deal with the irregularity of input graphs and improve GNNear's performance. Comprehensive evaluations on 16 GNN training tasks demonstrate that GNNear achieves 30.8$\times$/2.5$\times$ geomean speedup and 79.6$\times$/7.3$\times$(geomean) higher energy efficiency compared to Xeon E5-2698-v4 CPU and NVIDIA V100 GPU.

AROct 18, 2021

Energon: Towards Efficient Acceleration of Transformers Using Dynamic Sparse Attention

Zhe Zhou, Junlin Liu, Zhenyu Gu et al.

In recent years, transformer models have revolutionized Natural Language Processing (NLP) and shown promising performance on Computer Vision (CV) tasks. Despite their effectiveness, transformers' attention operations are hard to accelerate due to the complicated data movement and quadratic computational complexity, prohibiting the real-time inference on resource-constrained edge-computing platforms. To tackle this challenge, we propose Energon, an algorithm-architecture co-design approach that accelerates various transformers using dynamic sparse attention. With the observation that attention results only depend on a few important query-key pairs, we propose a Mix-Precision Multi-Round Filtering (MP-MRF) algorithm to dynamically identify such pairs at runtime. We adopt low bitwidth in each filtering round and only use high-precision tensors in the attention stage to reduce overall complexity. By this means, we significantly mitigate the computational cost with negligible accuracy loss. To enable such an algorithm with lower latency and better energy efficiency, we also propose an Energon co-processor architecture. Elaborated pipelines and specialized optimizations jointly boost the performance and reduce power consumption. Extensive experiments on both NLP and CV benchmarks demonstrate that Energon achieves $168\times$ and $8.7\times$ geo-mean speedup and up to $10^4\times$ and $10^3\times$ energy reduction compared with Intel Xeon 5220 CPU and NVIDIA V100 GPU. Compared to state-of-the-art attention accelerators SpAtten and $A^3$, Energon also achieves $1.7\times, 1.25\times$ speedup and $1.6 \times, 1.5\times $ higher energy efficiency.

AIApr 13, 2021

BlockGNN: Towards Efficient GNN Acceleration Using Block-Circulant Weight Matrices

Zhe Zhou, Bizhao Shi, Zhe Zhang et al.

In recent years, Graph Neural Networks (GNNs) appear to be state-of-the-art algorithms for analyzing non-euclidean graph data. By applying deep-learning to extract high-level representations from graph structures, GNNs achieve extraordinary accuracy and great generalization ability in various tasks. However, with the ever-increasing graph sizes, more and more complicated GNN layers, and higher feature dimensions, the computational complexity of GNNs grows exponentially. How to inference GNNs in real time has become a challenging problem, especially for some resource-limited edge-computing platforms. To tackle this challenge, we propose BlockGNN, a software-hardware co-design approach to realize efficient GNN acceleration. At the algorithm level, we propose to leverage block-circulant weight matrices to greatly reduce the complexity of various GNN models. At the hardware design level, we propose a pipelined CirCore architecture, which supports efficient block-circulant matrices computation. Basing on CirCore, we present a novel BlockGNN accelerator to compute various GNNs with low latency. Moreover, to determine the optimal configurations for diverse deployed tasks, we also introduce a performance and resource model that helps choose the optimal hardware parameters automatically. Comprehensive experiments on the ZC706 FPGA platform demonstrate that on various GNN tasks, BlockGNN achieves up to $8.3\times$ speedup compared to the baseline HyGCN architecture and $111.9\times$ energy reduction compared to the Intel Xeon CPU platform.

CRMar 8, 2021

Volcano: Stateless Cache Side-channel Attack by Exploiting Mesh Interconnect

Junpeng Wan, Yanxiang Bi, Zhe Zhou et al.

Cache side-channel attacks lead to severe security threats to the settings that a CPU is shared across users, e.g., in the cloud. The existing attacks rely on sensing the micro-architectural state changes made by victims, and this assumption can be invalidated by combining spatial (\eg, Intel CAT) and temporal isolation (\eg, time protection). In this work, we advance the state of cache side-channel attacks by showing stateless cache side-channel attacks that cannot be defeated by both spatial and temporal isolation. This side-channel exploits the timing difference resulted from interconnect congestion. Specifically, to complete cache transactions, for Intel CPUs, cache lines would travel across cores via the CPU mesh interconnect. Nonetheless, the mesh links are shared by all cores, and cache isolation does not segregate the traffic. An attacker can generate interconnect traffic to contend with the victim's on a mesh link, hoping that extra delay will be measured. With the variant delays, the attacker can deduce the memory access pattern of a victim program, and infer its sensitive data. Based on this idea, we implement Volcano and test it against the existing RSA implementations of JDK. We found the RSA private key used by a victim process can be partially recovered. In the end, we propose a few directions for defense and call for the attention of the security community.

CRJan 28, 2019

Do Not Return Similarity: Face Recovery with Distance

Mingtian Tan, Zhe Zhou

Machine Learning (ML) already has been integrated into all kinds of systems, helping developers to solve problems with even higher accuracy than human beings. However, when integrating ML models into a system, developers may accidentally take not enough care of the outputs of ML models, mainly because of their unfamiliarity with ML and AI, resulting in severe consequences like hurting data owners' privacy. In this work, we focus on understanding the risks of abusing embeddings of ML models, an important and popular way of using ML. To show the consequence, we reveal several kinds of channels in which embeddings are accidentally leaked. As our study shows, a face verification system deployed by a government organization leaking only distance to authentic users allows an attacker to exactly recover the embedding of the verifier's pre-installed photo. Further, as we discovered, with the leaked embedding, attackers can easily recover the input photo with negligible quality losses, indicating devastating consequences to users' privacy. This is achieved with our devised GAN-like structure model, which showed 93.65% success rate on popular face embedding model under black box assumption.

CRMar 13, 2018

Invisible Mask: Practical Attacks on Face Recognition with Infrared

Zhe Zhou, Di Tang, Xiaofeng Wang et al.

Accurate face recognition techniques make a series of critical applications possible: policemen could employ it to retrieve criminals' faces from surveillance video streams; cross boarder travelers could pass a face authentication inspection line without the involvement of officers. Nonetheless, when public security heavily relies on such intelligent systems, the designers should deliberately consider the emerging attacks aiming at misleading those systems employing face recognition. We propose a kind of brand new attack against face recognition systems, which is realized by illuminating the subject using infrared according to the adversarial examples worked out by our algorithm, thus face recognition systems can be bypassed or misled while simultaneously the infrared perturbations cannot be observed by raw eyes. Through launching this kind of attack, an attacker not only can dodge surveillance cameras. More importantly, he can impersonate his target victim and pass the face authentication system, if only the victim's photo is acquired by the attacker. Again, the attack is totally unobservable by nearby people, because not only the light is invisible, but also the device we made to launch the attack is small enough. According to our study on a large dataset, attackers have a very high success rate with a over 70\% success rate for finding such an adversarial example that can be implemented by infrared. To the best of our knowledge, our work is the first one to shed light on the severity of threat resulted from infrared adversarial examples against face recognition.

CVJan 6, 2018

Face Flashing: a Secure Liveness Detection Protocol based on Light Reflections

Di Tang, Zhe Zhou, Yinqian Zhang et al.

Face authentication systems are becoming increasingly prevalent, especially with the rapid development of Deep Learning technologies. However, human facial information is easy to be captured and reproduced, which makes face authentication systems vulnerable to various attacks. Liveness detection is an important defense technique to prevent such attacks, but existing solutions did not provide clear and strong security guarantees, especially in terms of time. To overcome these limitations, we propose a new liveness detection protocol called Face Flashing that significantly increases the bar for launching successful attacks on face authentication systems. By randomly flashing well-designed pictures on a screen and analyzing the reflected light, our protocol has leveraged physical characteristics of human faces: reflection processing at the speed of light, unique textual features, and uneven 3D shapes. Cooperating with working mechanism of the screen and digital cameras, our protocol is able to detect subtle traces left by an attacking process. To demonstrate the effectiveness of Face Flashing, we implemented a prototype and performed thorough evaluations with large data set collected from real-world scenarios. The results show that our Timing Verification can effectively detect the time gap between legitimate authentications and malicious cases. Our Face Verification can also differentiate 2D plane from 3D objects accurately. The overall accuracy of our liveness detection system is 98.8\%, and its robustness was evaluated in different scenarios. In the worst case, our system's accuracy decreased to a still-high 97.3\%.

CRMay 21, 2016

Vulnerable GPU Memory Management: Towards Recovering Raw Data from GPU

Zhe Zhou, Wenrui Diao, Xiangyu Liu et al.

In this paper, we present that security threats coming with existing GPU memory management strategy are overlooked, which opens a back door for adversaries to freely break the memory isolation: they enable adversaries without any privilege in a computer to recover the raw memory data left by previous processes directly. More importantly, such attacks can work on not only normal multi-user operating systems, but also cloud computing platforms. To demonstrate the seriousness of such attacks, we recovered original data directly from GPU memory residues left by exited commodity applications, including Google Chrome, Adobe Reader, GIMP, Matlab. The results show that, because of the vulnerable memory management strategy, commodity applications in our experiments are all affected.

CRJul 21, 2014

An Empirical Study on Android for Saving Non-shared Data on Public Storage

Xiangyu Liu, Zhe Zhou, Wenrui Diao et al.

With millions of apps that can be downloaded from official or third-party market, Android has become one of the most popular mobile platforms today. These apps help people in all kinds of ways and thus have access to lots of user's data that in general fall into three categories: sensitive data, data to be shared with other apps, and non-sensitive data not to be shared with others. For the first and second type of data, Android has provided very good storage models: an app's private sensitive data are saved to its private folder that can only be access by the app itself, and the data to be shared are saved to public storage (either the external SD card or the emulated SD card area on internal FLASH memory). But for the last type, i.e., an app's non-sensitive and non-shared data, there is a big problem in Android's current storage model which essentially encourages an app to save its non-sensitive data to shared public storage that can be accessed by other apps. At first glance, it seems no problem to do so, as those data are non-sensitive after all, but it implicitly assumes that app developers could correctly identify all sensitive data and prevent all possible information leakage from private-but-non-sensitive data. In this paper, we will demonstrate that this is an invalid assumption with a thorough survey on information leaks of those apps that had followed Android's recommended storage model for non-sensitive data. Our studies showed that highly sensitive information from billions of users can be easily hacked by exploiting the mentioned problematic storage model. Although our empirical studies are based on a limited set of apps, the identified problems are never isolated or accidental bugs of those apps being investigated. On the contrary, the problem is rooted from the vulnerable storage model recommended by Android. To mitigate the threat, we also propose a defense framework.

CRJul 18, 2014

Your Voice Assistant is Mine: How to Abuse Speakers to Steal Information and Control Your Phone

Wenrui Diao, Xiangyu Liu, Zhe Zhou et al.

Previous research about sensor based attacks on Android platform focused mainly on accessing or controlling over sensitive device components, such as camera, microphone and GPS. These approaches get data from sensors directly and need corresponding sensor invoking permissions. This paper presents a novel approach (GVS-Attack) to launch permission bypassing attacks from a zero permission Android application (VoicEmployer) through the speaker. The idea of GVS-Attack utilizes an Android system built-in voice assistant module -- Google Voice Search. Through Android Intent mechanism, VoicEmployer triggers Google Voice Search to the foreground, and then plays prepared audio files (like "call number 1234 5678") in the background. Google Voice Search can recognize this voice command and execute corresponding operations. With ingenious designs, our GVS-Attack can forge SMS/Email, access privacy information, transmit sensitive data and achieve remote control without any permission. Also we found a vulnerability of status checking in Google Search app, which can be utilized by GVS-Attack to dial arbitrary numbers even when the phone is securely locked with password. A prototype of VoicEmployer has been implemented to demonstrate the feasibility of GVS-Attack in real world. In theory, nearly all Android devices equipped with Google Services Framework can be affected by GVS-Attack. This study may inspire application developers and researchers rethink that zero permission doesn't mean safety and the speaker can be treated as a new attack surface.

CRJul 3, 2014

Acoustic Fingerprinting Revisited: Generate Stable Device ID Stealthy with Inaudible Sound

Zhe Zhou, Wenrui Diao, Xiangyu Liu et al.

The popularity of mobile device has made people's lives more convenient, but threatened people's privacy at the same time. As end users are becoming more and more concerned on the protection of their private information, it is even harder to track a specific user using conventional technologies. For example, cookies might be cleared by users regularly. Apple has stopped apps accessing UDIDs, and Android phones use some special permission to protect IMEI code. To address this challenge, some recent studies have worked on tracing smart phones using the hardware features resulted from the imperfect manufacturing process. These works have demonstrated that different devices can be differentiated to each other. However, it still has a long way to go in order to replace cookie and be deployed in real world scenarios, especially in terms of properties like uniqueness, robustness, etc. In this paper, we presented a novel method to generate stable and unique device ID stealthy for smartphones by exploiting the frequency response of the speaker. With carefully selected audio frequencies and special sound wave patterns, we can reduce the impacts of non-linear effects and noises, and keep our feature extraction process un-noticeable to users. The extracted feature is not only very stable for a given smart phone speaker, but also unique to that phone. The feature contains rich information that is equivalent to around 40 bits of entropy, which is enough to identify billions of different smart phones of the same model. We have built a prototype to evaluate our method, and the results show that the generated device ID can be used as a replacement of cookie.