NISep 8, 2023
LLMCad: Fast and Scalable On-device Large Language Model InferenceDaliang Xu, Wangsong Yin, Xin Jin et al.
Generative tasks, such as text generation and question answering, hold a crucial position in the realm of mobile applications. Due to their sensitivity to privacy concerns, there is a growing demand for their execution directly on mobile devices. Currently, the execution of these generative tasks heavily depends on Large Language Models (LLMs). Nevertheless, the limited memory capacity of these devices presents a formidable challenge to the scalability of such models. In our research, we introduce LLMCad, an innovative on-device inference engine specifically designed for efficient generative Natural Language Processing (NLP) tasks. The core idea behind LLMCad revolves around model collaboration: a compact LLM, residing in memory, takes charge of generating the most straightforward tokens, while a high-precision LLM steps in to validate these tokens and rectify any identified errors. LLMCad incorporates three novel techniques: (1) Instead of generating candidate tokens in a sequential manner, LLMCad employs the smaller LLM to construct a token tree, encompassing a wider range of plausible token pathways. Subsequently, the larger LLM can efficiently validate all of these pathways simultaneously. (2) It employs a self-adjusting fallback strategy, swiftly initiating the verification process whenever the smaller LLM generates an erroneous token. (3) To ensure a continuous flow of token generation, LLMCad speculatively generates tokens during the verification process by implementing a compute-IO pipeline. Through an extensive series of experiments, LLMCad showcases an impressive token generation speed, achieving rates up to 9.3x faster than existing inference engines.
AIJul 8, 2024
Fast On-device LLM Inference with NPUsDaliang Xu, Hao Zhang, Liming Yang et al.
On-device inference for Large Language Models (LLMs), driven by increasing privacy concerns and advancements of mobile-sized models, has gained significant interest. However, even mobile-sized LLMs (e.g., Gemma-2B) encounter unacceptably high inference latency, often bottlenecked by the prefill stage in tasks like screen UI understanding. We present llm.npu, the first LLM inference system utilizing on-device Neural Processing Unit (NPU) offloading to reduce prefill latency. llm.npu enhances NPU offloading efficiency by re-constructing the prompt and model in three levels: (1) At prompt level, it divides variable-length prompts into multiple fixed-sized chunks while maintaining data dependencies; (2) At tensor level, it identifies and extracts significant outliers to run on the CPU/GPU in parallel with minimal overhead; (3) At block level, it schedules Transformer blocks in an out-of-order manner to the CPU/GPU and NPU based on their hardware affinity and sensitivity to accuracy. Compared to competitive baselines, llm.npu achieves 22.4x faster prefill speed and 30.7$\times$ energy savings on average, and up to 32.8x speedup in an end-to-end real-world application. For the first time, llm.npu achieves more than 1,000 tokens/sec prefilling for a billion-sized model.
DCSep 8, 2024
Elastic On-Device LLM ServiceWangsong Yin, Rongjie Yi, Daliang Xu et al.
On-device Large Language Models (LLMs) are transforming mobile AI, catalyzing applications like UI automation without privacy concerns. Nowadays the common practice is to deploy a single yet powerful LLM as a general task solver for multiple requests. We identify a key system challenge in this paradigm: current LLMs lack the elasticity to serve requests that have diversified Service-Level Objectives (SLOs) on inference latency. To tackle this, we present \sys, an on-device LLM service that elasticizes both the model and the prompt dimension of a full LLM. It incorporates (1) a one-shot neuron-reordering method, which leverages the intrinsic permutation consistency in transformer models to generate high-quality elasticized sub-models with minimal runtime switching overhead; (2) a dual-head tiny language model, which efficiently and effectively refines the prompt and orchestrates the elastification between model and prompt. We implement such an elastic on-device LLM service on multiple COTS smartphones, and evaluate \sys on both standalone NLP/mobile-agent datasets and end-to-end synthesized traces. On diverse SLOs, \sys outperforms 7 strong baselines in (absolute) accuracy by up to 14.83\% and 10.45\% on average, with <1\% TTFT switching overhead, on-par memory consumption and <100 offline GPU hours.
CYAug 5, 2023
Bias Behind the Wheel: Fairness Testing of Autonomous Driving SystemsXinyue Li, Zhenpeng Chen, Jie M. Zhang et al.
This paper conducts fairness testing of automated pedestrian detection, a crucial but under-explored issue in autonomous driving systems. We evaluate eight state-of-the-art deep learning-based pedestrian detectors across demographic groups on large-scale real-world datasets. To enable thorough fairness testing, we provide extensive annotations for the datasets, resulting in 8,311 images with 16,070 gender labels, 20,115 age labels, and 3,513 skin tone labels. Our findings reveal significant fairness issues, particularly related to age. The proportion of undetected children is 20.14% higher compared to adults. Furthermore, we explore how various driving scenarios affect the fairness of pedestrian detectors. We find that pedestrian detectors demonstrate significant gender biases during night time, potentially exacerbating the prevalent societal issue of female safety concerns during nighttime out. Moreover, we observe that pedestrian detectors can demonstrate both enhanced fairness and superior performance under specific driving conditions, which challenges the fairness-performance trade-off theory widely acknowledged in the fairness literature. We publicly release the code, data, and results to support future research on fairness in autonomous driving.
96.8ROMar 18
HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic AwarenessZihao Zheng, Zhihao Mao, Sicheng Tian et al.
Vision-Language-Action (VLA) Models have become the mainstream solution for robot control, but suffer from slow inference speeds. Speculative Decoding (SD) is a promising acceleration method which can be divided into two categories: drafter-based SD and retrieval-based SD. Existing methods fail to analyze the advantages and disadvantages of these two types of SD in VLA models, leading to their sole application or optimization. In this paper, we analyze the trajectory patterns of robots controlled by the VLA model and derive a key insight: the two types of SD should be used in a hybrid manner. However, achieving hybrid SD in VLA models poses several challenges: (1) draft rejection and persistent errors in retrieval-based SD; (2) difficulty in determining the hybrid boundary. To address these, we propose the HeiSD framework. We propose a retrieval-based SD optimization method in HeiSD,which contains a verify-skip mechanism and a sequence-wise relaxed acceptance strategy. Moreover, we proposed a kinematic-based fused metric in HeiSD to automatically determine the hybrid boundary. Experimental results demonstrate that HeiSD attains a speedup of up to 2.45x in simulation benchmarks and 2.06x~2.41x in real-world scenarios, while sustaining a high task success rate.
71.4SEApr 14
LLMs Are Not a Silver Bullet: A Case Study on Software FairnessXinyue Li, Sixuan Li, Ying Xiao et al.
Fairness is a critical requirement for human-related, high-stakes software systems, motivating extensive research on bias mitigation. Prior work has largely focused on tabular data settings using traditional Machine Learning (ML) methods. With the rapid rise of Large Language Models (LLMs), recent studies have begun to explore their use for bias mitigation in the same setting. However, it remains unclear whether LLM-based methods offer advantages over traditional ML methods, leaving software engineers without clear guidance for practical adoption. To address this gap, we present a large-scale study comparing state-of-the-art ML- and LLM-based bias mitigation methods. We find that ML-based methods consistently outperform LLM-based methods in both fairness and predictive performance, with even strong LLMs failing to surpass established ML baselines. To understand why prior LLM-based studies report favorable results, we analyze their evaluation settings and show that these gains are largely driven by artificially balanced test data rather than realistic imbalanced distributions. We further observe that existing LLM-based methods primarily rely on in-context learning and thus fail to leverage all available training data. Motivated by this, we explore supervised fine-tuning on the full training set and find that, while it achieves competitive results, its advantages over traditional ML methods remain limited. These findings suggest that LLMs are not a silver bullet for software fairness.
CLMay 30, 2025Code
FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning EvaluationJunyu Luo, Zhizhuo Kou, Liming Yang et al. · pku
Multimodal Large Language Models (MLLMs) have experienced rapid development in recent years. However, in the financial domain, there is a notable lack of effective and specialized multimodal evaluation datasets. To advance the development of MLLMs in the finance domain, we introduce FinMME, encompassing more than 11,000 high-quality financial research samples across 18 financial domains and 6 asset classes, featuring 10 major chart types and 21 subtypes. We ensure data quality through 20 annotators and carefully designed validation mechanisms. Additionally, we develop FinScore, an evaluation system incorporating hallucination penalties and multi-dimensional capability assessment to provide an unbiased evaluation. Extensive experimental results demonstrate that even state-of-the-art models like GPT-4o exhibit unsatisfactory performance on FinMME, highlighting its challenging nature. The benchmark exhibits high robustness with prediction variations under different prompts remaining below 1%, demonstrating superior reliability compared to existing datasets. Our dataset and evaluation protocol are available at https://huggingface.co/datasets/luojunyu/FinMME and https://github.com/luo-junyu/FinMME.
DCDec 10, 2025
WarmServe: Enabling One-for-Many GPU Prewarming for Multi-LLM ServingChiheng Lou, Sheng Qi, Rui Kang et al.
Deploying multiple models within shared GPU clusters is promising for improving resource efficiency in large language model (LLM) serving. Existing multi-LLM serving systems optimize GPU utilization at the cost of worse inference performance, especially time-to-first-token (TTFT). We identify the root cause of such compromise as their unawareness of future workload characteristics. In contrast, recent analysis on real-world traces has shown the high periodicity and long-term predictability of LLM serving workloads. We propose universal GPU workers to enable one-for-many GPU prewarming that loads models with knowledge of future workloads. Based on universal GPU workers, we design and build WarmServe, a multi-LLM serving system that (1) mitigates cluster-wide prewarming interference by adopting an evict-aware model placement strategy, (2) prepares universal GPU workers in advance by proactive prewarming, and (3) manages GPU memory with a zero-overhead memory switching mechanism. Evaluation under real-world datasets shows that WarmServe improves TTFT by up to 50.8$\times$ compared to the state-of-the-art autoscaling-based system, while being capable of serving up to 2.5$\times$ more requests compared to the GPU-sharing system.
MANov 26, 2025
BAMAS: Structuring Budget-Aware Multi-Agent SystemsLiming Yang, Junyu Luo, Xuanzhe Liu et al.
Large language model (LLM)-based multi-agent systems have emerged as a powerful paradigm for enabling autonomous agents to solve complex tasks. As these systems scale in complexity, cost becomes an important consideration for practical deployment. However, existing work rarely addresses how to structure multi-agent systems under explicit budget constraints. In this paper, we propose BAMAS, a novel approach for building multi-agent systems with budget awareness. BAMAS first selects an optimal set of LLMs by formulating and solving an Integer Linear Programming problem that balances performance and cost. It then determines how these LLMs should collaborate by leveraging a reinforcement learning-based method to select the interaction topology. Finally, the system is instantiated and executed based on the selected agents and their collaboration topology. We evaluate BAMAS on three representative tasks and compare it with state-of-the-art agent construction methods. Results show that BAMAS achieves comparable performance while reducing cost by up to 86%.
DCApr 15, 2024
LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence ParallelismBingyang Wu, Shengyu Liu, Yinmin Zhong et al.
The context window of large language models (LLMs) is rapidly increasing, leading to a huge variance in resource usage between different requests as well as between different phases of the same request. Restricted by static parallelism strategies, existing LLM serving systems cannot efficiently utilize the underlying resources to serve variable-length requests in different phases. To address this problem, we propose a new parallelism paradigm, elastic sequence parallelism (ESP), to elastically adapt to the variance between different requests and phases. Based on ESP, we design and build LoongServe, an LLM serving system that (1) improves computation efficiency by elastically adjusting the degree of parallelism in real-time, (2) improves communication efficiency by reducing key-value cache migration overhead and overlapping partial decoding communication with computation, and (3) improves GPU memory efficiency by reducing key-value cache fragmentation across instances. Our evaluation under diverse real-world datasets shows that LoongServe improves the maximum throughput by up to 3.85$\times$ compared to the chunked prefill and 5.81$\times$ compared to the prefill-decoding disaggregation.
DCApr 18, 2024
RAGCache: Efficient Knowledge Caching for Retrieval-Augmented GenerationChao Jin, Zili Zhang, Xuanlin Jiang et al.
Retrieval-Augmented Generation (RAG) has shown significant improvements in various natural language processing tasks by integrating the strengths of large language models (LLMs) and external knowledge databases. However, RAG introduces long sequence generation and leads to high computation and memory costs. We propose RAGCache, a novel multilevel dynamic caching system tailored for RAG. Our analysis benchmarks current RAG systems, pinpointing the performance bottleneck (i.e., long sequence due to knowledge injection) and optimization opportunities (i.e., caching knowledge's intermediate states). Based on these insights, we design RAGCache, which organizes the intermediate states of retrieved knowledge in a knowledge tree and caches them in the GPU and host memory hierarchy. RAGCache proposes a replacement policy that is aware of LLM inference characteristics and RAG retrieval patterns. It also dynamically overlaps the retrieval and inference steps to minimize the end-to-end latency. We implement RAGCache and evaluate it on vLLM, a state-of-the-art LLM inference system and Faiss, a state-of-the-art vector database. The experimental results show that RAGCache reduces the time to first token (TTFT) by up to 4x and improves the throughput by up to 2.1x compared to vLLM integrated with Faiss.
CYNov 1, 2024
Benchmarking Bias in Large Language Models during Role-PlayingXinyue Li, Zhenpeng Chen, Jie M. Zhang et al.
Large Language Models (LLMs) have become foundational in modern language-driven applications, profoundly influencing daily life. A critical technique in leveraging their potential is role-playing, where LLMs simulate diverse roles to enhance their real-world utility. However, while research has highlighted the presence of social biases in LLM outputs, it remains unclear whether and to what extent these biases emerge during role-playing scenarios. In this paper, we introduce BiasLens, a fairness testing framework designed to systematically expose biases in LLMs during role-playing. Our approach uses LLMs to generate 550 social roles across a comprehensive set of 11 demographic attributes, producing 33,000 role-specific questions targeting various forms of bias. These questions, spanning Yes/No, multiple-choice, and open-ended formats, are designed to prompt LLMs to adopt specific roles and respond accordingly. We employ a combination of rule-based and LLM-based strategies to identify biased responses, rigorously validated through human evaluation. Using the generated questions as the benchmark, we conduct extensive evaluations of six advanced LLMs released by OpenAI, Mistral AI, Meta, Alibaba, and DeepSeek. Our benchmark reveals 72,716 biased responses across the studied LLMs, with individual models yielding between 7,754 and 16,963 biased responses, underscoring the prevalence of bias in role-playing contexts. To support future research, we have publicly released the benchmark, along with all scripts and experimental results.
DCApr 3, 2025
MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert ParallelismRuidong Zhu, Ziheng Jiang, Chao Jin et al.
Mixture-of-Experts (MoE) showcases tremendous potential to scale large language models (LLMs) with enhanced performance and reduced computational complexity. However, its sparsely activated architecture shifts feed-forward networks (FFNs) from being compute-intensive to memory-intensive during inference, leading to substantially lower GPU utilization and increased operational costs. We present MegaScale-Infer, an efficient and cost-effective system for serving large-scale MoE models. MegaScale-Infer disaggregates attention and FFN modules within each model layer, enabling independent scaling, tailored parallelism strategies, and heterogeneous deployment for both modules. To fully exploit disaggregation in the presence of MoE's sparsity, MegaScale-Infer introduces ping-pong pipeline parallelism, which partitions a request batch into micro-batches and shuttles them between attention and FFNs for inference. Combined with distinct model parallelism for each module, MegaScale-Infer effectively hides communication overhead and maximizes GPU utilization. To adapt to disaggregated attention and FFN modules and minimize data transmission overhead (e.g., token dispatch), MegaScale-Infer provides a high-performance M2N communication library that eliminates unnecessary GPU-to-CPU data copies, group initialization overhead, and GPU synchronization. Experimental results indicate that MegaScale-Infer achieves up to 1.90x higher per-GPU throughput than state-of-the-art solutions.
94.2MAApr 28
Pythia: Toward Predictability-Driven Agent-Native LLM ServingShan Yu, Junyi Shu, Yuanjiang Ni et al.
As LLM applications grow more complex, developers are increasingly adopting multi-agent architectures to decompose workflows into specialized, collaborative components, introducing structure that constrains agent behavior and exposes useful semantic predictability. Unlike traditional LLM serving, which operates under highly dynamic and uncertain conditions, this structured topology enables opportunities to reduce runtime uncertainty -- yet existing systems fail to exploit it, treating agentic workloads as generic traffic and incurring significant inefficiencies. Our analysis of production traces from an agent-serving platform and an internal coding assistant reveals key bottlenecks, including low prefix cache hit rates, severe resource contention from long-context requests, and substantial queuing delays due to suboptimal scaling. To address these challenges, we propose Pythia, a multi-agent serving system that captures workflow semantics through a simple interface at the serving layer, unlocking new optimization opportunities and substantially improving throughput and job completion time over state-of-the-art baselines.
LGMar 9
DyQ-VLA: Temporal-Dynamic-Aware Quantization for Embodied Vision-Language-Action ModelsZihao Zheng, Hangyu Cao, Sicheng Tian et al.
Vision-Language-Action (VLA) models are dominant in embodied intelligence but are constrained by inference overheads. While model quantization alleviates these bottlenecks for edge deployment, static quantization approaches remain suboptimal for VLAs due to two critical challenges: (1) Temporal-dynamic sensitivity, where fixed precision wastes resources by ignoring stage-varying error tolerances; and (2) Real-time allocation, where identifying real-time sensitivity to guide bit allocation remains unsolved. To address these challenges, we propose DyQ-VLA, a dynamic quantization framework for VLAs. Specifically, a sensitivity-aware switching strategy leverages real-time kinematic proxies to trigger the bit-width switch, while a kinematic-guided module dynamically allocates the optimal bit-width. Experiments show that DyQ-VLA requires only 30.9% of the original memory footprint while maintaining 99.5% of its original performance, achieving 1.49x simulation and up to 1.43x real-world speedups.
PFAug 22, 2025
Dynamic Sparse Attention on Mobile SoCsWangsong Yin, Daliang Xu, Mengwei Xu et al.
On-device running Large Language Models (LLMs) is nowadays a critical enabler towards preserving user privacy. We observe that the attention operator falls back from the special-purpose NPU to the general-purpose CPU/GPU because of quantization sensitivity in state-of-the-art frameworks. This fallback results in a degraded user experience and increased complexity in system scheduling. To this end, this paper presents shadowAttn, a system-algorithm codesigned sparse attention module with minimal reliance on CPU/GPU by only sparsely calculating the attention on a tiny portion of tokens. The key idea is to hide the overhead of estimating the important tokens with a NPU-based pilot compute. Further, shadowAttn proposes insightful techniques such as NPU compute graph bucketing, head-wise NPU-CPU/GPU pipeline and per-head fine-grained sparsity ratio to achieve high accuracy and efficiency. shadowAttn delivers the best performance with highly limited CPU/GPU resource; it requires much less CPU/GPU resource to deliver on-par performance of SoTA frameworks.
LGFeb 8, 2024
Anatomizing Deep Learning Inference in Web BrowsersQipeng Wang, Shiqi Jiang, Zhenpeng Chen et al.
Web applications have increasingly adopted Deep Learning (DL) through in-browser inference, wherein DL inference performs directly within Web browsers. The actual performance of in-browser inference and its impacts on the quality of experience (QoE) remain unexplored, and urgently require new QoE measurements beyond traditional ones, e.g., mainly focusing on page load time. To bridge this gap, we make the first comprehensive performance measurement of in-browser inference to date. Our approach proposes new metrics to measure in-browser inference: responsiveness, smoothness, and inference accuracy. Our extensive analysis involves 9 representative DL models across Web browsers of 50 popular PC devices and 20 mobile devices. The results reveal that in-browser inference exhibits a substantial latency gap, averaging 16.9 times slower on CPU and 4.9 times slower on GPU compared to native inference on PC devices. The gap on mobile CPU and mobile GPU is 15.8 times and 7.8 times, respectively. Furthermore, we identify contributing factors to such latency gap, including underutilized hardware instruction sets, inherent overhead in the runtime environment, resource contention within the browser, and inefficiencies in software libraries and GPU abstractions. Additionally, in-browser inference imposes significant memory demands, at times exceeding 334.6 times the size of the DL models themselves, partly attributable to suboptimal memory management. We also observe that in-browser inference leads to a significant 67.2% increase in the time it takes for GUI components to render within Web browsers, significantly affecting the overall user QoE of Web applications reliant on this technology
ROMar 7
VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics AwarenessZihao Zheng, Zhihao Mao, Xingyue Zhou et al.
Vision-and-Language Navigation (VLN) increasingly relies on large vision-language models, but their inference cost conflicts with real-time deployment. Token caching is a promising training-free strategy that avoids redundant computation by reusing stable visual tokens across frames. However, existing methods assume a static camera and fixed semantic focus, assumptions that VLN fundamentally violates. We identify two failure modes: (1) visual dynamics, where viewpoint shift displaces token positions across frames, causing position-wise matching to pair misaligned content; (2) semantic dynamics, where token relevance shifts across task stages as navigation progresses, making cached states stale. We propose VLN-Cache, a visual-dynamic-aware and semantic-dynamic-aware caching framework that introduces view-aligned remapping to recover geometric correspondences and a task-relevance saliency filter to veto reuse at semantic transitions. A layer-adaptive entropy policy further balances the per-layer reuse budget. Experiments on the R2R-CE simulation benchmark show up to 1.52x speedup while maintaining competitive navigation success rates.
89.9DCMar 13
ARL-Tangram: Unleash the Resource Efficiency in Agentic Reinforcement LearningBangjun Xiao, Yihao Zhao, Xiangwei Deng et al.
Agentic reinforcement learning (RL) has emerged as a transformative workload in cloud clusters, enabling large language models (LLMs) to solve complex problems through interactions with real world. However, unlike traditional RL, agentic RL demands substantial external cloud resources, e.g., CPUs for code execution and GPUs for reward models, that exist outside the primary training cluster. Existing agentic RL framework typically rely on static over-provisioning, i.e., resources are often tied to long-lived trajectories or isolated by tasks, which leads to severe resource inefficiency. We propose the action-level orchestration, and incorporate it into ARL-Tangram, a unified resource management system that enables fine-grained external resource sharing and elasticity. ARL-Tangram utilizes a unified action-level formulation and an elastic scheduling algorithm to minimize action completion time (ACT) while satisfying heterogeneous resource constraints. Further, heterogeneous resource managers are tailored to efficiently support the action-level execution on resources with heterogeneous characteristics and topologies. Evaluation on real-world agentic RL tasks demonstrates that ARL-Tangram improves average ACT by up to 4.3$\times$, speeds up the step duration of RL training by up to 1.5$\times$, and saves the external resources by up to 71.2$\%$. This system has been deployed to support the training of the MiMo series models.
DCAug 24, 2025
TokenLake: A Unified Segment-level Prefix Cache Pool for Fine-grained Elastic Long-Context LLM ServingBingyang Wu, Zili Zhang, Yinmin Zhong et al.
Prefix caching is crucial to accelerate multi-turn interactions and requests with shared prefixes. At the cluster level, existing prefix caching systems are tightly coupled with request scheduling to optimize cache efficiency and computation performance together, leading to load imbalance, data redundancy, and memory fragmentation of caching systems across instances. To address these issues, memory pooling is promising to shield the scheduler from the underlying cache management so that it can focus on the computation optimization. However, because existing prefix caching systems only transfer increasingly longer prefix caches between instances, they cannot achieve low-latency memory pooling. To address these problems, we propose a unified segment-level prefix cache pool, TokenLake. It uses a declarative cache interface to expose requests' query tensors, prefix caches, and cache-aware operations to TokenLake for efficient pooling. Powered by this abstraction, TokenLake can manage prefix cache at the segment level with a heavy-hitter-aware load balancing algorithm to achieve better cache load balance, deduplication, and defragmentation. TokenLake also transparently minimizes the communication volume of query tensors and new caches. Based on TokenLake, the scheduler can schedule requests elastically by using existing techniques without considering prefix cache management. Evaluations on real-world workloads show that TokenLake can improve throughput by up to 2.6$\times$ and 2.0$\times$ and boost hit rate by 2.0$\times$ and 2.1$\times$, compared to state-of-the-art cache-aware routing and cache-centric PD-disaggregation solutions, respectively.
LGJan 16, 2024
A Survey of Resource-efficient LLM and Multimodal Foundation ModelsMengwei Xu, Wangsong Yin, Dongqi Cai et al.
Large foundation models, including large language models (LLMs), vision transformers (ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine learning lifecycle, from training to deployment. However, the substantial advancements in versatility and performance these models offer come at a significant cost in terms of hardware resources. To support the growth of these large models in a scalable and environmentally sustainable way, there has been a considerable focus on developing resource-efficient strategies. This survey delves into the critical importance of such research, examining both algorithmic and systemic aspects. It offers a comprehensive analysis and valuable insights gleaned from existing literature, encompassing a broad array of topics from cutting-edge model architectures and training/serving algorithms to practical system designs and implementations. The goal of this survey is to provide an overarching understanding of how current approaches are tackling the resource challenges posed by large foundation models and to potentially inspire future breakthroughs in this field.
LGMay 10, 2023
Fast Distributed Inference Serving for Large Language ModelsBingyang Wu, Yinmin Zhong, Zili Zhang et al.
Large language models (LLMs) power a new generation of interactive AI applications exemplified by ChatGPT. The interactive nature of these applications demands low latency for LLM inference. Existing LLM serving systems use run-to-completion processing for inference jobs, which suffers from head-of-line blocking and long latency. We present FastServe, a distributed inference serving system for LLMs. FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token. FastServe uses preemptive scheduling to minimize latency with a novel skip-join Multi-Level Feedback Queue scheduler. Based on the new semi-information-agnostic setting of LLM inference, the scheduler leverages the input length information to assign an appropriate initial queue for each arrival job to join. The higher priority queues than the joined queue are skipped to reduce demotions. We design an efficient GPU memory management mechanism that proactively offloads and uploads intermediate state between GPU memory and host memory for LLM inference. We build a system prototype of FastServe and experimental results show that compared to the state-of-the-art solution vLLM, FastServe improves the throughput by up to 31.4x and 17.9x under the same average and tail latency requirements, respectively.
LGFeb 14, 2022
Benchmarking of DL Libraries and Models on Mobile DevicesQiyang Zhang, Xiang Li, Xiangying Che et al.
Deploying deep learning (DL) on mobile devices has been a notable trend in recent years. To support fast inference of on-device DL, DL libraries play a critical role as algorithms and hardware do. Unfortunately, no prior work ever dives deep into the ecosystem of modern DL libs and provides quantitative results on their performance. In this paper, we first build a comprehensive benchmark that includes 6 representative DL libs and 15 diversified DL models. We then perform extensive experiments on 10 mobile devices, which help reveal a complete landscape of the current mobile DL libs ecosystem. For example, we find that the best-performing DL lib is severely fragmented across different models and hardware, and the gap between those DL libs can be rather huge. In fact, the impacts of DL libs can overwhelm the optimizations from algorithms or hardware, e.g., model quantization and GPU/DSP-based heterogeneous computing. Finally, atop the observations, we summarize practical implications to different roles in the DL lib ecosystem.
SEDec 12, 2021
Rise of Distributed Deep Learning Training in the Big Model Era: From a Software Engineering PerspectiveXuanzhe Liu, Diandian Gu, Zhenpeng Chen et al.
Deep learning (DL) has become a key component of modern software. In the "big model" era, the rich features of DL-based software substantially rely on powerful DL models, e.g., BERT, GPT-3, and the recently emerging GPT-4, which are trained on the powerful cloud with large datasets. Hence, training effective DL models has become a vital stage in the whole software lifecycle. When training deep learning models, especially those big models, developers need to parallelize and distribute the computation and memory resources amongst multiple devices in the training process, which is known as distributed deep learning training, or distributed training for short. However, the unique challenges that developers encounter in distributed training process have not been studied in the software engineering community. Given the increasingly heavy dependence of current DL-based software on distributed training, this paper aims to fill in the knowledge gap and presents the first comprehensive study on developers' issues in distributed training. To this end, we analyze 1,131 real-world developers' issues about using these frameworks reported on Stack Overflow and GitHub. We construct a fine-grained taxonomy consisting of 30 categories regarding the fault symptoms and summarize common fix patterns for different symptoms. Based on the results, we suggest actionable implications on research avenues that can potentially facilitate the distributed training to develop DL-based software, such as focusing on the frequent and common fix patterns when designing testing or debugging tools, developing efficient testing and debugging techniques for communication configuration along with the synthesis of network configuration analysis, designing new multi-device checkpoint-and-replay techniques to help reproduction, and designing serverless APIs for cloud platforms.
SEJan 13, 2021
An Empirical Study on Deployment Faults of Deep Learning Based Mobile ApplicationsZhenpeng Chen, Huihan Yao, Yiling Lou et al.
Deep Learning (DL) is finding its way into a growing number of mobile software applications. These software applications, named as DL based mobile applications (abbreviated as mobile DL apps) integrate DL models trained using large-scale data with DL programs. A DL program encodes the structure of a desirable DL model and the process by which the model is trained using training data. Due to the increasing dependency of current mobile apps on DL, software engineering (SE) for mobile DL apps has become important. However, existing efforts in SE research community mainly focus on the development of DL models and extensively analyze faults in DL programs. In contrast, faults related to the deployment of DL models on mobile devices (named as deployment faults of mobile DL apps) have not been well studied. Since mobile DL apps have been used by billions of end users daily for various purposes including for safety-critical scenarios, characterizing their deployment faults is of enormous importance. To fill the knowledge gap, this paper presents the first comprehensive study on the deployment faults of mobile DL apps. We identify 304 real deployment faults from Stack Overflow and GitHub, two commonly used data sources for studying software faults. Based on the identified faults, we construct a fine-granularity taxonomy consisting of 23 categories regarding to fault symptoms and distill common fix strategies for different fault types. Furthermore, we suggest actionable implications and research avenues that could further facilitate the deployment of DL models on mobile devices.
CRDec 2, 2020
VM Matters: A Comparison of WASM VMs and EVMs in the Performance of Blockchain Smart ContractsShuyu Zheng, Haoyu Wang, Lei Wu et al.
WebAssemly is an emerging runtime for Web applications and has been supported in almost all browsers. Recently, WebAssembly is further regarded to be a the next-generation environment for blockchain applications, and has been adopted by Ethereum, namely eWASM, to replace the state-of-the-art EVM. However, whether and how well current eWASM outperforms EVM on blockchain clients is still unknown. This paper conducts the first measurement study, to measure the performance on WASM VM and EVM for executing smart contracts on blockchain. To our surprise, the current WASM VM does not perform in expected performance. The overhead introduced by WASM is really non-trivial. Our results highlight the challenges when deploying WASM in practice, and provide insightful implications for improvement space.
LGOct 22, 2020
Hierarchical Federated Learning through LAN-WAN OrchestrationJinliang Yuan, Mengwei Xu, Xiao Ma et al.
Federated learning (FL) was designed to enable mobile phones to collaboratively learn a global model without uploading their private data to a cloud server. However, exiting FL protocols has a critical communication bottleneck in a federated network coupled with privacy concerns, usually powered by a wide-area network (WAN). Such a WAN-driven FL design leads to significantly high cost and much slower model convergence. In this work, we propose an efficient FL protocol, which involves a hierarchical aggregation mechanism in the local-area network (LAN) due to its abundant bandwidth and almost negligible monetary cost than WAN. Our proposed FL can accelerate the learning process and reduce the monetary cost with frequent local aggregation in the same LAN and infrequent global aggregation on a cloud across WAN. We further design a concrete FL platform, namely LanFL, that incorporates several key techniques to handle those challenges introduced by LAN: cloud-device aggregation architecture, intra-LAN peer-to-peer (p2p) topology generation, inter-LAN bandwidth capacity heterogeneity. We evaluate LanFL on 2 typical Non-IID datasets, which reveals that LanFL can significantly accelerate FL training (1.5x-6.0x), save WAN traffic (18.3x-75.6x), and reduce monetary cost (3.8x-27.2x) while preserving the model accuracy.
LGSep 20, 2020
Exploring the Generalizability of Spatio-Temporal Traffic Prediction: Meta-Modeling and an Analytic FrameworkLeye Wang, Di Chai, Xuanzhe Liu et al.
The Spatio-Temporal Traffic Prediction (STTP) problem is a classical problem with plenty of prior research efforts that benefit from traditional statistical learning and recent deep learning approaches. While STTP can refer to many real-world problems, most existing studies focus on quite specific applications, such as the prediction of taxi demand, ridesharing order, traffic speed, and so on. This hinders the STTP research as the approaches designed for different applications are hardly comparable, and thus how an application-driven approach can be generalized to other scenarios is unclear. To fill in this gap, this paper makes three efforts: (i) we propose an analytic framework, called STAnalytic, to qualitatively investigate STTP approaches regarding their design considerations on various spatial and temporal factors, aiming to make different application-driven approaches comparable; (ii) we design a spatio-temporal meta-model, called STMeta, which can flexibly integrate generalizable temporal and spatial knowledge identified by STAnalytic, (iii) we build an STTP benchmark platform including ten real-life datasets with five scenarios to quantitatively measure the generalizability of STTP approaches. In particular, we implement STMeta with different deep learning techniques, and STMeta demonstrates better generalizability than state-of-the-art approaches by achieving lower prediction error on average across all the datasets.
CRJul 27, 2020
Don't Fish in Troubled Waters! Characterizing Coronavirus-themed Cryptocurrency ScamsPengcheng Xia, Haoyu Wang, Xiapu Luo et al.
As COVID-19 has been spreading across the world since early 2020, a growing number of malicious campaigns are capitalizing the topic of COVID-19. COVID-19 themed cryptocurrency scams are increasingly popular during the pandemic. However, these newly emerging scams are poorly understood by our community. In this paper, we present the first measurement study of COVID-19 themed cryptocurrency scams. We first create a comprehensive taxonomy of COVID-19 scams by manually analyzing the existing scams reported by users from online resources. Then, we propose a hybrid approach to perform the investigation by: 1) collecting reported scams in the wild; and 2) detecting undisclosed ones based on information collected from suspicious entities (e.g., domains, tweets, etc). We have collected 195 confirmed COVID-19 cryptocurrency scams in total, including 91 token scams, 19 giveaway scams, 9 blackmail scams, 14 crypto malware scams, 9 Ponzi scheme scams, and 53 donation scams. We then identified over 200 blockchain addresses associated with these scams, which lead to at least 330K US dollars in losses from 6,329 victims. For each type of scams, we further investigated the tricks and social engineering techniques they used. To facilitate future research, we have released all the well-labelled scams to the research community.
LGJun 12, 2020
Characterizing Impacts of Heterogeneity in Federated Learning upon Large-Scale Smartphone DataChengxu Yang, Qipeng Wang, Mengwei Xu et al.
Federated learning (FL) is an emerging, privacy-preserving machine learning paradigm, drawing tremendous attention in both academia and industry. A unique characteristic of FL is heterogeneity, which resides in the various hardware specifications and dynamic states across the participating devices. Theoretically, heterogeneity can exert a huge influence on the FL training process, e.g., causing a device unavailable for training or unable to upload its model updates. Unfortunately, these impacts have never been systematically studied and quantified in existing FL literature. In this paper, we carry out the first empirical study to characterize the impacts of heterogeneity in FL. We collect large-scale data from 136k smartphones that can faithfully reflect heterogeneity in real-world settings. We also build a heterogeneity-aware FL platform that complies with the standard FL protocol but with heterogeneity in consideration. Based on the data and the platform, we conduct extensive experiments to compare the performance of state-of-the-art FL algorithms under heterogeneity-aware and heterogeneity-unaware settings. Results show that heterogeneity causes non-trivial performance degradation in FL, including up to 9.2% accuracy drop, 2.32x lengthened training time, and undermined fairness. Furthermore, we analyze potential impact factors and find that device failure and participant bias are two potential factors for performance degradation. Our study provides insightful implications for FL practitioners. On the one hand, our findings suggest that FL algorithm designers consider necessary heterogeneity during the evaluation. On the other hand, our findings urge system providers to design specific mechanisms to mitigate the impacts of heterogeneity.
SEMay 2, 2020
A Comprehensive Study on Challenges in Deploying Deep Learning Based SoftwareZhenpeng Chen, Yanbin Cao, Yuanqiang Liu et al.
Deep learning (DL) becomes increasingly pervasive, being used in a wide range of software applications. These software applications, named as DL based software (in short as DL software), integrate DL models trained using a large data corpus with DL programs written based on DL frameworks such as TensorFlow and Keras. A DL program encodes the network structure of a desirable DL model and the process by which the model is trained using the training data. To help developers of DL software meet the new challenges posed by DL, enormous research efforts in software engineering have been devoted. Existing studies focus on the development of DL software and extensively analyze faults in DL programs. However, the deployment of DL software has not been comprehensively studied. To fill this knowledge gap, this paper presents a comprehensive study on understanding challenges in deploying DL software. We mine and analyze 3,023 relevant posts from Stack Overflow, a popular Q&A website for developers, and show the increasing popularity and high difficulty of DL software deployment among developers. We build a taxonomy of specific challenges encountered by developers in the process of DL software deployment through manual inspection of 769 sampled posts and report a series of actionable implications for researchers, developers, and DL framework vendors.
LGFeb 15, 2020
Federated Neural Architecture SearchJinliang Yuan, Mengwei Xu, Yuxin Zhao et al.
To preserve user privacy while enabling mobile intelligence, techniques have been proposed to train deep neural networks on decentralized data. However, training over decentralized data makes the design of neural architecture quite difficult as it already was. Such difficulty is further amplified when designing and deploying different neural architectures for heterogeneous mobile platforms. In this work, we propose an automatic neural architecture search into the decentralized training, as a new DNN training paradigm called Federated Neural Architecture Search, namely federated NAS. To deal with the primary challenge of limited on-client computational and communication resources, we present FedNAS, a highly optimized framework for efficient federated NAS. FedNAS fully exploits the key opportunity of insufficient model candidate re-training during the architecture search process, and incorporates three key optimizations: parallel candidates training on partial clients, early dropping candidates with inferior performance, and dynamic round numbers. Tested on large-scale datasets and typical CNN architectures, FedNAS achieves comparable model accuracy as state-of-the-art NAS algorithm that trains models with centralized data, and also reduces the client cost by up to two orders of magnitude compared to a straightforward design of federated NAS.
CRFeb 13, 2020
Characterizing EOSIO BlockchainYuheng Huang, Haoyu Wang, Lei Wu et al.
EOSIO has become one of the most popular blockchain platforms since its mainnet launch in June 2018. In contrast to the traditional PoW-based systems (e.g., Bitcoin and Ethereum), which are limited by low throughput, EOSIO is the first high throughput Delegated Proof of Stake system that has been widely adopted by many applications. Although EOSIO has millions of accounts and billions of transactions, little is known about its ecosystem, especially related to security and fraud. In this paper, we perform a large-scale measurement study of the EOSIO blockchain and its associated DApps. We gather a large-scale dataset of EOSIO and characterize activities including money transfers, account creation and contract invocation. Using our insights, we then develop techniques to automatically detect bots and fraudulent activity. We discover thousands of bot accounts (over 30\% of the accounts in the platform) and a number of real-world attacks (301 attack accounts). By the time of our study, 80 attack accounts we identified have been confirmed by DApp teams, causing 828,824 EOS tokens losses (roughly 2.6 million US\$) in total.
SESep 3, 2019
A First Look at Blockchain-based Decentralized ApplicationsKaidong Wu, Yun Ma, Gang Huang et al.
With the increasing popularity of blockchain technologies in recent years, blockchain-based decentralized applications (DApps for short in this paper) have been rapidly developed and widely adopted in many areas, being a hot topic in both academia and industry. Despite of the importance of DApps, we still have quite little understanding of DApps along with its ecosystem. To bridge the knowledge gap, this paper presents the first comprehensive empirical study of blockchain-based DApps to date, based on an extensive dataset of 995 Ethereum DApps and 29,846,075 transaction logs over them. We make a descriptive analysis of the popularity of DApps, summarize the patterns of how DApps use smart contracts to access the underlying blockchain, and explore the worth-addressing issues of deploying and operating DApps. Based on the findings, we propose some implications for DApp users to select proper DApps, for DApp developers to improve the efficiency of DApps, and for blockchain vendors to enhance the support of DApps.
SEJul 4, 2019
SEntiMoji: An Emoji-Powered Learning Approach for Sentiment Analysis in Software EngineeringZhenpeng Chen, Yanbin Cao, Xuan Lu et al.
Sentiment analysis has various application scenarios in software engineering (SE), such as detecting developers' emotions in commit messages and identifying their opinions on Q&A forums. However, commonly used out-of-the-box sentiment analysis tools cannot obtain reliable results on SE tasks and the misunderstanding of technical jargon is demonstrated to be the main reason. Then, researchers have to utilize labeled SE-related texts to customize sentiment analysis for SE tasks via a variety of algorithms. However, the scarce labeled data can cover only very limited expressions and thus cannot guarantee the analysis quality. To address such a problem, we turn to the easily available emoji usage data for help. More specifically, we employ emotional emojis as noisy labels of sentiments and propose a representation learning approach that uses both Tweets and GitHub posts containing emojis to learn sentiment-aware representations for SE-related texts. These emoji-labeled posts can not only supply the technical jargon, but also incorporate more general sentiment patterns shared across domains. They as well as labeled data are used to learn the final sentiment classifier. Compared to the existing sentiment analysis methods used in SE, the proposed approach can achieve significant improvement on representative benchmark datasets. By further contrast experiments, we find that the Tweets make a key contribution to the power of our approach. This finding informs future research not to unilaterally pursue the domain-specific resource, but try to transform knowledge from the open domain through ubiquitous signals such as emojis.
SEJan 27, 2019
Moving Deep Learning into Web Browser: How Far Can We Go?Yun Ma, Dongwei Xiang, Shuyu Zheng et al.
Recently, several JavaScript-based deep learning frameworks have emerged, making it possible to perform deep learning tasks directly in browsers. However, little is known on what and how well we can do with these frameworks for deep learning in browsers. To bridge the knowledge gap, in this paper, we conduct the first empirical study of deep learning in browsers. We survey 7 most popular JavaScript-based deep learning frameworks, investigating to what extent deep learning tasks have been supported in browsers so far. Then we measure the performance of different frameworks when running different deep learning tasks. Finally, we dig out the performance gap between deep learning in browsers and on native platforms by comparing the performance of TensorFlow.js and TensorFlow in Python. Our findings could help application developers, deep-learning framework vendors and browser vendors to improve the efficiency of deep learning in browsers.
CYDec 12, 2018
A First Look at Emoji Usage on GitHub: An Empirical StudyXuan Lu, Yanbin Cao, Zhenpeng Chen et al.
Emoji is becoming a ubiquitous language and gaining worldwide popularity in recent years including the field of software engineering (SE). As nonverbal cues, emojis are widely used in user understanding tasks such as sentiment analysis, but few work has been done to study emojis in SE scenarios. This paper presents a large scale empirical study on how GitHub users use emojis in development-related communications. We find that emojis are used by a considerable proportion of GitHub users. In comparison to Internet users, developers show interesting usage characteristics and have their own interpretation of the meanings of emojis. In addition, the usage of emojis reflects a positive and supportive culture of this community. Through a manual annotation task, we find that sentimental usage is a main intention of using emojis in issues, pull requests, and comments, while emojis are mainly used to emphasize important contents in README. These findings not only deepen our understanding about the culture of SE communities, but also provide implications on how to facilitate SE tasks with emojis such as sentiment analysis.
LGNov 8, 2018
A First Look at Deep Learning Apps on SmartphonesMengwei Xu, Jiawei Liu, Yuanqiang Liu et al.
We are in the dawn of deep learning explosion for smartphones. To bridge the gap between research and practice, we present the first empirical study on 16,500 the most popular Android apps, demystifying how smartphone apps exploit deep learning in the wild. To this end, we build a new static tool that dissects apps and analyzes their deep learning functions. Our study answers threefold questions: what are the early adopter apps of deep learning, what do they use deep learning for, and how do their deep learning models look like. Our study has strong implications for app developers, smartphone vendors, and deep learning R\&D. On one hand, our findings paint a promising picture of deep learning for smartphones, showing the prosperity of mobile deep learning frameworks as well as the prosperity of apps building their cores atop deep learning. On the other hand, our findings urge optimizations on deep learning models deployed on smartphones, the protection of these models, and validation of research ideas on these models.
IRJun 7, 2018
Emoji-Powered Representation Learning for Cross-Lingual Sentiment ClassificationZhenpeng Chen, Sheng Shen, Ziniu Hu et al.
Sentiment classification typically relies on a large amount of labeled data. In practice, the availability of labels is highly imbalanced among different languages, e.g., more English texts are labeled than texts in any other languages, which creates a considerable inequality in the quality of related information services received by users speaking different languages. To tackle this problem, cross-lingual sentiment classification approaches aim to transfer knowledge learned from one language that has abundant labeled examples (i.e., the source language, usually English) to another language with fewer labels (i.e., the target language). The source and the target languages are usually bridged through off-the-shelf machine translation tools. Through such a channel, cross-language sentiment patterns can be successfully learned from English and transferred into the target languages. This approach, however, often fails to capture sentiment knowledge specific to the target language, and thus compromises the accuracy of the downstream classification task. In this paper, we employ emojis, which are widely available in many languages, as a new channel to learn both the cross-language and the language-specific sentiment patterns. We propose a novel representation learning method that uses emoji prediction as an instrument to learn respective sentiment-aware representations for each language. The learned representations are then integrated to facilitate cross-lingual sentiment classification. The proposed method demonstrates state-of-the-art performance on benchmark datasets, which is sustained even when sentiment labels are scarce.
HCJan 12, 2018
Predicting Smartphone Battery Life based on Comprehensive and Real-time Usage DataHuoran Li, Xuanzhe Liu, Qiaozhu Mei
Smartphones and smartphone apps have undergone an explosive growth in the past decade. However, smartphone battery technology hasn't been able to keep pace with the rapid growth of the capacity and the functionality of smartphones and apps. As a result, battery has always been a bottleneck of a user's daily experience of smartphones. An accurate estimation of the remaining battery life could tremendously help the user to schedule their activities and use their smartphones more efficiently. Existing studies on battery life prediction have been primitive due to the lack of real-world smartphone usage data at scale. This paper presents a novel method that uses the state-of-the-art machine learning models for battery life prediction, based on comprehensive and real-time usage traces collected from smartphones. The proposed method is the first that identifies and addresses the severe data missing problem in this context, using a principled statistical metric called the concordance index. The method is evaluated using a dataset collected from 51 users for 21 months, which covers comprehensive and fine-grained smartphone usage traces including system status, sensor indicators, system events, and app status. We find that the remaining battery life of a smartphone can be accurately predicted based on how the user uses the device at the real-time, in the current session, and in history. The machine learning models successfully identify predictive features for battery life and their applicable scenarios.
SIDec 6, 2017
Listening to Chaotic Whispers: A Deep Learning Framework for News-oriented Stock Trend PredictionZiniu Hu, Weiqing Liu, Jiang Bian et al.
Stock trend prediction plays a critical role in seeking maximized profit from stock investment. However, precise trend prediction is very difficult since the highly volatile and non-stationary nature of stock market. Exploding information on Internet together with advancing development of natural language processing and text mining techniques have enable investors to unveil market trends and volatility from online content. Unfortunately, the quality, trustworthiness and comprehensiveness of online content related to stock market varies drastically, and a large portion consists of the low-quality news, comments, or even rumors. To address this challenge, we imitate the learning process of human beings facing such chaotic online news, driven by three principles: sequential content dependency, diverse influence, and effective and efficient learning. In this paper, to capture the first two principles, we designed a Hybrid Attention Networks to predict the stock trend based on the sequence of recent related news. Moreover, we apply the self-paced learning mechanism to imitate the third principle. Extensive experiments on real-world stock market data demonstrate the effectiveness of our approach.
CYDec 1, 2017
DeepWear: Adaptive Local Offloading for On-Wearable Deep LearningMengwei Xu, Feng Qian, Mengze Zhu et al.
Due to their on-body and ubiquitous nature, wearables can generate a wide range of unique sensor data creating countless opportunities for deep learning tasks. We propose DeepWear, a deep learning (DL) framework for wearable devices to improve the performance and reduce the energy footprint. DeepWear strategically offloads DL tasks from a wearable device to its paired handheld device through local network. Compared to the remote-cloud-based offloading, DeepWear requires no Internet connectivity, consumes less energy, and is robust to privacy breach. DeepWear provides various novel techniques such as context-aware offloading, strategic model partition, and pipelining support to efficiently utilize the processing capacity from nearby paired handhelds. Deployed as a user-space library, DeepWear offers developer-friendly APIs that are as simple as those in traditional DL libraries such as TensorFlow. We have implemented DeepWear on the Android OS and evaluated it on COTS smartphones and smartwatches with real DL models. DeepWear brings up to 5.08X and 23.0X execution speedup, as well as 53.5% and 85.5% energy saving compared to wearable-only and handheld-only strategies, respectively.
CVDec 1, 2017
DeepCache: Principled Cache for Mobile Deep VisionMengwei Xu, Mengze Zhu, Yunxin Liu et al.
We present DeepCache, a principled cache design for deep learning inference in continuous mobile vision. DeepCache benefits model execution efficiency by exploiting temporal locality in input video streams. It addresses a key challenge raised by mobile vision: the cache must operate under video scene variation, while trading off among cacheability, overhead, and loss in model accuracy. At the input of a model, DeepCache discovers video temporal locality by exploiting the video's internal structure, for which it borrows proven heuristics from video compression; into the model, DeepCache propagates regions of reusable results by exploiting the model's internal structure. Notably, DeepCache eschews applying video heuristics to model internals which are not pixels but high-dimensional, difficult-to-interpret data. Our implementation of DeepCache works with unmodified deep learning models, requires zero developer's manual effort, and is therefore immediately deployable on off-the-shelf mobile devices. Our experiments show that DeepCache saves inference execution time by 18% on average and up to 47%. DeepCache reduces system energy consumption by 20% on average.
SENov 30, 2017
Automating Release of Deep Link APIs for Android ApplicationsYun Ma, Ziniu Hu, Dian Yang et al.
Unlike the Web where each web page has a global URL to reach, a specific "content page" inside a mobile app cannot be opened unless the user explores the app with several operations from the landing page. Recently, deep links have been advocated by major companies to enable targeting and opening a specific page of an app externally with an accessible uniform resource identifier (URI). To empirically investigate the state of the practice on adopting deep links, in this article, we present the largest empirical study of deep links over 20,000 Android apps, and find that deep links do not get wide adoption among current Android apps, and non-trivial manual efforts are required for app developers to support deep links. To address such an issue, we propose the Aladdin approach and supporting tool to release deep links to access arbitrary location of existing apps. Aladdin instantiates our novel cooperative framework to synergically combine static analysis and dynamic analysis while minimally engaging developers to provide inputs to the framework for automation, without requiring any coding efforts or additional deployment efforts. We evaluate Aladdin with popular apps and demonstrate its effectiveness and performance.
HCMay 16, 2017
Through a Gender Lens: Learning Usage Patterns of Emojis from Large-Scale Android UsersZhenpeng Chen, Xuan Lu, Wei Ai et al.
Based on a large data set of emoji using behavior collected from smartphone users over the world, this paper investigates gender-specific usage of emojis. We present various interesting findings that evidence a considerable difference in emoji usage by female and male users. Such a difference is significant not just in a statistical sense; it is sufficient for a machine learning algorithm to accurately infer the gender of a user purely based on the emojis used in their messages. In real world scenarios where gender inference is a necessity, models based on emojis have unique advantages over existing models that are based on textual or contextual information. Emojis not only provide language-independent indicators, but also alleviate the risk of leaking private user information through the analysis of text and metadata.
CYFeb 14, 2017
Mining Behavioral Patterns from Millions of Android UsersXuanzhe Liu, Huoran Li, Xuan Lu et al.
The prevalence of smart mobile devices has promoted the popularity of mobile applications (a.k.a. apps). Supporting mobility has become a promising trend in software engineering research. This article presents an empirical study of behavioral service profiles collected from millions of users whose devices are deployed with Wandoujia, a leading Android app store service in China. The dataset of Wandoujia service profiles consists of two kinds of user behavioral data from using 0.28 million free Android apps, including (1) app management activities (i.e., downloading, updating, and uninstalling apps) from over 17 million unique users and (2) app network usage from over 6 million unique users. We explore multiple aspects of such behavioral data and present patterns of app usage. Based on the findings as well as derived knowledge, we also suggest some new open opportunities and challenges that can be explored by the research community, including app development, deployment, delivery, revenue, etc.
SEMay 23, 2016
DroidLink: Automated Generation of Deep Links for Android AppsYun Ma, Xuanzhe Liu, Ruogu Du et al.
The mobile application (app) has become the main entrance to access the Internet on handheld devices. Unlike the Web where each webpage has a global URL to reach directly, a specific "content page" of an app can be opened only by exploring the app with several operations from the landing page. The interoperability between apps is quite fixed and thus limits the value-added "linked data" between apps. Recently, deep link has been proposed to enable targeting and opening a specific page of an app externally with an accessible uniform resource identifier (URI). However, implementing deep link for mobile apps requires a lot of manual efforts by app developers, which can be very error-prone and time-consuming. In this paper, we propose DroidLink to automatically generating deep links for existing Android apps. We design a deep link model suitable for automatic generation. Then we explore the transition of pages and build a navigation graph based on static and dynamic analysis of Android apps. Next, we realize an updating mechanism that keeps on revisiting the target app and discover new pages, and thus generates deep links for every single page of the app. Finally, we repackage the app with deep link supports, but requires no additional deployment requirements. We generate deep links for some popular apps and demonstrate the feasibility of DroidLink.
SEMay 21, 2016
Mitigating Redundant Data Transfers for Mobile Web Applications via App-Specific Cache SpaceXuanzhe Liu, Yun Ma, Shuailiang Dong et al.
Redundant transfer of resources is a critical issue for compromising the performance of mobile Web applications (a.k.a., apps) in terms of data traffic, load time, and even energy consumption. Evidence shows that the current cache mechanisms are far from satisfactory. With lessons learned from how native apps manage their resources, in this paper, we propose the ReWAP approach to fundamentally reducing redundant transfers by restructuring the resource loading of mobile Web apps. ReWAP is based on an efficient mechanism of resource packaging where stable resources are encapsulated and maintained into a package, and such a package shall be loaded always from the local storage and updated by explicitly refreshing. By retrieving and analyzing the update of resources, ReWAP maintains resource packages that can accurately identify which resources can be loaded from the local storage for a considerably long period. ReWAP also provides a wrapper for mobile Web apps to enable loading and updating resource packages in the local storage as well as loading resources from resource packages. ReWAP can be easily and seamlessly deployed into existing mobile Web architectures with minimal modifications, and is transparent to end-users. We evaluate ReWAP based on continuous 15-day access traces of 50 mobile Web apps that suffer heavily from the problem of redundant transfers. Compared to the original mobile Web apps with cache enabled, ReWAP can significantly reduce the data traffic, with the median saving up to 51%. In addition, ReWAP can incur only very minor runtime overhead of the client-side browsers.
SEFeb 19, 2016
MUIT: A Middleware for Adaptive Mobile Web-based User Interfaces in WS-BPELXuanzhe Liu, Mengwei Xu, Gang Huang et al.
In enterprise organizations, the Bring-Your-Own-Device (BYOD) requirement has become prevalent as employees use their own mobile devices to process the workflow-oriented tasks. Consequently, it calls for approaches that can quickly develop and integrate mobile user interactions into existing business processes, and adapt to various contexts. However, designing, developing and deploying adaptive and mobile-oriented user interfaces for existing process engines are non-trivial, and require significant systematic efforts. To address this issue, we present a novel middleware-based approach, called MUIT, to developing and deploying the Mobility, User Interactions and Tasks into WS-BPEL engines. MUIT can be seamlessly into WS-BPEL without intrusions of existing process instances. MUIT provides a Domain-Specific Language (DSL) that provides some intuitive APIs to support the declarative development of adaptive, mobile-oriented, and Web-based user interfaces in WS-BPEL. The DSL can significantly improve the development of user interactions by preventing arbitrarily mixed codes, and its runtime supports satisfactory user experiences. We implement a proof- of-concept prototype by integrating MUIT into the commodity WS-BPEL-based Apusic Platform, and evaluate the performance and usability of MUIT platform.