NIOct 11, 2023Code
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model ServingYuhan Liu, Hanchen Li, Yihua Cheng et al. · stanford
As large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge. Yet using long contexts is challenging, as nothing can be generated until the whole context is processed by the LLM. While the context-processing delay can be reduced by reusing the KV cache of a context across different inputs, fetching the KV cache, which contains large tensors, over the network can cause high extra network delays. CacheGen is a fast context-loading module for LLM systems. First, CacheGen uses a custom tensor encoder, leveraging KV cache's distributional properties to encode a KV cache into more compact bitstream representations with negligible decoding overhead, to save bandwidth usage. Second, CacheGen adapts the compression level of different parts of a KV cache to cope with changes in available bandwidth, in order to maintain low context-loading delay and high generation quality. % When available bandwidth drops, CacheGen may raise the compression level for a part of the context or recompute its KV cache on the fly. We test CacheGen on popular LLMs and datasets. Compared to the recent systems that reuse the KV cache, CacheGen reduces the KV cache size by 3.5-4.3x and the total delay in fetching and processing contexts by 3.2-3.7x with negligible impact on the LLM response quality. Our code is at: https://github.com/UChi-JCL/CacheGen.
SESep 19, 2024
AutoVerus: Automated Proof Generation for Rust CodeChenyuan Yang, Xuheng Li, Md Rakib Hossain Misu et al. · microsoft-research
Generative AI has shown its values for many software engineering tasks. Still in its infancy, large language model (LLM)-based proof generation lags behind LLM-based code generation. In this paper, we present AutoVerus. AutoVerus uses LLMs to automatically generate correctness proof for Rust code. AutoVerus is designed to match the unique features of Verus, a verification tool that can prove the correctness of Rust code using proofs and specifications also written in Rust. AutoVerus consists of a network of LLM agents that are crafted and orchestrated to mimic human experts' three phases of proof construction: preliminary proof generation, proof refinement guided by generic tips, and proof debugging guided by verification errors. To thoroughly evaluate AutoVerus and help foster future research in this direction, we have built a benchmark suite of 150 non-trivial proof tasks, based on existing code-generation benchmarks and verification benchmarks. Our evaluation shows that AutoVerus can automatically generate correct proof for more than 90% of them, with more than half of them tackled in less than 30 seconds or 3 LLM calls.
96.2OSApr 15Code
VeruSAGE: A Study of Agent-Based Verification for Rust SystemsChenyuan Yang, Natalie Neamtu, Chris Hawblitzel et al.
Large language models (LLMs) have shown impressive capability to understand and develop code. However, their capability to rigorously reason about and prove code correctness remains in question. This paper offers a comprehensive study of LLMs' capability to develop correctness proofs for system software written in Rust. We curate a new system-verification benchmark suite, VeruSAGE-Bench, which consists of 849 proof tasks extracted from eight open-source Verus-verified Rust systems. Furthermore, we design different agent systems to match the strengths and weaknesses of different LLMs (o4-mini, GPT-5, Sonnet 4, and Sonnet 4.5). Our study shows that different tools and agent settings are needed to stimulate the system-verification capability of different types of LLMs. The best LLM-agent combination in our study completes over 80% of system-verification tasks in VeruSAGE-Bench. It also completes over 90% of a set of system proof tasks not part of VeruSAGE-Bench because they had not yet been finished by human experts. This result shows the great potential for LLM-assisted development of verified system software.
CLFeb 4
ERNIE 5.0 Technical ReportHaifeng Wang, Hua Wu, Tian Wu et al.
In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-sparse mixture-of-experts (MoE) architecture with modality-agnostic expert routing. To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm. Within a single pre-training run, the model learns a family of sub-models with varying depths, expert capacities, and routing sparsity, enabling flexible trade-offs among performance, model size, and inference latency in memory- or time-constrained scenarios. Moreover, we systematically address the challenges of scaling reinforcement learning to unified foundation models, thereby guaranteeing efficient and stable post-training under ultra-sparse MoE architectures and diverse multimodal settings. Extensive experiments demonstrate that ERNIE 5.0 achieves strong and balanced performance across multiple modalities. To the best of our knowledge, among publicly disclosed models, ERNIE 5.0 represents the first production-scale realization of a trillion-parameter unified autoregressive model that supports both multimodal understanding and generation. To facilitate further research, we present detailed visualizations of modality-agnostic expert routing in the unified model, alongside comprehensive empirical analysis of elastic training, aiming to offer profound insights to the community.
91.1SEMay 9
Reducing the Costs of Proof Synthesis on Rust Systems by Scaling Up a Seed Training SetNongyu Di, Tianyu Chen, Shan Lu et al.
Large Language Models (LLMs) are widely used for code generation. However, the correctness of code generated by LLMs remains a concern. A potential remedy to this concern is to have LLMs generate formal correctness proofs along with such code. However, compared with code generation, code-proof generation requires much higher reasoning capability and has much less existing data to learn from. In this paper, we present VeruSyn, a data synthesis pipeline for Verus, a state-of-the-art verification tool for system software written in Rust. Through self-synthesis and tutorial-based synthesis, VeruSyn achieves much larger scale and Verus-feature coverage than previous data-synthesis techniques designed for Verus; VeruSyn also supplements its dataset with long-chain-of-thought (CoT) data through agent trajectory synthesis. With VeruSyn, we synthesize the largest set of Verus verified programs: 6.9 million Rust programs, each with a formal specification and a proof that it meets that specification. This dataset lets us create a fine-tuned Qwen2.5-Coder-32B-Instruct model with appealing cost-proof tradeoff compared with state-of-the-art commercial models like Claude Sonnet 4.5. It also significantly outperforms models like o4-mini and previously proposed research models.
34.5ROMay 7
Real-world Latency Analysis of Vehicular Visible Light Communication with Multiple LED Transmitters and an Event-Based CameraRyota Soga, Tsukasa Shimizu, Shintaro Shiba et al.
Event cameras offer high temporal resolution, low latency, and wide dynamic range, making them promising receivers for visible light communication (VLC) in vehicle-to-everything (V2X) applications. This work presents an event-camera-based VLC system addressing three key challenges: bandwidth saturation, multi-transmitter reception, and latency characterization. We adopt a positive-event-only mode and design a protocol that suppresses event generation while maintaining communication distance and a wide field of view. We also propose a method to identify multiple transmitters and demonstrate simultaneous reception from up to three LEDs. Finally, we evaluate end-to-end latency in real vehicular scenarios and show that the system meets cooperative perception requirements. These results demonstrate that event-camera-based VLC is a feasible complement to existing V2X technologies (e.g., RF).
4.2SPMay 22
Deep-Learning-Aided Successive Cancellation List Flip Decoding for Polar CodesFu-Siang Liang, Shan Lu, Yeong-Luh Ueng
Polar codes are the first error-correcting code proven to achieve channel capacity based on infinite code length. The Successive Cancellation List Flip (SCLF) decoding algorithm was proposed by flipping an erroneous bit during the next decoding attempt. To identify the erroneous bits, the Log-Likelihood Ratio (LLR) is used to indicate the reliability of each decision bit. To improve the accuracy of the erroneous bit prediction, we propose deep-learning-aided (DL-aided) SCLF decoding algorithms. We first offer a stacked LSTM network that contains new features to train our models, which are able to improve the accuracy of the prediction of positions of erroneous bits. Then we separately train the stacked LSTM models to predict the position of both the first and second erroneous bits and whether to continue flipping. As a result, the DL-aided SCLF decoding algorithms based on the proposed stacked LSTM \mbox{flip-1} model, stacked LSTM \mbox{flip-2} model, and the stacked LSTM \mbox{continue-flipping} check (CFC) model are able to provide a better performance at a lower number of average decoding attempts when compared to other state-of-the-art decoding algorithms.
66.1ITMay 22
Layered construction of Message-Wise Unequal Error Protection CodesQiming Lu, Shan Lu, Takaya Yamazato
Conventional communication systems are mainly designed to reduce error rates and increase transmission rates, and therefore usually provide uniform protection to all transmitted messages. However, in intent-oriented applications, different messages may have different semantic meanings and importance levels, requiring different levels of reliability. This paper proposes a layered construction of message-level unequal error protection (UEP) codes for short-blocklength communication. Instead of appending an explicit protection tag to each codeword, the proposed method embeds the protection structure directly into the Hamming-distance structure of the codebook. By assigning larger minimum intra-level distances to higher-importance message groups and imposing suitable inter-level distance constraints, the proposed codebook provides differentiated error-correction capabilities while enabling reliable importance-level classification at the receiver. Theoretical conditions for correct group classification are derived, and simulations over AWGN and VLC-ISI channels show that the proposed scheme improves BER performance and group classification accuracy compared with a tag-based ECC baseline.
SEOct 21, 2024Code
Automated Proof Generation for Rust Code via Self-EvolutionTianyu Chen, Shuai Lu, Shan Lu et al.
Ensuring correctness is crucial for code generation. Formal verification offers a definitive assurance of correctness, but demands substantial human effort in proof construction and hence raises a pressing need for automation. The primary obstacle lies in the severe lack of data-there is much fewer proofs than code snippets for Large Language Models (LLMs) to train upon. In this paper, we introduce SAFE, a framework that overcomes the lack of human-written proofs to enable automated proof generation of Rust code. SAFE establishes a self-evolving cycle where data synthesis and fine-tuning collaborate to enhance the model capability, leveraging the definitive power of a symbolic verifier in telling correct proofs from incorrect ones. SAFE also re-purposes the large number of synthesized incorrect proofs to train the self-debugging capability of the fine-tuned models, empowering them to fix incorrect proofs based on the verifier's feedback. SAFE demonstrates superior efficiency and precision compared to GPT-4o. Through tens of thousands of synthesized proofs and the self-debugging mechanism, we improve the capability of open-source models, initially unacquainted with formal verification, to automatically write proofs for Rust code. This advancement leads to a significant improvement in performance, achieving a 52.52% accuracy rate in a benchmark crafted by human experts, a significant leap over GPT-4o's performance of 14.39%.
LGOct 26, 2023
MIM-GAN-based Anomaly Detection for Multivariate Time Series DataShan Lu, Zhicheng Dong, Donghong Cai et al.
The loss function of Generative adversarial network(GAN) is an important factor that affects the quality and diversity of the generated samples for anomaly detection. In this paper, we propose an unsupervised multiple time series anomaly detection algorithm based on the GAN with message importance measure(MIM-GAN). In particular, the time series data is divided into subsequences using a sliding window. Then a generator and a discriminator designed based on the Long Short-Term Memory (LSTM) are employed to capture the temporal correlations of the time series data. To avoid the local optimal solution of loss function and the model collapse, we introduce an exponential information measure into the loss function of GAN. Additionally, a discriminant reconstruction score consisting on discrimination and reconstruction loss is taken into account. The global optimal solution for the loss function is derived and the model collapse is proved to be avoided in our proposed MIM-GAN-based anomaly detection algorithm. Experimental results show that the proposed MIM-GAN-based anomaly detection algorithm has superior performance in terms of precision, recall, and F1 score.
SEOct 7, 2023
Automatic and Efficient Customization of Neural Networks for ML ApplicationsYuhan Liu, Chengcheng Wan, Kuntai Du et al.
ML APIs have greatly relieved application developers of the burden to design and train their own neural network models -- classifying objects in an image can now be as simple as one line of Python code to call an API. However, these APIs offer the same pre-trained models regardless of how their output is used by different applications. This can be suboptimal as not all ML inference errors can cause application failures, and the distinction between inference errors that can or cannot cause failures varies greatly across applications. To tackle this problem, we first study 77 real-world applications, which collectively use six ML APIs from two providers, to reveal common patterns of how ML API output affects applications' decision processes. Inspired by the findings, we propose ChameleonAPI, an optimization framework for ML APIs, which takes effect without changing the application source code. ChameleonAPI provides application developers with a parser that automatically analyzes the application to produce an abstract of its decision process, which is then used to devise an application-specific loss function that only penalizes API output errors critical to the application. ChameleonAPI uses the loss function to efficiently train a neural network model customized for each application and deploys it to serve API invocations from the respective application via existing interface. Compared to a baseline that selects the best-of-all commercial ML API, we show that ChameleonAPI reduces incorrect application decisions by 43%.
34.0OSApr 13
Nanvix: A Multikernel OS Design for High-Density Serverless DeploymentsCarlos Segarra, Pedro Henrique Penna, Enrique Saurez et al.
Serverless providers strive for high resource utilization by optimizing deployment density: how many applications can be deployed per host server. However, achieving high deployment density without compromising application performance or isolation remains an open challenge. High density can be achieved by sharing components across applications, yet applications from different tenants must be strongly isolated from each other due to the risk of side-channel attacks. Sharing components across applications from the same tenant, if done naively, can introduce contention on host resources thus negatively affecting application performance. We describe Nanvix, a new multikernel OS that disaggregates ephemeral execution state, unique per application invocation, from long-lived persistent state, shared among invocations from the same tenant. Applications in Nanvix execute inside a lightweight user VM running a micro-kernel that implements threads and memory, and forwards all I/O requests to a system VM. The system VM runs a macro-kernel with a rich set of device drivers and is shared among all invocations from the same tenant. Nanvix' split design achieves strong hypervisor isolation across tenants without sacrificing application performance, and reduces same-tenant contention by multiplexing all I/O requests to the system VM. Thanks to a system-wide co-design, Nanvix achieves order-of-magnitude lower application start up times with moderate I/O overheads. When replaying a production trace, Nanvix needs 20-100x fewer host servers compared to state-of-the-art systems, improving deployment density
53.6ITMay 17
ISI Modeling and BER Performance for Rotating Light-Trail Image Sensor CommunicationShin Asaoka, Shan Lu, Zhengqiang Tang et al.
Image sensor communication (ISC) employing a propeller-LED transmitter encodes data along rotating light trails. We present an analytical framework that (i) constructs a single-LED, single-blink light trail model that maps optical power to pixel values, and (ii) integrates a probabilistic noise model to derive a closed-form bit-error rate (BER) using the $Q$-function. Trimodal pixel-value histograms motivate an adjacent-only inter-symbol interference (ISI) model in which the decision at segment $j$ depends on adjacent segments. Applying a hardest-pair midpoint threshold yields per-segment BER and a general BER after marginalization. We further provide practical sufficiency conditions under which adjacent-only ISI is adequate, and validate its tightness against Monte Carlo simulations and experimental results. Using the analytical BER, we select the control angle that maximizes throughput while satisfying a target BER reliability constraint.
CVApr 20, 2023
LA3: Efficient Label-Aware AutoAugmentMingjun Zhao, Shan Lu, Zixuan Wang et al.
Automated augmentation is an emerging and effective technique to search for data augmentation policies to improve generalizability of deep neural network training. Most existing work focuses on constructing a unified policy applicable to all data samples in a given dataset, without considering sample or class variations. In this paper, we propose a novel two-stage data augmentation algorithm, named Label-Aware AutoAugment (LA3), which takes advantage of the label information, and learns augmentation policies separately for samples of different labels. LA3 consists of two learning stages, where in the first stage, individual augmentation methods are evaluated and ranked for each label via Bayesian Optimization aided by a neural predictor, which allows us to identify effective augmentation techniques for each label under a low search cost. And in the second stage, a composite augmentation policy is constructed out of a selection of effective as well as complementary augmentations, which produces significant performance boost and can be easily deployed in typical model training. Extensive experiments demonstrate that LA3 achieves excellent performance matching or surpassing existing methods on CIFAR-10 and CIFAR-100, and achieves a new state-of-the-art ImageNet accuracy of 79.97% on ResNet-50 among auto-augmentation methods, while maintaining a low computational cost.
98.7ARMay 17
VeriCache: Turning Lossy KV Cache into Lossless LLM InferenceJiayi Yao, Samuel Shen, Kuntai Du et al.
The large size of the KV cache has become a major bottleneck for serving LLMs with increasing context lengths. In response, many KV cache compression methods, such as token dropping and quantization, have been proposed. However, almost all of these methods are inherently lossy-despite minimal accuracy degradation for short outputs, their outputs increasingly diverge from full-KV-cache outputs as more tokens are decoded, which leads to catastrophic failures in code generation and tool calling. We present VeriCache, the first inference framework that ensures the same output as full-KV-cache decoding but largely preserves the high decoding throughput of a range of KV cache compression algorithms. VeriCache uses the compressed KV cache to draft tokens, then verifies them against the full KV cache. While it may seem like just speculative decoding, VeriCache requires addressing a key system challenge to work-keeping the full KV cache out of GPU memory and minimizing the overhead of swapping it in for verification. The insight is two-fold: (1) compressed-KV decoding can be parallelized with full-KV swap, because one is HBM-bandwidth-bound and the other is PCIe/network-bound, and (2) the compressed KV cache often produces output similar to the full KV cache, allowing a long drafting horizon to amortize each full-KV swap. VeriCache applies to both long-context decoding and remote prefix caching, supports a broad family of token-dropping and quantization methods through a uniform compressor interface, and composes with traditional speculative decoding. Experimental results show that VeriCache achieves up to 4X higher throughput than full-KV inference while producing identical outputs.
75.3ITMay 17
Channel Modeling and LED Spot Detection for Dense Image-Sensor Visible Light CommunicationTianhao Shi, Shan Lu, Takaya Yamazato
High-density LED arrays enable high-speed transmission in image-sensor-based visible-light communication (VLC) systems. However, when optical spots become blurred and spatially overlapped due to focal shift, resolution limitations, or interference, severe inter-symbol interference (ISI) occurs, significantly degrading decoding performance. Furthermore, radial distortion introduces geometric deformation of the LED grid, while vignetting leads to incomplete and asymmetric spot shapes at the periphery, both of which further hinder reliable signal detection. Existing methods mitigate ISI by reducing LED transmission signaling density. This paper proposes a robust decoding framework that maintains full LED signaling density. We introduce a pilot-aided geometric recognition method that uses a PSF-constrained Hough transform and circle-center alignment refinement. \textbf{In addition, radial distortion correction and vignetting-aware compensation are incorporated to restore geometric consistency and suppress edge-related detection errors.} By leveraging prior structural knowledge from pilot frames, the system effectively separates overlapping LED signals under severe optical distortion. Experimental results on a real-world VLC testbed confirm that the proposed method achieves superior decoding accuracy and throughput compared to conventional Hough-based and low-density baseline methods. The results highlight its potential for high-efficiency VLC applications in interference-prone environments.
LGNov 28, 2023
MultiModal-Learning for Predicting Molecular Properties: A Framework Based on Image and Graph StructuresZhuoyuan Wang, Jiacong Mi, Shan Lu et al.
The quest for accurate prediction of drug molecule properties poses a fundamental challenge in the realm of Artificial Intelligence Drug Discovery (AIDD). An effective representation of drug molecules emerges as a pivotal component in this pursuit. Contemporary leading-edge research predominantly resorts to self-supervised learning (SSL) techniques to extract meaningful structural representations from large-scale, unlabeled molecular data, subsequently fine-tuning these representations for an array of downstream tasks. However, an inherent shortcoming of these studies lies in their singular reliance on one modality of molecular information, such as molecule image or SMILES representations, thus neglecting the potential complementarity of various molecular modalities. In response to this limitation, we propose MolIG, a novel MultiModaL molecular pre-training framework for predicting molecular properties based on Image and Graph structures. MolIG model innovatively leverages the coherence and correlation between molecule graph and molecule image to execute self-supervised tasks, effectively amalgamating the strengths of both molecular representation forms. This holistic approach allows for the capture of pivotal molecular structural characteristics and high-level semantic information. Upon completion of pre-training, Graph Neural Network (GNN) Encoder is used for the prediction of downstream tasks. In comparison to advanced baseline models, MolIG exhibits enhanced performance in downstream tasks pertaining to molecular property prediction within benchmark groups such as MoleculeNet Benchmark Group and ADMET Benchmark Group.
99.2PLMar 30
ExVerus: Verus Proof Repair via Counterexample ReasoningJun Yang, Yuechun Sun, Yi Wu et al.
Large Language Models (LLMs) have shown promising results in automating formal verification. However, existing approaches treat proof generation as a static, end-to-end prediction over source code, relying on limited verifier feedback and lacking access to concrete program behaviors. We present EXVERUS, a counterexample-guided framework that enables LLMs to reason about proofs using behavioral feedback via counterexamples. When a proof fails, EXVERUS automatically generates and validates counterexamples, and then guides the LLM to generalize them into inductive invariants to block these failures. Our evaluation shows that EXVERUS significantly improves proof accuracy, robustness, and token efficiency over the state-of-the-art prompting-based Verus proof generator.
DCNov 20, 2024Code
MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge DevicesMohammadali Shakerdargah, Shan Lu, Chao Gao et al.
The advent of foundation models have revolutionized various fields, enabling unprecedented task accuracy and flexibility in computational linguistics, computer vision and other domains. Attention mechanism has become an essential component of foundation models, due to their superb capability of capturing correlations in a sequence. However, attention results in quadratic complexity in memory and compute as the context length grows. Although many fusion-based exact attention acceleration algorithms have been developed for datacenter-grade GPUs and accelerators leveraging multi-core parallelism and data locality, yet it remains a significant challenge to accelerate attention on resource-constrained edge neural accelerators with limited compute units and stringent on-chip caches. In this paper, we propose a scheme for exact attention inference acceleration on memory-constrained edge accelerators, by parallelizing the utilization of heterogeneous compute units, i.e., vector processing units and matrix processing units. Our method involves scheduling workloads onto these different compute units in a multi-tiered tiling scheme to process tiled vector workloads and matrix workloads in attention as two streams, respecting the workload dependencies. We search for tiling factors to maximize the parallelization of both compute units while considering I/O overhead, and propose a proactive cache overwrite strategy to avoid undesirable cache spills in reality. Extensive results based on open-sourced simulation frameworks show up to 2.75x speedup and 54% reduction in energy consumption as compared to the state-of-the-art attention fusion method (FLAT) in the edge computing scenario. Further experiments on a real-world edge neural processing unit demonstrate speedup of up to 1.76x for attention as compared to FLAT, without affecting model output accuracy.
CLNov 29, 2023
RoKEPG: RoBERTa and Knowledge Enhancement for Prescription Generation of Traditional Chinese MedicineHua Pu, Jiacong Mi, Shan Lu et al.
Traditional Chinese medicine (TCM) prescription is the most critical form of TCM treatment, and uncovering the complex nonlinear relationship between symptoms and TCM is of great significance for clinical practice and assisting physicians in diagnosis and treatment. Although there have been some studies on TCM prescription generation, these studies consider a single factor and directly model the symptom-prescription generation problem mainly based on symptom descriptions, lacking guidance from TCM knowledge. To this end, we propose a RoBERTa and Knowledge Enhancement model for Prescription Generation of Traditional Chinese Medicine (RoKEPG). RoKEPG is firstly pre-trained by our constructed TCM corpus, followed by fine-tuning the pre-trained model, and the model is guided to generate TCM prescriptions by introducing four classes of knowledge of TCM through the attention mask matrix. Experimental results on the publicly available TCM prescription dataset show that RoKEPG improves the F1 metric by about 2% over the baseline model with the best results.
DBJul 9, 2016Code
Database-Backed Web Applications in the Wild: How Well Do They Work?Cong Yan, Alvin Cheung, Shan Lu
Most modern database-backed web applications are built upon Object Relational Mapping (ORM) frameworks. While ORM frameworks ease application development by abstracting persistent data as objects, such convenience often comes with a performance cost. In this paper, we present CADO, a tool that analyzes the application logic and its interaction with databases using the Ruby on Rails ORM framework. CADO includes a static program analyzer, a profiler and a synthetic data generator to extract and understand application's performance characteristics. We used CADO to analyze the performance problems of 27 real-world open-source Rails applications, covering domains such as online forums, e-commerce, project management, blogs, etc. Based on the results, we uncovered a number of issues that lead to sub-optimal application performance, ranging from issuing queries, how result sets are used, and physical design. We suggest possible remedies for each issue, and highlight new research opportunities that arise from them.
MANov 5, 2024
DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM ServingYuhan Liu, Yuyang Huang, Jiayi Yao et al.
Compound AI systems, such as agentic systems, are an emerging trend in large-scale enterprise settings, with multiple LLMs specialized for different users, tasks, and/or roles working together. In these scenarios, different models often process inputs that share the same context prefix. Although much work was done in the past to enable the reuse of prefix KV caches across inputs for a single model, how to enable one model to reuse the prefix KV caches of a different model remains an open question. We introduce DroidSpeak, the first distributed LLM inference system that enables KV cache reuse across distributed nodes running inference of different LLMs, so long as the LLMs have the same architecture. We present the first study that aims at understanding the impact of sharing KV caches across different LLMs, and if/when such sharing affects quality. Inspired by the findings, we present DroidSpeak, which selectively recomputes a few layers of the KV cache produced by another LLM and reuses the remaining layers, with negligible quality loss. Moreover, carefully pipelining the layer-wise re-computation and the loading of reused KV cache further improves the inference performance. Experiments on diverse datasets and model pairs demonstrate that DroidSpeak achieves up to 4x throughput improvement and about 3.1x faster prefill (time to first token), with negligible loss of quality in F1 scores, Rouge-L or code similarity score, compared to the baseline which does not allow any sharing across models.
LGApr 16, 2024
AGHINT: Attribute-Guided Representation Learning on Heterogeneous Information Networks with TransformerJinhui Yuan, Shan Lu, Peibo Duan et al.
Recently, heterogeneous graph neural networks (HGNNs) have achieved impressive success in representation learning by capturing long-range dependencies and heterogeneity at the node level. However, few existing studies have delved into the utilization of node attributes in heterogeneous information networks (HINs). In this paper, we investigate the impact of inter-node attribute disparities on HGNNs performance within the benchmark task, i.e., node classification, and empirically find that typical models exhibit significant performance decline when classifying nodes whose attributes markedly differ from their neighbors. To alleviate this issue, we propose a novel Attribute-Guided heterogeneous Information Networks representation learning model with Transformer (AGHINT), which allows a more effective aggregation of neighbor node information under the guidance of attributes. Specifically, AGHINT transcends the constraints of the original graph structure by directly integrating higher-order similar neighbor features into the learning process and modifies the message-passing mechanism between nodes based on their attribute disparities. Extensive experimental results on three real-world heterogeneous graph benchmarks with target node attributes demonstrate that AGHINT outperforms the state-of-the-art.
IVMay 23, 2025
Distance Estimation in Outdoor Driving Environments Using Phase-only Correlation Method with Event CamerasMasataka Kobayashi, Shintaro Shiba, Quan Kong et al.
With the growing adoption of autonomous driving, the advancement of sensor technology is crucial for ensuring safety and reliable operation. Sensor fusion techniques that combine multiple sensors such as LiDAR, radar, and cameras have proven effective, but the integration of multiple devices increases both hardware complexity and cost. Therefore, developing a single sensor capable of performing multiple roles is highly desirable for cost-efficient and scalable autonomous driving systems. Event cameras have emerged as a promising solution due to their unique characteristics, including high dynamic range, low latency, and high temporal resolution. These features enable them to perform well in challenging lighting conditions, such as low-light or backlit environments. Moreover, their ability to detect fine-grained motion events makes them suitable for applications like pedestrian detection and vehicle-to-infrastructure communication via visible light. In this study, we present a method for distance estimation using a monocular event camera and a roadside LED bar. By applying a phase-only correlation technique to the event data, we achieve sub-pixel precision in detecting the spatial shift between two light sources. This enables accurate triangulation-based distance estimation without requiring stereo vision. Field experiments conducted in outdoor driving scenarios demonstrated that the proposed approach achieves over 90% success rate with less than 0.5-meter error for distances ranging from 20 to 60 meters. Future work includes extending this method to full position estimation by leveraging infrastructure such as smart poles equipped with LEDs, enabling event-camera-based vehicles to determine their own position in real time. This advancement could significantly enhance navigation accuracy, route optimization, and integration into intelligent transportation systems.
SEOct 28, 2025
VeriStruct: AI-assisted Automated Verification of Data-Structure Modules in VerusChuyue Sun, Yican Sun, Daneshvar Amrollahi et al.
We introduce VeriStruct, a novel framework that extends AI-assisted automated verification from single functions to more complex data structure modules in Verus. VeriStruct employs a planner module to orchestrate the systematic generation of abstractions, type invariants, specifications, and proof code. To address the challenge that LLMs often misunderstand Verus' annotation syntax and verification-specific semantics, VeriStruct embeds syntax guidance within prompts and includes a repair stage to automatically correct annotation errors. In an evaluation on eleven Rust data structure modules, VeriStruct succeeds on ten of the eleven, successfully verifying 128 out of 129 functions (99.2%) in total. These results represent an important step toward the goal of automatic AI-assisted formal verification.
AIJan 29, 2024
Type-based Neural Link Prediction Adapter for Complex Query AnsweringLingning Song, Yi Zu, Shan Lu et al.
Answering complex logical queries on incomplete knowledge graphs (KGs) is a fundamental and challenging task in multi-hop reasoning. Recent work defines this task as an end-to-end optimization problem, which significantly reduces the training cost and enhances the generalization of the model by a pretrained link predictors for query answering. However, most existing proposals ignore the critical semantic knowledge inherently available in KGs, such as type information, which could help answer complex logical queries. To this end, we propose TypE-based Neural Link Prediction Adapter (TENLPA), a novel model that constructs type-based entity-relation graphs to discover the latent relationships between entities and relations by leveraging type information in KGs. Meanwhile, in order to effectively combine type information with complex logical queries, an adaptive learning mechanism is introduced, which is trained by back-propagating during the complex query answering process to achieve adaptive adjustment of neural link predictors. Experiments on 3 standard datasets show that TENLPA model achieves state-of-the-art performance on complex query answering with good generalization and robustness.
LGAug 15, 2020
Orthogonalized SGD and Nested Architectures for Anytime Neural NetworksChengcheng Wan, Henry Hoffmann, Shan Lu et al.
We propose a novel variant of SGD customized for training network architectures that support anytime behavior: such networks produce a series of increasingly accurate outputs over time. Efficient architectural designs for these networks focus on re-using internal state; subnetworks must produce representations relevant for both immediate prediction as well as refinement by subsequent network stages. We consider traditional branched networks as well as a new class of recursively nested networks. Our new optimizer, Orthogonalized SGD, dynamically re-balances task-specific gradients when training a multitask network. In the context of anytime architectures, this optimizer projects gradients from later outputs onto a parameter subspace that does not interfere with those from earlier outputs. Experiments demonstrate that training with Orthogonalized SGD significantly improves generalization accuracy of anytime networks.
PFOct 31, 2019
ALERT: Accurate Learning for Energy and TimelinessChengcheng Wan, Muhammad Santriaji, Eri Rogers et al.
An increasing number of software applications incorporate runtime Deep Neural Networks (DNNs) to process sensor data and return inference results to humans. Effective deployment of DNNs in these interactive scenarios requires meeting latency and accuracy constraints while minimizing energy, a problem exacerbated by common system dynamics. Prior approaches handle dynamics through either (1) system-oblivious DNN adaptation, which adjusts DNN latency/accuracy tradeoffs, or (2) application-oblivious system adaptation, which adjusts resources to change latency/energy tradeoffs. In contrast, this paper improves on the state-of-the-art by coordinating application- and system-level adaptation. ALERT, our runtime scheduler, uses a probabilistic model to detect environmental volatility and then simultaneously select both a DNN and a system resource configuration to meet latency, accuracy, and energy constraints. We evaluate ALERT on CPU and GPU platforms for image and speech tasks in dynamic environments. ALERT's holistic approach achieves more than 13% energy reduction, and 27% error reduction over prior approaches that adapt solely at the application or system level. Furthermore, ALERT incurs only 3% more energy consumption and 2% higher DNN-inference error than an oracle scheme with perfect application and system knowledge.