Lingtao Ouyang

91.5LGApr 24Code

SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference

Yuqi Pan, Jinghao Zhuang, Yupeng Feng et al.

Scaling context length is reshaping large-model development, yet full-attention Transformers suffer from prohibitive computation and inference bottlenecks at long sequences. A key challenge is to design foundation models that maintain performance and long-context efficiency with minimal training overhead. We introduce SpikingBrain2.0 (SpB2.0), a 5B model that advances both architecture and training efficiency of its predecessor. Our contributions are two-fold. (1) Architectural Innovation: We propose Dual-Space Sparse Attention (DSSA), an inter-layer hybrid of Sparse Softmax Attention (MoBA) and Sparse Linear Attention (SSE), achieving an improved performance-efficiency trade-off for long-context modeling. SpB2.0 further supports dual quantization paths: INT8-Spiking coding enables sparse event-driven computation, while FP8 coding accelerates inference on modern GPUs. (2) Enhanced Training Strategy: We develop an optimized Transformer-to-Hybrid (T2H) pipeline with dual conversion paths for LLMs and VLMs using curated open-source data. Empirically, SpB2.0-5B and SpB2.0-VL-5B recover most of the base Transformer (Qwen3-4B) capability with under 7k A100 GPU hours. SpB2.0 achieves a 10.13x TTFT speedup at 4M context and supports over 10M tokens on 8 A100 GPUs under vLLM, where full-attention models exceed memory limits. It also demonstrates strong cross-platform compatibility, enabling FP8 GPU inference (2.52x speedup at 250k) and efficient neuromorphic execution (64.31% sparsity, with 70.6% and 46.5% area and power reduction at 500MHz). Overall, SpikingBrain2.0 provides a practical pathway for lightweight, multimodal, spiking foundation models, highlighting the potential of combining brain-inspired mechanisms with efficient architectures for resource-constrained and edge scenarios.

91.8NIApr 2Code

Cooperative Edge Caching with Large Language Model in Wireless Networks

Ning Yang, Wentao Wang, Lingtao Ouyang et al.

Cooperative edge caching in overlapping zones couples Base Station (BS) decisions, making content replacement sensitive to spatial topology and temporal reuse. Conventional heuristics suffer from myopia, while Deep Reinforcement Learning relies on brittle numerical representations and needs prohibitive retraining under topological or traffic dynamics. This paper studies a centralized, cooperative multi-BS cache-replacement controller driven by a Large Language Model (LLM) within a deterministic text-to-action loop. At each time slot, the global cache state is rendered into a prompt encapsulating each BS's inventory, deduplicated requests, and multi-scale frequency summaries. The LLM generates one decision line per BS. A strict parser and feasibility checker then either accept the joint action or fall back to an all-BS NoOp action. We align the LLM via two-stage training: Supervised Fine-Tuning on look-ahead expert trajectories to acquire action syntax and robust initialization, followed by Group Relative Policy Optimization. This employs an 'opportunity-aware' reward, using multi-step cooperative hit rate gains relative to a NoOp baseline as the primary signal, plus penalties for invalid outputs. We focus on reactive replacement of equal-sized files, max one replacement per BS per slot, and insertions restricted to current requests. Evaluating on identical request traces and association graphs, our orchestrator approaches a single-step exhaustive-search reference (0.610 vs. 0.617 in a 5-BS scenario), surpasses classical baselines (+4.1% over least-frequently used), and exhibits robust zero-shot transfer across cache capacity, library size, popularity skewness, and user density. Code is available at https://github.com/gracefulning/CoopLLM-Cache.

Lingtao Ouyang

2 Papers