SDJan 11, 2023
Rethinking complex-valued deep neural networks for monaural speech enhancementHaibin Wu, Ke Tan, Buye Xu et al.
Despite multiple efforts made towards adopting complex-valued deep neural networks (DNNs), it remains an open question whether complex-valued DNNs are generally more effective than real-valued DNNs for monaural speech enhancement. This work is devoted to presenting a critical assessment by systematically examining complex-valued DNNs against their real-valued counterparts. Specifically, we investigate complex-valued DNN atomic units, including linear layers, convolutional layers, long short-term memory (LSTM), and gated linear units. By comparing complex- and real-valued versions of fundamental building blocks in the recently developed gated convolutional recurrent network (GCRN), we show how different mechanisms for basic blocks affect the performance. We also find that the use of complex-valued operations hinders the model capacity when the model size is small. In addition, we examine two recent complex-valued DNNs, i.e. deep complex convolutional recurrent network (DCCRN) and deep complex U-Net (DCUNET). Evaluation results show that both DNNs produce identical performance to their real-valued counterparts while requiring much more computation. Based on these comprehensive comparisons, we conclude that complex-valued DNNs do not provide a performance gain over their real-valued counterparts for monaural speech enhancement, and thus are less desirable due to their higher computational costs.
DCMar 26Code
eBeeMetrics: An eBPF-based Library Framework for Feedback-free Observability of QoS MetricsMuntaka Ibnath, Mohammadreza Rezvani, Daniel Wong
Many system management runtimes (SMRs), such as resource management and power management techniques, rely on quality-of-service (QoS) metrics, such as tail latency or throughput, as feedback. These QoS metrics are generally neither observable with hardware performance counters nor directly observable within the OS kernel. This introduces complexity and overhead in instrumenting the application and integrating QoS performance metric feedback with many management runtimes. To bridge this gap, we introduced eBeeMetrics, an eBPF-based library framework to accurately observe application-level metrics derived from only eBPF-observable events, such as system calls. eBeeMetrics can be used as a drop-in replacement to decouple system management runtimes from QoS metric feedback reporting, or can supplement existing QoS metrics to better identify server-side dynamics. eBeeMetrics achieves a strong correlation with real-world measured throughput and latency metrics across various latency-sensitive workloads. The eBeeMetrics tool is open-source; the source code is available at: https://github.com/Ibnathism/eBeeMetrics.
DCApr 24
Coordinating GPU Data Centers and Power Grid Regulation Service for Exogenous Carbon BenefitsAli Jahanshahi, Sara Rashidi Golrouye, Osten Anderson et al.
The rapid growth of AI/ML data centers has led to higher energy consumption and carbon emissions. The shift to renewable energy and growing data center energy demands can destabilize the power grid. Power grids rely on frequency regulation reserves, typically fossil-fueled power plants, to stabilize and balance the supply and demand of electricity. This paper sheds light on the hidden carbon emissions of frequency regulation service. Our work explores how modern GPU data centers can coordinate with power grids to reduce the need for fossil-fueled frequency regulation reserves. We first introduce a novel metric, Exogenous Carbon, to quantify grid-side carbon emission reductions resulting from data center participation in regulation service. We additionally introduce EcoCenter, a framework to maximize the amount of frequency regulation provision that GPU data centers can provide, and thus, reduce the amount of frequency regulation reserves necessary. We demonstrate that data center participation in frequency regulation can result in Exogenous carbon savings that can outweigh operational carbon emissions
ASFeb 3
Conditional Flow Matching for Visually-Guided Acoustic HighlightingHugo Malard, Gael Le Lan, Daniel Wong et al.
Visually-guided acoustic highlighting seeks to rebalance audio in alignment with the accompanying video, creating a coherent audio-visual experience. While visual saliency and enhancement have been widely studied, acoustic highlighting remains underexplored, often leading to misalignment between visual and auditory focus. Existing approaches use discriminative models, which struggle with the inherent ambiguity in audio remixing, where no natural one-to-one mapping exists between poorly-balanced and well-balanced audio mixes. To address this limitation, we reframe this task as a generative problem and introduce a Conditional Flow Matching (CFM) framework. A key challenge in iterative flow-based generation is that early prediction errors -- in selecting the correct source to enhance -- compound over steps and push trajectories off-manifold. To address this, we introduce a rollout loss that penalizes drift at the final step, encouraging self-correcting trajectories and stabilizing long-range flow integration. We further propose a conditioning module that fuses audio and visual cues before vector field regression, enabling explicit cross-modal source selection. Extensive quantitative and qualitative evaluations show that our method consistently surpasses the previous state-of-the-art discriminative approach, establishing that visually-guided audio remixing is best addressed through generative modeling.
LGNov 6, 2025
DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU MultiplexingLei Gao, Chaoyi Jiang, Hossein Entezari Zarch et al.
Modern LLM serving systems must sustain high throughput while meeting strict latency SLOs across two distinct inference phases: compute-intensive prefill and memory-bound decode phases. Existing approaches either (1) aggregate both phases on shared GPUs, leading to interference between prefill and decode phases, which degrades time-between-tokens (TBT); or (2) disaggregate the two phases across GPUs, improving latency but wasting resources through duplicated models and KV cache transfers. We present DuetServe, a unified LLM serving framework that achieves disaggregation-level isolation within a single GPU. DuetServe operates in aggregated mode by default and dynamically activates SM-level GPU spatial multiplexing when TBT degradation is predicted. Its key idea is to decouple prefill and decode execution only when needed through fine-grained, adaptive SM partitioning that provides phase isolation only when contention threatens latency service level objectives (SLOs). DuetServe integrates (1) an attention-aware roofline model to forecast iteration latency, (2) a partitioning optimizer that selects the optimal SM split to maximize throughput under TBT constraints, and (3) an interruption-free execution engine that eliminates CPU-GPU synchronization overhead. Evaluations show that DuetServe improves total throughput by up to 1.3x while maintaining low generation latency compared to state-of-the-art frameworks.
SDDec 8, 2021
NICE-Beam: Neural Integrated Covariance Estimators for Time-Varying BeamformersJonah Casebeer, Jacob Donley, Daniel Wong et al.
Estimating a time-varying spatial covariance matrix for a beamforming algorithm is a challenging task, especially for wearable devices, as the algorithm must compensate for time-varying signal statistics due to rapid pose-changes. In this paper, we propose Neural Integrated Covariance Estimators for Beamformers, NICE-Beam. NICE-Beam is a general technique for learning how to estimate time-varying spatial covariance matrices, which we apply to joint speech enhancement and dereverberation. It is based on training a neural network module to non-linearly track and leverage scene information across time. We integrate our solution into a beamforming pipeline, which enables simple training, faster than real-time inference, and a variety of test-time adaptation options. We evaluate the proposed model against a suite of baselines in scenes with both stationary and moving microphones. Our results show that the proposed method can outperform a hand-tuned estimator, despite the hand-tuned estimator using oracle source separation knowledge.
LGOct 21, 2020
Transferable Graph Optimizers for ML CompilersYanqi Zhou, Sudip Roy, Amirali Abdolrashidi et al.
Most compilers for machine learning (ML) frameworks need to solve many correlated optimization problems to generate efficient machine code. Current ML compilers rely on heuristics based algorithms to solve these optimization problems one at a time. However, this approach is not only hard to maintain but often leads to sub-optimal solutions especially for newer model architectures. Existing learning based approaches in the literature are sample inefficient, tackle a single optimization problem, and do not generalize to unseen graphs making them infeasible to be deployed in practice. To address these limitations, we propose an end-to-end, transferable deep reinforcement learning method for computational graph optimization (GO), based on a scalable sequential attention mechanism over an inductive graph neural network. GO generates decisions on the entire graph rather than on each individual node autoregressively, drastically speeding up the search compared to prior methods. Moreover, we propose recurrent attention layers to jointly optimize dependent graph optimization tasks and demonstrate 33%-60% speedup on three graph optimization tasks compared to TensorFlow default optimization. On a diverse set of representative graphs consisting of up to 80,000 nodes, including Inception-v3, Transformer-XL, and WaveNet, GO achieves on average 21% improvement over human experts and 18% improvement over the prior state of the art with 15x faster convergence, on a device placement task evaluated in real systems.
LGSep 28, 2019
GDP: Generalized Device Placement for Dataflow GraphsYanqi Zhou, Sudip Roy, Amirali Abdolrashidi et al.
Runtime and scalability of large neural networks can be significantly affected by the placement of operations in their dataflow graphs on suitable devices. With increasingly complex neural network architectures and heterogeneous device characteristics, finding a reasonable placement is extremely challenging even for domain experts. Most existing automated device placement approaches are impractical due to the significant amount of compute required and their inability to generalize to new, previously held-out graphs. To address both limitations, we propose an efficient end-to-end method based on a scalable sequential attention mechanism over a graph neural network that is transferable to new graphs. On a diverse set of representative deep learning models, including Inception-v3, AmoebaNet, Transformer-XL, and WaveNet, our method on average achieves 16% improvement over human experts and 9.2% improvement over the prior art with 15 times faster convergence. To further reduce the computation cost, we pre-train the policy network on a set of dataflow graphs and use a superposition network to fine-tune it on each individual graph, achieving state-of-the-art performance on large hold-out graphs with over 50k nodes, such as an 8-layer GNMT.