CLFeb 12, 2025
Top-Theta Attention: Sparsifying Transformers by Compensated ThresholdingKonstantin Berestizshevsky, Renzo Andri, Lukas Cavigelli
We present Top-Theta (Top-$θ$) Attention, a training-free method for sparsifying transformer attention during inference. Our key insight is that static, per-head thresholds can be calibrated to retain the desired constant number of significant elements per attention row. This approach enables content-based sparsity without retraining, and it remains robust across data domains. We further introduce compensation techniques to preserve accuracy under aggressive sparsification, establishing attention thresholding as a practical and principled alternative to top-k attention. We provide extensive evaluation on natural language processing tasks, showing that Top-$θ$ achieves 3-10x reduction in V-cache usage and up to 10x fewer attention elements during inference while degrading no more than 1% in accuracy.
ARSep 29, 2025
Intent-Driven Storage Systems: From Low-Level Tuning to High-Level UnderstandingShai Bergman, Won Wook Song, Lukas Cavigelli et al.
Existing storage systems lack visibility into workload intent, limiting their ability to adapt to the semantics of modern, large-scale data-intensive applications. This disconnect leads to brittle heuristics and fragmented, siloed optimizations. To address these limitations, we propose Intent-Driven Storage Systems (IDSS), a vision for a new paradigm where large language models (LLMs) infer workload and system intent from unstructured signals to guide adaptive and cross-layer parameter reconfiguration. IDSS provides holistic reasoning for competing demands, synthesizing safe and efficient decisions within policy guardrails. We present four design principles for integrating LLMs into storage control loops and propose a corresponding system architecture. Initial results on FileBench workloads show that IDSS can improve IOPS by up to 2.45X by interpreting intent and generating actionable configurations for storage components such as caching and prefetching. These findings suggest that, when constrained by guardrails and embedded within structured workflows, LLMs can function as high-level semantic optimizers, bridging the gap between application goals and low-level system control. IDSS points toward a future in which storage systems are increasingly adaptive, autonomous, and aligned with dynamic workload demands.
SYOct 14, 2019
Physics-Informed Deep Neural Network Method for Limited Observability State EstimationJonatan Ostrometzky, Konstantin Berestizshevsky, Andrey Bernstein et al.
The precise knowledge regarding the state of the power grid is important in order to ensure optimal and reliable grid operation. Specifically, knowing the state of the distribution grid becomes increasingly important as more renewable energy sources are connected directly into the distribution network, increasing the fluctuations of the injected power. In this paper, we consider the case when the distribution grid becomes partially observable, and the state estimation problem is under-determined. We present a new methodology that leverages a deep neural network (DNN) to estimate the grid state. The standard DNN training method is modified to explicitly incorporate the physical information of the grid topology and line/shunt admittance. We show that our method leads to a superior accuracy of the estimation when compared to the case when no physical information is provided. Finally, we compare the performance of our method to the standard state estimation approach, which is based on the weighted least squares with pseudo-measurements, and show that our method performs significantly better with respect to the estimation accuracy.
LGMay 28, 2018
Dynamically Sacrificing Accuracy for Reduced Computation: Cascaded Inference Based on Softmax ConfidenceKonstantin Berestizshevsky, Guy Even
We study the tradeoff between computational effort and classification accuracy in a cascade of deep neural networks. During inference, the user sets the acceptable accuracy degradation which then automatically determines confidence thresholds for the intermediate classifiers. As soon as the confidence threshold is met, inference terminates immediately without having to compute the output of the complete network. Confidence levels are derived directly from the softmax outputs of intermediate classifiers, as we do not train special decision functions. We show that using a softmax output as a confidence measure in a cascade of deep neural networks leads to a reduction of 15%-50% in the number of MAC operations while degrading the classification accuracy by roughly 1%. Our method can be easily incorporated into pre-trained non-cascaded architectures, as we exemplify on ResNet. Our main contribution is a method that dynamically adjusts the tradeoff between accuracy and computation without retraining the model.