Or Sharir

h-index12

20papers

2,728citations

Novelty60%

AI Score43

Ranked #53,735 of 194,257 authors (top 28%)#12,252 in LG (top 30%)

20 Papers

3.3QUANT-PHDec 21, 2022

Towards Neural Variational Monte Carlo That Scales Linearly with System Size

Or Sharir, Garnet Kin-Lic Chan, Anima Anandkumar

Quantum many-body problems are some of the most challenging problems in science and are central to demystifying some exotic quantum phenomena, e.g., high-temperature superconductors. The combination of neural networks (NN) for representing quantum states, coupled with the Variational Monte Carlo (VMC) algorithm, has been shown to be a promising method for solving such problems. However, the run-time of this approach scales quadratically with the number of simulated particles, constraining the practically usable NN to - in machine learning terms - minuscule sizes (<10M parameters). Considering the many breakthroughs brought by extreme NN in the +1B parameters scale to other domains, lifting this constraint could significantly expand the set of quantum systems we can accurately simulate on classical computers, both in size and complexity. We propose a NN architecture called Vector-Quantized Neural Quantum States (VQ-NQS) that utilizes vector-quantization techniques to leverage redundancies in the local-energy calculations of the VMC algorithm - the source of the quadratic scaling. In our preliminary experiments, we demonstrate VQ-NQS ability to reproduce the ground state of the 2D Heisenberg model across various system sizes, while reporting a significant reduction of about ${\times}10$ in the number of FLOPs in the local-energy calculation.

2.0LGJul 27, 2023

Incrementally-Computable Neural Networks: Efficient Inference for Dynamic Inputs

Or Sharir, Anima Anandkumar

Deep learning often faces the challenge of efficiently processing dynamic inputs, such as sensor data or user inputs. For example, an AI writing assistant is required to update its suggestions in real time as a document is edited. Re-running the model each time is expensive, even with compression techniques like knowledge distillation, pruning, or quantization. Instead, we take an incremental computing approach, looking to reuse calculations as the inputs change. However, the dense connectivity of conventional architectures poses a major obstacle to incremental computation, as even minor input changes cascade through the network and restrict information reuse. To address this, we use vector quantization to discretize intermediate values in the network, which filters out noisy and unnecessary modifications to hidden neurons, facilitating the reuse of their values. We apply this approach to the transformers architecture, creating an efficient incremental inference algorithm with complexity proportional to the fraction of the modified inputs. Our experiments with adapting the OPT-125M pre-trained language model demonstrate comparable accuracy on document classification while requiring 12.1X (median) fewer operations for processing sequences of atomic edits.

27.3CLFeb 19, 2024Code

ChatGPT Based Data Augmentation for Improved Parameter-Efficient Debiasing of LLMs

Pengrui Han, Rafal Kocielnik, Adhithya Saravanan et al. · uw

Large Language models (LLMs), while powerful, exhibit harmful social biases. Debiasing is often challenging due to computational costs, data constraints, and potential degradation of multi-task language capabilities. This work introduces a novel approach utilizing ChatGPT to generate synthetic training data, aiming to enhance the debiasing of LLMs. We propose two strategies: Targeted Prompting, which provides effective debiasing for known biases but necessitates prior specification of bias in question; and General Prompting, which, while slightly less effective, offers debiasing across various categories. We leverage resource-efficient LLM debiasing using adapter tuning and compare the effectiveness of our synthetic data to existing debiasing datasets. Our results reveal that: (1) ChatGPT can efficiently produce high-quality training data for debiasing other LLMs; (2) data produced via our approach surpasses existing datasets in debiasing performance while also preserving internal knowledge of a pre-trained LLM; and (3) synthetic data exhibits generalizability across categories, effectively mitigating various biases, including intersectional ones. These findings underscore the potential of synthetic data in advancing the fairness of LLMs with minimal retraining cost.

6.6CHEM-PHFeb 19, 2024Code

Image Super-resolution Inspired Electron Density Prediction

Chenghan Li, Or Sharir, Shunyue Yuan et al.

Drawing inspiration from the domain of image super-resolution, we view the electron density as a 3D grayscale image and use a convolutional residual network to transform a crude and trivially generated guess of the molecular density into an accurate ground-state quantum mechanical density. We find that this model outperforms all prior density prediction approaches. Because the input is itself a real-space density, the predictions are equivariant to molecular symmetry transformations even though the model is not constructed to be. Due to its simplicity, the model is directly applicable to unseen molecular conformations and chemical elements. We show that fine-tuning on limited new data provides high accuracy even in challenging cases of exotic elements and charge states. Our work suggests new routes to learning real-space physical quantities drawing from the established ideas of image processing.

7.3AIDec 3, 2024

Artificial Expert Intelligence through PAC-reasoning

Shai Shalev-Shwartz, Amnon Shashua, Gal Beniamini et al.

Artificial Expert Intelligence (AEI) seeks to transcend the limitations of both Artificial General Intelligence (AGI) and narrow AI by integrating domain-specific expertise with critical, precise reasoning capabilities akin to those of top human experts. Existing AI systems often excel at predefined tasks but struggle with adaptability and precision in novel problem-solving. To overcome this, AEI introduces a framework for ``Probably Approximately Correct (PAC) Reasoning". This paradigm provides robust theoretical guarantees for reliably decomposing complex problems, with a practical mechanism for controlling reasoning precision. In reference to the division of human thought into System 1 for intuitive thinking and System 2 for reflective reasoning~\citep{tversky1974judgment}, we refer to this new type of reasoning as System 3 for precise reasoning, inspired by the rigor of the scientific method. AEI thus establishes a foundation for error-bounded, inference-time learning.

7.8AIJul 17, 2025

FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming

Gal Beniamini, Yuval Dor, Alon Vinnikov et al.

Frontier AI models demonstrate formidable breadth of knowledge. But how close are they to true human -- or superhuman -- expertise? Genuine experts can tackle the hardest problems and push the boundaries of scientific understanding. To illuminate the limits of frontier model capabilities, we turn away from contrived competitive programming puzzles, and instead focus on real-life research problems. We construct FormulaOne, a benchmark that lies at the intersection of graph theory, logic, and algorithms, all well within the training distribution of frontier models. Our problems are incredibly demanding, requiring an array of reasoning steps. The dataset has three key properties. First, it is of commercial interest and relates to practical large-scale optimisation problems, such as those arising in routing, scheduling, and network design. Second, it is generated from the highly expressive framework of Monadic Second-Order (MSO) logic on graphs, paving the way toward automatic problem generation at scale; ideal for building RL environments. Third, many of our problems are intimately related to the frontier of theoretical computer science, and to central conjectures therein, such as the Strong Exponential Time Hypothesis (SETH). As such, any significant algorithmic progress on our dataset, beyond known results, could carry profound theoretical implications. Remarkably, state-of-the-art models like OpenAI's o3 fail entirely on FormulaOne, solving less than 1% of the questions, even when given 10 attempts and explanatory fewshot examples -- highlighting how far they remain from expert-level understanding in some domains. To support further research, we additionally curate FormulaOne-Warmup, offering a set of simpler tasks, from the same distribution. We release the full corpus along with a comprehensive evaluation framework.

13.0QUANT-PHMar 18, 2021

Neural tensor contractions and the expressive power of deep neural quantum states

Or Sharir, Amnon Shashua, Giuseppe Carleo

We establish a direct connection between general tensor networks and deep feed-forward artificial neural networks. The core of our results is the construction of neural-network layers that efficiently perform tensor contractions, and that use commonly adopted non-linear activation functions. The resulting deep networks feature a number of edges that closely matches the contraction complexity of the tensor networks to be approximated. In the context of many-body quantum states, this result establishes that neural-network states have strictly the same or higher expressive power than practically usable variational tensor networks. As an example, we show that all matrix product states can be efficiently written as neural-network states with a number of edges polynomial in the bond dimension and depth logarithmic in the system size. The opposite instead does not hold true, and our results imply that there exist quantum states that are not efficiently expressible in terms of matrix product states or PEPS, but that are instead efficiently expressible with neural network states.

2.3CLJun 30, 2020

Technical Report: Auxiliary Tuning and its Application to Conditional Text Generation

Yoel Zeldes, Dan Padnos, Or Sharir et al.

We introduce a simple and efficient method, called Auxiliary Tuning, for adapting a pre-trained Language Model to a novel task; we demonstrate this approach on the task of conditional text generation. Our approach supplements the original pre-trained model with an auxiliary model that shifts the output distribution according to the target task. The auxiliary model is trained by adding its logits to the pre-trained model logits and maximizing the likelihood of the target task output. Our method imposes no constraints on the auxiliary architecture. In particular, the auxiliary model can ingest additional input relevant to the target task, independently from the pre-trained model's input. Furthermore, mixing the models at the logits level provides a natural probabilistic interpretation of the method. Our method achieved similar results to training from scratch for several different tasks, while using significantly fewer resources for training; we share a specific example of text generation conditioned on keywords.

20.0LGJun 22, 2020

The Depth-to-Width Interplay in Self-Attention

Yoav Levine, Noam Wies, Or Sharir et al.

Self-attention architectures, which are rapidly pushing the frontier in natural language processing, demonstrate a surprising depth-inefficient behavior: previous works indicate that increasing the internal representation (network width) is just as useful as increasing the number of self-attention layers (network depth). We theoretically predict a width-dependent transition between depth-efficiency and depth-inefficiency in self-attention. We conduct systematic empirical ablations on networks of depths 6 to 48 that clearly reveal the theoretically predicted behaviors, and provide explicit quantitative suggestions regarding the optimal depth-to-width allocation for a given self-attention network size. The race towards beyond 1-Trillion parameter language models renders informed guidelines for increasing self-attention depth and width in tandem an essential ingredient. Our guidelines elucidate the depth-to-width trade-off in self-attention networks of sizes up to the scale of GPT3 (which we project to be too deep for its size), and beyond, marking an unprecedented width of 30K as optimal for a 1-Trillion parameter network.

10.9CLApr 19, 2020

The Cost of Training NLP Models: A Concise Overview

Or Sharir, Barak Peleg, Yoav Shoham

We review the cost of training large-scale language models, and the drivers of these costs. The intended audience includes engineers and scientists budgeting their model-training experiments, as well as non-practitioners trying to make sense of the economics of modern-day Natural Language Processing (NLP).

31.1CLAug 15, 2019

SenseBERT: Driving Some Sense into BERT

Yoav Levine, Barak Lenz, Or Dagan et al.

The ability to learn from large unlabeled corpora has allowed neural language models to advance the frontier in natural language understanding. However, existing self-supervision techniques operate at the word form level, which serves as a surrogate for the underlying semantic content. This paper proposes a method to employ weak-supervision directly at the word sense level. Our model, named SenseBERT, is pre-trained to predict not only the masked words but also their WordNet supersenses. Accordingly, we attain a lexical-semantic level language model, without the use of human annotation. SenseBERT achieves significantly improved lexical understanding, as we demonstrate by experimenting on SemEval Word Sense Disambiguation, and by attaining a state of the art result on the Word in Context task.

23.3DIS-NNFeb 11, 2019Code

Deep autoregressive models for the efficient variational simulation of many-body quantum systems

Or Sharir, Yoav Levine, Noam Wies et al.

Artificial Neural Networks were recently shown to be an efficient representation of highly-entangled many-body quantum states. In practical applications, neural-network states inherit numerical schemes used in Variational Monte Carlo, most notably the use of Markov-Chain Monte-Carlo (MCMC) sampling to estimate quantum expectations. The local stochastic sampling in MCMC caps the potential advantages of neural networks in two ways: (i) Its intrinsic computational cost sets stringent practical limits on the width and depth of the networks, and therefore limits their expressive capacity; (ii) Its difficulty in generating precise and uncorrelated samples can result in estimations of observables that are very far from their true value. Inspired by the state-of-the-art generative models used in machine learning, we propose a specialized Neural Network architecture that supports efficient and exact sampling, completely circumventing the need for Markov Chain sampling. We demonstrate our approach for two-dimensional interacting spin models, showcasing the ability to obtain accurate results on larger system sizes than those currently accessible to neural-network quantum states.

24.9QUANT-PHMar 26, 2018

Quantum Entanglement in Deep Learning Architectures

Yoav Levine, Or Sharir, Nadav Cohen et al.

Modern deep learning has enabled unprecedented achievements in various domains. Nonetheless, employment of machine learning for wave function representations is focused on more traditional architectures such as restricted Boltzmann machines (RBMs) and fully-connected neural networks. In this letter, we establish that contemporary deep learning architectures, in the form of deep convolutional and recurrent networks, can efficiently represent highly entangled quantum systems. By constructing Tensor Network equivalents of these architectures, we identify an inherent reuse of information in the network operation as a key trait which distinguishes them from standard Tensor Network based representations, and which enhances their entanglement capacity. Our results show that such architectures can support volume-law entanglement scaling, polynomially more efficiently than presently employed RBMs. Thus, beyond a quantification of the entanglement capacity of leading deep learning architectures, our analysis formally motivates a shift of trending neural-network based wave function representations closer to the state-of-the-art in machine learning.

6.5LGOct 25, 2017

On the Long-Term Memory of Deep Recurrent Networks

Yoav Levine, Or Sharir, Alon Ziv et al.

A key attribute that drives the unprecedented success of modern Recurrent Neural Networks (RNNs) on learning tasks which involve sequential data, is their ability to model intricate long-term temporal dependencies. However, a well established measure of RNNs long-term memory capacity is lacking, and thus formal understanding of the effect of depth on their ability to correlate data throughout time is limited. Specifically, existing depth efficiency results on convolutional networks do not suffice in order to account for the success of deep RNNs on data of varying lengths. In order to address this, we introduce a measure of the network's ability to support information flow across time, referred to as the Start-End separation rank, which reflects the distance of the function realized by the recurrent network from modeling no dependency between the beginning and end of the input sequence. We prove that deep recurrent networks support Start-End separation ranks which are combinatorially higher than those supported by their shallow counterparts. Thus, we establish that depth brings forth an overwhelming advantage in the ability of recurrent networks to model long-term dependencies, and provide an exemplar of quantifying this key attribute which may be readily extended to other RNN architectures of interest, e.g. variants of LSTM networks. We obtain our results by considering a class of recurrent networks referred to as Recurrent Arithmetic Circuits, which merge the hidden state with the input via the Multiplicative Integration operation, and empirically demonstrate the discussed phenomena on common RNNs. Finally, we employ the tool of quantum Tensor Networks to gain additional graphic insight regarding the complexity brought forth by depth in recurrent networks.

5.7LGOct 12, 2017

Sum-Product-Quotient Networks

Or Sharir, Amnon Shashua

We present a novel tractable generative model that extends Sum-Product Networks (SPNs) and significantly boosts their power. We call it Sum-Product-Quotient Networks (SPQNs), whose core concept is to incorporate conditional distributions into the model by direct computation using quotient nodes, e.g. $P(A|B) = \frac{P(A,B)}{P(B)}$. We provide sufficient conditions for the tractability of SPQNs that generalize and relax the decomposable and complete tractability conditions of SPNs. These relaxed conditions give rise to an exponential boost to the expressive efficiency of our model, i.e. we prove that there are distributions which SPQNs can compute efficiently but require SPNs to be of exponential size. Thus, we narrow the gap in expressivity between tractable graphical models and other Neural Network-based generative models.

12.0LGMay 5, 2017

Analysis and Design of Convolutional Networks via Hierarchical Tensor Decompositions

Nadav Cohen, Or Sharir, Yoav Levine et al.

The driving force behind convolutional networks - the most successful deep learning architecture to date, is their expressive power. Despite its wide acceptance and vast empirical evidence, formal analyses supporting this belief are scarce. The primary notions for formally reasoning about expressiveness are efficiency and inductive bias. Expressive efficiency refers to the ability of a network architecture to realize functions that require an alternative architecture to be much larger. Inductive bias refers to the prioritization of some functions over others given prior knowledge regarding a task at hand. In this paper we overview a series of works written by the authors, that through an equivalence to hierarchical tensor decompositions, analyze the expressive efficiency and inductive bias of various convolutional network architectural features (depth, width, strides and more). The results presented shed light on the demonstrated effectiveness of convolutional networks, and in addition, provide new tools for network design.

10.3LGMar 6, 2017Code

On the Expressive Power of Overlapping Architectures of Deep Learning

Or Sharir, Amnon Shashua

Expressive efficiency refers to the relation between two architectures A and B, whereby any function realized by B could be replicated by A, but there exists functions realized by A, which cannot be replicated by B unless its size grows significantly larger. For example, it is known that deep networks are exponentially efficient with respect to shallow networks, in the sense that a shallow network must grow exponentially large in order to approximate the functions represented by a deep network of polynomial size. In this work, we extend the study of expressive efficiency to the attribute of network connectivity and in particular to the effect of "overlaps" in the convolutional process, i.e., when the stride of the convolution is smaller than its filter size (receptive field). To theoretically analyze this aspect of network's design, we focus on a well-established surrogate for ConvNets called Convolutional Arithmetic Circuits (ConvACs), and then demonstrate empirically that our results hold for standard ConvNets as well. Specifically, our analysis shows that having overlapping local receptive fields, and more broadly denser connectivity, results in an exponential increase in the expressive capacity of neural networks. Moreover, while denser connectivity can increase the expressive capacity, we show that the most common types of modern architectures already exhibit exponential increase in expressivity, without relying on fully-connected layers.

12.4LGOct 13, 2016Code

Tensorial Mixture Models

Or Sharir, Ronen Tamari, Nadav Cohen et al.

Casting neural networks in generative frameworks is a highly sought-after endeavor these days. Contemporary methods, such as Generative Adversarial Networks, capture some of the generative capabilities, but not all. In particular, they lack the ability of tractable marginalization, and thus are not suitable for many tasks. Other methods, based on arithmetic circuits and sum-product networks, do allow tractable marginalization, but their performance is challenged by the need to learn the structure of a circuit. Building on the tractability of arithmetic circuits, we leverage concepts from tensor analysis, and derive a family of generative models we call Tensorial Mixture Models (TMMs). TMMs assume a simple convolutional network structure, and in addition, lend themselves to theoretical analyses that allow comprehensive understanding of the relation between their structure and their expressive properties. We thus obtain a generative model that is tractable on one hand, and on the other hand, allows effective representation of rich distributions in an easily controlled manner. These two capabilities are brought together in the task of classification under missing data, where TMMs deliver state of the art accuracies with seamless implementation and design.

43.5NESep 16, 2015

On the Expressive Power of Deep Learning: A Tensor Analysis

Nadav Cohen, Or Sharir, Amnon Shashua

It has long been conjectured that hypotheses spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical networks than with shallow ones. Despite the vast empirical evidence supporting this belief, theoretical justifications to date are limited. In particular, they do not account for the locality, sharing and pooling constructs of convolutional networks, the most successful deep learning architecture to date. In this work we derive a deep network architecture based on arithmetic circuits that inherently employs locality, sharing and pooling. An equivalence between the networks and hierarchical tensor factorizations is established. We show that a shallow network corresponds to CP (rank-1) decomposition, whereas a deep network corresponds to Hierarchical Tucker decomposition. Using tools from measure theory and matrix algebra, we prove that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network. Since log-space computation transforms our networks into SimNets, the result applies directly to a deep learning architecture demonstrating promising empirical performance. The construction and theory developed in this paper shed new light on various practices and ideas employed by the deep learning community.

14.5NEJun 9, 2015

Deep SimNets

Nadav Cohen, Or Sharir, Amnon Shashua

We present a deep layered architecture that generalizes convolutional neural networks (ConvNets). The architecture, called SimNets, is driven by two operators: (i) a similarity function that generalizes inner-product, and (ii) a log-mean-exp function called MEX that generalizes maximum and average. The two operators applied in succession give rise to a standard neuron but in "feature space". The feature spaces realized by SimNets depend on the choice of the similarity operator. The simplest setting, which corresponds to a convolution, realizes the feature space of the Exponential kernel, while other settings realize feature spaces of more powerful kernels (Generalized Gaussian, which includes as special cases RBF and Laplacian), or even dynamically learned feature spaces (Generalized Multiple Kernel Learning). As a result, the SimNet contains a higher abstraction level compared to a traditional ConvNet. We argue that enhanced expressiveness is important when the networks are small due to run-time constraints (such as those imposed by mobile applications). Empirical evaluation validates the superior expressiveness of SimNets, showing a significant gain in accuracy over ConvNets when computational resources at run-time are limited. We also show that in large-scale settings, where computational complexity is less of a concern, the additional capacity of SimNets can be controlled with proper regularization, yielding accuracies comparable to state of the art ConvNets.