Barak Sober

CL
h-index46
13papers
384citations
Novelty43%
AI Score39

13 Papers

SDMar 27
Probabilistic Multilabel Graphical Modelling of Motif Transformations in Symbolic Music

Ron Taieb, Yoel Greenberg, Barak Sober

Motifs often recur in musical works in altered forms, preserving aspects of their identity while undergoing local variation. This paper investigates how such motivic transformations occur within their musical context in symbolic music. To support this analysis, we develop a probabilistic framework for modeling motivic transformations and apply it to Beethoven's piano sonatas by integrating multiple datasets that provide melodic, rhythmic, harmonic, and motivic information within a unified analytical representation. Motif transformations are represented as multilabel variables by comparing each motif instance to a designated reference occurrence within its local context, ensuring consistent labeling across transformation families. We introduce a multilabel Conditional Random Field to model how motif-level musical features influence the occurrence of transformations and how different transformation families tend to co-occur. Our goal is to provide an interpretable, distributional analysis of motivic transformation patterns, enabling the study of their structural relationships and stylistic variation. By linking computational modeling with music-theoretical interpretation, the proposed framework supports quantitative investigation of musical structure and complexity in symbolic corpora and may facilitate the analysis of broader compositional patterns and writing practices.

CLMar 10, 2025
An Information-Theoretic Approach to Identifying Formulaic Clusters in Textual Data

Gideon Yoffe, Yair Segev, Barak Sober

Texts, whether literary or historical, exhibit structural and stylistic patterns shaped by their purpose, authorship, and cultural context. Formulaic texts, characterized by repetition and constrained expression, tend to have lower variability in self-information compared to more dynamic compositions. Identifying such patterns in historical documents, particularly multi-author texts like the Hebrew Bible provides insights into their origins, purpose, and transmission. This study aims to identify formulaic clusters -- sections exhibiting systematic repetition and structural constraints -- by analyzing recurring phrases, syntactic structures, and stylistic markers. However, distinguishing formulaic from non-formulaic elements in an unsupervised manner presents a computational challenge, especially in high-dimensional textual spaces where patterns must be inferred without predefined labels. To address this, we develop an information-theoretic algorithm leveraging weighted self-information distributions to detect structured patterns in text, unlike covariance-based methods, which become unstable in small-sample, high-dimensional settings, our approach directly models variations in self-information to identify formulaicity. By extending classical discrete self-information measures with a continuous formulation based on differential self-information, our method remains applicable across different types of textual representations, including neural embeddings under Gaussian priors. Applied to hypothesized authorial divisions in the Hebrew Bible, our approach successfully isolates stylistic layers, providing a quantitative framework for textual stratification. This method enhances our ability to analyze compositional patterns, offering deeper insights into the literary and cultural evolution of texts shaped by complex authorship and editorial processes.

CLNov 7, 2024
Estimating the Influence of Sequentially Correlated Literary Properties in Textual Classification: A Data-Centric Hypothesis-Testing Approach

Gideon Yoffe, Nachum Dershowitz, Ariel Vishne et al.

We introduce a data-centric hypothesis-testing framework to quantify the influence of sequentially correlated literary properties--such as thematic continuity--on textual classification tasks. Our method models label sequences as stochastic processes and uses an empirical autocovariance matrix to generate surrogate labelings that preserve sequential dependencies. This enables statistical testing to determine whether classification outcomes are primarily driven by thematic structure or by non-sequential features like authorial style. Applying this framework across a diverse corpus of English prose, we compare traditional (word n-grams and character k-mers) and neural (contrastively trained) embeddings in both supervised and unsupervised classification settings. Crucially, our method identifies when classifications are confounded by sequentially correlated similarity, revealing that supervised and neural models are more prone to false positives--mistaking shared themes and cross-genre differences for stylistic signals. In contrast, unsupervised models using traditional features often yield high true positive rates with minimal false positives, especially in genre-consistent settings. By disentangling sequential from non-sequential influences, our approach provides a principled way to assess and interpret classification reliability. This is particularly impactful for authorship attribution, forensic linguistics, and the analysis of redacted or composite texts, where conventional methods may conflate theme with style. Our results demonstrate that controlling for sequential correlation is essential for reducing false positives and ensuring that classification outcomes reflect genuine stylistic distinctions.

CLMay 3, 2023
A Statistical Exploration of Text Partition Into Constituents: The Case of the Priestly Source in the Books of Genesis and Exodus

Gideon Yoffe, Axel Bühler, Nachum Dershowitz et al.

We present a pipeline for a statistical textual exploration, offering a stylometry-based explanation and statistical validation of a hypothesized partition of a text. Given a parameterization of the text, our pipeline: (1) detects literary features yielding the optimal overlap between the hypothesized and unsupervised partitions, (2) performs a hypothesis-testing analysis to quantify the statistical significance of the optimal overlap, while conserving implicit correlations between units of text that are more likely to be grouped, and (3) extracts and quantifies the importance of features most responsible for the classification, estimates their statistical stability and cluster-wise abundance. We apply our pipeline to the first two books in the Bible, where one stylistic component stands out in the eyes of biblical scholars, namely, the Priestly component. We identify and explore statistically significant stylistic differences between the Priestly and non-Priestly components.

CVJan 23, 2022
Mixed X-Ray Image Separation for Artworks with Concealed Designs

Wei Pu, Jun-Jie Huang, Barak Sober et al.

In this paper, we focus on X-ray images of paintings with concealed sub-surface designs (e.g., deriving from reuse of the painting support or revision of a composition by the artist), which include contributions from both the surface painting and the concealed features. In particular, we propose a self-supervised deep learning-based image separation approach that can be applied to the X-ray images from such paintings to separate them into two hypothetical X-ray images. One of these reconstructed images is related to the X-ray image of the concealed painting, while the second one contains only information related to the X-ray of the visible painting. The proposed separation network consists of two components: the analysis and the synthesis sub-networks. The analysis sub-network is based on learned coupled iterative shrinkage thresholding algorithms (LCISTA) designed using algorithm unrolling techniques, and the synthesis sub-network consists of several linear mappings. The learning algorithm operates in a totally self-supervised fashion without requiring a sample set that contains both the mixed X-ray images and the separated ones. The proposed method is demonstrated on a real painting with concealed content, Doña Isabel de Porcel by Francisco de Goya, to show its effectiveness.

LGJul 28, 2021
Neural Network Approximation of Refinable Functions

Ingrid Daubechies, Ronald DeVore, Nadav Dym et al.

In the desire to quantify the success of neural networks in deep learning and other applications, there is a great interest in understanding which functions are efficiently approximated by the outputs of neural networks. By now, there exists a variety of results which show that a wide range of functions can be approximated with sometimes surprising accuracy by these outputs. For example, it is known that the set of functions that can be approximated with exponential accuracy (in terms of the number of parameters used) includes, on one hand, very smooth functions such as polynomials and analytic functions (see e.g. \cite{E,S,Y}) and, on the other hand, very rough functions such as the Weierstrass function (see e.g. \cite{EPGB,DDFHP}), which is nowhere differentiable. In this paper, we add to the latter class of rough functions by showing that it also includes refinable functions. Namely, we show that refinable functions are approximated by the outputs of deep ReLU networks with a fixed width and increasing depth with accuracy exponential in terms of their number of parameters. Our results apply to functions used in the standard construction of wavelets as well as to functions constructed via subdivision algorithms in Computer Aided Geometric Design.

STMay 11, 2021
Non-Parametric Estimation of Manifolds from Noisy Data

Yariv Aizenbud, Barak Sober

A common observation in data-driven applications is that high dimensional data has a low intrinsic dimension, at least locally. In this work, we consider the problem of estimating a $d$ dimensional sub-manifold of $\mathbb{R}^D$ from a finite set of noisy samples. Assuming that the data was sampled uniformly from a tubular neighborhood of $\mathcal{M}\in \mathcal{C}^k$, a compact manifold without boundary, we present an algorithm that takes a point $r$ from the tubular neighborhood and outputs $\hat p_n\in \mathbb{R}^D$, and $\widehat{T_{\hat p_n}\mathcal{M}}$ an element in the Grassmanian $Gr(d, D)$. We prove that as the number of samples $n\to\infty$ the point $\hat p_n$ converges to $p\in \mathcal{M}$ and $\widehat{T_{\hat p_n}\mathcal{M}}$ converges to $T_p\mathcal{M}$ (the tangent space at that point) with high probability. Furthermore, we show that the estimation yields asymptotic rates of convergence of $n^{-\frac{k}{2k + d}}$ for the point estimation and $n^{-\frac{k-1}{2k + d}}$ for the estimation of the tangent space. These rates are known to be optimal for the case of function estimation.

IVSep 16, 2020
Image Separation with Side Information: A Connected Auto-Encoders Based Approach

Wei Pu, Barak Sober, Nathan Daly et al.

X-radiography (X-ray imaging) is a widely used imaging technique in art investigation. It can provide information about the condition of a painting as well as insights into an artist's techniques and working methods, often revealing hidden information invisible to the naked eye. In this paper, we deal with the problem of separating mixed X-ray images originating from the radiography of double-sided paintings. Using the visible color images (RGB images) from each side of the painting, we propose a new Neural Network architecture, based upon 'connected' auto-encoders, designed to separate the mixed X-ray image into two simulated X-ray images corresponding to each side. In this proposed architecture, the convolutional auto encoders extract features from the RGB images. These features are then used to (1) reproduce both of the original RGB images, (2) reconstruct the hypothetical separated X-ray images, and (3) regenerate the mixed X-ray image. The algorithm operates in a totally self-supervised fashion without requiring a sample set that contains both the mixed X-ray images and the separated ones. The methodology was tested on images from the double-sided wing panels of the \textsl{Ghent Altarpiece}, painted in 1432 by the brothers Hubert and Jan van Eyck. These tests show that the proposed approach outperforms other state-of-the-art X-ray image separation methods for art investigation applications.

NAJul 20, 2020
Approximating the Riemannian Metric from Point Clouds via Manifold Moving Least Squares

Barak Sober, Robert Ravier, Ingrid Daubechies

The approximation of both geodesic distances and shortest paths on point cloud sampled from an embedded submanifold $\mathcal{M}$ of Euclidean space has been a long-standing challenge in computational geometry. Given a sampling resolution parameter $ h $, state-of-the-art discrete methods yield $ O(h) $ provable approximations. In this paper, we investigate the convergence of such approximations made by Manifold Moving Least-Squares (Manifold-MLS), a method that constructs an approximating manifold $\mathcal{M}^h$ using information from a given point cloud that was developed by Sober \& Levin in 2019. In this paper, we show that provided that $\mathcal{M}\in C^{k}$ and closed (i.e. $\mathcal{M}$ is a compact manifold without boundary) the Riemannian metric of $ \mathcal{M}^h $ approximates the Riemannian metric of $ \mathcal{M}, $. Explicitly, given points $ p_1, p_2 \in \mathcal{M}$ with geodesic distance $ ρ_{\mathcal{M}}(p_1, p_2) $, we show that their corresponding points $ p_1^h, p_2^h \in \mathcal{M}^h$ have a geodesic distance of $ ρ_{\mathcal{M}^h}(p_1^h,p_2^h) = ρ_{\mathcal{M}}(p_1, p_2)(1 + O(h^{k-1})) $ (i.e., the Manifold-MLS is nearly an isometry). We then use this result, as well as the fact that $ \mathcal{M}^h $ can be sampled with any desired resolution, to devise a naive algorithm that yields approximate geodesic distances with a rate of convergence $ O(h^{k-1}) $. We show the potential and the robustness to noise of the proposed method on some numerical simulations.

LGMay 27, 2019
Expression of Fractals Through Neural Network Functions

Nadav Dym, Barak Sober, Ingrid Daubechies

To help understand the underlying mechanisms of neural networks (NNs), several groups have, in recent years, studied the number of linear regions $\ell$ of piecewise linear functions generated by deep neural networks (DNN). In particular, they showed that $\ell$ can grow exponentially with the number of network parameters $p$, a property often used to explain the advantages of DNNs over shallow NNs in approximating complicated functions. Nonetheless, a simple dimension argument shows that DNNs cannot generate all piecewise linear functions with $\ell$ linear regions as soon as $\ell > p$. It is thus natural to seek to characterize specific families of functions with $\ell$ linear regions that can be constructed by DNNs. Iterated Function Systems (IFS) generate sequences of piecewise linear functions $F_k$ with a number of linear regions exponential in $k$. We show that, under mild assumptions, $F_k$ can be generated by a NN using only $\mathcal{O}(k)$ parameters. IFS are used extensively to generate, at low computational cost, natural-looking landscape textures in artificial images. They have also been proposed for compression of natural images, albeit with less commercial success. The surprisingly good performance of this fractal-based compression suggests that our visual system may lock in, to some extent, on self-similarities in images. The combination of this phenomenon with the capacity, demonstrated here, of DNNs to efficiently approximate IFS may contribute to the success of DNNs, particularly striking for image processing tasks, as well as suggest new algorithms for representing self similarities in images based on the DNN mechanism.

MLNov 2, 2017
Approximation of Functions over Manifolds: A Moving Least-Squares Approach

Barak Sober, Yariv Aizenbud, David Levin

We present an algorithm for approximating a function defined over a $d$-dimensional manifold utilizing only noisy function values at locations sampled from the manifold with noise. To produce the approximation we do not require any knowledge regarding the manifold other than its dimension $d$. We use the Manifold Moving Least-Squares approach of (Sober and Levin 2016) to reconstruct the atlas of charts and the approximation is built on-top of those charts. The resulting approximant is shown to be a function defined over a neighborhood of a manifold, approximating the originally sampled manifold. In other words, given a new point, located near the manifold, the approximation can be evaluated directly on that point. We prove that our construction yields a smooth function, and in case of noiseless samples the approximation order is $\mathcal{O}(h^{m+1})$, where $h$ is a local density of sample parameter (i.e., the fill distance) and $m$ is the degree of a local polynomial approximation, used in our algorithm. In addition, the proposed algorithm has linear time complexity with respect to the ambient-space's dimension. Thus, we are able to avoid the computational complexity, commonly encountered in high dimensional approximations, without having to perform non-linear dimension reduction, which inevitably introduces distortions to the geometry of the data. Additionaly, we show numerical experiments that the proposed approach compares favorably to statistical approaches for regression over manifolds and show its potential.

GRJun 22, 2016
Manifold Approximation by Moving Least-Squares Projection (MMLS)

Barak Sober, David Levin

In order to avoid the curse of dimensionality, frequently encountered in Big Data analysis, there was a vast development in the field of linear and nonlinear dimension reduction techniques in recent years. These techniques (sometimes referred to as manifold learning) assume that the scattered input data is lying on a lower dimensional manifold, thus the high dimensionality problem can be overcome by learning the lower dimensionality behavior. However, in real life applications, data is often very noisy. In this work, we propose a method to approximate $\mathcal{M}$ a $d$-dimensional $C^{m+1}$ smooth submanifold of $\mathbb{R}^n$ ($d \ll n$) based upon noisy scattered data points (i.e., a data cloud). We assume that the data points are located "near" the lower dimensional manifold and suggest a non-linear moving least-squares projection on an approximating $d$-dimensional manifold. Under some mild assumptions, the resulting approximant is shown to be infinitely smooth and of high approximation order (i.e., $O(h^{m+1})$, where $h$ is the fill distance and $m$ is the degree of the local polynomial approximation). The method presented here assumes no analytic knowledge of the approximated manifold and the approximation algorithm is linear in the large dimension $n$. Furthermore, the approximating manifold can serve as a framework to perform operations directly on the high dimensional data in a computationally efficient manner. This way, the preparatory step of dimension reduction, which induces distortions to the data, can be avoided altogether.

GRFeb 23, 2016
Computer Aided Restoration of Handwritten Character Strokes

Barak Sober, David Levin

This work suggests a new variational approach to the task of computer aided restoration of incomplete characters, residing in a highly noisy document. We model character strokes as the movement of a pen with a varying radius. Following this model, a cubic spline representation is being utilized to perform gradient descent steps, while maintaining interpolation at some initial (manually sampled) points. The proposed algorithm was utilized in the process of restoring approximately 1000 ancient Hebrew characters (dating to ca. 8th-7th century BCE), some of which are presented herein and show that the algorithm yields plausible results when applied on deteriorated documents.