Rustem Takhanov

LG
h-index8
24papers
2,269citations
Novelty43%
AI Score52

24 Papers

82.4ITJun 2
Classification of independent sets in signed Johnson graphs and applications to kissing arrangements

Rustem Takhanov, Stanislav Yun

Johnson graph are a family of graphs that play an important role in the theory of constant-weight codes, extremal combinatorics, and combinatorial geometry. We study signed analogues of classical Johnson graphs, denoted by $J_\pm(n,k)$, whose vertices are vectors of the form $\pm e_{i_1}\pm\cdots\pm e_{i_k}$, where two vertices are adjacent whenever their dot product equals $k-1$. We are particularly interested in maximum independent sets in the case $k=4$. An example of such an independent set in $J_\pm(n,4)$, which we call \emph{classical}, is obtained by lifting an arbitrary optimal $(n,4,4)$-code. Such independent sets naturally define kissing arrangements in ${\mathbb R}^n$. We develop an algorithm that is practical for computing all maximum independent sets in $J_\pm(n,4)$ up to signed permutations for $n\le 12$, $n\ne 11$. In addition to obtaining complete lists, we provide structural characterizations of all types of maximum independent sets in these dimensions, excluding $n=5$ and $n=11$. Our most striking results concern the case $n=12$. We identify $1579$ non-isomorphic maximum independent sets in $J_\pm(12,4)$, all corresponding to non-isometric kissing arrangements of size $840$ in ${\mathbb R}^{12}$. Structurally, $1575$ of these independent sets arise from three different constructions, the rest are liftings of one of four $(12,4,4)$-codes. To our knowledge, this is the first dimension in which such a large diversity of potentially optimal kissing arrangements has been observed. Beyond this finite range, we prove that for $n\equiv 2$ or $4 \pmod 6$, every maximum independent set arises from a Steiner quadruple system. We also obtain a characterization of the so-called \emph{nontrivially self-compatible} codes, namely optimal $(n,4,4)$-codes from which non-classical maximum independent sets can be constructed.

28.0LGMay 25
Conditional KRR: Injecting Unpenalized Features into Kernel Methods with Applications to Kernel Thresholding

Rustem Takhanov, Zhenisbek Assylbekov

Conditionally positive definite (CPD) kernels are defined with respect to a function class $\mathcal{F}$. It is well known that such a kernel $K$ is associated with its native space (defined analogously to an RKHS), which in turn gives rise to a learning method -- called conditional kernel ridge regression (conditional KRR) due to its analogy with KRR -- where the estimated regression function is penalized by the square of its native space norm. This method is of interest because it can be viewed as classical linear regression, with features specified by $\mathcal{F}$, followed by the application of standard KRR to the residual (unexplained) component of the target variable. Methods of this type have recently attracted increasing attention. We study the statistical properties of this method by reducing its behavior to that of KRR with another fixed kernel, called the residual kernel. Our main theoretical result shows that such a reduction is indeed possible, at the cost of an additional term in the expected test risk, bounded by $\mathcal{O}(1/\sqrt{N})$, where $N$ is the sample size and the hidden constant depends on the class $\mathcal{F}$ and the input distribution. This reduction enables us to analyze conditional KRR in the case where $K$ is positive definite and $\mathcal{F}$ is given by the first $k$ principal eigenfunctions in the Mercer decomposition of $K$. We also consider the setting where $\mathcal{F}$ consists of $k$ random features from a random feature representation of $K$. It turns out that these two settings are closely related. Both our theoretical analysis and experiments confirm that conditional KRR outperforms standard KRR in these cases whenever the $\mathcal{F}$-component of the regression function is more pronounced than the residual part.

LGMay 1, 2022
On the speed of uniform convergence in Mercer's theorem

Rustem Takhanov

The classical Mercer's theorem claims that a continuous positive definite kernel $K({\mathbf x}, {\mathbf y})$ on a compact set can be represented as $\sum_{i=1}^\infty λ_iφ_i({\mathbf x})φ_i({\mathbf y})$ where $\{(λ_i,φ_i)\}$ are eigenvalue-eigenvector pairs of the corresponding integral operator. This infinite representation is known to converge uniformly to the kernel $K$. We estimate the speed of this convergence in terms of the decay rate of eigenvalues and demonstrate that for $2m$ times differentiable kernels the first $N$ terms of the series approximate $K$ as $\mathcal{O}\big((\sum_{i=N+1}^\inftyλ_i)^{\frac{m}{m+n}}\big)$ or $\mathcal{O}\big((\sum_{i=N+1}^\inftyλ^2_i)^{\frac{m}{2m+n}}\big)$. Finally, we demonstrate some applications of our results to a spectral charaterization of integral operators with continuous roots and other powers.

LGOct 19, 2023
Gradient Descent Fails to Learn High-frequency Functions and Modular Arithmetic

Rustem Takhanov, Maxat Tezekbayev, Artur Pak et al.

Classes of target functions containing a large number of approximately orthogonal elements are known to be hard to learn by the Statistical Query algorithms. Recently this classical fact re-emerged in a theory of gradient-based optimization of neural networks. In the novel framework, the hardness of a class is usually quantified by the variance of the gradient with respect to a random choice of a target function. A set of functions of the form $x\to ax \bmod p$, where $a$ is taken from ${\mathbb Z}_p$, has attracted some attention from deep learning theorists and cryptographers recently. This class can be understood as a subset of $p$-periodic functions on ${\mathbb Z}$ and is tightly connected with a class of high-frequency periodic functions on the real line. We present a mathematical analysis of limitations and challenges associated with using gradient-based learning techniques to train a high-frequency periodic function or modular multiplication from examples. We highlight that the variance of the gradient is negligibly small in both cases when either a frequency or the prime base $p$ is large. This in turn prevents such a learning algorithm from being successful.

LGJun 25, 2023
Autoencoders for a manifold learning problem with a Jacobian rank constraint

Rustem Takhanov, Y. Sultan Abylkairov, Maxat Tezekbayev

We formulate the manifold learning problem as the problem of finding an operator that maps any point to a close neighbor that lies on a ``hidden'' $k$-dimensional manifold. We call this operator the correcting function. Under this formulation, autoencoders can be viewed as a tool to approximate the correcting function. Given an autoencoder whose Jacobian has rank $k$, we deduce from the classical Constant Rank Theorem that its range has a structure of a $k$-dimensional manifold. A $k$-dimensionality of the range can be forced by the architecture of an autoencoder (by fixing the dimension of the code space), or alternatively, by an additional constraint that the rank of the autoencoder mapping is not greater than $k$. This constraint is included in the objective function as a new term, namely a squared Ky-Fan $k$-antinorm of the Jacobian function. We claim that this constraint is a factor that effectively reduces the dimension of the range of an autoencoder, additionally to the reduction defined by the architecture. We also add a new curvature term into the objective. To conclude, we experimentally compare our approach with the CAE+H method on synthetic and real-world datasets.

LGOct 2, 2023
Intractability of Learning the Discrete Logarithm with Gradient-Based Methods

Rustem Takhanov, Maxat Tezekbayev, Artur Pak et al.

The discrete logarithm problem is a fundamental challenge in number theory with significant implications for cryptographic protocols. In this paper, we investigate the limitations of gradient-based methods for learning the parity bit of the discrete logarithm in finite cyclic groups of prime order. Our main result, supported by theoretical analysis and empirical verification, reveals the concentration of the gradient of the loss function around a fixed point, independent of the logarithm's base used. This concentration property leads to a restricted ability to learn the parity bit efficiently using gradient-based methods, irrespective of the complexity of the network architecture being trained. Our proof relies on Boas-Bellman inequality in inner product spaces and it involves establishing approximate orthogonality of discrete logarithm's parity bit functions through the spectral norm of certain matrices. Empirical experiments using a neural network-based approach further verify the limitations of gradient-based learning, demonstrating the decreasing success rate in predicting the parity bit as the group order increases.

LGJan 22
On the Intrinsic Dimensions of Data in Kernel Learning

Rustem Takhanov

The manifold hypothesis suggests that the generalization performance of machine learning methods improves significantly when the intrinsic dimension of the input distribution's support is low. In the context of KRR, we investigate two alternative notions of intrinsic dimension. The first, denoted $d_ρ$, is the upper Minkowski dimension defined with respect to the canonical metric induced by a kernel function $K$ on a domain $Ω$. The second, denoted $d_K$, is the effective dimension, derived from the decay rate of Kolmogorov $n$-widths associated with $K$ on $Ω$. Given a probability measure $μ$ on $Ω$, we analyze the relationship between these $n$-widths and eigenvalues of the integral operator $φ\to \int_ΩK(\cdot,x)φ(x)dμ(x)$. We show that, for a fixed domain $Ω$, the Kolmogorov $n$-widths characterize the worst-case eigenvalue decay across all probability measures $μ$ supported on $Ω$. These eigenvalues are central to understanding the generalization behavior of constrained KRR, enabling us to derive an excess error bound of order $O(n^{-\frac{2+d_K}{2+2d_K} + ε})$ for any $ε> 0$, when the training set size $n$ is large. We also propose an algorithm that estimates upper bounds on the $n$-widths using only a finite sample from $μ$. For distributions close to uniform, we prove that $ε$-accurate upper bounds on all $n$-widths can be computed with high probability using at most $O\left(ε^{-d_ρ}\log\frac{1}ε\right)$ samples, with fewer required for small $n$. Finally, we compute the effective dimension $d_K$ for various fractal sets and present additional numerical experiments. Our results show that, for kernels such as the Laplace kernel, the effective dimension $d_K$ can be significantly smaller than the Minkowski dimension $d_ρ$, even though $d_K = d_ρ$ provably holds on regular domains.

MLApr 9, 2022
Non-asymptotic spectral bounds on the $\varepsilon$-entropy of kernel classes

Rustem Takhanov

Let $K: \boldsymbolΩ\times \boldsymbolΩ$ be a continuous Mercer kernel defined on a compact subset of ${\mathbb R}^n$ and $\mathcal{H}_K$ be the reproducing kernel Hilbert space (RKHS) associated with $K$. Given a finite measure $ν$ on $\boldsymbolΩ$, we investigate upper and lower bounds on the $\varepsilon$-entropy of the unit ball of $\mathcal{H}_K$ in the space $L_p(ν)$. This topic is an important direction in the modern statistical theory of kernel-based methods. We prove sharp upper and lower bounds for $p\in [1,+\infty]$. For $p\in [1,2]$, the upper bounds are determined solely by the eigenvalue behaviour of the corresponding integral operator $φ\to \int_{\boldsymbolΩ} K(\cdot,{\mathbf y})φ({\mathbf y})dν({\mathbf y})$. In constrast, for $p>2$, the bounds additionally depend on the convergence rate of the truncated Mercer series to the kernel $K$ in the $L_p(ν)$-norm. We discuss a number of consequences of our bounds and show that they are substantially tighter than previous bounds for general kernels. Furthermore, for specific cases, such as zonal kernels and the Gaussian kernel on a box, our bounds are asymptotically tight as $\varepsilon\to +0$.

LGApr 26, 2024
Multi-layer random features and the approximation power of neural networks

Rustem Takhanov

A neural architecture with randomly initialized weights, in the infinite width limit, is equivalent to a Gaussian Random Field whose covariance function is the so-called Neural Network Gaussian Process kernel (NNGP). We prove that a reproducing kernel Hilbert space (RKHS) defined by the NNGP contains only functions that can be approximated by the architecture. To achieve a certain approximation error the required number of neurons in each layer is defined by the RKHS norm of the target function. Moreover, the approximation can be constructed from a supervised dataset by a random multi-layer representation of an input vector, together with training of the last layer's weights. For a 2-layer NN and a domain equal to an $n-1$-dimensional sphere in ${\mathbb R}^n$, we compare the number of neurons required by Barron's theorem and by the multi-layer features construction. We show that if eigenvalues of the integral operator of the NNGP decay slower than $k^{-n-\frac{2}{3}}$ where $k$ is an order of an eigenvalue, then our theorem guarantees a more succinct neural network approximation than Barron's theorem. We also make some computational experiments to verify our theoretical findings. Our experiments show that realistic neural networks easily learn target functions even when both theorems do not give any guarantees.

MLJan 4
Deep Linear Discriminant Analysis Revisited

Maxat Tezekbayev, Rustem Takhanov, Arman Bolatov et al.

We show that for unconstrained Deep Linear Discriminant Analysis (LDA) classifiers, maximum-likelihood training admits pathological solutions in which class means drift together, covariances collapse, and the learned representation becomes almost non-discriminative. Conversely, cross-entropy training yields excellent accuracy but decouples the head from the underlying generative model, leading to highly inconsistent parameter estimates. To reconcile generative structure with discriminative performance, we introduce the \emph{Discriminative Negative Log-Likelihood} (DNLL) loss, which augments the LDA log-likelihood with a simple penalty on the mixture density. DNLL can be interpreted as standard LDA NLL plus a term that explicitly discourages regions where several classes are simultaneously likely. Deep LDA trained with DNLL produces clean, well-separated latent spaces, matches the test accuracy of softmax classifiers on synthetic data and standard image benchmarks, and yields substantially better calibrated predictive probabilities, restoring a coherent probabilistic interpretation to deep discriminant models.

LGMay 28, 2025
The informativeness of the gradient revisited

Rustem Takhanov

In the past decade gradient-based deep learning has revolutionized several applications. However, this rapid advancement has highlighted the need for a deeper theoretical understanding of its limitations. Research has shown that, in many practical learning tasks, the information contained in the gradient is so minimal that gradient-based methods require an exceedingly large number of iterations to achieve success. The informativeness of the gradient is typically measured by its variance with respect to the random selection of a target function from a hypothesis class. We use this framework and give a general bound on the variance in terms of a parameter related to the pairwise independence of the target function class and the collision entropy of the input distribution. Our bound scales as $ \tilde{\mathcal{O}}(\varepsilon+e^{-\frac{1}{2}\mathcal{E}_c}) $, where $ \tilde{\mathcal{O}} $ hides factors related to the regularity of the learning model and the loss function, $ \varepsilon $ measures the pairwise independence of the target function class and $\mathcal{E}_c$ is the collision entropy of the input distribution. To demonstrate the practical utility of our bound, we apply it to the class of Learning with Errors (LWE) mappings and high-frequency functions. In addition to the theoretical analysis, we present experiments to understand better the nature of recent deep learning-based attacks on LWE.

AIMay 27, 2023
Computing a partition function of a generalized pattern-based energy over a semiring

Rustem Takhanov

Valued constraint satisfaction problems with ordered variables (VCSPO) are a special case of Valued CSPs in which variables are totally ordered and soft constraints are imposed on tuples of variables that do not violate the order. We study a restriction of VCSPO, in which soft constraints are imposed on a segment of adjacent variables and a constraint language $Γ$ consists of $\{0,1\}$-valued characteristic functions of predicates. This kind of potentials generalizes the so-called pattern-based potentials, which were applied in many tasks of structured prediction. For a constraint language $Γ$ we introduce a closure operator, $ \overline{Γ^{\cap}}\supseteq Γ$, and give examples of constraint languages for which $|\overline{Γ^{\cap}}|$ is small. If all predicates in $Γ$ are cartesian products, we show that the minimization of a generalized pattern-based potential (or, the computation of its partition function) can be made in ${\mathcal O}(|V|\cdot |D|^2 \cdot |\overline{Γ^{\cap}}|^2 )$ time, where $V$ is a set of variables, $D$ is a domain set. If, additionally, only non-positive weights of constraints are allowed, the complexity of the minimization task drops to ${\mathcal O}(|V|\cdot |\overline{Γ^{\cap}}| \cdot |D| \cdot \max_{ρ\in Γ}\|ρ\|^2 )$ where $\|ρ\|$ is the arity of $ρ\in Γ$. For a general language $Γ$ and non-positive weights, the minimization task can be carried out in ${\mathcal O}(|V|\cdot |\overline{Γ^{\cap}}|^2)$ time. We argue that in many natural cases $\overline{Γ^{\cap}}$ is of moderate size, though in the worst case $|\overline{Γ^{\cap}}|$ can blow up and depend exponentially on $\max_{ρ\in Γ}\|ρ\|$.

LGJun 27, 2021
How many moments does MMD compare?

Rustem Takhanov

We present a new way of study of Mercer kernels, by corresponding to a special kernel $K$ a pseudo-differential operator $p({\mathbf x}, D)$ such that $\mathcal{F} p({\mathbf x}, D)^†p({\mathbf x}, D) \mathcal{F}^{-1}$ acts on smooth functions in the same way as an integral operator associated with $K$ (where $\mathcal{F}$ is the Fourier transform). We show that kernels defined by pseudo-differential operators are able to approximate uniformly any continuous Mercer kernel on a compact set. The symbol $p({\mathbf x}, {\mathbf y})$ encapsulates a lot of useful information about the structure of the Maximum Mean Discrepancy distance defined by the kernel $K$. We approximate $p({\mathbf x}, {\mathbf y})$ with the sum of the first $r$ terms of the Singular Value Decomposition of $p$, denoted by $p_r({\mathbf x}, {\mathbf y})$. If ordered singular values of the integral operator associated with $p({\mathbf x}, {\mathbf y})$ die down rapidly, the MMD distance defined by the new symbol $p_r$ differs from the initial one only slightly. Moreover, the new MMD distance can be interpreted as an aggregated result of comparing $r$ local moments of two probability distributions. The latter results holds under the condition that right singular vectors of the integral operator associated with $p$ are uniformly bounded. But even if this is not satisfied we can still hold that the Hilbert-Schmidt distance between $p$ and $p_r$ vanishes. Thus, we report an interesting phenomenon: the MMD distance measures the difference of two probability distributions with respect to a certain number of local moments, $r^\ast$, and this number $r^\ast$ depends on the speed with which singular values of $p$ die down.

STMar 12, 2019
Reducing the dimensionality of data using tempered distributions

Rustem Takhanov

We reformulate unsupervised dimension reduction problem (UDR) in the language of tempered distributions, i.e. as a problem of approximating an empirical probability density function by another tempered distribution, supported in a $k$-dimensional subspace. We show that this task is connected with another classical problem of data science -- the sufficient dimension reduction problem (SDR). In fact, an algorithm for the first problem induces an algorithm for the second and vice versa. In order to reduce an optimization problem over distributions to an optimization problem over ordinary functions we introduce a nonnegative penalty function that ``forces'' the support of the model distribution to be $k$-dimensional. Then we present an algorithm for the minimization of the penalized objective, based on the infinite-dimensional low-rank optimization, which we call the alternating scheme. Also, we design an efficient approximate algorithm for a special case of the problem, where the distance between the empirical distribution and the model distribution is measured by Maximum Mean Discrepancy defined by a Mercer kernel of a certain type. We test our methods on four examples (three UDR and one SDR) using synthetic data and standard datasets.

MLFeb 26, 2019
Context Vectors are Reflections of Word Vectors in Half the Dimensions

Zhenisbek Assylbekov, Rustem Takhanov

This paper takes a step towards theoretical analysis of the relationship between word embeddings and context embeddings in models such as word2vec. We start from basic probabilistic assumptions on the nature of word vectors, context vectors, and text generation. These assumptions are well supported either empirically or theoretically by the existing literature. Next, we show that under these assumptions the widely-used word-word PMI matrix is approximately a random symmetric Gaussian ensemble. This, in turn, implies that context vectors are reflections of word vectors in approximately half the dimensions. As a direct application of our result, we suggest a theoretically grounded way of tying weights in the SGNS model.

NEFeb 8, 2019
Fourier Neural Networks: A Comparative Study

Abylay Zhumekenov, Malika Uteuliyeva, Olzhas Kabdolov et al.

We review neural network architectures which were motivated by Fourier series and integrals and which are referred to as Fourier neural networks. These networks are empirically evaluated in synthetic and real-world tasks. Neither of them outperforms the standard neural network with sigmoid activation function in the real-world tasks. All neural networks, both Fourier and the standard one, empirically demonstrate lower approximation error than the truncated Fourier series when it comes to an approximation of a known function of multiple variables.

LGAug 19, 2018
Fourier analysis perspective for sufficient dimension reduction problem

Rustem Takhanov

A theory of sufficient dimension reduction (SDR) is developed from an optimizational perspective. In our formulation of the problem, instead of dealing with raw data, we assume that our ground truth includes a mapping ${\mathbf f}: {\mathbb R}^n\rightarrow {\mathbb R}^m$ and a probability distribution function $p$ over ${\mathbb R}^n$, both given analytically. We formulate SDR as a problem of finding a function ${\mathbf g}: {\mathbb R}^k\rightarrow {\mathbb R}^m$ and a matrix $P\in {\mathbb R}^{k\times n}$ such that ${\mathbb E}_{{\mathbf x}\sim p({\mathbf x})} \left|{\mathbf f}({\mathbf x}) - {\mathbf g}(P{\mathbf x})\right|^2$ is minimal. It turns out that the latter problem allows a reformulation in the dual space, i.e. instead of searching for ${\mathbf g}(P{\mathbf x})$ we suggest searching for its Fourier transform. First, we characterize all tempered distributions that can serve as the Fourier transform of such functions. The reformulation in the dual space can be interpreted as a problem of finding a $k$-dimensional linear subspace $S$ and a tempered distribution ${\mathbf t}$ supported in $S$ such that ${\mathbf t}$ is "close" in a certain sense to the Fourier transform of ${\mathbf f}$. Instead of optimizing over generalized functions with a $k$-dimensional support, we suggest minimizing over ordinary functions but with an additional term $R$ that penalizes a strong distortion of the support from any $k$-dimensional linear subspace. For a specific case of $R$, we develop an algorithm that can be formulated for functions given in the initial form as well as for their Fourier transforms. Eventually, we report results of numerical experiments with a discretized version of the latter algorithm.

CLFeb 23, 2018
Reusing Weights in Subword-aware Neural Language Models

Zhenisbek Assylbekov, Rustem Takhanov

We propose several ways of reusing subword embeddings and other weights in subword-aware neural language models. The proposed techniques do not benefit a competitive character-aware model, but some of them improve the performance of syllable- and morpheme-aware models while showing significant reductions in model sizes. We discover a simple hands-on principle: in a multi-layer input embedding model, layers should be tied consecutively bottom-up if reused at output. Our best morpheme-aware model with properly reused weights beats the competitive word-level model by a large margin across multiple languages and has 20%-87% fewer parameters.

CLSep 2, 2017
Patterns versus Characters in Subword-aware Neural Language Modeling

Rustem Takhanov, Zhenisbek Assylbekov

Words in some natural languages can have a composite structure. Elements of this structure include the root (that could also be composite), prefixes and suffixes with which various nuances and relations to other words can be expressed. Thus, in order to build a proper word representation one must take into account its internal structure. From a corpus of texts we extract a set of frequent subwords and from the latter set we select patterns, i.e. subwords which encapsulate information on character $n$-gram regularities. The selection is made using the pattern-based Conditional Random Field model with $l_1$ regularization. Further, for every word we construct a new sequence over an alphabet of patterns. The new alphabet's symbols confine a local statistical context stronger than the characters, therefore they allow better representations in ${\mathbb{R}}^n$ and are better building blocks for word representation. In the task of subword-aware language modeling, pattern-based models outperform character-based analogues by 2-20 perplexity points. Also, a recurrent neural network in which a word is represented as a sum of embeddings of its patterns is on par with a competitive and significantly more sophisticated character-based convolutional architecture.

CLJul 20, 2017
Syllable-aware Neural Language Models: A Failure to Beat Character-aware Ones

Zhenisbek Assylbekov, Rustem Takhanov, Bagdat Myrzakhmetov et al.

Syllabification does not seem to improve word-level RNN language modeling quality when compared to character-based segmentation. However, our best syllable-aware language model, achieving performance comparable to the competitive character-aware model, has 18%-33% fewer parameters and is trained 1.2-2.2 times faster.

FLApr 22, 2014
Combining pattern-based CRFs and weighted context-free grammars

Rustem Takhanov, Vladimir Kolmogorov

We consider two models for the sequence labeling (tagging) problem. The first one is a {\em Pattern-Based Conditional Random Field }(\PB), in which the energy of a string (chain labeling) $x=x_1\ldots x_n\in D^n$ is a sum of terms over intervals $[i,j]$ where each term is non-zero only if the substring $x_i\ldots x_j$ equals a prespecified word $w\in Λ$. The second model is a {\em Weighted Context-Free Grammar }(\WCFG) frequently used for natural language processing. \PB and \WCFG encode local and non-local interactions respectively, and thus can be viewed as complementary. We propose a {\em Grammatical Pattern-Based CRF model }(\GPB) that combines the two in a natural way. We argue that it has certain advantages over existing approaches such as the {\em Hybrid model} of Bened{í} and Sanchez that combines {\em $\mbox{$N$-grams}$} and \WCFGs. The focus of this paper is to analyze the complexity of inference tasks in a \GPB such as computing MAP. We present a polynomial-time algorithm for general \GPBs and a faster version for a special case that we call {\em Interaction Grammars}.

CVApr 14, 2014
Proceedings of The 38th Annual Workshop of the Austrian Association for Pattern Recognition (ÖAGM), 2014

Vladimir Kolmogorov, Christoph Lampert, Emilie Morvant et al.

The 38th Annual Workshop of the Austrian Association for Pattern Recognition (ÖAGM) will be held at IST Austria, on May 22-23, 2014. The workshop provides a platform for researchers and industry to discuss traditional and new areas of computer vision. This year the main topic is: Pattern Recognition: interdisciplinary challenges and opportunities.

LGOct 1, 2012
Inference algorithms for pattern-based CRFs on sequence data

Rustem Takhanov, Vladimir Kolmogorov

We consider Conditional Random Fields (CRFs) with pattern-based potentials defined on a chain. In this model the energy of a string (labeling) $x_1...x_n$ is the sum of terms over intervals $[i,j]$ where each term is non-zero only if the substring $x_i...x_j$ equals a prespecified pattern $α$. Such CRFs can be naturally applied to many sequence tagging problems. We present efficient algorithms for the three standard inference tasks in a CRF, namely computing (i) the partition function, (ii) marginals, and (iii) computing the MAP. Their complexities are respectively $O(n L)$, $O(n L \ell_{max})$ and $O(n L \min\{|D|,\log (\ell_{max}+1)\})$ where $L$ is the combined length of input patterns, $\ell_{max}$ is the maximum length of a pattern, and $D$ is the input alphabet. This improves on the previous algorithms of (Ye et al., 2009) whose complexities are respectively $O(n L |D|)$, $O(n |Γ| L^2 \ell_{max}^2)$ and $O(n L |D|)$, where $|Γ|$ is the number of input patterns. In addition, we give an efficient algorithm for sampling. Finally, we consider the case of non-positive weights. (Komodakis & Paragios, 2009) gave an $O(n L)$ algorithm for computing the MAP. We present a modification that has the same worst-case complexity but can beat it in the best case.