Jarek Duda

LG
h-index2
24papers
136citations
Novelty47%
AI Score43

24 Papers

LGJun 13, 2022
Predicting conditional probability distributions of redshifts of Active Galactic Nuclei using Hierarchical Correlation Reconstruction

Jarek Duda

While there is a general focus on prediction of values, real data often only allows to predict conditional probability distributions, with capabilities bounded by conditional entropy $H(Y|X)$. If additionally estimating uncertainty, we can treat a predicted value as the center of Gaussian of Laplace distribution - idealization which can be far from complex conditional distributions of real data. This article applies Hierarchical Correlation Reconstruction (HCR) approach to inexpensively predict quite complex conditional probability distributions (e.g. multimodal): by independent MSE estimation of multiple moment-like parameters, which allow to reconstruct the conditional distribution. Using linear regression for this purpose, we get interpretable models: with coefficients describing contributions of features to conditional moments. This article extends on the original approach especially by using Canonical Correlation Analysis (CCA) for feature optimization and l1 "lasso" regularization, focusing on practical problem of prediction of redshift of Active Galactic Nuclei (AGN) based on Fourth Fermi-LAT Data Release 2 (4LAC) dataset.

SPApr 24, 2023
Time delay multi-feature correlation analysis to extract subtle dependencies from EEG signals

Jarek Duda

Electroencephalography (EEG) signals are resultants of extremely complex brain activity. Some details of this hidden dynamics might be accessible through e.g. joint distributions $ρ_{Δt}$ of signals of pairs of electrodes shifted by various time delays (lag $Δt$). A standard approach is monitoring a single evaluation of such joint distributions, like Pearson correlation (or mutual information), which turns out relatively uninteresting - as expected, there is usually a small peak for zero delay and nearly symmetric drop with delay. In contrast, such a complex signal might be composed of multiple types of statistical dependencies - this article proposes approach to automatically decompose and extract them. Specifically, we model such joint distributions as polynomials, estimated separately for all considered lag dependencies, then with PCA dimensionality reduction we find the dominant joint density distortion directions $f_v$. This way we get a few lag dependent features $a_i(Δt)$ describing separate dominating statistical dependencies of known contributions: $ρ_{Δt}(y,z)\approx \sum_{i=1}^r a_i(Δt)\, f_{v_i}(y,z)$. Such features complement Pearson correlation, extracting hidden more complex behavior, e.g. with asymmetry which might be related with direction of information transfer, extrema suggesting characteristic delays, or oscillatory behavior suggesting some periodicity. There is also discussed extension of Granger causality to such multi-feature joint density analysis, suggesting e.g. two separate causality waves. While this early article is initial fundamental research, in future it might help e.g. with understanding of cortex hidden dynamics, diagnosis of pathologies like epilepsy, determination of precise electrode position, or building brain-computer interface.

LGApr 18, 2022
Fast optimization of common basis for matrix set through Common Singular Value Decomposition

Jarek Duda

SVD (singular value decomposition) is one of the basic tools of machine learning, allowing to optimize basis for a given matrix. However, sometimes we have a set of matrices $\{A_k\}_k$ instead, and would like to optimize a single common basis for them: find orthogonal matrices $U$, $V$, such that $\{U^T A_k V\}$ set of matrices is somehow simpler. For example DCT-II is orthonormal basis of functions commonly used in image/video compression - as discussed here, this kind of basis can be quickly automatically optimized for a given dataset. While also discussed gradient descent optimization might be computationally costly, there is proposed CSVD (common SVD): fast general approach based on SVD. Specifically, we choose $U$ as built of eigenvectors of $\sum_i (w_k)^q (A_k A_k^T)^p$ and $V$ of $\sum_k (w_k)^q (A_k^T A_k)^p$, where $w_k$ are their weights, $p,q>0$ are some chosen powers e.g. 1/2, optionally with normalization e.g. $A \to A - rc^T$ where $r_i=\sum_j A_{ij}, c_j =\sum_i A_{ij}$.

MEApr 6, 2023
Adaptive Student's t-distribution with method of moments moving estimator for nonstationary time series

Jarek Duda

The real life time series are usually nonstationary, bringing a difficult question of model adaptation. Classical approaches like ARMA-ARCH assume arbitrary type of dependence. To avoid their bias, we will focus on recently proposed agnostic philosophy of moving estimator: in time $t$ finding parameters optimizing e.g. $F_t=\sum_{τ<t} (1-η)^{t-τ} \ln(ρ_θ(x_τ))$ moving log-likelihood, evolving in time. It allows for example to estimate parameters using inexpensive exponential moving averages (EMA), like absolute central moments $m_p=E[|x-μ|^p]$ evolving for one or multiple powers $p\in\mathbb{R}^+$ using $m_{p,t+1} = m_{p,t} + η(|x_t-μ_t|^p-m_{p,t})$. Application of such general adaptive methods of moments will be presented on Student's t-distribution, popular especially in economical applications, here applied to log-returns of DJIA companies. While standard ARMA-ARCH approaches provide evolution of $μ$ and $σ$, here we also get evolution of $ν$ describing $ρ(x)\sim |x|^{-ν-1}$ tail shape, probability of extreme events - which might turn out catastrophic, destabilizing the market.

BMJul 21, 2022
Low cost prediction of probability distributions of molecular properties for early virtual screening

Jarek Duda, Sabina Podlewska

While there is a general focus on predictions of values, mathematically more appropriate is prediction of probability distributions: with additional possibilities like prediction of uncertainty, higher moments and quantiles. For the purpose of the computer-aided drug design field, this article applies Hierarchical Correlation Reconstruction approach, previously applied in the analysis of demographic, financial and astronomical data. Instead of a single linear regression to predict values, it uses multiple linear regressions to independently predict multiple moments, finally combining them into predicted probability distribution, here of several ADMET properties based on substructural fingerprint developed by Klekota\&Roth. Discussed application example is inexpensive selection of a percentage of molecules with properties nearly certain to be in a predicted or chosen range during virtual screening. Such an approach can facilitate the interpretation of the results as the predictions characterized by high rate of uncertainty are automatically detected. In addition, for each of the investigated predictive problems, we detected crucial structural features, which should be carefully considered when optimizing compounds towards particular property. The whole methodology developed in the study constitutes therefore a great support for medicinal chemists, as it enable fast rejection of compounds with the lowest potential of desired physicochemical/ADMET characteristic and guides the compound optimization process.

QMSep 13, 2022
Predicting probability distributions for cancer therapy drug selection optimization

Jarek Duda

Large variability between cell lines brings a difficult optimization problem of drug selection for cancer therapy. Standard approaches use prediction of value for this purpose, corresponding e.g. to expected value of their distribution. This article shows superiority of working on, predicting the entire probability distributions - proposing basic tools for this purpose. We are mostly interested in the best drug in their batch to be tested - proper optimization of their selection for extreme statistics requires knowledge of the entire probability distributions, which for distributions of drug properties among cell lines often turn out binomial, e.g. depending on corresponding gene. Hence for basic prediction mechanism there is proposed mixture of two Gaussians, trying to predict its weight based on additional information.

LGDec 7, 2025
Comparing BFGS and OGR for Second-Order Optimization

Adrian Przybysz, Mikołaj Kołek, Franciszek Sobota et al.

Estimating the Hessian matrix, especially for neural network training, is a challenging problem due to high dimensionality and cost. In this work, we compare the classical Sherman-Morrison update used in the popular BFGS method (Broy-den-Fletcher-Goldfarb-Shanno), which maintains a positive definite Hessian approximation under a convexity assumption, with a novel approach called Online Gradient Regression (OGR). OGR performs regression of gradients against positions using an exponential moving average to estimate second derivatives online, without requiring Hessian inversion. Unlike BFGS, OGR allows estimation of a general (not necessarily positive definite) Hessian and can thus handle non-convex structures. We evaluate both methods across standard test functions and demonstrate that OGR achieves faster convergence and improved loss, particularly in non-convex settings.

LGMay 8, 2024
Biology-inspired joint distribution neurons based on Hierarchical Correlation Reconstruction allowing for multidirectional neural networks

Jarek Duda

Biological neural networks seem qualitatively superior (e.g. in learning, flexibility, robustness) to current artificial like Multi-Layer Perceptron (MLP) or Kolmogorov-Arnold Network (KAN). Simultaneously, in contrast to them: biological have fundamentally multidirectional signal propagation \cite{axon}, also of probability distributions e.g. for uncertainty estimation, and are believed not being able to use standard backpropagation training \cite{backprop}. There are proposed novel artificial neurons based on HCR (Hierarchical Correlation Reconstruction) allowing to remove the above low level differences: with neurons containing local joint distribution model (of its connections), representing joint density on normalized variables as just linear combination of $(f_\mathbf{j})$ orthonormal polynomials: $ρ(\mathbf{x})=\sum_{\mathbf{j}\in B} a_\mathbf{j} f_\mathbf{j}(\mathbf{x})$ for $\mathbf{x} \in [0,1]^d$ and $B\subset \mathbb{N}^d$ some chosen basis. By various index summations of such $(a_\mathbf{j})_{\mathbf{j}\in B}$ tensor as neuron parameters, we get simple formulas for e.g. conditional expected values for propagation in any direction, like $E[x|y,z]$, $E[y|x]$, which degenerate to KAN-like parametrization if restricting to pairwise dependencies. Such HCR network can also propagate probability distributions (also joint) like $ρ(y,z|x)$. It also allows for additional training approaches, like direct $(a_\mathbf{j})$ estimation, through tensor decomposition, or more biologically plausible information bottleneck training: layers directly influencing only neighbors, optimizing content to maximize information about the next layer, and minimizing about the previous to remove noise, extract crucial information.

MLNov 22, 2023
Extracting individual variable information for their decoupling, direct mutual information and multi-feature Granger causality

Jarek Duda

Working with multiple variables they usually contain difficult to control complex dependencies. This article proposes extraction of their individual information, e.g. $\overline{X|Y}$ as random variable containing information from $X$, but with removed information about $Y$, by using $(x,y) \leftrightarrow (\bar{x}=\textrm{CDF}_{X|Y=y}(x),y)$ reversible normalization. One application can be decoupling of individual information of variables: reversibly transform $(X_1,\ldots,X_n)\leftrightarrow(\tilde{X}_1,\ldots \tilde{X}_n)$ together containing the same information, but being independent: $\forall_{i\neq j} \tilde{X}_i\perp \tilde{X}_j, \tilde{X}_i\perp X_j$. It requires detailed models of complex conditional probability distributions - it is generally a difficult task, but here can be done through multiple dependency reducing iterations, using imperfect methods (here HCR: Hierarchical Correlation Reconstruction). It could be also used for direct mutual information - evaluating direct information transfer: without use of intermediate variables. For causality direction there is discussed multi-feature Granger causality, e.g. to trace various types of individual information transfers between such decoupled variables, including propagation time (delay).

LGAug 25, 2025
Linear cost mutual information estimation and independence test of similar performance as HSIC

Jarek Duda, Jagoda Bracha, Adrian Przybysz

Evaluation of statistical dependencies between two data samples is a basic problem of data science/machine learning, and HSIC (Hilbert-Schmidt Information Criterion)~\cite{HSIC} is considered the state-of-art method. However, for size $n$ data sample it requires multiplication of $n\times n$ matrices, what currently needs $\sim O(n^{2.37})$ computational complexity~\cite{mult}, making it impractical for large data samples. We discuss HCR (Hierarchical Correlation Reconstruction) as its linear cost practical alternative, in tests of even higher sensitivity to dependencies, and additionally providing actual joint distribution model for chosen significance level, by description of dependencies through features being mixed moments, starting with correlation and homoscedasticity. Also allowing to approximate mutual information as just sum of squares of such nontrivial mixed moments between two data samples. Such single dependence describing feature is calculated in $O(n)$ linear time. Their number to test varies with dimension $d$ - requiring $O(d^2)$ for pairwise dependencies, $O(d^3)$ if wanting to also consider more subtle triplewise, and so on.

LGJul 16, 2025
Improving KAN with CDF normalization to quantiles

Jakub Strawa, Jarek Duda

Data normalization is crucial in machine learning, usually performed by subtracting the mean and dividing by standard deviation, or by rescaling to a fixed range. In copula theory, popular in finance, there is used normalization to approximately quantiles by transforming x to CDF(x) with estimated CDF (cumulative distribution function) to nearly uniform distribution in [0,1], allowing for simpler representations which are less likely to overfit. It seems nearly unknown in machine learning, therefore, we would like to present some its advantages on example of recently popular Kolmogorov-Arnold Networks (KANs), improving predictions from Legendre-KAN by just switching rescaling to CDF normalization. Additionally, in HCR interpretation, weights of such neurons are mixed moments providing local joint distribution models, allow to propagate also probability distributions, and change propagation direction.

MEMay 20, 2025
Adaptive stable distribution and Hurst exponent by method of moments moving estimator for nonstationary time series

Jarek Duda

Nonstationarity of real-life time series requires model adaptation. In classical approaches like ARMA-ARCH there is assumed some arbitrarily chosen dependence type. To avoid their bias, we will focus on novel more agnostic approach: moving estimator, which estimates parameters separately for every time $t$: optimizing $F_t=\sum_{τ<t} (1-η)^{t-τ} \ln(ρ_θ(x_τ))$ local log-likelihood with exponentially weakening weights of the old values. In practice such moving estimates can be found by EMA (exponential moving average) of some parameters, like $m_p=E[|x-μ|^p]$ absolute central moments, updated by $m_{p,t+1} = m_{p,t} + η(|x_t-μ_t|^p-m_{p,t})$. We will focus here on its applications for alpha-Stable distribution, which also influences Hurst exponent, hence can be used for its adaptive estimation. Its application will be shown on financial data as DJIA time series - beside standard estimation of evolution of center $μ$ and scale parameter $σ$, there is also estimated evolution of $α$ parameter allowing to continuously evaluate market stability - tails having $ρ(x) \sim 1/|x|^{α+1}$ behavior, controlling probability of potentially dangerous extreme events.

IVApr 6, 2020
Exploiting context dependence for image compression with upsampling

Jarek Duda

Image compression with upsampling encodes information to succeedingly increase image resolution, for example by encoding differences in FUIF and JPEG XL. It is useful for progressive decoding, also often can improve compression ratio - both for lossless compression and e.g. DC coefficients of lossy. However, the currently used solutions rather do not exploit context dependence for encoding of such upscaling information. This article discusses simple inexpensive general techniques for this purpose, which allowed to save on average $0.645$ bits/difference (between $0.138$ and $1.489$) for the last upscaling for 48 standard $512\times 512$ grayscale 8 bit images - compared to assumption of fixed Laplace distribution. Using least squares linear regression of context to predict center of Laplace distribution gave on average $0.393$ bits/difference savings. The remaining savings were obtained by additionally predicting width of this Laplace distribution, also using just the least squares linear regression. For RGB images, optimization of color transform alone gave mean $\approx 4.6\%$ size reduction comparing to standard YCrCb if using fixed transform, $\approx 6.3\%$ if optimizing transform individually for each image. Then further mean $\approx 10\%$ reduction was obtained if predicting Laplace parameters based on context. The presented simple inexpensive general methodology can be also used for different types of data like DCT coefficients in lossy image compression.

MLMar 4, 2020
Adaptive exponential power distribution with moving estimator for nonstationary time series

Jarek Duda

While standard estimation assumes that all datapoints are from probability distribution of the same fixed parameters $θ$, we will focus on maximum likelihood (ML) adaptive estimation for nonstationary time series: separately estimating parameters $θ_T$ for each time $T$ based on the earlier values $(x_t)_{t<T}$ using (exponential) moving ML estimator $θ_T=\arg\max_θl_T$ for $l_T=\sum_{t<T} η^{T-t} \ln(ρ_θ(x_t))$ and some $η\in(0,1]$. Computational cost of such moving estimator is generally much higher as we need to optimize log-likelihood multiple times, however, in many cases it can be made inexpensive thanks to dependencies. We focus on such example: $ρ(x)\propto \exp(-|(x-μ)/σ|^κ/κ)$ exponential power distribution (EPD) family, which covers wide range of tail behavior like Gaussian ($κ=2$) or Laplace ($κ=1$) distribution. It is also convenient for such adaptive estimation of scale parameter $σ$ as its standard ML estimation is $σ^κ$ being average $\|x-μ\|^κ$. By just replacing average with exponential moving average: $(σ_{T+1})^κ=η(σ_T)^κ+(1-η)|x_T-μ|^κ$ we can inexpensively make it adaptive. It is tested on daily log-return series for DJIA companies, leading to essentially better log-likelihoods than standard (static) estimation, with optimal $κ$ tails types varying between companies. Presented general alternative estimation philosophy provides tools which might be useful for building better models for analysis of nonstationary time-series.

LGJul 16, 2019
SGD momentum optimizer with step estimation by online parabola model

Jarek Duda

In stochastic gradient descent, especially for neural network training, there are currently dominating first order methods: not modeling local distance to minimum. This information required for optimal step size is provided by second order methods, however, they have many difficulties, starting with full Hessian having square of dimension number of coefficients. This article proposes a minimal step from successful first order momentum method toward second order: online parabola modelling in just a single direction: normalized $\hat{v}$ from momentum method. It is done by estimating linear trend of gradients $\vec{g}=\nabla F(\vecθ)$ in $\hat{v}$ direction: such that $g(\vecθ_\bot+θ\hat{v})\approx λ(θ-p)$ for $θ= \vecθ\cdot \hat{v}$, $g= \vec{g}\cdot \hat{v}$, $\vecθ_\bot=\vecθ-θ\hat{v}$. Using linear regression, $λ$, $p$ are MSE estimated by just updating four averages (of $g$, $θ$, $gθ$, $θ^2$) in the considered direction. Exponential moving averages allow here for inexpensive online estimation, weakening contribution of the old gradients. Controlling sign of curvature $λ$, we can repel from saddles in contrast to attraction in standard Newton method. In the remaining directions: not considered in second order model, we can simultaneously perform e.g. gradient descent. There is also discussed its learning rate approximation as $μ=σ_θ/ σ_g$, allowing e.g. for adaptive SGD - with learning rate separately optimized (2nd order) for each parameter.

IVMay 28, 2019
Parametric context adaptive Laplace distribution for multimedia compression

Jarek Duda

Data compression often subtracts prediction and encodes the difference (residue) e.g. assuming Laplace distribution, for example for images, videos, audio, or numerical data. Its performance is strongly dependent on the proper choice of width (scale parameter) of this parametric distribution, can be improved if optimizing it based on local situation like context. For example in popular LOCO-I \cite{loco} (JPEG-LS) lossless image compressor there is used 3 dimensional context quantized into 365 discrete possibilities treated independently. This article discusses inexpensive approaches for exploiting their dependencies with autoregressive ARCH-like context dependent models for parameters of parametric distribution for residue, also evolving in time for adaptive case. For example tested such 4 or 11 parameter models turned out to provide similar performance as 365 parameter LOCO-I model for 48 tested images. Beside smaller headers, such reduction of number of parameters can lead to better generalization. In contrast to context quantization approaches, parameterized models also allow to directly use higher dimensional contexts, for example using information from all 3 color channels, further pixels, some additional region classifiers, or from interleaving multi-scale scanning - for which there is proposed Haar upscale scan combining advantages of Haar wavelets with possibility of scanning exploiting local contexts.

LGJan 31, 2019
Improving SGD convergence by online linear regression of gradients in multiple statistically relevant directions

Jarek Duda

Deep neural networks are usually trained with stochastic gradient descent (SGD), which minimizes objective function using very rough approximations of gradient, only averaging to the real gradient. Standard approaches like momentum or ADAM only consider a single direction, and do not try to model distance from extremum - neglecting valuable information from calculated sequence of gradients, often stagnating in some suboptimal plateau. Second order methods could exploit these missed opportunities, however, beside suffering from very large cost and numerical instabilities, many of them attract to suboptimal points like saddles due to negligence of signs of curvatures (as eigenvalues of Hessian). Saddle-free Newton method is a rare example of addressing this issue - changes saddle attraction into repulsion, and was shown to provide essential improvement for final value this way. However, it neglects noise while modelling second order behavior, focuses on Krylov subspace for numerical reasons, and requires costly eigendecomposion. Maintaining SFN advantages, there are proposed inexpensive ways for exploiting these opportunities. Second order behavior is linear dependence of first derivative - we can optimally estimate it from sequence of noisy gradients with least square linear regression, in online setting here: with weakening weights of old gradients. Statistically relevant subspace is suggested by PCA of recent noisy gradients - in online setting it can be made by slowly rotating considered directions toward new gradients, gradually replacing old directions with recent statistically relevant. Eigendecomposition can be also performed online: with regularly performed step of QR method to maintain diagonal Hessian. Outside the second order modeled subspace we can simultaneously perform gradient descent.

LGDec 19, 2018
Credibility evaluation of income data with hierarchical correlation reconstruction

Jarek Duda, Adam Szulc

In situations like tax declarations or analyzes of household budgets we would like to automatically evaluate credibility of exogenous variable (declared income) based on some available (endogenous) variables - we want to build a model and train it on provided data sample to predict (conditional) probability distribution of exogenous variable based on values of endogenous variables. Using Polish household budget survey data there will be discussed simple and systematic adaptation of hierarchical correlation reconstruction (HCR) technique for this purpose, which allows to combine interpretability of statistics with modelling of complex densities like in machine learning. For credibility evaluation we normalize marginal distribution of predicted variable to $ρ\approx 1$ uniform distribution on $[0,1]$ using empirical distribution function $(x=EDF(y)\in[0,1])$, then model density of its conditional distribution $(\textrm{Pr}(x_0|x_1 x_2\ldots))$ as a linear combination of orthonormal polynomials using coefficients modelled as linear combinations of features of the remaining variables. These coefficients can be calculated independently, have similar interpretation as cumulants, additionally allowing to directly reconstruct probability distribution. Values corresponding to high predicted density can be considered as credible, while low density suggests disagreement with statistics of data sample, for example to mark for manual verification a chosen percentage of data points evaluated as the least credible.

LGNov 12, 2018
Gaussian AutoEncoder

Jarek Duda

Generative AutoEncoders require a chosen probability distribution in latent space, usually multivariate Gaussian. The original Variational AutoEncoder (VAE) uses randomness in encoder - causing problematic distortion, and overlaps in latent space for distinct inputs. It turned out unnecessary: we can instead use deterministic encoder with additional regularizer to ensure that sample distribution in latent space is close to the required. The original approach (WAE) uses Wasserstein metric, what required comparing with random sample and using an arbitrarily chosen kernel. Later CWAE finally derived a non-random analytic formula by averaging $L_2$ distance of Gaussian-smoothened sample over all 1D projections. However, these arbitrarily chosen regularizers do not lead to Gaussian distribution. This article proposes approach for regularizers directly optimizing agreement between empirical distribution function and its desired CDF for chosen properties, for example radii and distances for Gaussian distribution, or coordinate-wise, to directly attract this distribution in latent space of AutoEncoder. We can also attract different distributions with this general approach, for example latent space uniform distribution on $[0,1]^D$ hypercube or torus would allow for data compression without entropy coding, increased density near codewords would optimize for the required quantization.

LGJul 11, 2018
Exploiting statistical dependencies of time series with hierarchical correlation reconstruction

Jarek Duda

While we are usually focused on forecasting future values of time series, it is often valuable to additionally predict their entire probability distributions, e.g. to evaluate risk, Monte Carlo simulations. On example of time series of $\approx$ 30000 Dow Jones Industrial Averages, there will be presented application of hierarchical correlation reconstruction for this purpose: MSE estimating polynomial as joint density for (current value, context), where context is for example a few previous values. Then substituting the currently observed context and normalizing density to 1, we get predicted probability distribution for the current value. In contrast to standard machine learning approaches like neural networks, optimal polynomial coefficients here have inexpensive direct formula, have controllable accuracy, are unique and independently calculated, each has a specific cumulant-like interpretation, and such approximation can asymptotically approach complete description of any real joint distribution - providing universal tool to quantitatively describe and exploit statistical dependencies in time series, systematically enhancing ARMA/ARCH-like approaches, also based on different distributions than Gaussian which turns out improper for daily log returns. There is also discussed application for non-stationary time series like calculating linear time trend, or adapting coefficients to local statistical behavior.

LGApr 17, 2018
Hierarchical correlation reconstruction with missing data, for example for biology-inspired neuron

Jarek Duda

Machine learning often needs to model density from a multidimensional data sample, including correlations between coordinates. Additionally, we often have missing data case: that data points can miss values for some of coordinates. This article adapts rapid parametric density estimation approach for this purpose: modelling density as a linear combination of orthonormal functions, for which $L^2$ optimization says that (independently) estimated coefficient for a given function is just average over the sample of value of this function. Hierarchical correlation reconstruction first models probability density for each separate coordinate using all its appearances in data sample, then adds corrections from independently modelled pairwise correlations using all samples having both coordinates, and so on independently adding correlations for growing numbers of variables using often decreasing evidence in data sample. A basic application of such modelled multidimensional density can be imputation of missing coordinates: by inserting known coordinates to the density, and taking expected values for the missing coordinates, or even their entire joint probability distribution. Presented method can be compared with cascade correlations approach, offering several advantages in flexibility and accuracy. It can be also used as artificial neuron: maximizing prediction capabilities for only local behavior - modelling and predicting local connections.

LGJan 3, 2018
Polynomial-based rotation invariant features

Jarek Duda

One of basic difficulties of machine learning is handling unknown rotations of objects, for example in image recognition. A related problem is evaluation of similarity of shapes, for example of two chemical molecules, for which direct approach requires costly pairwise rotation alignment and comparison. Rotation invariants are useful tools for such purposes, allowing to extract features describing shape up to rotation, which can be used for example to search for similar rotated patterns, or fast evaluation of similarity of shapes e.g. for virtual screening, or machine learning including features directly describing shape. A standard approach are rotationally invariant cylindrical or spherical harmonics, which can be seen as based on polynomials on sphere, however, they provide very few invariants - only one per degree of polynomial. There will be discussed a general approach to construct arbitrarily large sets of rotation invariants of polynomials, for degree $D$ in $\mathbb{R}^n$ up to $O(n^D)$ independent invariants instead of $O(D)$ offered by standard approaches, possibly also a complete set: providing not only necessary, but also sufficient condition for differing only by rotation (and reflectional symmetry).

LGFeb 7, 2017
Rapid parametric density estimation

Jarek Duda

Parametric density estimation, for example as Gaussian distribution, is the base of the field of statistics. Machine learning requires inexpensive estimation of much more complex densities, and the basic approach is relatively costly maximum likelihood estimation (MLE). There will be discussed inexpensive density estimation, for example literally fitting a polynomial (or Fourier series) to the sample, which coefficients are calculated by just averaging monomials (or sine/cosine) over the sample. Another discussed basic application is fitting distortion to some standard distribution like Gaussian - analogously to ICA, but additionally allowing to reconstruct the disturbed density. Finally, by using weighted average, it can be also applied for estimation of non-probabilistic densities, like modelling mass distribution, or for various clustering problems by using negative (or complex) weights: fitting a function which sign (or argument) determines clusters. The estimated parameters are approaching the optimal values with error dropping like $1/\sqrt{n}$, where $n$ is the sample size.

ITDec 14, 2016
Lightweight compression with encryption based on Asymmetric Numeral Systems

Jarek Duda, Marcin Niemiec

Data compression combined with effective encryption is a common requirement of data storage and transmission. Low cost of these operations is often a high priority in order to increase transmission speed and reduce power usage. This requirement is crucial for battery-powered devices with limited resources, such as autonomous remote sensors or implants. Well-known and popular encryption techniques are frequently too expensive. This problem is on the increase as machine-to-machine communication and the Internet of Things are becoming a reality. Therefore, there is growing demand for finding trade-offs between security, cost and performance in lightweight cryptography. This article discusses Asymmetric Numeral Systems -- an innovative approach to entropy coding which can be used for compression with encryption. It provides compression ratio comparable with arithmetic coding at similar speed as Huffman coding, hence, this coding is starting to replace them in new compressors. Additionally, by perturbing its coding tables, the Asymmetric Numeral System makes it possible to simultaneously encrypt the encoded message at nearly no additional cost. The article introduces this approach and analyzes its security level. The basic application is reducing the number of rounds of some cipher used on ANS-compressed data, or completely removing additional encryption layer if reaching a satisfactory protection level.