Nabil Mlaiki

2papers

2 Papers

CVJun 1, 2024Code
An Effective Weight Initialization Method for Deep Learning: Application to Satellite Image Classification

Wadii Boulila, Eman Alshanqiti, Ayyub Alzahem et al.

The growing interest in satellite imagery has triggered the need for efficient mechanisms to extract valuable information from these vast data sources, providing deeper insights. Even though deep learning has shown significant progress in satellite image classification. Nevertheless, in the literature, only a few results can be found on weight initialization techniques. These techniques traditionally involve initializing the networks' weights before training on extensive datasets, distinct from fine-tuning the weights of pre-trained networks. In this study, a novel weight initialization method is proposed in the context of satellite image classification. The proposed weight initialization method is mathematically detailed during the forward and backward passes of the convolutional neural network (CNN) model. Extensive experiments are carried out using six real-world datasets. Comparative analyses with existing weight initialization techniques made on various well-known CNN models reveal that the proposed weight initialization technique outperforms the previous competitive techniques in classification accuracy. The complete code of the proposed technique, along with the obtained results, is available at https://github.com/WadiiBoulila/Weight-Initialization

6.9LGMay 9
VORT: Adaptive Power-Law Memory for NLP Transformers

Nabil Mlaiki

Standard Transformers impose near-exponential decay on the influence of distant tokens, conflicting with the power-law structure of long-range dependencies in natural language. We introduce the \emph{Variable-Order Retention Transformer} (\VORT{}), a memory architecture in which each ingested token is assigned a learnable fractional order α_i\in[δ,1] that governs a Grünwald--Letnikov power-law retention kernel. Because the fractional weighted sum is non-Markovian, we approximate it through a sum-of-exponentials (SOE) decomposition computed by Gauss--Laguerre quadrature on a Laplace-type integral representation of the kernel weights. Each exponential component admits a one-step Markovian recurrence at O(Sd_v) per step, where S=O(\log(T/\varepsilon)) terms suffice for \varepsilon-uniform accuracy on horizon [1,T]. Retrieval is keyed and associative via a linear-attention accumulator with an exact O(KSd_ϕd_v) -per-step recurrence. Four results are established: (i) an SOE approximation theorem with geometric convergence rate from the analyticity of the integrand after a log-change of variables; (ii) a quantisation bound valid on [δ,1] with correct analysis near α=0; (iii) a direct L^2 energy argument (Proposition) showing that for α>1/2 any mixture with fixed minimum decay rate Λ>0 incurs L^2([1,T]) error at least N_α(T)-C(Λ)\to\infty, with the Λ-dependence made explicit; and (iv) linear convergence of a gradient plasticity rule under the Polyak--Łojasiewicz condition. Two synthetic experiments confirm the architectural advantage: a Zipf-distributed retrieval benchmark and an entity label-copy task with uniform lag distribution, the latter ruling out prior-matching as an explanation for the power-law kernel's advantage.