3.5CRApr 10
Stringology-Based Cryptanalysis for EChaCha20 Stream CipherVictor Kebande
Stringology-Based Cryptanalysis (SBC) offers a suitable and a structurally aligned approach for uncovering structural patterns in stream ciphers that traditional statistical tests may often fail to detect. Despite \texttt{EChaCha20}'s design enhancements, no systematic investigation has been performed to determine whether its expanded 6$\times$6 state matrix and modified Quarter-Round Function (\texttt{QR-F}) introduce subtle keystream patterns, rotational biases, or partial collisions that could serve as statistical distinguishers. As such, addressing this gap is critical to ensure that the cipher's modifications do not unintentionally reduce its security margin. Therefore, this paper leverages Knuth-Morris-Pratt (\texttt{KMP}) and Boyer-Moore (\texttt{BM}) algorithms to analyze \texttt{EChaCha20}, which is a variant of ChaCha20 that features an expanded 6$\times$6 state matrix and an enhanced \texttt{QR-F}. The author has developed and optimized adaptations of the \texttt{KMP} and \texttt{BM} algorithms for 32-bit word level pattern analysis and employed them to investigate $m$-bit pattern frequency distributions to assess the \texttt{EChaCha20}'s resistance of rotational-differential attacks. Our experimental results on large-scale one million keystream datasets have confirmed that \texttt{EChaCha20} is able to maintain strong pseudorandomness at 16-bit and 32-bit levels with minor irregularities observed in the 8-bit domain. In addition to these, the differential tests have indicated a rapid diffusion, exhibiting an avalanche effect after two \texttt{QR-F} rounds and no statistically significant rotational collisions were observed within the evaluated bounds, consistent with expected ARX diffusion behavior beyond 3 rounds. This work puts forward SBC as a complementary tool for ARX cipher evaluation and provide new thoughts on the security properties of \texttt{EChaCha20}.
49.3CRApr 17
Stringology Based CryptologyVictor Kebande
The modern cryptographic primitives are known to generate large volumes of sequential data like keystreams, ciphertext blocks, and hash outputs. Traditional cryptgraphic evaluation methods rely primarily on statistical randomness tests and algebraic cryptanalysis techniques. This paper introduces the concept of Stringology-Based Cryptology (SBC), which applies classical string processing and pattern matching techniques to analyze structural properties of cryptographic outputs. By interpreting cryptographic outputs as symbolic sequences, stringology algorithms can be used to detect pattern recurrence, substring distributions, and structural correlations. In addition, the paper demonstrate how pattern frequency analysis and substring recurrence metrics can be applied to evaluate keystream outputs generated by stream ciphers. Experimental results illustrate that SBC analysis provides complementary insights into structural characteristics of cryptographic sequences and may support future research in structural cryptanalysis and cryptographic evaluation
9.1CRApr 14
Neural Stringology Based Cryptanalysis of EChaCha20Victor Kebande
Modern stream ciphers rely on strong diffusion and pseudorandom keystream generation (PKG) to resist cryptanalysis. While conventional evaluation methods such as statistical randomness tests and differential analysis provide important security assurances, they may fail to detect localized structural patterns embedded within cipher outputs. In this paper, a Neural Stringology Cryptanalysis (NSC) framework that combines classical string pattern analysis with machine learning techniques to investigate potential structural anomalies in stream cipher keystreams is introduced. The proposed approach first applies stringology-inspired feature extraction methods such as m-gram frequency analysis, substring recurrence detection, and positional pattern statistics aligned with the internal operations of Add-Rotate-XOR (ARX) based stream ciphers. These extracted features are then analyzed using a neural learning model to identify deviations from expected random behavior and to detect subtle structural patterns that may not be captured by traditional statistical tests. Experimental evaluation is conducted on keystream outputs generated by the EChaCha20 stream cipher under multiple configurations, including reduced round variants. The results demonstrate that the proposed NSC framework can identify distinguishable structural characteristics in the keystream data under controlled conditions, suggesting that integrating machine learning with stringology-based analysis provides a promising complementary methodology for evaluating the structural robustness of modern ARX-based stream cipher designs.
LGSep 30, 2023
Beyond Random Noise: Insights on Anonymization Strategies from a Latent Bandit StudyAlexander Galozy, Sadi Alawadi, Victor Kebande et al.
This paper investigates the issue of privacy in a learning scenario where users share knowledge for a recommendation task. Our study contributes to the growing body of research on privacy-preserving machine learning and underscores the need for tailored privacy techniques that address specific attack patterns rather than relying on one-size-fits-all solutions. We use the latent bandit setting to evaluate the trade-off between privacy and recommender performance by employing various aggregation strategies, such as averaging, nearest neighbor, and clustering combined with noise injection. More specifically, we simulate a linkage attack scenario leveraging publicly available auxiliary information acquired by the adversary. Our results on three open real-world datasets reveal that adding noise using the Laplace mechanism to an individual user's data record is a poor choice. It provides the highest regret for any noise level, relative to de-anonymization probability and the ADS metric. Instead, one should combine noise with appropriate aggregation strategies. For example, using averages from clusters of different sizes provides flexibility not achievable by varying the amount of noise alone. Generally, no single aggregation strategy can consistently achieve the optimum regret for a given desired level of privacy.
9.4CRMay 18
Structural Analysis of Cryptographic Sequences using Stringology-Based FingerprintingVictor Kebande
Cryptographic primitives such as stream ciphers,Pseudorandom Number Generators (PRNGs), and block cipher modes produce sequences that are designed to be statistically indistinguishable from random data. As a result, the traditional evaluation techniques therefore rely primarily on statistical randomness tests to assess the quality of generated sequences. While these tests verify global statistical properties, they do not address whether structural characteristics of sequences can reveal information about the underlying generator. In this paper, we introduce a stringology-based fingerprinting, (SBF) framework for the structural analysis of cryptographic sequences. The proposed SBF framework interprets cryptographic outputs as symbolic strings and applies pattern-based feature extraction to capture structural statistics such as substring frequency distributions, recurrence patterns, and entropy characteristics. These structural features are aggregated into fingerprint vectors that characterize sequence generators. The experimental evaluation is conducted using datasets composed of Cipher-Generated Sequences (CGS) and Uniformly Random Sequences (URS). The results demonstrate that stringology-based pattern analysis can reveal measurable structural signatures across different sequence sources. Although these signals do not imply practical cryptographic weaknesses, they provide an additional analytical perspective for evaluating the structural behavior of cryptographic generators.
SEMay 31, 2023
FedCSD: A Federated Learning Based Approach for Code-Smell DetectionSadi Alawadi, Khalid Alkharabsheh, Fahed Alkhabbas et al.
This paper proposes a Federated Learning Code Smell Detection (FedCSD) approach that allows organizations to collaboratively train federated ML models while preserving their data privacy. These assertions have been supported by three experiments that have significantly leveraged three manually validated datasets aimed at detecting and examining different code smell scenarios. In experiment 1, which was concerned with a centralized training experiment, dataset two achieved the lowest accuracy (92.30%) with fewer smells, while datasets one and three achieved the highest accuracy with a slight difference (98.90% and 99.5%, respectively). This was followed by experiment 2, which was concerned with cross-evaluation, where each ML model was trained using one dataset, which was then evaluated over the other two datasets. Results from this experiment show a significant drop in the model's accuracy (lowest accuracy: 63.80\%) where fewer smells exist in the training dataset, which has a noticeable reflection (technical debt) on the model's performance. Finally, the last and third experiments evaluate our approach by splitting the dataset into 10 companies. The ML model was trained on the company's site, then all model-updated weights were transferred to the server. Ultimately, an accuracy of 98.34% was achieved by the global model that has been trained using 10 companies for 100 training rounds. The results reveal a slight difference in the global model's accuracy compared to the highest accuracy of the centralized model, which can be ignored in favour of the global model's comprehensive knowledge, lower training cost, preservation of data privacy, and avoidance of the technical debt problem.