LGSep 11, 2022
Efficient Approximate Kernel Based Spike Sequence ClassificationSarwan Ali, Bikram Sahoo, Muhammad Asad Khan et al.
Machine learning (ML) models, such as SVM, for tasks like classification and clustering of sequences, require a definition of distance/similarity between pairs of sequences. Several methods have been proposed to compute the similarity between sequences, such as the exact approach that counts the number of matches between $k$-mers (sub-sequences of length $k$) and an approximate approach that estimates pairwise similarity scores. Although exact methods yield better classification performance, they pose high computational costs, limiting their applicability to a small number of sequences. The approximate algorithms are proven to be more scalable and perform comparably to (sometimes better than) the exact methods -- they are designed in a "general" way to deal with different types of sequences (e.g., music, protein, etc.). Although general applicability is a desired property of an algorithm, it is not the case in all scenarios. For example, in the current COVID-19 (coronavirus) pandemic, there is a need for an approach that can deal specifically with the coronavirus. To this end, we propose a series of ways to improve the performance of the approximate kernel (using minimizers and information gain) in order to enhance its predictive performance pm coronavirus sequences. More specifically, we improve the quality of the approximate kernel using domain knowledge (computed using information gain) and efficient preprocessing (using minimizers computation) to classify coronavirus spike protein sequences corresponding to different variants (e.g., Alpha, Beta, Gamma). We report results using different classification and clustering algorithms and evaluate their performance using multiple evaluation metrics. Using two datasets, we show that our proposed method helps improve the kernel's performance compared to the baseline and state-of-the-art approaches in the healthcare domain.
LGNov 1, 2022
Impact Of Missing Data Imputation On The Fairness And Accuracy Of Graph Node ClassifiersHaris Mansoor, Sarwan Ali, Shafiq Alam et al.
Analysis of the fairness of machine learning (ML) algorithms recently attracted many researchers' interest. Most ML methods show bias toward protected groups, which limits the applicability of ML models in many applications like crime rate prediction etc. Since the data may have missing values which, if not appropriately handled, are known to further harmfully affect fairness. Many imputation methods are proposed to deal with missing data. However, the effect of missing data imputation on fairness is not studied well. In this paper, we analyze the effect on fairness in the context of graph data (node attributes) imputation using different embedding and neural network methods. Extensive experiments on six datasets demonstrate severe fairness issues in missing data imputation under graph node classification. We also find that the choice of the imputation method affects both fairness and accuracy. Our results provide valuable insights into graph data fairness and how to handle missingness in graphs efficiently. This work also provides directions regarding theoretical studies on fairness in graph data.
AIJul 14, 2023
CAMP: A Context-Aware Cricket Players Performance MetricMuhammad Sohaib Ayub, Naimat Ullah, Sarwan Ali et al.
Cricket is the second most popular sport after soccer in terms of viewership. However, the assessment of individual player performance, a fundamental task in team sports, is currently primarily based on aggregate performance statistics, including average runs and wickets taken. We propose Context-Aware Metric of player Performance, CAMP, to quantify individual players' contributions toward a cricket match outcome. CAMP employs data mining methods and enables effective data-driven decision-making for selection and drafting, coaching and training, team line-ups, and strategy development. CAMP incorporates the exact context of performance, such as opponents' strengths and specific circumstances of games, such as pressure situations. We empirically evaluate CAMP on data of limited-over cricket matches between 2001 and 2019. In every match, a committee of experts declares one player as the best player, called Man of the M}atch (MoM). The top two rated players by CAMP match with MoM in 83\% of the 961 games. Thus, the CAMP rating of the best player closely matches that of the domain experts. By this measure, CAMP significantly outperforms the current best-known players' contribution measure based on the Duckworth-Lewis-Stern (DLS) method.
PEAug 18, 2021
Effective and scalable clustering of SARS-CoV-2 sequencesSarwan Ali, Tamkanat-E-Ali, Muhammad Asad Khan et al.
SARS-CoV-2, like any other virus, continues to mutate as it spreads, according to an evolutionary process. Unlike any other virus, the number of currently available sequences of SARS-CoV-2 in public databases such as GISAID is already several million. This amount of data has the potential to uncover the evolutionary dynamics of a virus like never before. However, a million is already several orders of magnitude beyond what can be processed by the traditional methods designed to reconstruct a virus's evolutionary history, such as those that build a phylogenetic tree. Hence, new and scalable methods will need to be devised in order to make use of the ever increasing number of viral sequences being collected. Since identifying variants is an important part of understanding the evolution of a virus, in this paper, we propose an approach based on clustering sequences to identify the current major SARS-CoV-2 variants. Using a $k$-mer based feature vector generation and efficient feature selection methods, our approach is effective in identifying variants, as well as being efficient and scalable to millions of sequences. Such a clustering method allows us to show the relative proportion of each variant over time, giving the rate of spread of each variant in different locations -- something which is important for vaccine development and distribution. We also compute the importance of each amino acid position of the spike protein in identifying a given variant in terms of information gain. Positions of high variant-specific importance tend to agree with those reported by the USA's Centers for Disease Control and Prevention (CDC), further demonstrating our approach.
CVFeb 2, 2020
Effect of Analysis Window and Feature Selection on Classification of Hand Movements Using EMG SignalAsad Ullah, Sarwan Ali, Imdadullah Khan et al.
Electromyography (EMG) signals have been successfully employed for driving prosthetic limbs of a single or double degree of freedom. This principle works by using the amplitude of the EMG signals to decide between one or two simpler movements. This method underperforms as compare to the contemporary advances done at the mechanical, electronics, and robotics end, and it lacks intuition. Recently, research on myoelectric control based on pattern recognition (PR) shows promising results with the aid of machine learning classifiers. Using the approach termed as, EMG-PR, EMG signals are divided into analysis windows, and features are extracted for each window. These features are then fed to the machine learning classifiers as input. By offering multiple class movements and intuitive control, this method has the potential to power an amputated subject to perform everyday life movements. In this paper, we investigate the effect of the analysis window and feature selection on classification accuracy of different hand and wrist movements using time-domain features. We show that effective data preprocessing and optimum feature selection helps to improve the classification accuracy of hand movements. We use publicly available hand and wrist gesture dataset of $40$ intact subjects for experimentation. Results computed using different classification algorithms show that the proposed preprocessing and features selection outperforms the baseline and achieve up to $98\%$ classification accuracy.
CRFeb 2, 2020
Permissioned Blockchain-Based Security for SDN in IoT Cloud NetworksSafi Faizullah, Muhammad Asad Khan, Ali Alzahrani et al.
The advancement in cloud networks has enabled connectivity of both traditional networked elements and new devices from all walks of life, thereby forming the Internet of Things (IoT). In an IoT setting, improving and scaling network components as well as reducing cost is essential to sustain exponential growth. In this domain, software-defined networking (SDN) is revolutionizing the network infrastructure with a new paradigm. SDN splits the control/routing logic from the data transfer/forwarding. This splitting causes many issues in SDN, such as vulnerabilities of DDoS attacks. Many solutions (including blockchain based) have been proposed to overcome these problems. In this work, we offer a blockchain-based solution that is provided in redundant SDN (load-balanced) to service millions of IoT devices. Blockchain is considered as tamper-proof and impossible to corrupt due to the replication of the ledger and consensus for verification and addition to the ledger. Therefore, it is a perfect fit for SDN in IoT Networks. Blockchain technology provides everyone with a working proof of decentralized trust. The experimental results show gain and efficiency with respect to the accuracy, update process, and bandwidth utilization.
SPDec 28, 2019
Short-Term Load Forecasting Using AMI DataHaris Mansoor, Sarwan Ali, Imdadullah Khan et al.
Accurate short-term load forecasting is essential for the efficient operation of the power sector. Forecasting load at a fine granularity such as hourly loads of individual households is challenging due to higher volatility and inherent stochasticity. At the aggregate levels, such as monthly load at a grid, the uncertainties and fluctuations are averaged out; hence predicting load is more straightforward. This paper proposes a method called Forecasting using Matrix Factorization (\textsc{fmf}) for short-term load forecasting (\textsc{stlf}). \textsc{fmf} only utilizes historical data from consumers' smart meters to forecast future loads (does not use any non-calendar attributes, consumers' demographics or activity patterns information, etc.) and can be applied to any locality. A prominent feature of \textsc{fmf} is that it works at any level of user-specified granularity, both in the temporal (from a single hour to days) and spatial dimensions (a single household to groups of consumers). We empirically evaluate \textsc{fmf} on three benchmark datasets and demonstrate that it significantly outperforms the state-of-the-art methods in terms of load forecasting. The computational complexity of \textsc{fmf} is also substantially less than known methods for \textsc{stlf} such as long short-term memory neural networks, random forest, support vector machines, and regression trees.
LGDec 27, 2019
Predicting Attributes of Nodes Using Network StructureSarwan Ali, Muhammad Haroon Shakeel, Imdadullah Khan et al.
In many graphs such as social networks, nodes have associated attributes representing their behavior. Predicting node attributes in such graphs is an important problem with applications in many domains like recommendation systems, privacy preservation, and targeted advertisement. Attributes values can be predicted by analyzing patterns and correlations among attributes and employing classification/regression algorithms. However, these approaches do not utilize readily available network topology information. In this regard, interconnections between different attributes of nodes can be exploited to improve the prediction accuracy. In this paper, we propose an approach to represent a node by a feature map with respect to an attribute $a_i$ (which is used as input for machine learning algorithms) using all attributes of neighbors to predict attributes values for $a_i$. We perform extensive experimentation on ten real-world datasets and show that the proposed feature map significantly improves the prediction accuracy as compared to baseline approaches on these datasets.
CRDec 27, 2019
Detecting DDoS Attack on SDN Due to Vulnerabilities in OpenFlowSarwan Ali, Maria Khalid Alvi, Safi Faizullah et al.
Software Defined Networking (SDN) is a network paradigm shift that facilitates comprehensive network programmability to cope with emerging new technologies such as cloud computing and big data. SDN facilitates simplified and centralized network management enabling it to operate in dynamic scenarios. Further, SDN uses the OpenFlow protocol for communication between the controller and its switches. The OpenFlow creates vulnerabilities for network attacks especially Distributed Denial of Service (DDoS). DDoS attacks are launched from the compromised hosts connected to the SDN switches. In this paper, we introduce a time- and space-efficient solution for the identification of these compromised hosts. Our solution consumes less computational resources and space and does not require any special equipment.
LGDec 27, 2019
Efficient Data Analytics on Augmented Similarity TripletsSarwan Ali, Muhammad Ahmad, Umair ul Hassan et al.
Data analysis require a pairwise proximity measure over objects. Recent work has extended this to situations where the distance information between objects is given as comparison results of distances between three objects (triplets). Humans find the comparison tasks much easier than the exact distance computation and such data can be easily obtained in big quantity via crowd-sourcing. In this work, we propose triplets augmentation, an efficient method to extend the triplets data by inferring the hidden implicit information form the existing data. Triplets augmentation improves the quality of kernel-based and kernel-free data analytics. We also propose a novel set of algorithms for common data analysis tasks based on triplets. These methods work directly with triplets and avoid kernel evaluations, thus are scalable to big data. We demonstrate that our methods outperform the current best-known techniques and are robust to noisy data.
CRApr 5, 2015
CRT and Fixed Patterns in Combinatorial SequencesMuhammad Asad Khan, Amir Ali Khan, Fauzan Mirza
In this paper, new context of Chinese Remainder Theorem (CRT) based analysis of combinatorial sequence generators has been presented. CRT is exploited to establish fixed patterns in LFSR sequences and underlying cyclic structures of finite fields. New methodology of direct computations of DFT spectral points in higher finite fields from known DFT spectra points of smaller constituent fields is also introduced. Novel approach of CRT based structural analysis of LFSR based combinatorial sequence is given both in time and frequency domain. The proposed approach is demonstrated on some examples of combiner generators and is scalable to general configuration of combiner generators.
CRMar 3, 2015
Transform Domain Analysis of SequencesMuhammad Asad Khan, Amir A Khan, Fauzan Mirza
In cryptanalysis, security of ciphers vis-a-vis attacks is gauged against three criteria of complexities, i.e., computations, memory and time. Some features may not be so apparent in a particular domain, and their analysis in a transformed domain often reveals interesting patterns. Moreover, the complexity criteria in different domains are different and performance improvements are often achieved by transforming the problem in an alternate domain. Owing to the results of coding theory and signal processing, Discrete Fourier Transform (DFT) based attacks have proven to be efficient than algebraic attacks in terms of their computational complexity. Motivated by DFT based attacks, we present a transform domain analysis of Linear Feedback Shift Register(LFSR) based sequence generators. The time and frequency domain behavior of non-linear filter and combiner generators is discussed along with some novel observations based on the Chinese Remainder Theorem (CRT). CRT is exploited to establish patterns in LFSR sequences and underlying cyclic structures of finite fields. Application of DFT spectra attacks on combiner generators is also demonstrated. Our proposed method saves on the last stage computations of selective DFT attacks for combiner generators. The proposed approach is demonstrated on some examples of combiner generators and is scalable to general configuration of combiner generators.