Haris Mansoor

h-index6

6papers

38citations

Novelty52%

AI Score36

Ranked #101,177 of 194,257 authors (top 52%)#22,282 in LG (top 55%)

6 Papers

6.9LGNov 1, 2022Code

Impact Of Missing Data Imputation On The Fairness And Accuracy Of Graph Node Classifiers

Haris Mansoor, Sarwan Ali, Shafiq Alam et al.

Analysis of the fairness of machine learning (ML) algorithms recently attracted many researchers' interest. Most ML methods show bias toward protected groups, which limits the applicability of ML models in many applications like crime rate prediction etc. Since the data may have missing values which, if not appropriately handled, are known to further harmfully affect fairness. Many imputation methods are proposed to deal with missing data. However, the effect of missing data imputation on fairness is not studied well. In this paper, we analyze the effect on fairness in the context of graph data (node attributes) imputation using different embedding and neural network methods. Extensive experiments on six datasets demonstrate severe fairness issues in missing data imputation under graph node classification. We also find that the choice of the imputation method affects both fairness and accuracy. Our results provide valuable insights into graph data fairness and how to handle missingness in graphs efficiently. This work also provides directions regarding theoretical studies on fairness in graph data.

2.6LGOct 16, 2024

Position Specific Scoring Is All You Need? Revisiting Protein Sequence Classification Tasks

Sarwan Ali, Taslim Murad, Prakash Chourasia et al.

Understanding the structural and functional characteristics of proteins are crucial for developing preventative and curative strategies that impact fields from drug discovery to policy development. An important and popular technique for examining how amino acids make up these characteristics of the protein sequences with position-specific scoring (PSS). While the string kernel is crucial in natural language processing (NLP), it is unclear if string kernels can extract biologically meaningful information from protein sequences, despite the fact that they have been shown to be effective in the general sequence analysis tasks. In this work, we propose a weighted PSS kernel matrix (or W-PSSKM), that combines a PSS representation of protein sequences, which encodes the frequency information of each amino acid in a sequence, with the notion of the string kernel. This results in a novel kernel function that outperforms many other approaches for protein sequence classification. We perform extensive experimentation to evaluate the proposed method. Our findings demonstrate that the W-PSSKM significantly outperforms existing baselines and state-of-the-art methods and achieves up to 45.1\% improvement in classification accuracy.

4.1LGOct 1, 2025

Breaking the Euclidean Barrier: Hyperboloid-Based Biological Sequence Analysis

Sarwan Ali, Haris Mansoor, Murray Patterson

Genomic sequence analysis plays a crucial role in various scientific and medical domains. Traditional machine-learning approaches often struggle to capture the complex relationships and hierarchical structures of sequence data when working in high-dimensional Euclidean spaces. This limitation hinders accurate sequence classification and similarity measurement. To address these challenges, this research proposes a method to transform the feature representation of biological sequences into the hyperboloid space. By applying a transformation, the sequences are mapped onto the hyperboloid, preserving their inherent structural information. Once the sequences are represented in the hyperboloid space, a kernel matrix is computed based on the hyperboloid features. The kernel matrix captures the pairwise similarities between sequences, enabling more effective analysis of biological sequence relationships. This approach leverages the inner product of the hyperboloid feature vectors to measure the similarity between pairs of sequences. The experimental evaluation of the proposed approach demonstrates its efficacy in capturing important sequence correlations and improving classification accuracy.

2.6LGDec 19, 2024

Computing Gram Matrix for SMILES Strings using RDKFingerprint and Sinkhorn-Knopp Algorithm

Sarwan Ali, Haris Mansoor, Prakash Chourasia et al.

In molecular structure data, SMILES (Simplified Molecular Input Line Entry System) strings are used to analyze molecular structure design. Numerical feature representation of SMILES strings is a challenging task. This work proposes a kernel-based approach for encoding and analyzing molecular structures from SMILES strings. The proposed approach involves computing a kernel matrix using the Sinkhorn-Knopp algorithm while using kernel principal component analysis (PCA) for dimensionality reduction. The resulting low-dimensional embeddings are then used for classification and regression analysis. The kernel matrix is computed by converting the SMILES strings into molecular structures using the Morgan Fingerprint, which computes a fingerprint for each molecule. The distance matrix is computed using the pairwise kernels function. The Sinkhorn-Knopp algorithm is used to compute the final kernel matrix that satisfies the constraints of a probability distribution. This is achieved by iteratively adjusting the kernel matrix until the marginal distributions of the rows and columns match the desired marginal distributions. We provided a comprehensive empirical analysis of the proposed kernel method to evaluate its goodness with greater depth. The suggested method is assessed for drug subcategory prediction (classification task) and solubility AlogPS ``Aqueous solubility and Octanol/Water partition coefficient" (regression task) using the benchmark SMILES string dataset. The outcomes show the proposed method outperforms several baseline methods in terms of supervised analysis and has potential uses in molecular design and drug discovery. Overall, the suggested method is a promising avenue for kernel methods-based molecular structure analysis and design.

2.6LGOct 21, 2024

MIK: Modified Isolation Kernel for Biological Sequence Visualization, Classification, and Clustering

Sarwan Ali, Prakash Chourasia, Haris Mansoor et al.

The t-Distributed Stochastic Neighbor Embedding (t-SNE) has emerged as a popular dimensionality reduction technique for visualizing high-dimensional data. It computes pairwise similarities between data points by default using an RBF kernel and random initialization (in low-dimensional space), which successfully captures the overall structure but may struggle to preserve the local structure efficiently. This research proposes a novel approach called the Modified Isolation Kernel (MIK) as an alternative to the Gaussian kernel, which is built upon the concept of the Isolation Kernel. MIK uses adaptive density estimation to capture local structures more accurately and integrates robustness measures. It also assigns higher similarity values to nearby points and lower values to distant points. Comparative research using the normal Gaussian kernel, the isolation kernel, and several initialization techniques, including random, PCA, and random walk initializations, are used to assess the proposed approach (MIK). Additionally, we compare the computational efficiency of all $3$ kernels with $3$ different initialization methods. Our experimental results demonstrate several advantages of the proposed kernel (MIK) and initialization method selection. It exhibits improved preservation of the local and global structure and enables better visualization of clusters and subclusters in the embedded space. These findings contribute to advancing dimensionality reduction techniques and provide researchers and practitioners with an effective tool for data exploration, visualization, and analysis in various domains.

10.3SPDec 28, 2019

Short-Term Load Forecasting Using AMI Data

Haris Mansoor, Sarwan Ali, Imdadullah Khan et al.

Accurate short-term load forecasting is essential for the efficient operation of the power sector. Forecasting load at a fine granularity such as hourly loads of individual households is challenging due to higher volatility and inherent stochasticity. At the aggregate levels, such as monthly load at a grid, the uncertainties and fluctuations are averaged out; hence predicting load is more straightforward. This paper proposes a method called Forecasting using Matrix Factorization (\textsc{fmf}) for short-term load forecasting (\textsc{stlf}). \textsc{fmf} only utilizes historical data from consumers' smart meters to forecast future loads (does not use any non-calendar attributes, consumers' demographics or activity patterns information, etc.) and can be applied to any locality. A prominent feature of \textsc{fmf} is that it works at any level of user-specified granularity, both in the temporal (from a single hour to days) and spatial dimensions (a single household to groups of consumers). We empirically evaluate \textsc{fmf} on three benchmark datasets and demonstrate that it significantly outperforms the state-of-the-art methods in terms of load forecasting. The computational complexity of \textsc{fmf} is also substantially less than known methods for \textsc{stlf} such as long short-term memory neural networks, random forest, support vector machines, and regression trees.