Jay Kumar

CL
11papers
452citations
Novelty27%
AI Score25

11 Papers

LGOct 29, 2023
Multi-Path Long-Term Vessel Trajectories Forecasting with Probabilistic Feature Fusion for Problem Shifting

Gabriel Spadon, Jay Kumar, Derek Eden et al.

This paper addresses the challenge of boosting the precision of multi-path long-term vessel trajectory forecasting on engineered sequences of Automatic Identification System (AIS) data using feature fusion for problem shifting. We have developed a deep auto-encoder model and a phased framework approach to predict the next 12 hours of vessel trajectories using 1 to 3 hours of AIS data as input. To this end, we fuse the spatiotemporal features from the AIS messages with probabilistic features engineered from historical AIS data referring to potential routes and destinations. As a result, we reduce the forecasting uncertainty by shifting the problem into a trajectory reconstruction problem. The probabilistic features have an F1-Score of approximately 85% and 75% for the vessel route and destination prediction, respectively. Under such circumstances, we achieved an R2 Score of over 98% with different layer structures and varying feature combinations; the high R2 Score is a natural outcome of the well-defined shipping lanes in the study region. However, our proposal stands out among competing approaches as it demonstrates the capability of complex decision-making during turnings and route selection. Furthermore, we have shown that our model achieves more accurate forecasting with average and median errors of 11km and 6km, respectively, a 25% improvement from the current state-of-the-art approaches. The resulting model from this proposal is deployed as part of a broader Decision Support System to safeguard whales by preventing the risk of vessel-whale collisions under the smartWhales initiative and acting on the Gulf of St. Lawrence in Atlantic Canada.

CLAug 28, 2024
An Evaluation of Sindhi Word Embedding in Semantic Analogies and Downstream Tasks

Wazir Ali, Saifullah Tumrani, Jay Kumar et al.

In this paper, we propose a new word embedding based corpus consisting of more than 61 million words crawled from multiple web resources. We design a preprocessing pipeline for the filtration of unwanted text from crawled data. Afterwards, the cleaned vocabulary is fed to state-of-the-art continuous-bag-of-words, skip-gram, and GloVe word embedding algorithms. For the evaluation of pretrained embeddings, we use popular intrinsic and extrinsic evaluation approaches. The evaluation results reveal that continuous-bag-of-words and skip-gram perform better than GloVe and existing Sindhi fastText word embedding on both intrinsic and extrinsic evaluation approaches

CLNov 28, 2019Code
Word Embedding based New Corpus for Low-resourced Language: Sindhi

Wazir Ali, Jay Kumar, Junyu Lu et al.

Representing words and phrases into dense vectors of real numbers which encode semantic and syntactic properties is a vital constituent in natural language processing (NLP). The success of neural network (NN) models in NLP largely rely on such dense word representations learned on the large unlabeled corpus. Sindhi is one of the rich morphological language, spoken by large population in Pakistan and India lacks corpora which plays an essential role of a test-bed for generating word embeddings and developing language independent NLP systems. In this paper, a large corpus of more than 61 million words is developed for low-resourced Sindhi language for training neural word embeddings. The corpus is acquired from multiple web-resources using web-scrappy. Due to the unavailability of open source preprocessing tools for Sindhi, the prepossessing of such large corpus becomes a challenging problem specially cleaning of noisy data extracted from web resources. Therefore, a preprocessing pipeline is employed for the filtration of noisy text. Afterwards, the cleaned vocabulary is utilized for training Sindhi word embeddings with state-of-the-art GloVe, Skip-Gram (SG), and Continuous Bag of Words (CBoW) word2vec algorithms. The intrinsic evaluation approach of cosine similarity matrix and WordSim-353 are employed for the evaluation of generated Sindhi word embeddings. Moreover, we compare the proposed word embeddings with recently revealed Sindhi fastText (SdfastText) word representations. Our intrinsic evaluation results demonstrate the high quality of our generated Sindhi word embeddings using SG, CBoW, and GloVe as compare to SdfastText word representations.

IRAug 15, 2024
Evolving Text Data Stream Mining

Jay Kumar

A text stream is an ordered sequence of text documents generated over time. A massive amount of such text data is generated by online social platforms every day. Designing an algorithm for such text streams to extract useful information is a challenging task due to unique properties of the stream such as infinite length, data sparsity, and evolution. Thereby, learning useful information from such streaming data under the constraint of limited time and memory has gained increasing attention. During the past decade, although many text stream mining algorithms have proposed, there still exists some potential issues. First, high-dimensional text data heavily degrades the learning performance until the model either works on subspace or reduces the global feature space. The second issue is to extract semantic text representation of documents and capture evolving topics over time. Moreover, the problem of label scarcity exists, whereas existing approaches work on the full availability of labeled data. To deal with these issues, in this thesis, new learning models are proposed for clustering and multi-label learning on text streams.

CRApr 22, 2021
Blockchain based Privacy-Preserved Federated Learning for Medical Images: A Case Study of COVID-19 CT Scans

Rajesh Kumar, WenYong Wang, Cheng Yuan et al.

Medical health care centers are envisioned as a promising paradigm to handle the massive volume of data of COVID-19 patients using artificial intelligence (AI). Traditionally, AI techniques often require centralized data collection and training the model in a single organization, which is most common weakness due to the privacy and security of raw data communication. To solve this challenging task, we propose a blockchain-based federated learning framework that provides collaborative data training solutions by coordinating multiple hospitals to train and share encrypted federated models without leakage of data privacy. The blockchain ledger technology provides the decentralization of federated learning model without any central server. The proposed homomorphic encryption scheme encrypts and decrypts the gradients of model to preserve the privacy. More precisely, the proposed framework: i) train the local model by a novel capsule network to segmentation and classify COVID-19 images, ii) then use the homomorphic encryption scheme to secure the local model that encrypts and decrypts the gradients, and finally the model is shared over a decentralized platform through proposed blockchain-based federated learning algorithm. The integration of blockchain and federated learning leads to a new paradigm for medical image data sharing in the decentralized network. The conducted experimental resultsdemonstrate the performance of the proposed scheme.

LGApr 2, 2021
A Comparative Analysis of Machine Learning and Grey Models

Gang He, Khwaja Mutahir Ahmad, Wenxin Yu et al.

Artificial Intelligence (AI) has recently shown its capabilities for almost every field of life. Machine Learning, which is a subset of AI, is a `HOT' topic for researchers. Machine Learning outperforms other classical forecasting techniques in almost all-natural applications. It is a crucial part of modern research. As per this statement, Modern Machine Learning algorithms are hungry for big data. Due to the small datasets, the researchers may not prefer to use Machine Learning algorithms. To tackle this issue, the main purpose of this survey is to illustrate, demonstrate related studies for significance of a semi-parametric Machine Learning framework called Grey Machine Learning (GML). This kind of framework is capable of handling large datasets as well as small datasets for time series forecasting likely outcomes. This survey presents a comprehensive overview of the existing semi-parametric machine learning techniques for time series forecasting. In this paper, a primer survey on the GML framework is provided for researchers. To allow an in-depth understanding for the readers, a brief description of Machine Learning, as well as various forms of conventional grey forecasting models are discussed. Moreover, a brief description on the importance of GML framework is presented.

CRFeb 26, 2021
Collective Intelligence: Decentralized Learning for Android Malware Detection in IoT with Blockchain

Rajesh Kumar, WenYong Wang, Jay Kumar et al.

The widespread significance of Android IoT devices is due to its flexibility and hardware support features which revolutionized the digital world by introducing exciting applications almost in all walks of daily life, such as healthcare, smart cities, smart environments, safety, remote sensing, and many more. Such versatile applicability gives incentive for more malware attacks. In this paper, we propose a framework which continuously aggregates multiple user trained models on non-overlapping data into single model. Specifically for malware detection task, (i) we propose a novel user (local) neural network (LNN) which trains on local distribution and (ii) then to assure the model authenticity and quality, we propose a novel smart contract which enable aggregation process over blokchain platform. The LNN model analyzes various static and dynamic features of both malware and benign whereas the smart contract verifies the malicious applications both for uploading and downloading processes in the network using stored aggregated features of local models. In this way, the proposed model not only improves malware detection accuracy using decentralized model network but also model efficacy with blockchain. We evaluate our approach with three state-of-the-art models and performed deep analyses of extracted features of the relative model.

CVFeb 19, 2021
Trends in Vehicle Re-identification Past, Present, and Future: A Comprehensive Review

Zakria, Jianhua Deng, Muhammad Saddam Khokhar et al.

Vehicle Re-identification (re-id) over surveillance camera network with non-overlapping field of view is an exciting and challenging task in intelligent transportation systems (ITS). Due to its versatile applicability in metropolitan cities, it gained significant attention. Vehicle re-id matches targeted vehicle over non-overlapping views in multiple camera network. However, it becomes more difficult due to inter-class similarity, intra-class variability, viewpoint changes, and spatio-temporal uncertainty. In order to draw a detailed picture of vehicle re-id research, this paper gives a comprehensive description of the various vehicle re-id technologies, applicability, datasets, and a brief comparison of different methodologies. Our paper specifically focuses on vision-based vehicle re-id approaches, including vehicle appearance, license plate, and spatio-temporal characteristics. In addition, we explore the main challenges as well as a variety of applications in different domains. Lastly, a detailed comparison of current state-of-the-art methods performances over VeRi-776 and VehicleID datasets is summarized with future directions. We aim to facilitate future research by reviewing the work being done on vehicle re-id till to date.

CLDec 30, 2020
Enhancing Sindhi Word Segmentation using Subword Representation Learning and Position-aware Self-attention

Wazir Ali, Jay Kumar, Saifullah Tumrani et al.

Sindhi word segmentation is a challenging task due to space omission and insertion issues. The Sindhi language itself adds to this complexity. It's cursive and consists of characters with inherent joining and non-joining properties, independent of word boundaries. Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features. However, these methods have limitations, such as difficulty handling out-of-vocabulary words, limited robustness for other languages, and inefficiency with large amounts of noisy or raw text. Neural network-based models, in contrast, can automatically capture word boundary information without requiring prior knowledge. In this paper, we propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task. The SGNWS model incorporates subword representation learning through a bidirectional long short-term memory encoder, position-aware self-attention, and a conditional random field. Our empirical results demonstrate that the SGNWS model achieves state-of-the-art performance in Sindhi word segmentation on six datasets.

IVJul 10, 2020
Blockchain-Federated-Learning and Deep Learning Models for COVID-19 detection using CT Imaging

Rajesh Kumar, Abdullah Aman Khan, Sinmin Zhang et al.

With the increase of COVID-19 cases worldwide, an effective way is required to diagnose COVID-19 patients. The primary problem in diagnosing COVID-19 patients is the shortage and reliability of testing kits, due to the quick spread of the virus, medical practitioners are facing difficulty identifying the positive cases. The second real-world problem is to share the data among the hospitals globally while keeping in view the privacy concerns of the organizations. Building a collaborative model and preserving privacy are major concerns for training a global deep learning model. This paper proposes a framework that collects a small amount of data from different sources (various hospitals) and trains a global deep learning model using blockchain based federated learning. Blockchain technology authenticates the data and federated learning trains the model globally while preserving the privacy of the organization. First, we propose a data normalization technique that deals with the heterogeneity of data as the data is gathered from different hospitals having different kinds of CT scanners. Secondly, we use Capsule Network-based segmentation and classification to detect COVID-19 patients. Thirdly, we design a method that can collaboratively train a global model using blockchain technology with federated learning while preserving privacy. Additionally, we collected real-life COVID-19 patients data, which is, open to the research community. The proposed framework can utilize up-to-date data which improves the recognition of computed tomography (CT) images. Finally, our results demonstrate a better performance to detect COVID-19 patients.

IRFeb 26, 2020
Fake Review Detection Using Behavioral and Contextual Features

Jay Kumar

User reviews reflect significant value of product in the world of e-market. Many firms or product providers hire spammers for misleading new customers by posting spam reviews. There are three types of fake reviews, untruthful reviews, brand reviews and non-reviews. All three types mislead the new customers. A multinomial organization "Yelp" is separating fake reviews from non-fake reviews since last decade. However, there are many e-commerce sites which do not filter fake and non-fake reviews separately. Automatic fake review detection is focused by researcher for last ten years. Many approaches and feature set are proposed for improving classification model of fake review detection. There are two types of dataset commonly used in this research area: psuedo fake and real life reviews. Literature reports low performance of classification model real life dataset if compared with pseudo fake reviews. After investigation behavioral and contextual features are proved important for fake review detection Our research has exploited important behavioral feature of reviewer named as "reviewer deviation". Our study comprises of investigating reviewer deviation with other contextual and behavioral features. We empirically proved importance of selected feature set for classification model to identify fake reviews. We ranked features in selected feature set where reviewer deviation achieved ninth rank. To assess the viability of selected feature set we scaled dataset and concluded that scaling dataset can improve recall as well as accuracy. Our selected feature set contains a contextual feature which capture text similarity between reviews of a reviewer. We experimented on NNC, LTC and BM25 term weighting schemes for calculating text similarity of reviews. We report that BM25 outperformed other term weighting scheme.