Prashant Gupta

LG
6papers
1,100citations
Novelty47%
AI Score37

6 Papers

LGAug 23, 2025
Learning ON Large Datasets Using Bit-String Trees

Prashant Gupta

This thesis develops computational methods in similarity-preserving hashing, classification, and cancer genomics. Standard space partitioning-based hashing relies on Binary Search Trees (BSTs), but their exponential growth and sparsity hinder efficiency. To overcome this, we introduce Compressed BST of Inverted hash tables (ComBI), which enables fast approximate nearest-neighbor search with reduced memory. On datasets of up to one billion samples, ComBI achieves 0.90 precision with 4X-296X speed-ups over Multi-Index Hashing, and also outperforms Cellfishing.jl on single-cell RNA-seq searches with 2X-13X gains. Building on hashing structures, we propose Guided Random Forest (GRAF), a tree-based ensemble classifier that integrates global and local partitioning, bridging decision trees and boosting while reducing generalization error. Across 115 datasets, GRAF delivers competitive or superior accuracy, and its unsupervised variant (uGRAF) supports guided hashing and importance sampling. We show that GRAF and ComBI can be used to estimate per-sample classifiability, which enables scalable prediction of cancer patient survival. To address challenges in interpreting mutations, we introduce Continuous Representation of Codon Switches (CRCS), a deep learning framework that embeds genetic changes into numerical vectors. CRCS allows identification of somatic mutations without matched normals, discovery of driver genes, and scoring of tumor mutations, with survival prediction validated in bladder, liver, and brain cancers. Together, these methods provide efficient, scalable, and interpretable tools for large-scale data analysis and biomedical applications.

LGNov 7, 2020
Enhash: A Fast Streaming Algorithm For Concept Drift Detection

Aashi Jindal, Prashant Gupta, Debarka Sengupta et al.

We propose Enhash, a fast ensemble learner that detects \textit{concept drift} in a data stream. A stream may consist of abrupt, gradual, virtual, or recurring events, or a mixture of various types of drift. Enhash employs projection hash to insert an incoming sample. We show empirically that the proposed method has competitive performance to existing ensemble learners in much lesser time. Also, Enhash has moderate resource requirements. Experiments relevant to performance comparison were performed on 6 artificial and 4 real data sets consisting of various types of drifts.

LGMay 14, 2020
A Weighted Mutual k-Nearest Neighbour for Classification Mining

Joydip Dhar, Ashaya Shukla, Mukul Kumar et al.

kNN is a very effective Instance based learning method, and it is easy to implement. Due to heterogeneous nature of data, noises from different possible sources are also widespread in nature especially in case of large-scale databases. For noise elimination and effect of pseudo neighbours, in this paper, we propose a new learning algorithm which performs the task of anomaly detection and removal of pseudo neighbours from the dataset so as to provide comparative better results. This algorithm also tries to minimize effect of those neighbours which are distant. A concept of certainty measure is also introduced for experimental results. The advantage of using concept of mutual neighbours and distance-weighted voting is that, dataset will be refined after removal of anomaly and weightage concept compels to take into account more consideration of those neighbours, which are closer. Consequently, finally the performance of proposed algorithm is calculated.

LGSep 2, 2019
Guided Random Forest and its application to data approximation

Prashant Gupta, Aashi Jindal, Jayadeva et al.

We present a new way of constructing an ensemble classifier, named the Guided Random Forest (GRAF) in the sequel. GRAF extends the idea of building oblique decision trees with localized partitioning to obtain a global partitioning. We show that global partitioning bridges the gap between decision trees and boosting algorithms. We empirically demonstrate that global partitioning reduces the generalization error bound. Results on 115 benchmark datasets show that GRAF yields comparable or better results on a majority of datasets. We also present a new way of approximating the datasets in the framework of random forests.

CGAug 19, 2019
Continuous Toolpath Planning in Additive Manufacturing

Prashant Gupta, Bala Krishnamoorthy, Gregory Dreifus

We develop a framework that creates a new polygonal mesh representation of the sparse infill domain of a layer-by-layer 3D printing job. We guarantee the existence of a single, continuous tool path covering each connected piece of the domain in every layer. We present a tool path algorithm that traverses each such continuous tool path with no crossovers. The key construction at the heart of our framework is an Euler transformation which converts a 2-dimensional cell complex K into a new 2-complex K^ such that every vertex in the 1-skeleton G^ of K^ has even degree. Hence G^ is Eulerian, and a Eulerian tour can be followed to print all edges in a continuous fashion. We start with a mesh K of the union of polygons obtained by projecting all layers to the plane. We compute its Euler transformation K^. In the slicing step, we clip K^ at each layer using its polygon to obtain a complex that may not necessarily be Euler. We then patch this complex by adding edges such that any odd-degree nodes created by slicing are transformed to have even degrees again. We print extra support edges in place of any segments left out to ensure there are no edges without support in the next layer. These support edges maintain the Euler nature of the complex. Finally we describe a tree-based search algorithm that builds the continuous tool path by traversing "concentric" cycles in the Euler complex. Our algorithm produces a tool path that avoids material collisions and crossovers, and can be printed in a continuous fashion irrespective of complex geometry or topology of the domain (e.g., holes). We implement our test our framework on several 3D objects. Apart from standard geometric shapes, we demonstrate the framework on the Stanford bunny.

IRJul 1, 2019
Pentagon at MEDIQA 2019: Multi-task Learning for Filtering and Re-ranking Answers using Language Inference and Question Entailment

Hemant Pugaliya, Karan Saxena, Shefali Garg et al.

Parallel deep learning architectures like fine-tuned BERT and MT-DNN, have quickly become the state of the art, bypassing previous deep and shallow learning methods by a large margin. More recently, pre-trained models from large related datasets have been able to perform well on many downstream tasks by just fine-tuning on domain-specific datasets . However, using powerful models on non-trivial tasks, such as ranking and large document classification, still remains a challenge due to input size limitations of parallel architecture and extremely small datasets (insufficient for fine-tuning). In this work, we introduce an end-to-end system, trained in a multi-task setting, to filter and re-rank answers in the medical domain. We use task-specific pre-trained models as deep feature extractors. Our model achieves the highest Spearman's Rho and Mean Reciprocal Rank of 0.338 and 0.9622 respectively, on the ACL-BioNLP workshop MediQA Question Answering shared-task.