Bhavesh Neekhra

h-index2
2papers

2 Papers

LGApr 24, 2023
Synthpop++: A Hybrid Framework for Generating A Country-scale Synthetic Population

Bhavesh Neekhra, Kshitij Kapoor, Debayan Gupta

Population censuses are vital to public policy decision-making. They provide insight into human resources, demography, culture, and economic structure at local, regional, and national levels. However, such surveys are very expensive (especially for low and middle-income countries with high populations, such as India), time-consuming, and may also raise privacy concerns, depending upon the kinds of data collected. In light of these issues, we introduce SynthPop++, a novel hybrid framework, which can combine data from multiple real-world surveys (with different, partially overlapping sets of attributes) to produce a real-scale synthetic population of humans. Critically, our population maintains family structures comprising individuals with demographic, socioeconomic, health, and geolocation attributes: this means that our ``fake'' people live in realistic locations, have realistic families, etc. Such data can be used for a variety of purposes: we explore one such use case, Agent-based modelling of infectious disease in India. To gauge the quality of our synthetic population, we use both machine learning and statistical metrics. Our experimental results show that synthetic population can realistically simulate the population for various administrative units of India, producing real-scale, detailed data at the desired level of zoom -- from cities, to districts, to states, eventually combining to form a country-scale synthetic population.

LGAug 5, 2025
On the (In)Significance of Feature Selection in High-Dimensional Datasets

Bhavesh Neekhra, Debayan Gupta, Partha Pratim Chakrabarti

Feature selection (FS) is assumed to improve predictive performance and identify meaningful features in high-dimensional datasets. Surprisingly, small random subsets of features (0.02-1%) match or outperform the predictive performance of both full feature sets and FS across 28 out of 30 diverse datasets (microarray, bulk and single-cell RNA-Seq, mass spectrometry, imaging, etc.). In short, any arbitrary set of features is as good as any other (with surprisingly low variance in results) - so how can a particular set of selected features be "important" if they perform no better than an arbitrary set? These results challenge the assumption that computationally selected features reliably capture meaningful signals, emphasizing the importance of rigorous validation before interpreting selected features as actionable, particularly in computational genomics.