IT NA IT NA PR ST THJan 22, 2018

Differential Message Importance Measure: A New Approach to the Required Sampling Number in Big Data Structure Characterization

arXiv:1801.070838 citationsh-index: 49

Originality Synthesis-oriented

AI Analysis

For researchers in big data and statistics, this offers a distribution-free criterion for sample size determination, but it is incremental as it builds on existing concepts like differential entropy and KS statistic.

The paper proposes a new measure called differential message importance measure (DMIM) to determine the required number of samples for characterizing big data structures without assuming a distribution. It shows that DMIM deviation is equivalent to the Kolmogorov-Smirnov statistic and provides approximate values for normal distributions.

Data collection is a fundamental problem in the scenario of big data, where the size of sampling sets plays a very important role, especially in the characterization of data structure. This paper considers the information collection process by taking message importance into account, and gives a distribution-free criterion to determine how many samples are required in big data structure characterization. Similar to differential entropy, we define differential message importance measure (DMIM) as a measure of message importance for continuous random variable. The DMIM for many common densities is discussed, and high-precision approximate values for normal distribution are given. Moreover, it is proved that the change of DMIM can describe the gap between the distribution of a set of sample values and a theoretical distribution. In fact, the deviation of DMIM is equivalent to Kolmogorov-Smirnov statistic, but it offers a new way to characterize the distribution goodness-of-fit. Numerical results show some basic properties of DMIM and the accuracy of the proposed approximate values. Furthermore, it is also obtained that the empirical distribution approaches the real distribution with decreasing of the DMIM deviation, which contributes to the selection of suitable sampling points in actual system.

View on arXiv PDF

Similar