On the Statistical Complexity of Sample Amplification
This work addresses a foundational statistical problem with potential applications in data augmentation and privacy, but it appears incremental as it builds on existing notions without claiming broad SOTA breakthroughs.
The paper tackles the sample amplification problem, which asks when it is possible to generate additional indistinguishable samples from an unknown distribution given a smaller set, and provides generally applicable procedures and lower bounds, establishing a rigorous connection to distribution learning for a large class of distributions like the exponential family.
The ``sample amplification'' problem formalizes the following question: Given $n$ i.i.d. samples drawn from an unknown distribution $P$, when is it possible to produce a larger set of $n+m$ samples which cannot be distinguished from $n+m$ i.i.d. samples drawn from $P$? In this work, we provide a firm statistical foundation for this problem by deriving generally applicable amplification procedures, lower bound techniques and connections to existing statistical notions. Our techniques apply to a large class of distributions including the exponential family, and establish a rigorous connection between sample amplification and distribution learning.