Provable Confidence Intervals on the Expected Value of a Distribution through Sampling


In two lines: Estimating the expected value of a distribution through sampling is fundamental to econometrics and has broad applications. Although the empirically estimated mean is an unbiased estimator of the expected value (corresponding to a high accuracy), understanding its precision is crucial. This post explores techniques to show high precision, including:

  • Tractable estimates of the mean through Chernoff bounds
  • Tighter confidence intervals through the DKW inequality for light-tailed distributions
  • Fancier (potentially tighter) concentration bounds by modifying the DKW

____________________________________________________________________________________________

Consider estimating the expected value of a distribution empirically through finite sampling — we want to give a confidence interval [\underline \mu, \overline \mu] such that the mean \mu of the distribution lies inside this interval with high probability. Given n samples s_1, s_2, \dots, s_n drawn i.i.d from a distribution F, the average \tfrac{s_1 + \dots + s_n}{n} of the n samples is an unbiased estimator of the expected value \mu of F, i.e, \mathbb{E}_{s \sim F}[s] = \mathbb{E}_{s_1, \dots, s_n \sim F}[\frac{s_1 + \dots + s_n}{n}].

For a distribution supported on [a, b], the above empirical estimate concentrates quite well around the expected value too. In other words, a standard Chernoff bound suggests

Pr_{s_1, \dots, s_n \sim F}(|\frac{s_1 + \dots + s_n}{n} - \mu| \geq \epsilon) < 2e^{-\tfrac{2n\epsilon^2}{(b-a)^2}}.

However, the above concentration guarantee is heavily dependent on the extremes of the support of the distribution F. For instance, if we knew that the probability of a sample being drawn outside [\frac{a+b}{2} - \delta, \frac{a+b}{2} + \delta] is an extremely small constant \eta, Bernstein’s inequality gives a much stronger bound. The variance \sigma^2 of the distribution can be bounded above by \eta \, (\tfrac{b-a}{2})^2 + (1-\eta) \, \delta^2. Bernstein inequality yields

Pr_{s_1, \dots, s_n \sim F} (|\frac{s_1 + \dots + s_n}{n} - \mu| \geq \epsilon) < 2e^{- \tfrac{1}{2} \cdot \tfrac{n \, \epsilon^2}{\sigma^2 + \frac{1}{3} \, (b-a) \, \epsilon}}.

However, there might be applications where we suspect that the distribution is heavily concentrated around the mean, but with no provable guarantees on the variance \sigma^2. Bernstein’s inequality can no longer be applied, but we would like to bootstrap the concentration bound with our beliefs to potentially get stronger bounds than Chernoff’s inequality.

Attempt 1: The DKW Inequality

Along with the expected value, it turns out, the entire cumulative distribution function (CDF) concentrates quite well when sufficiently many samples are constructed. For a bunch of samples s_1 \geq s_2 \geq \allowbreak \dots \allowbreak \geq s_n, let \hat{F} be the empirical distribution with CDF \hat F(x) = \frac{\text{\#samples lesser than } x}{n}. Then, by the DKW inequality,

Pr_{s_1, \dots, s_n \sim F}(\sup_{x \in \mathbb{R}} \hat F(x) - F(x) > \epsilon) \leq e^{-2n\epsilon^2}.

In other words, except for a very small probability, the CDF F and the empirical CDF \hat F differ only by a small constant \epsilon on all points x on the support of F. We will use the concentration of the entire distribution to get a tighter concentration guarantee around the estimated mean.

Consider inflating the empirical distribution \hat F by throwing away the smallest n\epsilon samples s_{n(1-\epsilon) + 1}, \dots, s_n and replacing them by the supremum b of the distribution F. The inflated distribution \overline F has a support [s_{n(1-\epsilon)}, b] and a CDF \overline F(x) = \hat F(x) - \epsilon for all x in the support of \overline F (we hav essentially moved an \epsilon mass below s_{n(1-\epsilon)} all the way to b). Coupling with the DKW inequality, except for a small probability, \overline F(x) \leq F(x) and thus, has a higher expected value than F(x).

Similarly, one can construct a deflated distribution \underline F by dropping the largest n \epsilon samples from \hat F and replacing them with the infimum a of the distribution F. By a symmetric version of the DKW inequality, we can conclude that the expected value of the distribution \underline F is smaller than that of F with high probability.

Theorem 1: Except for a probability 2e^{-2n\epsilon^2}, \mathbb{E}_{s \sim \underline F}[s] \leq \mathbb{E}_{s \sim F}[s] \leq \mathbb{E}_{s \sim \overline F}[s] where \underline F and \overline F are constructed by deflating and inflating an empirical distribution constructed from F with n samples.

Theorem 1 gives a tighter confidence interval than the one obtained from the Chernoff bound. Setting \epsilon = (b-a) \, \delta in the Chernoff bound, we get

Pr(\mathbb{E}_{s \sim \hat F}[s] - (b-a) \, \delta \leq \mu \leq \mathbb{E}_{s \sim \hat F}[s] + (b-a) \, \delta) > 1 - 2e^{-2n \delta^2},

where \mu is the expected value of the distribution F. For the same probability, Theorem 1 guarantees that \mathbb{E}_{s \sim \underline F}[s] \leq \mathbb{E}_{s \sim F}[s] \leq \mathbb{E}_{s \sim \overline F}[s]. We will observe that \mathbb{E}_{s \sim \overline F}[s] \leq \mathbb{E}_{s \sim \hat F}[s] + (b-a) \, \delta. Remember that \overline F was constructed from \hat F in two steps:

  1. Add n \delta copies of the supremum b. This would increase the expected value by b \delta.
  2. Delete the smallest n \delta samples. Note that all samples are larger than the infimum a, and we decrease the expected value when removing the smallest few samples by a \delta.

Combining the above two bullets proves the observation. From a similar argument, we can also conclude that \mathbb{E}_{s \sim \hat F}[s] - (b-a) \, \delta \leq \mathbb{E}_{s \sim \underline F}[s]. If the smallest few samples that are thrown away in bullet 2 are much larger than the infimum a (which is the case when we suspect that the tail of the distribution is extremely light, and not a lot of probability mass is around a), then Theorem 1 starts performing much better than the Chernoff bound, achieving the goal that we set out to reach.

However, we are trading performance for simplicity. The confidence interval given by the Chernoff bound is extremely elegant– compute the empirical average and look at the interval with diameter \epsilon on either side– and is much more useful if the concentration bound is to be fit inside a larger theoretical proof looking for closed form expressions. Theorem 1, on the other hand, is useful if estimating the confidence interval is the ultimate goal.

Attempt 2: Bootstrapping the DKW

The above approach already presents non-trivial improvements from the guarantees given by the Chernoff bound. However, the inflation and deflation procedures are themselves quite wasteful and can be improved further.

Consider the inflation procedure. We throw away the smallest few samples and replace them with the supremum b of the distribution F. What if we had a better idea of the distribution of the “larger values” (the stronger percentiles) of the distribution F? Instead of replacing all the thrown away samples by b, we can replace at least some of them with samples from the distribution of the larger values. Fortunately for us, we have drawn n samples from F, and some of these are going to have a strong percentile! We will be duplicating our extra-large samples to replace the smallest few samples that we will be throwing away.

Let s_0 = b. Consider the largest few samples s_0, s_1, \dots, s_{\frac{\epsilon}{\omega}} for some \omega << \epsilon. Insert \approx n \, \omega copies of each of these \frac{\epsilon}{\omega} + 1 samples, so that a total of n \epsilon large samples are injected in all to replace the n \epsilon small samples that have been thrown away. Note that we are risking that all the largest samples that we ended up drawing do not really belong to the stronger percentiles of F. However, it can be proved that such an event occurs with an extremely small probability. The magnitude of this probability can be made smaller than the probability of the empirical distribution \hat F and the distribution F of the population not being within \epsilon from each other ( \, e^{-2n\epsilon^2} given by DKW). We will call the above variant of inflating the empirical distribution as stratified inflation. We can define an analogous stratified deflation procedure too. We summarize the above discussion in the theorem below. The readers can find the formal proof of Theorem 2 in Appendix F.3 in the paper cited below.

Theorem 2 (Ferreira et al., 2024): Except for a probability 2 \, (e^{-2n\epsilon^2} + \frac{\epsilon}{\omega} \cdot e^{-n \omega}), the empirical distributions \overline F and \underline F achieved through stratified inflation and deflation respectively satisfy \mathbb{E}_{s \sim \underline F}[s] \leq \mathbb{E}_{s \sim F}[s] \leq \mathbb{E}_{s \sim \overline F}[s].

The stratified inflation and deflation procedure satisfy a much stronger property than described in Theorem 2. Similar to inflate, the probability that a sample s drawn from the distribution \overline F constructed through stratified inflation is larger than some value x is greater than 1-F(x), the probability that s drawn from F is larger than x. Stratified inflation and stratified deflation has even lesser dependence on the supremum and infimum of F and gives a tighter confidence interval than the ones given by (vanilla) inflation and deflation for distributions with extremely light tails. Except for the additional 2 \cdot \frac{\epsilon}{\omega} \cdot e^{-n \omega} probability of failure, stratified inflate and stratified deflate gives better empirical bounds on the expected value than both (vanilla) inflate and deflate and the tractable bounds given by Chernoff’s bounds.

References

Matheus V. X. Ferreira, Aadityan Ganesh, Jack Hourigan, Hannah Huh, S. Matthew Weinberg, and Catherine Yu. 2024. Computing Optimal Manipulations in Cryptographic Self-Selection Proof-of-Stake Protocols. In The 25th ACM Conference on Economics and Computation (EC ’24), July 8–11, 2024, New Haven, CT, USA. ACM, New York, NY, USA, 43 pages. https://doi.org/10.1145/3670865.3673602

aadityanganesh (at) princeton (dot) edu