Chernoff bound: Difference between revisions

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search
imported>Nailujon
 
Line 1: Line 1:
{{Short description|Exponentially decreasing bounds on tail distributions of random variables}}
{{Short description|Exponentially decreasing bounds on tail distributions of random variables}}
In [[probability theory]], a '''Chernoff bound''' is an exponentially decreasing upper bound on the tail of a random variable based on its [[moment generating function]]. The minimum of all such exponential bounds forms ''the'' Chernoff or '''Chernoff-Cramér bound''', which may decay faster than exponential (e.g. [[Sub-Gaussian distribution|sub-Gaussian]]).<ref name="blm">{{Cite book|last=Boucheron|first=Stéphane|url=https://www.worldcat.org/oclc/837517674|title=Concentration Inequalities: a Nonasymptotic Theory of Independence|date=2013|publisher=Oxford University Press|others=Gábor Lugosi, Pascal Massart|isbn=978-0-19-953525-5|location=Oxford|page=21|oclc=837517674}}</ref><ref>{{Cite web|last=Wainwright|first=M.|date=January 22, 2015|title=Basic tail and concentration bounds|url=https://www.stat.berkeley.edu/~mjwain/stat210b/Chap2_TailBounds_Jan22_2015.pdf|url-status=live|archive-url=https://web.archive.org/web/20160508170739/http://www.stat.berkeley.edu:80/~mjwain/stat210b/Chap2_TailBounds_Jan22_2015.pdf |archive-date=2016-05-08 }}</ref> It is especially useful for sums of independent random variables, such as sums of [[Bernoulli random variable]]s.<ref>{{Cite book|last=Vershynin|first=Roman|url=https://www.worldcat.org/oclc/1029247498|title=High-dimensional probability : an introduction with applications in data science|date=2018|isbn=978-1-108-41519-4|location=Cambridge, United Kingdom|oclc=1029247498|page=19}}</ref><ref>{{Cite journal|last=Tropp|first=Joel A.|date=2015-05-26|title=An Introduction to Matrix Concentration Inequalities|url=https://www.nowpublishers.com/article/Details/MAL-048|journal=Foundations and Trends in Machine Learning|language=English|volume=8|issue=1–2|page=60|doi=10.1561/2200000048|arxiv=1501.01571|s2cid=5679583|issn=1935-8237}}</ref>
In [[probability theory]], a '''Chernoff bound''' is an exponentially decreasing upper bound on the tail of a random variable based on its [[moment generating function]]. The minimum of all such exponential bounds forms ''the'' Chernoff or '''Chernoff-Cramér bound''', which may decay faster than exponential (e.g. [[Sub-Gaussian distribution|sub-Gaussian]]).<ref name="blm">{{Cite book|last=Boucheron|first=Stéphane|title=Concentration Inequalities: a Nonasymptotic Theory of Independence|date=2013|publisher=Oxford University Press|others=Gábor Lugosi, Pascal Massart|isbn=978-0-19-953525-5|location=Oxford|page=21|oclc=837517674}}</ref><ref>{{Cite web|last=Wainwright|first=M.|date=January 22, 2015|title=Basic tail and concentration bounds|url=https://www.stat.berkeley.edu/~mjwain/stat210b/Chap2_TailBounds_Jan22_2015.pdf|url-status=live|archive-url=https://web.archive.org/web/20160508170739/http://www.stat.berkeley.edu:80/~mjwain/stat210b/Chap2_TailBounds_Jan22_2015.pdf |archive-date=2016-05-08 }}</ref> It is especially useful for sums of independent random variables, such as sums of [[Bernoulli random variable]]s.<ref>{{Cite book|last=Vershynin|first=Roman|title=High-dimensional probability : an introduction with applications in data science|date=2018|isbn=978-1-108-41519-4|location=Cambridge, United Kingdom|oclc=1029247498|page=19}}</ref><ref>{{Cite journal|last=Tropp|first=Joel A.|date=2015-05-26|title=An Introduction to Matrix Concentration Inequalities|url=https://www.nowpublishers.com/article/Details/MAL-048|journal=Foundations and Trends in Machine Learning|language=English|volume=8|issue=1–2|page=60|doi=10.1561/2200000048|arxiv=1501.01571|s2cid=5679583|issn=1935-8237}}</ref>


The bound is commonly named after [[Herman Chernoff]] who described the method in a 1952 paper,<ref>{{Cite journal|last=Chernoff|first=Herman|date=1952|title=A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations|journal=The Annals of Mathematical Statistics|volume=23|issue=4|pages=493–507|doi=10.1214/aoms/1177729330|jstor=2236576|issn=0003-4851|doi-access=free}}</ref> though Chernoff himself attributed it to Herman Rubin.<ref>{{cite book | url=http://www.crcpress.com/product/isbn/9781482204964 | title=Past, Present, and Future of Statistics | chapter=A career in statistics | page=35 | publisher=CRC Press | last1=Chernoff | first1=Herman | editor-first1=Xihong | editor-last1=Lin | editor-first2=Christian | editor-last2=Genest | editor-first3=David L. | editor-last3=Banks | editor-first4=Geert | editor-last4=Molenberghs | editor-first5=David W. | editor-last5=Scott | editor-first6=Jane-Ling | editor-last6=Wang  | editor6-link = Jane-Ling Wang| year=2014 | isbn=9781482204964 | archive-url=https://web.archive.org/web/20150211232731/https://nisla05.niss.org/copss/past-present-future-copss.pdf | archive-date=2015-02-11 | chapter-url=https://nisla05.niss.org/copss/past-present-future-copss.pdf}}</ref> In 1938 [[Harald Cramér]] had published an almost identical concept now known as [[Cramér's theorem (large deviations)|Cramér's theorem]].
The bound is commonly named after [[Herman Chernoff]] who described the method in a 1952 paper,<ref>{{Cite journal|last=Chernoff|first=Herman|date=1952|title=A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations|journal=The Annals of Mathematical Statistics|volume=23|issue=4|pages=493–507|doi=10.1214/aoms/1177729330|jstor=2236576|issn=0003-4851|doi-access=free}}</ref> though Chernoff himself attributed it to Herman Rubin.<ref>{{cite book | url=http://www.crcpress.com/product/isbn/9781482204964 | title=Past, Present, and Future of Statistics | chapter=A career in statistics | page=35 | publisher=CRC Press | last1=Chernoff | first1=Herman | editor-first1=Xihong | editor-last1=Lin | editor-first2=Christian | editor-last2=Genest | editor-first3=David L. | editor-last3=Banks | editor-first4=Geert | editor-last4=Molenberghs | editor-first5=David W. | editor-last5=Scott | editor-first6=Jane-Ling | editor-last6=Wang  | editor6-link = Jane-Ling Wang| year=2014 | isbn=9781482204964 | archive-url=https://web.archive.org/web/20150211232731/https://nisla05.niss.org/copss/past-present-future-copss.pdf | archive-date=2015-02-11 | chapter-url=https://nisla05.niss.org/copss/past-present-future-copss.pdf}}</ref> In 1938 [[Harald Cramér]] had published an almost identical concept now known as [[Cramér's theorem (large deviations)|Cramér's theorem]].
Line 27: Line 27:
=== Properties ===
=== Properties ===


The exponential function is convex, so by [[Jensen's inequality]] <math>\operatorname E (e^{t X}) \ge e^{t \operatorname E (X)}</math>. It follows that the bound on the right tail is greater or equal to one when <math>a \le \operatorname E (X)</math>, and therefore trivial; similarly, the left bound is trivial for <math>a \ge \operatorname E (X)</math>. We may therefore combine the two infima and define the two-sided Chernoff bound:<math display="block">C(a) = \inf_{t} M(t) e^{-t a} </math>which provides an upper bound on the folded [[cumulative distribution function]] of <math>X</math> (folded at the mean, not the median).
The [[exponential function]] is convex, so by [[Jensen's inequality]] <math>\operatorname E (e^{t X}) \ge e^{t \operatorname E (X)}</math>. It follows that the bound on the right tail is greater or equal to one when <math>a \le \operatorname E (X)</math>, and therefore trivial; similarly, the left bound is trivial for <math>a \ge \operatorname E (X)</math>. We may therefore combine the two infima and define the two-sided Chernoff bound:<math display="block">C(a) = \inf_{t} M(t) e^{-t a} </math>which provides an upper bound on the folded [[cumulative distribution function]] of <math>X</math> (folded at the mean, not the median).


The logarithm of the two-sided Chernoff bound is known as the [[rate function]] (or ''Cramér transform'') <math>I = -\log C</math>. It is equivalent to the [[Legendre–Fenchel transformation|Legendre–Fenchel transform]] or [[convex conjugate]] of the [[cumulant generating function]] <math>K = \log M</math>, defined as: <math display="block">I(a) = \sup_{t} at - K(t) </math>The [[Moment-generating function#Important properties|moment generating function]] is [[Logarithmically convex function|log-convex]], so by a property of the convex conjugate, the Chernoff bound must be [[Logarithmically concave function|log-concave]]. The Chernoff bound attains its maximum at the mean, <math>C(\operatorname E(X))=1</math>, and is invariant under translation: <math display="inline">C_{X+k}(a) = C_X(a - k) </math>.
The logarithm of the two-sided Chernoff bound is known as the [[rate function]] (or ''Cramér transform'') <math>I = -\log C</math>. It is equivalent to the [[Legendre–Fenchel transformation|Legendre–Fenchel transform]] or [[convex conjugate]] of the [[cumulant generating function]] <math>K = \log M</math>, defined as: <math display="block">I(a) = \sup_{t} at - K(t) </math>The [[Moment-generating function#Important properties|moment generating function]] is [[Logarithmically convex function|log-convex]], so by a property of the convex conjugate, the Chernoff bound must be [[Logarithmically concave function|log-concave]]. The Chernoff bound attains its maximum at the mean, <math>C(\operatorname E(X))=1</math>, and is invariant under translation: <math display="inline">C_{X+k}(a) = C_X(a - k) </math>.
Line 48: Line 48:
|<math>\exp \left( {-\frac{a^2}{2\sigma^2}} \right)</math>
|<math>\exp \left( {-\frac{a^2}{2\sigma^2}} \right)</math>
|-
|-
|[[Bernoulli distribution]](detailed below)
|[[Bernoulli distribution]] (detailed below)
|<math>p</math>
|<math>p</math>
|<math>\ln \left( 1-p + pe^t \right)</math>
|<math>\ln \left( 1-p + pe^t \right)</math>
Line 76: Line 76:
|<math>k</math>
|<math>k</math>
|<math>-\frac{k}{2}\ln (1-2t)</math>
|<math>-\frac{k}{2}\ln (1-2t)</math>
|<math>\frac{k}{2} \left(\frac{a}{k} - 1 - \ln \frac{a}{k} \right)</math><ref>{{Cite journal |last=Ghosh |first=Malay |date=2021-03-04 |title=Exponential Tail Bounds for Chisquared Random Variables |journal=Journal of Statistical Theory and Practice |language=en |volume=15 |issue=2 |pages=35 |doi=10.1007/s42519-020-00156-x |s2cid=233546315 |issn=1559-8616|doi-access=free }}</ref>
|<math>\frac{k}{2} \left(\frac{a}{k} - 1 - \ln \frac{a}{k} \right)</math><ref>{{Cite journal |last=Ghosh |first=Malay |date=2021-03-04 |title=Exponential Tail Bounds for Chisquared Random Variables |journal=Journal of Statistical Theory and Practice |language=en |volume=15 |issue=2 |article-number=35 |doi=10.1007/s42519-020-00156-x |s2cid=233546315 |issn=1559-8616|doi-access=free }}</ref>
|<math>\left( \frac{a}{k} \right)^{k/2} e^{k/2-a/2} </math>
|<math>\left( \frac{a}{k} \right)^{k/2} e^{k/2-a/2} </math>
|-
|-
Line 148: Line 148:
  | year = 2024
  | year = 2024
  | volume = 187
  | volume = 187
| article-number = 106516
  | doi = 10.1016/j.ipl.2024.106516
  | doi = 10.1016/j.ipl.2024.106516
| doi-access = free
| doi-access = free
  }}</ref>
  }}</ref>


:<math>\Pr( X \ge R)\le 2^{-xR}, \qquad x > 0, \  R \ge (2^x e -1)\mu.</math>
:<math>\Pr( X \ge R)\le 2^{-xR}, \qquad x > 0, \ \mu > 0, \  R \ge (2^x e -1)\mu.</math>


=== Additive form (absolute error) ===
=== Additive form (absolute error) ===
Line 202: Line 203:
Chernoff bounds are also used to obtain tight bounds for permutation routing problems which reduce [[network congestion]] while routing packets in sparse networks.<ref name="0bAYl6d7hvkC" />
Chernoff bounds are also used to obtain tight bounds for permutation routing problems which reduce [[network congestion]] while routing packets in sparse networks.<ref name="0bAYl6d7hvkC" />


Chernoff bounds are used in [[computational learning theory]] to prove that a learning algorithm is [[Probably approximately correct learning|probably approximately correct]], i.e. with high probability the algorithm has small error on a sufficiently large training data set.<ref>{{cite book |first1=M. |last1=Kearns |first2=U. |last2=Vazirani |title=An Introduction to Computational Learning Theory |at=Chapter 9 (Appendix), pages 190–192 |publisher=MIT Press |year=1994 |isbn=0-262-11193-4 }}</ref>
Chernoff bounds are used in [[computational learning theory]] to prove that a learning algorithm is [[Probably approximately correct learning|probably approximately correct]], i.e. [[with high probability]] the algorithm has small error on a sufficiently large [[Training, validation, and test data sets|training data]] set.<ref>{{cite book |first1=M. |last1=Kearns |first2=U. |last2=Vazirani |title=An Introduction to Computational Learning Theory |at=Chapter 9 (Appendix), pages 190–192 |publisher=MIT Press |year=1994 |isbn=0-262-11193-4 }}</ref>


Chernoff bounds can be effectively used to evaluate the "robustness level" of an application/algorithm by exploring its perturbation space with randomization.<ref name="Alippi2014">{{cite book |first=C. |last=Alippi |chapter=Randomized Algorithms |title=Intelligence for Embedded Systems |publisher=Springer |year=2014 |isbn=978-3-319-05278-6 }}</ref>
Chernoff bounds can be effectively used to evaluate the "robustness level" of an application/algorithm by exploring its perturbation space with randomization.<ref name="Alippi2014">{{cite book |first=C. |last=Alippi |chapter=Randomized Algorithms |title=Intelligence for Embedded Systems |publisher=Springer |year=2014 |isbn=978-3-319-05278-6 }}</ref>
Line 238: Line 239:


Let {{math|''M''<sub>1</sub>, ..., ''M<sub>t</sub>''}} be independent matrix valued random variables such that <math> M_i\in \mathbb{C}^{d_1 \times d_2} </math> and <math> \mathbb{E}[M_i]=0</math>.
Let {{math|''M''<sub>1</sub>, ..., ''M<sub>t</sub>''}} be independent matrix valued random variables such that <math> M_i\in \mathbb{C}^{d_1 \times d_2} </math> and <math> \mathbb{E}[M_i]=0</math>.
Let us denote by <math> \lVert M \rVert </math> the operator norm of the matrix <math> M </math>. If <math> \lVert M_i \rVert \leq \gamma </math> holds almost surely for all <math> i\in\{1,\ldots, t\} </math>, then for every {{math|''ε'' > 0}}
Let us denote by <math> \lVert M \rVert </math> the [[operator norm]] of the matrix <math> M </math>. If <math> \lVert M_i \rVert \leq \gamma </math> holds [[almost surely]] for all <math> i\in\{1,\ldots, t\} </math>, then for every {{math|''ε'' > 0}}


:<math>\Pr\left( \left\| \frac{1}{t} \sum_{i=1}^t M_i \right\| > \varepsilon \right) \leq (d_1+d_2) \exp \left( -\frac{3\varepsilon^2 t}{8\gamma^2} \right).</math>
:<math>\Pr\left( \left\| \frac{1}{t} \sum_{i=1}^t M_i \right\| > \varepsilon \right) \leq (d_1+d_2) \exp \left( -\frac{3\varepsilon^2 t}{8\gamma^2} \right).</math>


Notice that in order to conclude that the deviation from 0 is bounded by {{math|''ε''}} with high probability, we need to choose a number of samples <math>t </math> proportional to the logarithm of <math> d_1+d_2 </math>. In general, unfortunately, a dependence on  <math> \log(\min(d_1,d_2)) </math> is inevitable: take for example a diagonal random sign matrix of dimension <math>d\times d </math>. The operator norm of the sum of ''t'' independent samples is precisely the maximum deviation among ''d'' independent random walks of length ''t''. In order to achieve a fixed bound on the maximum deviation with constant probability, it is easy to see that ''t'' should grow logarithmically with ''d'' in this scenario.<ref>{{cite arXiv |last1=Magen |first1=A.|author1-link=Avner Magen |last2=Zouzias |first2=A. |year=2011 |title=Low Rank Matrix-Valued Chernoff Bounds and Approximate Matrix Multiplication |class=cs.DM |eprint=1005.2724 }}</ref>
Notice that in order to conclude that the deviation from 0 is bounded by {{math|''ε''}} with high probability, we need to choose a number of samples <math>t </math> proportional to the logarithm of <math> d_1+d_2 </math>. In general, unfortunately, a dependence on  <math> \log(\min(d_1,d_2)) </math> is inevitable: take for example a diagonal random sign matrix of dimension <math>d\times d </math>. The operator norm of the sum of ''t'' independent samples is precisely the maximum deviation among ''d'' independent [[Random walk|random walks]] of length ''t''. In order to achieve a fixed bound on the maximum deviation with constant probability, it is easy to see that ''t'' should grow logarithmically with ''d'' in this scenario.<ref>{{cite arXiv |last1=Magen |first1=A.|author1-link=Avner Magen |last2=Zouzias |first2=A. |year=2011 |title=Low Rank Matrix-Valued Chernoff Bounds and Approximate Matrix Multiplication |class=cs.DM |eprint=1005.2724 }}</ref>


The following theorem can be obtained by assuming ''M'' has low rank, in order to avoid the dependency on the dimensions.
The following theorem can be obtained by assuming ''M'' has low rank, in order to avoid the dependency on the dimensions.
Line 339: Line 340:


To complete the proof for the symmetric case, we simply define the random variable {{math|''Y<sub>i</sub>'' {{=}} 1 − ''X<sub>i</sub>''}}, apply the same proof, and plug it into our bound.
To complete the proof for the symmetric case, we simply define the random variable {{math|''Y<sub>i</sub>'' {{=}} 1 − ''X<sub>i</sub>''}}, apply the same proof, and plug it into our bound.
===An elementary proof of the Chernoff–Hoeffding theorem (additive form)===
The following proof is from an article by Wolfgang Mulzer.<ref>{{cite journal |last=Mulzer |first=Wolfgang |date=February 2018 |title=Five Proofs of Chernoff's Bound with Applications |arxiv=1801.03365 |journal=Bulletin of the EATCS |volume=124 |article-number=525}}</ref> Let <math>q \geq p</math>. The proof analyzes two distributions <math>D_p</math> and <math>D_q</math>, both over <math>n</math>-tuples of bits <math>X=(X_1,\dots,X_n)</math>. In the distribution <math>D_p</math> each <math>X_i</math> is an independent Bernoulli random variable with expectation <math>p</math>, and <math>D_q</math> is defined analogously. When <math>\sum X_i = k</math>, the ratio <math>D_q(X)/D_p(X)</math> is
:<math>\left(\frac{q}{p}\right)^k\left(\frac{1-q}{1-p}\right)^{n-k}.</math>
Note that this is monotone in {{math|''k''}}, and so whenever <math>\sum X_i \geq qn</math>, the ratio <math>D_q(X)/D_p(X)</math> is at least
:<math>\left(\frac{q}{p}\right)^{qn}\left(\frac{1-q}{1-p}\right)^{n-qn} = e^{D(q\parallel p)n}.</math>
This shows us that <math>\sum X_i \geq qn</math> is unlikely in <math>D_p</math> since
:<math>\Pr_{X \sim D_p}\left(\sum X_i \geq qn\right) \leq e^{-D(q\parallel p)n}\Pr_{X \sim D_q}\left(\sum X_i \geq qn\right) \leq e^{-D(q\parallel p)n}.</math>
As in the previous proof, for the symmetric case we simply define the random variable {{math|''Y<sub>i</sub>'' {{=}} 1 − ''X<sub>i</sub>''}}, apply the same proof, and plug it into our bound.


==See also==
==See also==

Latest revision as of 01:38, 24 December 2025

Template:Short description In probability theory, a Chernoff bound is an exponentially decreasing upper bound on the tail of a random variable based on its moment generating function. The minimum of all such exponential bounds forms the Chernoff or Chernoff-Cramér bound, which may decay faster than exponential (e.g. sub-Gaussian).[1][2] It is especially useful for sums of independent random variables, such as sums of Bernoulli random variables.[3][4]

The bound is commonly named after Herman Chernoff who described the method in a 1952 paper,[5] though Chernoff himself attributed it to Herman Rubin.[6] In 1938 Harald Cramér had published an almost identical concept now known as Cramér's theorem.

It is a sharper bound than the first- or second-moment-based tail bounds such as Markov's inequality or Chebyshev's inequality, which only yield power-law bounds on tail decay. However, when applied to sums the Chernoff bound requires the random variables to be independent, a condition that is not required by either Markov's inequality or Chebyshev's inequality.

The Chernoff bound is related to the Bernstein inequalities. It is also used to prove Hoeffding's inequality, Bennett's inequality, and McDiarmid's inequality.

Generic Chernoff bounds

File:Chernoff-bound.svg
Two-sided Chernoff bound for a chi-square random variable

The generic Chernoff bound for a random variable X is attained by applying Markov's inequality to etX (which is why it is sometimes called the exponential Markov or exponential moments bound). For positive t this gives a bound on the right tail of X in terms of its moment-generating function M(t)=E(etX):

P(Xa)=P(etXeta)M(t)eta(t>0)

Since this bound holds for every positive t, we may take the infimum:

P(Xa)inft>0M(t)eta

Performing the same analysis with negative t we get a similar bound on the left tail:

P(Xa)=P(etXeta)M(t)eta(t<0)

and

P(Xa)inft<0M(t)eta

The quantity M(t)eta can be expressed as the expected value E(etX)eta, or equivalently E(et(Xa)).

Properties

The exponential function is convex, so by Jensen's inequality E(etX)etE(X). It follows that the bound on the right tail is greater or equal to one when aE(X), and therefore trivial; similarly, the left bound is trivial for aE(X). We may therefore combine the two infima and define the two-sided Chernoff bound:C(a)=inftM(t)etawhich provides an upper bound on the folded cumulative distribution function of X (folded at the mean, not the median).

The logarithm of the two-sided Chernoff bound is known as the rate function (or Cramér transform) I=logC. It is equivalent to the Legendre–Fenchel transform or convex conjugate of the cumulant generating function K=logM, defined as: I(a)=suptatK(t)The moment generating function is log-convex, so by a property of the convex conjugate, the Chernoff bound must be log-concave. The Chernoff bound attains its maximum at the mean, C(E(X))=1, and is invariant under translation: CX+k(a)=CX(ak).

The Chernoff bound is exact if and only if X is a single concentrated mass (degenerate distribution). The bound is tight only at or beyond the extremes of a bounded random variable, where the infima are attained for infinite t. For unbounded random variables the bound is nowhere tight, though it is asymptotically tight up to sub-exponential factors ("exponentially tight").Script error: No such module "Unsubst". Individual moments can provide tighter bounds, at the cost of greater analytical complexity.[7]

In practice, the exact Chernoff bound may be unwieldy or difficult to evaluate analytically, in which case a suitable upper bound on the moment (or cumulant) generating function may be used instead (e.g. a sub-parabolic CGF giving a sub-Gaussian Chernoff bound).

Exact rate functions and Chernoff bounds for common distributions
Distribution E(X) K(t) I(a) C(a)
Normal distribution 0 12σ2t2 12(aσ)2 exp(a22σ2)
Bernoulli distribution (detailed below) p ln(1p+pet) DKL(ap) (pa)a(1p1a)1a
Standard Bernoulli

(H is the binary entropy function)

12 ln(1+et)ln(2) ln(2)H(a) 12aa(1a)(1a)
Rademacher distribution 0 lncosh(t) ln(2)H(1+a2) (1+a)1a(1a)1+a
Gamma distribution θk kln(1θt) klnaθkk+aθ (aθk)keka/θ
Chi-squared distribution k k2ln(12t) k2(ak1lnak)[8] (ak)k/2ek/2a/2
Poisson distribution λ λ(et1) aln(a/λ)a+λ (a/λ)aeaλ

Bounds from below from the MGF

Using only the moment generating function, a bound from below on the tail probabilities can be obtained by applying the Paley-Zygmund inequality to etX, yielding: P(X>a)supt>0M(t)eta(1etaM(t))2M(t)2M(2t)(a bound on the left tail is obtained for negative t). Unlike the Chernoff bound however, this result is not exponentially tight.

Theodosopoulos[9] constructed a tight(er) MGF-based bound from below using an exponential tilting procedure.

For particular distributions (such as the binomial) bounds from below of the same exponential order as the Chernoff bound are often available.

Sums of independent random variables

When Template:Mvar is the sum of Template:Mvar independent random variables X1, ..., XnScript error: No such module "Check for unknown parameters"., the moment generating function of Template:Mvar is the product of the individual moment generating functions, giving that:

Template:NumBlk

and:

Pr(Xa)inft<0etaiE[etXi]

Specific Chernoff bounds are attained by calculating the moment-generating function E[etXi] for specific instances of the random variables Xi.

When the random variables are also identically distributed (iid), the Chernoff bound for the sum reduces to a simple rescaling of the single-variable Chernoff bound. That is, the Chernoff bound for the average of n iid variables is equivalent to the nth power of the Chernoff bound on a single variable (see Cramér's theorem).

Sums of independent bounded random variables

Script error: No such module "Labelled list hatnote".

Chernoff bounds may also be applied to general sums of independent, bounded random variables, regardless of their distribution; this is known as Hoeffding's inequality. The proof follows a similar approach to the other Chernoff bounds, but applying Hoeffding's lemma to bound the moment generating functions (see Hoeffding's inequality).

Hoeffding's inequality. Suppose X1, ..., XnScript error: No such module "Check for unknown parameters". are independent random variables taking values in [a,b].Script error: No such module "Check for unknown parameters". Let Template:Mvar denote their sum and let μ = E[X]Script error: No such module "Check for unknown parameters". denote the sum's expected value. Then for any t>0,
Pr(Xμt)<e2t2/(n(ba)2),
Pr(Xμ+t)<e2t2/(n(ba)2).

Sums of independent Bernoulli random variables

The bounds in the following sections for Bernoulli random variables are derived by using that, for a Bernoulli random variable Xi with probability p of being equal to 1,

E[etXi]=(1p)e0+pet=1+p(et1)ep(et1).

One can encounter many flavors of Chernoff bounds: the original additive form (which gives a bound on the absolute error) or the more practical multiplicative form (which bounds the error relative to the mean).

Multiplicative form (relative error)

Multiplicative Chernoff bound. Suppose X1, ..., XnScript error: No such module "Check for unknown parameters". are independent random variables taking values in {0, 1}.Script error: No such module "Check for unknown parameters". Let Template:Mvar denote their sum and let μ = E[X]Script error: No such module "Check for unknown parameters". denote the sum's expected value. Then for any δ > 0Script error: No such module "Check for unknown parameters".,

Pr(X(1+δ)μ)(eδ(1+δ)1+δ)μ.

A similar proof strategy can be used to show that for 0 < δ < 1Script error: No such module "Check for unknown parameters".

Pr(X(1δ)μ)(eδ(1δ)1δ)μ.

The above formula is often unwieldy in practice, so the following looser but more convenient bounds[10] are often used, which follow from the inequality 2δ2+δlog(1+δ) from the list of logarithmic inequalities:

Pr(X(1+δ)μ)eδ2μ/(2+δ),0δ,
Pr(X(1δ)μ)eδ2μ/2,0δ1,
Pr(|Xμ|δμ)2eδ2μ/3,0δ1.

Notice that the bounds are trivial for δ=0.

In addition, based on the Taylor expansion for the Lambert W function,[11]

Pr(XR)2xR,x>0, μ>0, R(2xe1)μ.

Additive form (absolute error)

The following theorem is due to Wassily Hoeffding[12] and hence is called the Chernoff–Hoeffding theorem.

Chernoff–Hoeffding theorem. Suppose X1, ..., XnScript error: No such module "Check for unknown parameters". are i.i.d. random variables, taking values in {0, 1}.Script error: No such module "Check for unknown parameters". Let p = E[X1]Script error: No such module "Check for unknown parameters". and ε > 0Script error: No such module "Check for unknown parameters"..
Pr(1nXip+ε)((pp+ε)p+ε(1p1pε)1pε)n=eD(p+εp)nPr(1nXipε)((ppε)pε(1p1p+ε)1p+ε)n=eD(pεp)n
where
D(xy)=xlnxy+(1x)ln(1x1y)
is the Kullback–Leibler divergence between Bernoulli distributed random variables with parameters x and y respectively. If pTemplate:Sfrac,Script error: No such module "Check for unknown parameters". then D(p+εp)ε22p(1p) which means
Pr(1nXi>p+x)exp(x2n2p(1p)).

A simpler bound follows by relaxing the theorem using D(p + ε || p) ≥ 2ε2Script error: No such module "Check for unknown parameters"., which follows from the convexity of D(p + ε || p)Script error: No such module "Check for unknown parameters". and the fact that

d2dε2D(p+εp)=1(p+ε)(1pε)4=d2dε2(2ε2).

This result is a special case of Hoeffding's inequality. Sometimes, the bounds

D((1+x)pp)14x2p,12x12,D(xy)3(xy)22(2y+x),D(xy)(xy)22y,xy,D(xy)(xy)22x,xy

which are stronger for p < Template:Sfrac,Script error: No such module "Check for unknown parameters". are also used.

Applications

Chernoff bounds have very useful applications in set balancing and packet routing in sparse networks.

The set balancing problem arises while designing statistical experiments. Typically while designing a statistical experiment, given the features of each participant in the experiment, we need to know how to divide the participants into 2 disjoint groups such that each feature is roughly as balanced as possible between the two groups.[13]

Chernoff bounds are also used to obtain tight bounds for permutation routing problems which reduce network congestion while routing packets in sparse networks.[13]

Chernoff bounds are used in computational learning theory to prove that a learning algorithm is probably approximately correct, i.e. with high probability the algorithm has small error on a sufficiently large training data set.[14]

Chernoff bounds can be effectively used to evaluate the "robustness level" of an application/algorithm by exploring its perturbation space with randomization.[15] The use of the Chernoff bound permits one to abandon the strong—and mostly unrealistic—small perturbation hypothesis (the perturbation magnitude is small). The robustness level can be, in turn, used either to validate or reject a specific algorithmic choice, a hardware implementation or the appropriateness of a solution whose structural parameters are affected by uncertainties.

A simple and common use of Chernoff bounds is for "boosting" of randomized algorithms. If one has an algorithm that outputs a guess that is the desired answer with probability p > 1/2, then one can get a higher success rate by running the algorithm n=log(1/δ)2p/(p1/2)2 times and outputting a guess that is output by more than n/2 runs of the algorithm. (There cannot be more than one such guess.) Assuming that these algorithm runs are independent, the probability that more than n/2 of the guesses is correct is equal to the probability that the sum of independent Bernoulli random variables XkScript error: No such module "Check for unknown parameters". that are 1 with probability p is more than n/2. This can be shown to be at least 1δ via the multiplicative Chernoff bound (Corollary 13.3 in Sinclair's class notes, μ = npScript error: No such module "Check for unknown parameters".).:[16]

Pr[X>n2]1en(p1/2)2/(2p)1δ

Matrix Chernoff bound

Script error: No such module "Labelled list hatnote".

Rudolf Ahlswede and Andreas Winter introduced a Chernoff bound for matrix-valued random variables.[17] The following version of the inequality can be found in the work of Tropp.[18]

Let M1, ..., MtScript error: No such module "Check for unknown parameters". be independent matrix valued random variables such that Mid1×d2 and 𝔼[Mi]=0. Let us denote by M the operator norm of the matrix M. If Miγ holds almost surely for all i{1,,t}, then for every ε > 0Script error: No such module "Check for unknown parameters".

Pr(1ti=1tMi>ε)(d1+d2)exp(3ε2t8γ2).

Notice that in order to conclude that the deviation from 0 is bounded by εScript error: No such module "Check for unknown parameters". with high probability, we need to choose a number of samples t proportional to the logarithm of d1+d2. In general, unfortunately, a dependence on log(min(d1,d2)) is inevitable: take for example a diagonal random sign matrix of dimension d×d. The operator norm of the sum of t independent samples is precisely the maximum deviation among d independent random walks of length t. In order to achieve a fixed bound on the maximum deviation with constant probability, it is easy to see that t should grow logarithmically with d in this scenario.[19]

The following theorem can be obtained by assuming M has low rank, in order to avoid the dependency on the dimensions.

Theorem without the dependency on the dimensions

Let 0 < ε < 1Script error: No such module "Check for unknown parameters". and M be a random symmetric real matrix with E[M]1 and Mγ almost surely. Assume that each element on the support of M has at most rank r. Set

t=Ω(γlog(γ/ε2)ε2).

If rt holds almost surely, then

Pr(1ti=1tMiE[M]>ε)1𝐩𝐨𝐥𝐲(t)

where M1, ..., MtScript error: No such module "Check for unknown parameters". are i.i.d. copies of M.

Sampling variant

The following variant of Chernoff's bound can be used to bound the probability that a majority in a population will become a minority in a sample, or vice versa.[20]

Suppose there is a general population A and a sub-population B ⊆ A. Mark the relative size of the sub-population (|B|/|A|) by r.

Suppose we pick an integer k and a random sample S ⊂ A of size k. Mark the relative size of the sub-population in the sample (|BS|/|S|) by rS.

Then, for every fraction d ∈ [0,1]:

Pr(rS<(1d)r)<exp(rd2k2)

In particular, if B is a majority in A (i.e. r > 0.5) we can bound the probability that B will remain majority in S(rS > 0.5) by taking: d = 1 − 1/(2r):[21]

Pr(rS>0.5)>1exp(r(112r)2k2)

This bound is of course not tight at all. For example, when r = 0.5 we get a trivial bound Prob > 0.

Proofs

Multiplicative form

Following the conditions of the multiplicative Chernoff bound, let X1, ..., XnScript error: No such module "Check for unknown parameters". be independent Bernoulli random variables, whose sum is XScript error: No such module "Check for unknown parameters"., each having probability pi of being equal to 1. For a Bernoulli variable:

E[etXi]=(1pi)e0+piet=1+pi(et1)epi(et1)

So, using (1) with a=(1+δ)μ for any δ>0 and where μ=E[X]=i=1npi,

Pr(X>(1+δ)μ)inft0exp(t(1+δ)μ)i=1nE[exp(tXi)]inft0exp(t(1+δ)μ+i=1npi(et1))=inft0exp(t(1+δ)μ+(et1)μ).

If we simply set t = log(1 + δ)Script error: No such module "Check for unknown parameters". so that t > 0Script error: No such module "Check for unknown parameters". for δ > 0Script error: No such module "Check for unknown parameters"., we can substitute and find

exp(t(1+δ)μ+(et1)μ)=exp((1+δ1)μ)(1+δ)(1+δ)μ=[eδ(1+δ)(1+δ)]μ.

This proves the result desired.

Chernoff–Hoeffding theorem (additive form)

Let q = p + εScript error: No such module "Check for unknown parameters".. Taking a = nqScript error: No such module "Check for unknown parameters". in (1), we obtain:

Pr(1nXiq)inft>0E[etXi]etnq=inft>0(E[etXi]etq)n.

Now, knowing that Pr(Xi = 1) = p, Pr(Xi = 0) = 1 − pScript error: No such module "Check for unknown parameters"., we have

(E[etXi]etq)n=(pet+(1p)etq)n=(pe(1q)t+(1p)eqt)n.

Therefore, we can easily compute the infimum, using calculus:

ddt(pe(1q)t+(1p)eqt)=(1q)pe(1q)tq(1p)eqt

Setting the equation to zero and solving, we have

(1q)pe(1q)t=q(1p)eqt(1q)pet=q(1p)

so that

et=(1p)q(1q)p.

Thus,

t=log((1p)q(1q)p).

As q = p + ε > pScript error: No such module "Check for unknown parameters"., we see that t > 0Script error: No such module "Check for unknown parameters"., so our bound is satisfied on Template:Mvar. Having solved for Template:Mvar, we can plug back into the equations above to find that

log(pe(1q)t+(1p)eqt)=log(eqt(1p+pet))=log(eqlog((1p)q(1q)p))+log(1p+pelog(1p1q)elogqp)=qlog1p1qqlogqp+log(1p+p(1p1q)qp)=qlog1p1qqlogqp+log((1p)(1q)1q+(1p)q1q)=qlogqp+(qlog1p1q+log1p1q)=qlogqp+(1q)log1p1q=D(qp).

We now have our desired result, that

Pr(1nXip+ε)eD(p+εp)n.

To complete the proof for the symmetric case, we simply define the random variable Yi = 1 − XiScript error: No such module "Check for unknown parameters"., apply the same proof, and plug it into our bound.

An elementary proof of the Chernoff–Hoeffding theorem (additive form)

The following proof is from an article by Wolfgang Mulzer.[22] Let qp. The proof analyzes two distributions Dp and Dq, both over n-tuples of bits X=(X1,,Xn). In the distribution Dp each Xi is an independent Bernoulli random variable with expectation p, and Dq is defined analogously. When Xi=k, the ratio Dq(X)/Dp(X) is

(qp)k(1q1p)nk.

Note that this is monotone in kScript error: No such module "Check for unknown parameters"., and so whenever Xiqn, the ratio Dq(X)/Dp(X) is at least

(qp)qn(1q1p)nqn=eD(qp)n.

This shows us that Xiqn is unlikely in Dp since

PrXDp(Xiqn)eD(qp)nPrXDq(Xiqn)eD(qp)n.

As in the previous proof, for the symmetric case we simply define the random variable Yi = 1 − XiScript error: No such module "Check for unknown parameters"., apply the same proof, and plug it into our bound.

See also

References

<templatestyles src="Reflist/styles.css" />

  1. Script error: No such module "citation/CS1".
  2. Script error: No such module "citation/CS1".
  3. Script error: No such module "citation/CS1".
  4. Script error: No such module "Citation/CS1".
  5. Script error: No such module "Citation/CS1".
  6. Script error: No such module "citation/CS1".
  7. Script error: No such module "Citation/CS1".
  8. Script error: No such module "Citation/CS1".
  9. Script error: No such module "Citation/CS1".
  10. Script error: No such module "citation/CS1".
  11. Script error: No such module "Citation/CS1".
  12. Script error: No such module "Citation/CS1".
  13. a b Refer to this book section for more info on the problem.
  14. Script error: No such module "citation/CS1".
  15. Script error: No such module "citation/CS1".
  16. Script error: No such module "citation/CS1".
  17. Script error: No such module "Citation/CS1".
  18. Script error: No such module "Citation/CS1".
  19. Script error: No such module "citation/CS1".
  20. Script error: No such module "citation/CS1".; lemma 6.1
  21. See graphs of: the bound as a function of r when k changes and the bound as a function of k when r changes.
  22. Script error: No such module "Citation/CS1".

Script error: No such module "Check for unknown parameters".

Further reading

  • Script error: No such module "Citation/CS1".
  • Script error: No such module "Citation/CS1".
  • Script error: No such module "Citation/CS1".
  • Script error: No such module "Citation/CS1".