Pinsker's inequality

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Template:Short description In information theory, Pinsker's inequality, named after its inventor Mark Semenovich Pinsker, is an inequality that bounds the total variation distance (or statistical distance) in terms of the Kullback–Leibler divergence. The inequality is tight up to constant factors.[1]

Formal statement

Pinsker's inequality states that, if P and Q are two probability distributions on a measurable space (X,Σ), then

δ(P,Q)12DKL(PQ),

where

δ(P,Q)=sup{|P(A)Q(A)|AΣ is a measurable event}

is the total variation distance (or statistical distance) between P and Q and

DKL(PQ)=EP(logdPdQ)=X(logdPdQ)dP

is the Kullback–Leibler divergence in nats. When the sample space X is a finite set, the Kullback–Leibler divergence is given by

DKL(PQ)=iX(logP(i)Q(i))P(i)

Note that in terms of the total variation norm PQ of the signed measure PQ, Pinsker's inequality differs from the one given above by a factor of two:

PQ2DKL(PQ).

A proof of Pinsker's inequality uses the partition inequality for f-divergences.

Alternative version

Note that the expression of Pinsker inequality depends on what basis of logarithm is used in the definition of KL-divergence. DKL is defined using ln (logarithm in base e), whereas D is typically defined with log2 (logarithm in base 2). Then,

D(PQ)=DKL(PQ)ln2.

Given the above comments, there is an alternative statement of Pinsker's inequality in some literature that relates information divergence to variation distance:

D(PQ)=DKL(PQ)ln212ln2V2(p,q),

i.e.,

DKL(PQ)2V(p,q)2,

in which

V(p,q)=x𝒳|p(x)q(x)|

is the (non-normalized) variation distance between two probability density functions p and q on the same alphabet 𝒳.[2]

This form of Pinsker's inequality shows that "convergence in divergence" is a stronger notion than "convergence in variation distance".

A simple proof by John Pollard is shown by letting r(x)=P(x)/Q(x)11:

DKL(PQ)=EQ[(1+r(x))log(1+r(x))r(x)]12EQ[r(x)21+r(x)/3]12EQ[|r(x)|]2EQ[1+r(x)/3](from Titu's lemma)=12EQ[|r(x)|]2(As EQ[1+r(x)/3]=1 )=12V(p,q)2.

Here Titu's lemma is also known as Sedrakyan's inequality.

Note that the lower bound from Pinsker's inequality is vacuous for any distributions where DKL(PQ)>2, since the total variation distance is at most 1. For such distributions, an alternative bound can be used, due to Bretagnolle and Huber[3] (see, also, Tsybakov[4]):

δ(P,Q)1eDKL(PQ).

History

Pinsker first proved the inequality with a greater constant. The inequality in the above form was proved independently by Kullback, Csiszár, and Kemperman.[5]

Inverse problem

A precise inverse of the inequality cannot hold: for every ε>0, there are distributions Pε,Q with δ(Pε,Q)ε but DKL(PεQ)=. An easy example is given by the two-point space {0,1} with Q(0)=0,Q(1)=1 and Pε(0)=ε,Pε(1)=1ε.[6]

However, an inverse inequality holds on finite spaces X with a constant depending on Q.[7] More specifically, it can be shown that with the definition αQ:=minxX:Q(x)>0Q(x) we have for any measure P which is absolutely continuous to Q

12DKL(PQ)1αQδ(P,Q)2.

As a consequence, if Q has full support (i.e. Q(x)>0 for all xX), then

δ(P,Q)212DKL(PQ)1αQδ(P,Q)2.

Proof of Pinsker’s inequality

Lemma 1.1 (Pinsker’s inequality) Let P and Q be two distributions defined on the universe U. Then, D(P||Q)12ln2||PQ||12.[8]

Proof: A special case:

P={1,w.p. p0,w.p. 1p

and,

Q={1,w.p. q0,w.p. 1q

We assume pq (other case is similar), and let

f(p,q)=plogpq+(1p)log1p1q12ln2(2(pq))2.

Since

fq=pqln2(1q(1q)4)0,

and f=0 when q=p, we conclude that f(p,q)0 where qp. Thus we have that D(P||Q)12ln2||PQ||12 for this special case.

General case: Let P and Q be distributions on U. Let AU be

A={x|p(x)q(x)}.

and PA and QA be

PA={1,w.p. xAp(x)0,w.p. x∉Ap(x)

QA={1,w.p. xAq(x)0,w.p. x∉Aq(x)

Then,

||PQ||1=x|p(x)q(x)|=xA(p(x)q(x))+xA(q(x)p(x))=|xAp(x)xAq(x)|+|xAp(x)xAq(x)|||PQ||1=||PAQA||1(1)

Define a random variable Z as Z={1,if xA0,if xA. We have thatD(P||Q)=D(P(Z)||Q(Z))+D(P||Q|Z). Since D(P(Z)||Q(Z))=D(PA||QA) and D(P||Q|Z)0, we have

D(P||Q)D(PA| QA)12ln2||PAQA||12(use the special case)=12ln2||PQ||12(use equation 1)

See also

References

<templatestyles src="Reflist/styles.css" />

  1. Script error: No such module "citation/CS1".
  2. Script error: No such module "citation/CS1".
  3. Bretagnolle, J.; Huber, C, Estimation des densités: risque minimax, Séminaire de Probabilités, XII (Univ. Strasbourg, Strasbourg, 1976/1977), pp. 342–363, Lecture Notes in Math., 649, Springer, Berlin, 1978, Lemma 2.1 (French).
  4. Tsybakov, Alexandre B., Introduction to nonparametric estimation, Revised and extended from the 2004 French original. Translated by Vladimir Zaiats. Springer Series in Statistics. Springer, New York, 2009. xii+214 pp. Template:ISBN, Equation 2.25.
  5. Script error: No such module "citation/CS1".
  6. The divergence becomes infinite whenever one of the two distributions assigns probability zero to an event while the other assigns it a nonzero probability (no matter how small); see e.g. Script error: No such module "citation/CS1"..
  7. See Remark 1 in Script error: No such module "citation/CS1".
  8. Tulsiani, Madhur, and Gökalp Demirci. “Information and Coding Theory Lecture 5: October 14, 2014.” University of Chicago.

Script error: No such module "Check for unknown parameters".

Further reading

  • Thomas M. Cover and Joy A. Thomas: Elements of Information Theory, 2nd edition, Willey-Interscience, 2006
  • Nicolo Cesa-Bianchi and Gábor Lugosi: Prediction, Learning, and Games, Cambridge University Press, 2006