128.29.17.11: Moved punctuation outside of math notation and removed distracting box.

2025-05-16T15:00:06Z

Moved punctuation outside of math notation and removed distracting box.

New page

{{Short description|Information theory}}
{{Information theory}}

[[Image:VennInfo3Var.svg|thumb|256px|right|[[Venn diagram]] of information theoretic measures for three variables <math>x</math>, <math>y</math>, and <math>z</math>, represented by the lower left, lower right, and upper circles, respectively. The conditional mutual informations <math>I(x;z|y)</math>, <math>I(y;z|x)</math> and <math>I(x;y|z)</math> are represented by the yellow, cyan, and magenta regions, respectively.]]

In [[probability theory]], particularly [[information theory]], the '''conditional mutual information'''<ref name = Wyner1978>{{cite journal|last=Wyner|first=A. D. |title=A definition of conditional mutual information for arbitrary ensembles|journal=Information and Control|year=1978|volume=38|issue=1|pages=51–59|doi=10.1016/s0019-9958(78)90026-8|doi-access=free}}</ref><ref name = Dobrushin1959>{{cite journal|last=Dobrushin|first=R. L. |title=General formulation of Shannon's main theorem in information theory|journal=Uspekhi Mat. Nauk|year=1959|volume=14|pages=3–104}}</ref> is, in its most basic form, the [[expected value]] of the [[mutual information]] of two random variables given the value of a third.

==Definition==
For random variables <math>X</math>, <math>Y</math>, and <math>Z</math> with [[Support (mathematics)|support sets]] <math>\mathcal{X}</math>, <math>\mathcal{Y}</math> and <math>\mathcal{Z}</math>, we define the conditional mutual information as

<math>
I(X;Y|Z) = \int_\mathcal{Z} D_{\mathrm{KL}}( P_{(X,Y)|Z} \| P_{X|Z} \otimes P_{Y|Z} ) dP_{Z}
</math>.

This may be written in terms of the expectation operator: <math>I(X;Y|Z) = \mathbb{E}_Z [D_{\mathrm{KL}}( P_{(X,Y)|Z} \| P_{X|Z} \otimes P_{Y|Z} )]</math>.

Thus <math>I(X;Y|Z)</math> is the expected (with respect to <math>Z</math>) [[Kullback–Leibler divergence]] from the conditional joint distribution <math>P_{(X,Y)|Z}</math> to the product of the conditional marginals <math>P_{X|Z}</math> and <math>P_{Y|Z}</math>. Compare with the definition of [[mutual information]].

==In terms of PMFs for discrete distributions==
For discrete random variables <math>X</math>, <math>Y</math>, and <math>Z</math> with [[Support (mathematics)|support sets]] <math>\mathcal{X}</math>, <math>\mathcal{Y}</math> and <math>\mathcal{Z}</math>, the conditional mutual information <math>I(X;Y|Z)</math> is as follows
:<math>
I(X;Y|Z) = \sum_{z\in \mathcal{Z}} p_Z(z) \sum_{y\in \mathcal{Y}} \sum_{x\in \mathcal{X}}
p_{X,Y|Z}(x,y|z) \log \frac{p_{X,Y|Z}(x,y|z)}{p_{X|Z}(x|z)p_{Y|Z}(y|z)}
</math>
where the marginal, joint, and/or conditional [[probability mass function]]s are denoted by <math>p</math> with the appropriate subscript. This can be simplified as

<math>
I(X;Y|Z) = \sum_{z\in \mathcal{Z}} \sum_{y\in \mathcal{Y}} \sum_{x\in \mathcal{X}} p_{X,Y,Z}(x,y,z) \log \frac{p_Z(z)p_{X,Y,Z}(x,y,z)}{p_{X,Z}(x,z)p_{Y,Z}(y,z)}
</math>.

==In terms of PDFs for continuous distributions==
For (absolutely) continuous random variables <math>X</math>, <math>Y</math>, and <math>Z</math> with [[Support (mathematics)|support sets]] <math>\mathcal{X}</math>, <math>\mathcal{Y}</math> and <math>\mathcal{Z}</math>, the conditional mutual information <math>I(X;Y|Z)</math> is as follows
:<math>
I(X;Y|Z) = \int_{\mathcal{Z}} \bigg( \int_{\mathcal{Y}} \int_{\mathcal{X}}
\log \left(\frac{p_{X,Y|Z}(x,y|z)}{p_{X|Z}(x|z)p_{Y|Z}(y|z)}\right) p_{X,Y|Z}(x,y|z) dx dy \bigg) p_Z(z) dz
</math>
where the marginal, joint, and/or conditional [[probability density function]]s are denoted by <math>p</math> with the appropriate subscript. This can be simplified as

<math>
I(X;Y|Z) = \int_{\mathcal{Z}} \int_{\mathcal{Y}} \int_{\mathcal{X}} \log \left(\frac{p_Z(z)p_{X,Y,Z}(x,y,z)}{p_{X,Z}(x,z)p_{Y,Z}(y,z)}\right) p_{X,Y,Z}(x,y,z) dx dy dz
</math>.

==Some identities==
Alternatively, we may write in terms of joint and conditional [[Entropy (information theory)|entropies]] as<ref>{{cite book |last1=Cover |first1=Thomas |author-link1=Thomas M. Cover |last2=Thomas |first2=Joy A. |title=Elements of Information Theory |edition=2nd |location=New York |publisher=[[Wiley-Interscience]] |date=2006 |isbn=0-471-24195-4}}</ref>
:<math>\begin{align}
I(X;Y|Z) &= H(X,Z) + H(Y,Z) - H(X,Y,Z) - H(Z) \\
&= H(X|Z) - H(X|Y,Z) \\
&= H(X|Z)+H(Y|Z)-H(X,Y|Z).
\end{align}</math>
This can be rewritten to show its relationship to mutual information
:<math>I(X;Y|Z) = I(X;Y,Z) - I(X;Z)</math>
usually rearranged as '''the chain rule for mutual information'''
:<math>I(X;Y,Z) = I(X;Z) + I(X;Y|Z)</math>
or
:<math>I(X;Y|Z) = I(X;Y) - (I(X;Z) - I(X;Z|Y))\,.</math>
Another equivalent form of the above is
:<math>\begin{align}
I(X;Y|Z) &= H(Z|X) + H(X) + H(Z|Y) + H(Y) - H(Z|X,Y) - H(X,Y) - H(Z)\\
&= I(X;Y) + H(Z|X) + H(Z|Y) - H(Z|X,Y) - H(Z)
\end{align}\,.</math>
Another equivalent form of the conditional mutual information is
:<math>\begin{align}
I(X;Y|Z) = I(X,Z;Y,Z) - H(Z)
\end{align}\,.</math>

Like mutual information, conditional mutual information can be expressed as a [[Kullback–Leibler divergence]]:

:<math> I(X;Y|Z) = D_{\mathrm{KL}}[ p(X,Y,Z) \| p(X|Z)p(Y|Z)p(Z) ]. </math>

Or as an expected value of simpler Kullback–Leibler divergences:
:<math> I(X;Y|Z) = \sum_{z \in \mathcal{Z}} p( Z=z ) D_{\mathrm{KL}}[ p(X,Y|z) \| p(X|z)p(Y|z) ]</math>,
:<math> I(X;Y|Z) = \sum_{y \in \mathcal{Y}} p( Y=y ) D_{\mathrm{KL}}[ p(X,Z|y) \| p(X|Z)p(Z|y) ]</math>.

==More general definition==
A more general definition of conditional mutual information, applicable to random variables with continuous or other arbitrary distributions, will depend on the concept of '''[[regular conditional probability]]'''.<ref>D. Leao, Jr. et al. ''Regular conditional probability, disintegration of probability and Radon spaces.'' Proyecciones. Vol. 23, No. 1, pp. 15–29, May 2004, Universidad Católica del Norte, Antofagasta, Chile [http://www.scielo.cl/pdf/proy/v23n1/art02.pdf PDF]</ref>

Let <math>(\Omega, \mathcal F, \mathfrak P)</math> be a [[probability space]], and let the random variables <math>X</math>, <math>Y</math>, and <math>Z</math> each be defined as a Borel-measurable function from <math>\Omega</math> to some state space endowed with a topological structure.

Consider the Borel measure (on the σ-algebra generated by the open sets) in the state space of each random variable defined by assigning each Borel set the <math>\mathfrak P</math>-measure of its preimage in <math>\mathcal F</math>. This is called the [[pushforward measure]] <math>X _* \mathfrak P = \mathfrak P\big(X^{-1}(\cdot)\big).</math> The '''support of a random variable''' is defined to be the [[Support (measure theory)|topological support]] of this measure, i.e. <math>\mathrm{supp}\,X = \mathrm{supp}\,X _* \mathfrak P.</math>

Now we can formally define the [[conditional probability distribution|conditional probability measure]] given the value of one (or, via the [[product topology]], more) of the random variables. Let <math>M</math> be a measurable subset of <math>\Omega,</math> (i.e. <math>M \in \mathcal F,</math>) and let <math>x \in \mathrm{supp}\,X.</math> Then, using the [[disintegration theorem]]:
:<math>\mathfrak P(M | X=x) = \lim_{U \ni x}
\frac {\mathfrak P(M \cap \{X \in U\})}
{\mathfrak P(\{X \in U\})}
\qquad \textrm{and} \qquad \mathfrak P(M|X) = \int_M d\mathfrak P\big(\omega|X=X(\omega)\big),</math>
where the limit is taken over the open neighborhoods <math>U</math> of <math>x</math>, as they are allowed to become arbitrarily smaller with respect to [[Subset|set inclusion]].

Finally we can define the conditional mutual information via [[Lebesgue integration]]:
:<math>I(X;Y|Z) = \int_\Omega \log
\Bigl(
\frac {d \mathfrak P(\omega|X,Z)\, d\mathfrak P(\omega|Y,Z)}
{d \mathfrak P(\omega|Z)\, d\mathfrak P(\omega|X,Y,Z)}
\Bigr)
d \mathfrak P(\omega),
</math>
where the integrand is the logarithm of a [[Radon–Nikodym derivative]] involving some of the conditional probability measures we have just defined.

==Note on notation==
In an expression such as <math>I(A;B|C),</math> <math>A,</math> <math>B,</math> and <math>C</math> need not necessarily be restricted to representing individual random variables, but could also represent the joint distribution of any collection of random variables defined on the same [[probability space]]. As is common in [[probability theory]], we may use the comma to denote such a joint distribution, e.g. <math>I(A_0,A_1;B_1,B_2,B_3|C_0,C_1).</math> Hence the use of the semicolon (or occasionally a colon or even a wedge <math>\wedge</math>) to separate the principal arguments of the mutual information symbol. (No such distinction is necessary in the symbol for [[joint entropy]], since the joint entropy of any number of random variables is the same as the entropy of their joint distribution.)

== Properties ==
===Nonnegativity===
It is always true that
:<math>I(X;Y|Z) \ge 0</math>,
for discrete, jointly distributed random variables <math>X</math>, <math>Y</math> and <math>Z</math>. This result has been used as a basic building block for proving other [[inequalities in information theory]], in particular, those known as Shannon-type inequalities. Conditional mutual information is also non-negative for continuous random variables under certain regularity conditions.<ref>{{cite book |last1=Polyanskiy |first1=Yury |last2=Wu |first2=Yihong |title=Lecture notes on information theory |date=2017 |page=30 |url=http://people.lids.mit.edu/yp/homepage/data/itlectures_v5.pdf}}</ref>

===Interaction information===
Conditioning on a third random variable may either increase or decrease the mutual information: that is, the difference <math>I(X;Y) - I(X;Y|Z)</math>, called the [[interaction information]], may be positive, negative, or zero. This is the case even when random variables are pairwise independent. Such is the case when: <math display="block">X \sim \mathrm{Bernoulli}(0.5), Z \sim \mathrm{Bernoulli}(0.5), \quad Y=\left\{\begin{array}{ll} X & \text{if }Z=0\\ 1-X & \text{if }Z=1 \end{array}\right.</math>in which case <math>X</math>, <math>Y</math> and <math>Z</math> are pairwise independent and in particular <math>I(X;Y)=0</math>, but <math>I(X;Y|Z)=1.</math>

===Chain rule for mutual information===
The chain rule (as derived above) provides two ways to decompose <math>I(X;Y,Z)</math>:
:<math>
\begin{align}
I(X;Y,Z) &= I(X;Z) + I(X;Y|Z) \\
&= I(X;Y) + I(X;Z|Y)
\end{align}
</math>
The [[data processing inequality]] is closely related to conditional mutual information and can be proven using the chain rule.

==Interaction information==
{{main|Interaction information}}
The conditional mutual information is used to inductively define the '''interaction information''', a generalization of mutual information, as follows:
:<math>I(X_1;\ldots;X_{n+1}) = I(X_1;\ldots;X_n) - I(X_1;\ldots;X_n|X_{n+1}),</math>
where
:<math>I(X_1;\ldots;X_n|X_{n+1}) = \mathbb{E}_{X_{n+1}} [D_{\mathrm{KL}}( P_{(X_1,\ldots,X_n)|X_{n+1}} \| P_{X_1|X_{n+1}} \otimes\cdots\otimes P_{X_n|X_{n+1}} )].</math>
Because the conditional mutual information can be greater than or less than its unconditional counterpart, the interaction information can be positive, negative, or zero, which makes it hard to interpret.

==References==
<references/>

[[Category:Information theory]]
[[Category:Entropy and information]]

Conditional mutual information - Revision history

128.29.17.11: Moved punctuation outside of math notation and removed distracting box.