Information content: Difference between revisions

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search
imported>Renamed user 9f73a25afea510471a19e014e88a5cf4
Monotonically decreasing function of probability: changed a word for the sake of clarity and consistency regarding the previous statement
imported>Lexiconaut
Remove Goodreads as a reference per WP:GOODREADS
 
Line 1: Line 1:
{{short description|Basic quantity derived from the probability of a particular event occurring from a random variable}}
{{Short description|Quantity derived from the probability of a particular event occurring from a random variable}}
{{cleanup|reason=unclear terminology|date=June 2017}}
{{cleanup|reason=unclear terminology|date=June 2017}}


Line 13: Line 13:


== Definition ==
== Definition ==
[[Claude Shannon]]'s definition of self-information was chosen to meet several axioms:
[[Claude Shannon]]'s definition of self-information was chosen to meet several [[Axiom|axioms]]:


# An event with probability 100% is perfectly unsurprising and yields no information.
* An event with probability 100% is perfectly unsurprising and yields no information.  
# The less probable an event is, the more surprising it is and the more information it yields.
* The less probable an event is, the more surprising it is and the more information it yields.
# If two independent events are measured separately, the total amount of information is the sum of the self-informations of the individual events.
* If two independent events are measured separately, the total amount of information is the sum of the self-informations of the individual events.


The detailed derivation is below, but it can be shown that there is a unique function of probability that meets these three axioms, up to a multiplicative scaling factor. Broadly, given a real number <math>b>1</math> and an [[Event (probability theory)|event]] <math>x</math> with [[probability]] <math>P</math>, the information content is defined as follows:
The detailed derivation is below, but it can be shown that there is a unique function of probability that meets these three axioms, up to a multiplicative scaling factor. Broadly, given a real number <math>b>1</math> and an [[Event (probability theory)|event]] <math>x</math> with [[probability]] <math>P</math>, the information content is defined as the negative [[log probability]]:<math display="block">\mathrm{I}(x) := - \log_b{\left[\Pr{\left(x\right)}\right]} = -\log_b{\left(P\right)}. </math>The base <math>b</math> corresponds to the scaling factor above. Different choices of ''b'' correspond to different units of information: when <math>b=2</math>, the unit is the [[Shannon (unit)|shannon]] (symbol '''Sh'''), often called a 'bit'; when <math>b = e</math>, the unit is the [[Nat (unit)|natural unit of information]] (symbol '''nat'''); and when <math>b = 10</math>, the unit is the [[Hartley (unit)|hartley]] (symbol '''Hart''').
<math display="block">\mathrm{I}(x) := - \log_b{\left[\Pr{\left(x\right)}\right]} = -\log_b{\left(P\right)}. </math>


The base ''b'' corresponds to the scaling factor above. Different choices of ''b'' correspond to different units of information: when {{nowrap|1=''b'' = 2}}, the unit is the [[Shannon (unit)|shannon]] (symbol Sh), often called a 'bit'; when {{nowrap|1=''b'' = [[Euler's number|e]]}}, the unit is the [[Nat (unit)|natural unit of information]] (symbol nat); and when {{nowrap|1=''b'' = 10}}, the unit is the [[Hartley (unit)|hartley]] (symbol Hart).
Formally, given a discrete random variable <math>X</math> with [[probability mass function]] <math>p_{X}{\left(x\right)}</math>, the self-information of measuring <math>X</math> as [[Outcome (probability)|outcome]] <math>x</math> is defined as:<ref name=":0">{{Cite book|title=Quantum Computing Explained|last=McMahon|first=David M.|publisher=Wiley-Interscience|year=2008|isbn=9780470181386 |location=Hoboken, NJ|oclc=608622533}}</ref><math display="block">\operatorname{I}_{X}(x) := - \log{\left[p_{X}{\left(x\right)}\right]} = \log{\left(\frac{1}{p_{X}{\left(x\right)}}\right)}. </math>The use of the notation <math>I_X(x)</math> for self-information above is not universal. Since the notation <math>I(X;Y)</math> is also often used for the related quantity of [[mutual information]], many authors use a lowercase <math>h_X(x)</math> for self-entropy instead, mirroring the use of the capital <math>H(X)</math> for the entropy.
 
Formally, given a discrete random variable <math>X</math> with [[probability mass function]] <math>p_{X}{\left(x\right)}</math>, the self-information of measuring <math>X</math> as [[Outcome (probability)|outcome]] <math>x</math> is defined as<ref name=":0">{{Cite book|title=Quantum Computing Explained|last=McMahon|first=David M.|publisher=Wiley-Interscience|year=2008|isbn=9780470181386 |location=Hoboken, NJ|oclc=608622533}}</ref>
<math display="block">\operatorname I_X(x) :=  
- \log{\left[p_{X}{\left(x\right)}\right]}
= \log{\left(\frac{1}{p_{X}{\left(x\right)}}\right)}. </math>
 
The use of the notation <math>I_X(x)</math> for self-information above is not universal. Since the notation <math>I(X;Y)</math> is also often used for the related quantity of [[mutual information]], many authors use a lowercase <math>h_X(x)</math> for self-entropy instead, mirroring the use of the capital <math>H(X)</math> for the entropy.


== Properties ==
== Properties ==
{{Expand section|date=October 2018}}
{{Expand section|date=August 2025}}


=== Monotonically decreasing function of probability ===
=== Monotonically decreasing function of probability ===
For a given [[probability space]], the measurement of rarer [[event (probability theory)|event]]s are intuitively more "surprising", and yield more information content than more "common" events. Thus, self-information is a [[Monotonic function|strictly decreasing monotonic function]] of the probability, or sometimes called an "antitonic" function.
For a given [[probability space]], the measurement of rarer [[event (probability theory)|event]]s are intuitively more "surprising", and yield more information content than more "common" events. Thus, self-information is a [[Monotonic function|strictly decreasing monotonic function]] of the probability, or sometimes called an "antitonic" function.<ref name="CoverThomas">{{cite book |last1=Cover |first1=T.M. |last2=Thomas |first2=J.A. |title=Elements of Information Theory |edition=2nd |publisher=Wiley-Interscience |year=2006 |isbn=978-0471241959 |page=20}}</ref>
 
While standard probabilities are represented by real numbers in the interval <math>[0, 1]</math>, self-information values are non-negative [[extended real number]]s in the interval <math>[0, \infty]</math>. Specifically:


While standard probabilities are represented by real numbers in the interval <math>[0, 1]</math>, self-informations are represented by [[extended real number]]s in the interval <math>[0, \infty]</math>. In particular, we have the following, for any choice of logarithmic base:
* An event with probability <math>\Pr(x) = 1</math> (a certain event) has an information content of <math>\mathrm{I}(x) = -\log_b(1) = 0</math>. Its occurrence is perfectly unsurprising and reveals no new information.
* An event with probability <math>\Pr(x) = 0</math> (an impossible event) has an information content of <math>\mathrm{I}(x) = -\log_b(0)</math>, which is undefined but is taken to be <math>\infty</math> by [[limit (mathematics)|convention]]. This reflects that observing an event believed to be impossible would be infinitely surprising.<ref name="MacKay">{{cite book |last=MacKay |first=David J.C. |title=Information Theory, Inference, and Learning Algorithms |publisher=Cambridge University Press |year=2003 |isbn=978-0521642989 |page=32 |url=http://www.inference.org.uk/itila/}}</ref>


* If a particular event has a 100% probability of occurring, then its self-information is <math>-\log(1) = 0</math>: its occurrence is "perfectly non-surprising" and yields no information.
This monotonic relationship is fundamental to the use of information content as a measure of uncertainty. For example, learning that a one-in-a-million lottery ticket won provides far more information than learning it lost (See also ''[[Lottery mathematics]]''.) This also establishes an intuitive connection to concepts like [[statistical dispersion]]; events that are far from the mean or typical outcome (and thus have low probability in many common distributions) have high self-information.
* If a particular event has a 0% probability of occurring, then its self-information is <math>-\log(0) = \infty</math>: its occurrence is "infinitely surprising".


From this, we can get a few general properties:
=== Relationship to log-odds ===
The Shannon information is closely related to the [[log-odds]]. The log-odds of an event <math>x</math>, with probability <math>p(x)</math>, is defined as the logarithm of the [[odds]], <math>\frac{p(x)}{1-p(x)}</math>. This can be expressed as a difference of two information content values:<math display="block">{\displaystyle  \begin{align} \text{log-odds}(x)
&= \ \log_b\left(\frac{p(x)}{1-p(x)}\right) \\
&= \ \log_b(p(x)) - \log_b(1-p(x)) \\
&= \ \ \mathrm{I}(\lnot x) \ - \ \mathrm{I}(x), \end{align} }</math>where <math>\lnot x</math> denotes the event ''not'' <math>x</math>.


* Intuitively, more information is gained from observing an unexpected event—it is "surprising".
This expression can be interpreted as the amount of information gained (or surprise) from learning the event did ''not'' occur, minus the information gained from learning it ''did'' occur. This connection is particularly relevant in [[statistical modeling]] where log-odds are the core of the [[logit]] function and [[logistic regression]].<ref name="Bishop">{{cite book |last=Bishop |first=Christopher M. |title=Pattern Recognition and Machine Learning |publisher=Springer |year=2006 |isbn=978-0387310732 |page=205}}</ref>
** For example, if there is a [[wikt:one in a million|one-in-a-million]] chance of Alice winning the [[lottery]], her friend Bob will gain significantly more information from learning that she [[Winning the lottery|won]] than that she lost on a given day. (See also ''[[Lottery mathematics]]''.)
* This establishes an implicit relationship between the self-information of a [[random variable]] and its [[variance]].


=== Relationship to log-odds ===
=== Additivity of independent events ===
The Shannon information is closely related to the [[log-odds]]. In particular, given some event <math>x</math>, suppose that <math>p(x)</math> is the probability of <math>x</math> occurring, and that <math>p(\lnot x) = 1-p(x)</math> is the probability of <math>x</math> not occurring. Then we have the following definition of the log-odds:
The information content of two [[independent events]] is the sum of each event's information content. This property is known as [[Additive map|additivity]] in mathematics. Consider two [[independent random variables]] <math>X</math> and <math>Y</math> with [[probability mass function]]s <math>p_X(x)</math> and <math>p_Y(y)</math>. The [[joint probability]] of observing the outcome <math>(x, y)</math> is given by the product of the individual probabilities due to [[Independence (probability theory)|independence]]:<math display="block"> p_{X, Y}(x, y) = \Pr(X=x, Y=y) = p_X(x) \ p_Y(y)</math>The information content of this joint event is:<math display="block"> {\displaystyle  \begin{align} \operatorname{I}_{X,Y}(x, y)
<math display="block">\text{log-odds}(x) = \log\left(\frac{p(x)}{p(\lnot x)}\right)</math>
&= \ -\log_b \left[p{X,Y}(x, y)\right] \\
&= \ -\log_b \left[p_X(x) \ p_Y(y)\right] \\
&= \ -\log_b \left[p_X(x)\right] \ - \ \log_b \left[p_Y(y)\right] \ \\
&= \ \ \operatorname{I}_X(x) \ + \ \operatorname{I}_Y(y), \end{align} }
</math>This additivity makes information content a more mathematically convenient measure than probability in many applications, such as in [[coding theory]] where the amount of information needed to describe a sequence of independent symbols is the sum of the information needed for each symbol.<ref name="CoverThomas" />


This can be expressed as a difference of two Shannon informations:
The corresponding property for [[likelihood]]s is that the [[log-likelihood]] of independent events is the sum of the log-likelihoods of each event. Interpreting log-likelihood as "support" or negative surprisal (the degree to which an event supports a given model: a model is supported by an event to the extent that the event is unsurprising, given the model), this states that independent events add support: the information that the two events together provide for [[statistical inference]] is the sum of their independent information.
<math display="block">\text{log-odds}(x) = \mathrm{I}(\lnot x) - \mathrm{I}(x)</math>


In other words, the log-odds can be interpreted as the level of surprise when the event ''doesn't'' happen, minus the level of surprise when the event ''does'' happen.
==Relationship to entropy==
The [[Shannon entropy]] of the random variable <math>X</math> is [[Shannon entropy#Definition|defined as]]:<math display="block">{\displaystyle  \begin{align}


=== Additivity of independent events ===
\mathrm{H}(X) \
The information content of two [[independent events]] is the sum of each event's information content. This property is known as [[Additive map|additivity]] in mathematics, and [[sigma additivity]] in particular in [[Measure (mathematics)|measure]] and probability theory. Consider two [[independent random variables]] <math display="inline">X,\, Y</math> with [[probability mass function]]s <math>p_X(x)</math> and <math>p_Y(y)</math> respectively. The [[joint probability mass function]] is
&= \ \sum_{x} {-p_{X}{\left(x \right)} \ \log{p_{X}{\left(x\right)}}} \\
&= \ \sum_{x} {p_{X}{\left(x\right)} \  \operatorname{I}_X(x)} \ \


<math display="block"> p_{X, Y}\!\left(x, y\right) = \Pr(X = x,\, Y = y)
{\overset{\underset{\mathrm{def}}{}}{=}} \ \  
= p_X\!(x)\,p_Y\!(y)
</math>


because <math display="inline">X</math> and <math display="inline">Y</math> are [[Independence (probability theory)|independent]]. The information content of the [[Outcome (probability)|outcome]] <math> (X, Y) = (x, y)</math> is<math display="block"> \begin{align}
\operatorname{E}{\left[\operatorname{I}_X (X)\right]},
\operatorname{I}_{X,Y}(x, y) &= -\log_2\left[p_{X,Y}(x, y)\right]
= -\log_2 \left[p_X\!(x)p_Y\!(y)\right] \\[5pt]
&= -\log_2 \left[p_X{(x)}\right] -\log_2 \left[p_Y{(y)}\right] \\[5pt]
&= \operatorname{I}_X(x) + \operatorname{I}_Y(y)
\end{align}
</math>
See ''{{Section link||Two independent, identically distributed dice|nopage=y}}'' below for an example.


The corresponding property for [[likelihood]]s is that the [[log-likelihood]] of independent events is the sum of the log-likelihoods of each event. Interpreting log-likelihood as "support" or negative surprisal (the degree to which an event supports a given model: a model is supported by an event to the extent that the event is unsurprising, given the model), this states that independent events add support: the information that the two events together provide for statistical inference is the sum of their independent information.
\end{align} } </math>by definition equal to the [[Expected value|expected]] information content of measurement of <math>X</math>.<ref>{{cite book|url=https://books.google.com/books?id=Lyte2yl1SPAC&pg=PA11|title=Fundamentals in Information Theory and Coding|author=Borda, Monica|publisher=Springer|year=2011|isbn=978-3-642-20346-6}}</ref>{{rp|11}}<ref>{{cite book|url=https://books.google.com/books?id=VpRESN24Zj0C&pg=PA19|title=Mathematics of Information and Coding|publisher=American Mathematical Society|year=2002|isbn=978-0-8218-4256-0|author1=Han, Te Sun |author2=Kobayashi, Kingo }}</ref>{{rp|19–20}}


==Relationship to entropy==
The expectation is taken over the [[discrete random variable|discrete values]] over its [[Support (mathematics)|support]].
The [[Shannon entropy]] of the random variable <math>X </math> above is [[Shannon entropy#Definition|defined as]]
<math display="block">\begin{alignat}{2}
\Eta(X) &= \sum_{x} {-p_{X}{\left(x\right)} \log{p_{X}{\left(x\right)}}} \\
&= \sum_{x} {p_{X}{\left(x\right)} \operatorname{I}_X(x)} \\
&{\overset{\underset{\mathrm{def}}{}}{=}} \
  \operatorname{E}{\left[\operatorname{I}_X (X)\right]},
\end{alignat} </math>
by definition equal to the [[Expected value|expected]] information content of measurement of <math>X </math>.<ref>{{cite book|url=https://books.google.com/books?id=Lyte2yl1SPAC&pg=PA11|title=Fundamentals in Information Theory and Coding|author=Borda, Monica|publisher=Springer|year=2011|isbn=978-3-642-20346-6}}</ref>{{rp|11}}<ref>{{cite book|url=https://books.google.com/books?id=VpRESN24Zj0C&pg=PA19|title=Mathematics of Information and Coding|publisher=American Mathematical Society|year=2002|isbn=978-0-8218-4256-0|author1=Han, Te Sun |author2=Kobayashi, Kingo }}</ref>{{rp|19–20}}
The expectation is taken over the [[discrete random variable|discrete values]] over its [[Support (mathematics)|support]].


Sometimes, the entropy itself is called the "self-information" of the random variable, possibly because the entropy satisfies <math>\Eta(X) = \operatorname{I}(X; X)</math>, where <math>\operatorname{I}(X;X)</math> is the [[mutual information]] of <math>X</math> with itself.<ref>Thomas M. Cover, Joy A. Thomas; Elements of Information Theory; p. 20; 1991.</ref>
Sometimes, the entropy itself is called the "self-information" of the random variable, possibly because the entropy satisfies <math>\mathrm{H}(X) = \operatorname{I}(X; X)</math>, where <math>\operatorname{I}(X;X)</math> is the [[mutual information]] of <math>X</math> with itself.<ref>Thomas M. Cover, Joy A. Thomas; Elements of Information Theory; p. 20; 1991.</ref>


For [[Continuous Random Variables|continuous random variables]] the corresponding concept is [[differential entropy]].
For [[Continuous Random Variables|continuous random variables]] the corresponding concept is [[differential entropy]].
Line 92: Line 75:
== Notes ==
== Notes ==
This measure has also been called '''surprisal''', as it represents the "[[surprise (emotion)|surprise]]" of seeing the outcome (a highly improbable outcome is very surprising). This term (as a log-probability measure) was introduced by [[Edward W. Samson]] in his 1951 report "Fundamental natural concepts of information theory".<ref name="samson53">
This measure has also been called '''surprisal''', as it represents the "[[surprise (emotion)|surprise]]" of seeing the outcome (a highly improbable outcome is very surprising). This term (as a log-probability measure) was introduced by [[Edward W. Samson]] in his 1951 report "Fundamental natural concepts of information theory".<ref name="samson53">
{{Cite journal| volume = 10| issue = 4, Summer 1953, special issue on information theory| pages = 283–297| last = Samson| first = Edward W.| title = Fundamental natural concepts of information theory| journal = ETC: A Review of General Semantics| date = 1953| url = https://www.jstor.org/stable/42581366| jstor = 42581366|orig-date = Originally published October 1951 as Tech Report No. E5079, Air Force Cambridge Research Center}}
{{Cite journal| volume = 10| issue = 4, Summer 1953, special issue on information theory| pages = 283–297| last = Samson| first = Edward W.| title = Fundamental natural concepts of information theory| journal = ETC: A Review of General Semantics| date = 1953| url = [suspicious link removed]| jstor = 42581366|orig-date = Originally published October 1951 as Tech Report No. E5079, Air Force Cambridge Research Center}}
</ref><ref name="attneave">{{cite book | last=Attneave | first=Fred | title=Applications of Information Theory to Psychology: A Summary of Basic Concepts, Methods, and Results |publisher=Holt, Rinehart and Winston | publication-place=New York| edition=1 | date=1959 }}</ref> An early appearance in the Physics literature is in [[Myron Tribus]]' 1961 book ''Thermostatics and Thermodynamics''.<ref name="Bernstein1972">{{cite journal | url=https://aip.scitation.org/doi/abs/10.1063/1.1677983 | doi=10.1063/1.1677983 | title=Entropy and Chemical Change. I. Characterization of Product (And Reactant) Energy Distributions in Reactive Molecular Collisions: Information and Entropy Deficiency | date=1972 | last1=Bernstein | first1=R. B. | last2=Levine | first2=R. D. | journal=The Journal of Chemical Physics | volume=57 | issue=1 | pages=434–449 | bibcode=1972JChPh..57..434B | url-access=subscription }}</ref><ref name="Tribus1961">[http://www.eoht.info/page/Myron+Tribus Myron Tribus] (1961) '''Thermodynamics and Thermostatics:''' ''An Introduction to Energy, Information and States of Matter, with Engineering Applications'' (D. Van Nostrand, 24 West 40 Street, New York 18, New York, U.S.A) Tribus, Myron (1961), pp. 64–66 [https://archive.org/details/thermostaticsthe00trib borrow].</ref>
</ref><ref name="attneave">{{cite book | last=Attneave | first=Fred | title=Applications of Information Theory to Psychology: A Summary of Basic Concepts, Methods, and Results |publisher=Holt, Rinehart and Winston | publication-place=New York| edition=1 | date=1959 }}</ref> An early appearance in the Physics literature is in [[Myron Tribus]]' 1961 book ''Thermostatics and Thermodynamics''.<ref name="Bernstein1972">{{cite journal | url=https://aip.scitation.org/doi/abs/10.1063/1.1677983 | doi=10.1063/1.1677983 | title=Entropy and Chemical Change. I. Characterization of Product (And Reactant) Energy Distributions in Reactive Molecular Collisions: Information and Entropy Deficiency | date=1972 | last1=Bernstein | first1=R. B. | last2=Levine | first2=R. D. | journal=The Journal of Chemical Physics | volume=57 | issue=1 | pages=434–449 | bibcode=1972JChPh..57..434B | url-access=subscription }}</ref><ref name="Tribus1961">[http://www.eoht.info/page/Myron+Tribus Myron Tribus] (1961) '''Thermodynamics and Thermostatics:''' ''An Introduction to Energy, Information and States of Matter, with Engineering Applications'' (D. Van Nostrand, 24 West 40 Street, New York 18, New York, U.S.A) Tribus, Myron (1961), pp. 64–66 [https://archive.org/details/thermostaticsthe00trib borrow].</ref>


Line 167: Line 150:
\right\}</math> correspond to the event <math>C_k = 2</math> and a [[total probability]] of {{Sfrac|6}}.  These are the only events that are faithfully preserved with identity of which dice rolled which outcome because the outcomes are the same.  Without knowledge to distinguish the dice rolling the other numbers, the other <math display="inline"> \binom{6}{2} = 15</math> [[combination]]s correspond to one die rolling one number and the other die rolling a different number, each having probability {{Sfrac|18}}. Indeed, <math display="inline"> 6 \cdot \tfrac{1}{36} + 15 \cdot \tfrac{1}{18} = 1</math>, as required.
\right\}</math> correspond to the event <math>C_k = 2</math> and a [[total probability]] of {{Sfrac|6}}.  These are the only events that are faithfully preserved with identity of which dice rolled which outcome because the outcomes are the same.  Without knowledge to distinguish the dice rolling the other numbers, the other <math display="inline"> \binom{6}{2} = 15</math> [[combination]]s correspond to one die rolling one number and the other die rolling a different number, each having probability {{Sfrac|18}}. Indeed, <math display="inline"> 6 \cdot \tfrac{1}{36} + 15 \cdot \tfrac{1}{18} = 1</math>, as required.


Unsurprisingly, the information content of learning that both dice were rolled as the same particular number is more than the information content of learning that one dice was one number and the other was a different number.  Take for examples the events <math> A_k = \{(X, Y) = (k, k)\}</math> and <math> B_{j, k} = \{c_j = 1\} \cap \{c_k = 1\}</math> for <math> j \ne k, 1 \leq j, k \leq 6</math>. For example, <math> A_2 = \{X = 2 \text{ and } Y = 2\}</math> and <math> B_{3, 4} = \{(3, 4), (4, 3)\}</math>.
Unsurprisingly, the information content of learning that both dice were rolled as the same particular number is more than the information content of learning that one die was one number and the other was a different number.  Take for examples the events <math> A_k = \{(X, Y) = (k, k)\}</math> and <math> B_{j, k} = \{c_j = 1\} \cap \{c_k = 1\}</math> for <math> j \ne k, 1 \leq j, k \leq 6</math>. For example, <math> A_2 = \{X = 2 \text{ and } Y = 2\}</math> and <math> B_{3, 4} = \{(3, 4), (4, 3)\}</math>.


The information contents are
The information contents are
Line 201: Line 184:
\end{cases}</math>
\end{cases}</math>


For the purposes of information theory, the values <math>s \in \mathcal{S}</math> do not have to be [[number]]s; they can be any [[Mutually exclusive#Probability|mutually exclusive]] [[Event (probability theory)|events]] on a [[measure space]] of [[finite measure]] that has been [[Normalization (statistics)|normalized]] to a [[probability measure]] <math>p</math>.  [[Without loss of generality]], we can assume the categorical distribution is supported on the set <math display="inline">[N] = \left\{1, 2, \dots, N \right\}</math>; the mathematical structure is [[Isomorphism|isomorphic]] in terms of [[probability theory]] and therefore [[information theory]] as well.
For the purposes of information theory, the values <math>s \in \mathcal{S}</math> do not have to be [[number]]s; they can be any [[Mutually exclusive#Probability|mutually exclusive]] [[Event (probability theory)|events]] on a [[measure space]] of [[finite measure]] that has been [[Normalization (statistics)|normalized]] to a [[probability measure]] <math>p</math>.  [[Without loss of generality]], we can assume the [[categorical distribution]] is supported on the set <math display="inline">[N] = \left\{1, 2, \dots, N \right\}</math>; the [[mathematical structure]] is [[Isomorphism|isomorphic]] in terms of [[probability theory]] and therefore [[information theory]] as well.


The information of the outcome <math>X = x</math> is given
The information of the outcome <math>X = x</math> is given
Line 210: Line 193:


==Derivation==
==Derivation==
By definition, information is transferred from an originating entity possessing the information to a receiving entity only when the receiver had not known the information [[A priori knowledge|a priori]]. If the receiving entity had previously known the content of a message with certainty before receiving the message, the amount of information of the message received is zero. Only when the advance knowledge of the content of the message by the receiver is less than 100% certain does the message actually convey information.
By definition, information is transferred from an originating entity possessing the information to a receiving entity only when the receiver had not known the information [[A priori and a posteriori|a priori]]. If the receiving entity had previously known the content of a message with certainty before receiving the message, the amount of information of the message received is zero. Only when the advance knowledge of the content of the message by the receiver is less than 100% certain does the message actually convey information.


For example, quoting a character (the Hippy Dippy Weatherman) of comedian [[George Carlin]]:<blockquote>''Weather forecast for tonight: dark.''  
For example, quoting a character (the Hippy Dippy Weatherman) of comedian [[George Carlin]]:<blockquote>''Weather forecast for tonight: dark.'' '']''  


''Continued dark overnight, with widely scattered light by morning.''<ref>{{Cite web|title=A quote by George Carlin |url=https://www.goodreads.com/quotes/94336-weather-forecast-for-tonight-dark-continued-dark-overnight-with-widely|access-date=2021-04-01|website=www.goodreads.com}}</ref> </blockquote>Assuming that one does not reside near the [[Polar regions of Earth|polar regions]], the amount of information conveyed in that forecast is zero because it is known, in advance of receiving the forecast, that darkness always comes with the night.
''Continued dark overnight, with widely scattered light by morning.'' </blockquote>Assuming that one does not reside near the [[Polar regions of Earth|polar regions]], the amount of information conveyed in that forecast is zero because it is known, in advance of receiving the forecast, that darkness always comes with the night.


Accordingly, the amount of self-information contained in a message conveying content informing an occurrence of [[event (probability theory)|event]], <math>\omega_n</math>, depends only on the probability of that event.
Accordingly, the amount of self-information <math>\operatorname{I}</math> contained in a message conveying an occurrence of [[event (probability theory)|event]], <math>\omega_n</math>, depends only on the probability <math>\Pr(\omega_n)</math> of that event.<math display="block">\operatorname{I}(\omega_n) = f(\Pr(\omega_n)), </math>for some function <math>f</math> to be determined. If <math>\Pr(\omega_n) = 1</math>, then <math>\operatorname{I}(\omega_n) = 0</math>. If <math>\Pr(\omega_n) < 1</math>, then <math>\operatorname{I}(\omega_n) > 0</math>.


<math display="block">\operatorname I(\omega_n) = f(\operatorname P(\omega_n)) </math>
Further, by definition, the [[Measure (mathematics)|measure]] of self-information is nonnegative and additive. If an event <math>C</math> is the '''intersection''' of two [[statistical independence|independent]] events <math>A</math> and <math>B</math>, then the information of event <math>C</math> occurring is the '''sum''' of the amounts of information of the individual events <math>A</math> and <math>B</math>:<math display="block">\operatorname{I}(C) = \operatorname{I}(A \cap B) = \operatorname{I}(A) + \operatorname{I}(B).</math>Because of the independence of events <math>A</math> and <math>B</math>, the probability of event <math>C</math> is:<math display="block">\Pr(C) = \Pr(A \cap B) = \Pr(A) \cdot \Pr(B).</math>Relating the probabilities to the function <math>f</math>:<math display="block">f(\Pr(A) \cdot \Pr(B)) = f(\Pr(A)) + f(\Pr(B)).</math>This is a [[Cauchy's functional equation|functional equation]]. The only continuous functions <math>f</math> with this property are the [[logarithm]] functions. Therefore, <math>f(p)</math> must be of the form:<math display="block">f(p) = K \log_b(p),</math>for some base <math>b</math> and constant <math>K</math>. Since a low-probability event must correspond to high information content, the constant <math>K</math> must be negative. We can write <math>K = -1</math> and absorb any scaling into the base <math>b</math> of the logarithm. This gives the final form:<math display="block">\operatorname{I}(\omega_n) = -\log_b(\Pr(\omega_n)) = \log_b \left(\frac{1}{\Pr(\omega_n)} \right). </math>The smaller the probability of event <math>\omega_n</math>, the larger the quantity of self-information associated with the message that the event indeed occurred. If the above logarithm is base 2, the unit of <math> I(\omega_n)</math> is [[Shannon (unit)|shannon]]. This is the most common practice. When using the [[natural logarithm]] of base <math> e</math>, the unit will be the [[Nat (unit)|nat]]. For the base 10 logarithm, the unit of information is the [[Hartley (unit)|hartley]].
for some function <math>f(\cdot)</math> to be determined below.  If <math>\operatorname P(\omega_n) = 1</math>, then <math>\operatorname I(\omega_n) = 0</math>.  If <math>\operatorname P(\omega_n) < 1</math>, then <math>\operatorname I(\omega_n) > 0</math>.


Further, by definition, the [[Measure (mathematics)|measure]] of self-information is nonnegative and additive. If a message informing of event <math>C</math> is the '''intersection''' of two [[statistical independence|independent]] events <math>A</math> and <math>B</math>, then the information of event <math>C</math> occurring is that of the compound message of both independent events <math>A</math> and <math>B</math> occurring.  The quantity of information of compound message <math>C</math> would be expected to equal the '''sum''' of the amounts of information of the individual component messages <math>A</math> and <math>B</math> respectively:
As a quick illustration, the information content associated with an outcome of 4 heads (or any specific outcome) in 4 consecutive tosses of a coin would be 4 shannons (probability 1/16), and the information content associated with getting a result other than the one specified would be <math>-\log_2(15/16) \approx 0.09</math> shannons. See above for detailed examples.
<math display="block">\operatorname I(C) = \operatorname I(A \cap B) = \operatorname I(A) + \operatorname I(B).</math>


Because of the independence of events <math>A</math> and <math>B</math>, the probability of event <math>C</math> is
== See also ==
<math display="block">\operatorname P(C) = \operatorname P(A \cap B) = \operatorname P(A) \cdot \operatorname P(B).</math>


However, applying function <math>f(\cdot)</math> results in
<math display="block">\begin{align}
  \operatorname I(C) & = \operatorname I(A) + \operatorname I(B) \\
f(\operatorname P(C)) & = f(\operatorname P(A)) + f(\operatorname P(B)) \\
                      & = f\big(\operatorname P(A) \cdot \operatorname P(B)\big) \\
\end{align}</math>
Thanks to work on [[Cauchy's functional equation]], the only monotone functions <math>f(\cdot)</math> having the property such that
<math display="block">f(x \cdot y) = f(x) + f(y)</math>
are the [[logarithm]] functions <math>\log_b(x)</math>.  The only operational difference between logarithms of different bases is that of different scaling constants, so we may assume
<math display="block">f(x) = K \log(x)</math>
where <math>\log</math> is the [[natural logarithm]].  Since the probabilities of events are always between 0 and 1 and the information associated with these events must be nonnegative, that requires that <math>K<0</math>.
Taking into account these properties, the self-information <math>\operatorname I(\omega_n)</math> associated with outcome <math>\omega_n</math> with probability <math>\operatorname P(\omega_n)</math> is defined as:
<math display="block">\operatorname I(\omega_n) = -\log(\operatorname P(\omega_n)) = \log \left(\frac{1}{\operatorname P(\omega_n)} \right) </math>
The smaller the probability of event <math>\omega_n</math>, the larger the quantity of self-information associated with the message that the event indeed occurred.  If the above logarithm is base 2, the unit of <math> I(\omega_n)</math> is [[Shannon (unit)|shannon]].  This is the most common practice.  When using the [[natural logarithm]] of base <math> e</math>, the unit will be the [[Nat (unit)|nat]]. For the base 10 logarithm, the unit of information is the [[Hartley (unit)|hartley]].
As a quick illustration, the information content associated with an outcome of 4 heads (or any specific outcome) in 4 consecutive tosses of a coin would be 4 shannons (probability 1/16), and the information content associated with getting a result other than the one specified would be ~0.09 shannons (probability 15/16). See above for detailed examples.
== See also ==
* [[Kolmogorov complexity]]
* [[Kolmogorov complexity]]
* [[Surprisal analysis]]
* [[Surprisal analysis]]
Line 260: Line 217:


== External links ==
== External links ==
* [http://www.umsl.edu/~fraundor/egsurpri.html Examples of surprisal measures]
* [http://www.umsl.edu/~fraundor/egsurpri.html Examples of surprisal measures]
* [https://web.archive.org/web/20120717011943/http://www.lecb.ncifcrf.gov/~toms/glossary.html#surprisal "Surprisal" entry in a glossary of molecular information theory]
* [https://web.archive.org/web/20120717011943/http://www.lecb.ncifcrf.gov/~toms/glossary.html#surprisal "Surprisal" entry in a glossary of molecular information theory]
* [http://ilab.usc.edu/surprise/ Bayesian Theory of Surprise]
* [http://ilab.usc.edu/surprise/ Bayesian Theory of Surprise]



Latest revision as of 02:04, 18 December 2025

Template:Short description Script error: No such module "Unsubst".

In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative way of expressing probability, much like odds or log-odds, but which has particular mathematical advantages in the setting of information theory.

The Shannon information can be interpreted as quantifying the level of "surprise" of a particular outcome. As it is such a basic quantity, it also appears in several other settings, such as the length of a message needed to transmit the event given an optimal source coding of the random variable.

The Shannon information is closely related to entropy, which is the expected value of the self-information of a random variable, quantifying how surprising the random variable is "on average". This is the average amount of self-information an observer would expect to gain about a random variable when measuring it.[1]

The information content can be expressed in various units of information, of which the most common is the "bit" (more formally called the shannon), as explained below.

The term 'perplexity' has been used in language modelling to quantify the uncertainty inherent in a set of prospective events.Script error: No such module "Unsubst".

Definition

Claude Shannon's definition of self-information was chosen to meet several axioms:

  • An event with probability 100% is perfectly unsurprising and yields no information.
  • The less probable an event is, the more surprising it is and the more information it yields.
  • If two independent events are measured separately, the total amount of information is the sum of the self-informations of the individual events.

The detailed derivation is below, but it can be shown that there is a unique function of probability that meets these three axioms, up to a multiplicative scaling factor. Broadly, given a real number b>1 and an event x with probability P, the information content is defined as the negative log probability:I(x):=logb[Pr(x)]=logb(P).The base b corresponds to the scaling factor above. Different choices of b correspond to different units of information: when b=2, the unit is the shannon (symbol Sh), often called a 'bit'; when b=e, the unit is the natural unit of information (symbol nat); and when b=10, the unit is the hartley (symbol Hart).

Formally, given a discrete random variable X with probability mass function pX(x), the self-information of measuring X as outcome x is defined as:[2]IX(x):=log[pX(x)]=log(1pX(x)).The use of the notation IX(x) for self-information above is not universal. Since the notation I(X;Y) is also often used for the related quantity of mutual information, many authors use a lowercase hX(x) for self-entropy instead, mirroring the use of the capital H(X) for the entropy.

Properties

Script error: No such module "Unsubst".

Monotonically decreasing function of probability

For a given probability space, the measurement of rarer events are intuitively more "surprising", and yield more information content than more "common" events. Thus, self-information is a strictly decreasing monotonic function of the probability, or sometimes called an "antitonic" function.[3]

While standard probabilities are represented by real numbers in the interval [0,1], self-information values are non-negative extended real numbers in the interval [0,]. Specifically:

  • An event with probability Pr(x)=1 (a certain event) has an information content of I(x)=logb(1)=0. Its occurrence is perfectly unsurprising and reveals no new information.
  • An event with probability Pr(x)=0 (an impossible event) has an information content of I(x)=logb(0), which is undefined but is taken to be by convention. This reflects that observing an event believed to be impossible would be infinitely surprising.[4]

This monotonic relationship is fundamental to the use of information content as a measure of uncertainty. For example, learning that a one-in-a-million lottery ticket won provides far more information than learning it lost (See also Lottery mathematics.) This also establishes an intuitive connection to concepts like statistical dispersion; events that are far from the mean or typical outcome (and thus have low probability in many common distributions) have high self-information.

Relationship to log-odds

The Shannon information is closely related to the log-odds. The log-odds of an event x, with probability p(x), is defined as the logarithm of the odds, p(x)1p(x). This can be expressed as a difference of two information content values:log-odds(x)= logb(p(x)1p(x))= logb(p(x))logb(1p(x))=  I(¬x)  I(x),where ¬x denotes the event not x.

This expression can be interpreted as the amount of information gained (or surprise) from learning the event did not occur, minus the information gained from learning it did occur. This connection is particularly relevant in statistical modeling where log-odds are the core of the logit function and logistic regression.[5]

Additivity of independent events

The information content of two independent events is the sum of each event's information content. This property is known as additivity in mathematics. Consider two independent random variables X and Y with probability mass functions pX(x) and pY(y). The joint probability of observing the outcome (x,y) is given by the product of the individual probabilities due to independence:pX,Y(x,y)=Pr(X=x,Y=y)=pX(x) pY(y)The information content of this joint event is:IX,Y(x,y)= logb[pX,Y(x,y)]= logb[pX(x) pY(y)]= logb[pX(x)]  logb[pY(y)] =  IX(x) + IY(y),This additivity makes information content a more mathematically convenient measure than probability in many applications, such as in coding theory where the amount of information needed to describe a sequence of independent symbols is the sum of the information needed for each symbol.[3]

The corresponding property for likelihoods is that the log-likelihood of independent events is the sum of the log-likelihoods of each event. Interpreting log-likelihood as "support" or negative surprisal (the degree to which an event supports a given model: a model is supported by an event to the extent that the event is unsurprising, given the model), this states that independent events add support: the information that the two events together provide for statistical inference is the sum of their independent information.

Relationship to entropy

The Shannon entropy of the random variable X is defined as:H(X) = xpX(x) logpX(x)= xpX(x) IX(x)  =def  E[IX(X)],by definition equal to the expected information content of measurement of X.[6]Template:Rp[7]Template:Rp

The expectation is taken over the discrete values over its support.

Sometimes, the entropy itself is called the "self-information" of the random variable, possibly because the entropy satisfies H(X)=I(X;X), where I(X;X) is the mutual information of X with itself.[8]

For continuous random variables the corresponding concept is differential entropy.

Notes

This measure has also been called surprisal, as it represents the "surprise" of seeing the outcome (a highly improbable outcome is very surprising). This term (as a log-probability measure) was introduced by Edward W. Samson in his 1951 report "Fundamental natural concepts of information theory".[9][10] An early appearance in the Physics literature is in Myron Tribus' 1961 book Thermostatics and Thermodynamics.[11][12]

When the event is a random realization (of a variable) the self-information of the variable is defined as the expected value of the self-information of the realization.Script error: No such module "Unsubst".

Examples

Fair coin toss

Consider the Bernoulli trial of tossing a fair coin X. The probabilities of the events of the coin landing as heads H and tails T (see fair coin and obverse and reverse) are one half each, pX(H)=pX(T)=12=0.5. Upon measuring the variable as heads, the associated information gain is IX(H)=log2pX(H)=log212=1,so the information gain of a fair coin landing as heads is 1 shannon.[2] Likewise, the information gain of measuring tails T isIX(T)=log2pX(T)=log212=1 Sh.

Fair die roll

Suppose we have a fair six-sided die. The value of a die roll is a discrete uniform random variable XDU[1,6] with probability mass function pX(k)={16,k{1,2,3,4,5,6}0,otherwiseThe probability of rolling a 4 is pX(4)=16, as for any other valid roll. The information content of rolling a 4 is thusIX(4)=log2pX(4)=log2162.585Shof information.

Two independent, identically distributed dice

Suppose we have two independent, identically distributed random variables X,YDU[1,6] each corresponding to an independent fair 6-sided dice roll. The joint distribution of X and Y ispX,Y(x,y)=Pr(X=x,Y=y)=pX(x)pY(y)={136, x,y[1,6]0otherwise.

The information content of the random variate (X,Y)=(2,4) is IX,Y(2,4)=log2[pX,Y(2,4)]=log236=2log265.169925 Sh, and can also be calculated by additivity of events IX,Y(2,4)=log2[pX,Y(2,4)]=log2[pX(2)]log2[pY(4)]=2log265.169925 Sh.

Information from frequency of rolls

If we receive information about the value of the dice without knowledge of which die had which value, we can formalize the approach with so-called counting variables Ck:=δk(X)+δk(Y)={0,¬(X=kY=k)1,X=kY=k2,X=kY=k for k{1,2,3,4,5,6}, then k=16Ck=2 and the counts have the multinomial distribution f(c1,,c6)=Pr(C1=c1 and  and C6=c6)={1181c1!ck!, when i=16ci=20otherwise,={118, when 2 ck are 1136, when exactly one ck=20, otherwise.

To verify this, the 6 outcomes (X,Y){(k,k)}k=16={(1,1),(2,2),(3,3),(4,4),(5,5),(6,6)} correspond to the event Ck=2 and a total probability of Template:Sfrac. These are the only events that are faithfully preserved with identity of which dice rolled which outcome because the outcomes are the same. Without knowledge to distinguish the dice rolling the other numbers, the other (62)=15 combinations correspond to one die rolling one number and the other die rolling a different number, each having probability Template:Sfrac. Indeed, 6136+15118=1, as required.

Unsurprisingly, the information content of learning that both dice were rolled as the same particular number is more than the information content of learning that one die was one number and the other was a different number. Take for examples the events Ak={(X,Y)=(k,k)} and Bj,k={cj=1}{ck=1} for jk,1j,k6. For example, A2={X=2 and Y=2} and B3,4={(3,4),(4,3)}.

The information contents are I(A2)=log2136=5.169925 Sh I(B3,4)=log2118=4.169925 Sh

Let Same=i=16Ai be the event that both dice rolled the same value and Diff=Same be the event that the dice differed. Then Pr(Same)=16 and Pr(Diff)=56. The information contents of the events are I(Same)=log216=2.5849625 Sh I(Diff)=log256=0.2630344 Sh.

Information from sum of dice

The probability mass or density function (collectively probability measure) of the sum of two independent random variables is the convolution of each probability measure. In the case of independent fair 6-sided dice rolls, the random variable Z=X+Y has probability mass function pZ(z)=pX(x)*pY(y)=6|z7|36, where * represents the discrete convolution. The outcome Z=5 has probability pZ(5)=436=19. Therefore, the information asserted isIZ(5)=log219=log293.169925 Sh.

General discrete uniform distribution

Generalizing the Template:Section link example above, consider a general discrete uniform random variable (DURV) XDU[a,b];a,b, ba. For convenience, define N:=ba+1. The probability mass function is pX(k)={1N,k[a,b]0,otherwise.In general, the values of the DURV need not be integers, or for the purposes of information theory even uniformly spaced; they need only be equiprobable.[2] The information gain of any observation X=k isIX(k)=log21N=log2N Sh.

Special case: constant random variable

If b=a above, X degenerates to a constant random variable with probability distribution deterministically given by X=b and probability measure the Dirac measure pX(k)=δb(k). The only value X can take is deterministically b, so the information content of any measurement of X isIX(b)=log21=0.In general, there is no information gained from measuring a known value.[2]

Categorical distribution

Generalizing all of the above cases, consider a categorical discrete random variable with support 𝒮={si}i=1N and probability mass function given by

pX(k)={pi,k=si𝒮0,otherwise.

For the purposes of information theory, the values s𝒮 do not have to be numbers; they can be any mutually exclusive events on a measure space of finite measure that has been normalized to a probability measure p. Without loss of generality, we can assume the categorical distribution is supported on the set [N]={1,2,,N}; the mathematical structure is isomorphic in terms of probability theory and therefore information theory as well.

The information of the outcome X=x is given

IX(x)=log2pX(x).

From these examples, it is possible to calculate the information of any set of independent DRVs with known distributions by additivity.

Derivation

By definition, information is transferred from an originating entity possessing the information to a receiving entity only when the receiver had not known the information a priori. If the receiving entity had previously known the content of a message with certainty before receiving the message, the amount of information of the message received is zero. Only when the advance knowledge of the content of the message by the receiver is less than 100% certain does the message actually convey information.

For example, quoting a character (the Hippy Dippy Weatherman) of comedian George Carlin:

Weather forecast for tonight: dark. ] Continued dark overnight, with widely scattered light by morning.

Assuming that one does not reside near the polar regions, the amount of information conveyed in that forecast is zero because it is known, in advance of receiving the forecast, that darkness always comes with the night.

Accordingly, the amount of self-information I contained in a message conveying an occurrence of event, ωn, depends only on the probability Pr(ωn) of that event.I(ωn)=f(Pr(ωn)),for some function f to be determined. If Pr(ωn)=1, then I(ωn)=0. If Pr(ωn)<1, then I(ωn)>0.

Further, by definition, the measure of self-information is nonnegative and additive. If an event C is the intersection of two independent events A and B, then the information of event C occurring is the sum of the amounts of information of the individual events A and B:I(C)=I(AB)=I(A)+I(B).Because of the independence of events A and B, the probability of event C is:Pr(C)=Pr(AB)=Pr(A)Pr(B).Relating the probabilities to the function f:f(Pr(A)Pr(B))=f(Pr(A))+f(Pr(B)).This is a functional equation. The only continuous functions f with this property are the logarithm functions. Therefore, f(p) must be of the form:f(p)=Klogb(p),for some base b and constant K. Since a low-probability event must correspond to high information content, the constant K must be negative. We can write K=1 and absorb any scaling into the base b of the logarithm. This gives the final form:I(ωn)=logb(Pr(ωn))=logb(1Pr(ωn)).The smaller the probability of event ωn, the larger the quantity of self-information associated with the message that the event indeed occurred. If the above logarithm is base 2, the unit of I(ωn) is shannon. This is the most common practice. When using the natural logarithm of base e, the unit will be the nat. For the base 10 logarithm, the unit of information is the hartley.

As a quick illustration, the information content associated with an outcome of 4 heads (or any specific outcome) in 4 consecutive tosses of a coin would be 4 shannons (probability 1/16), and the information content associated with getting a result other than the one specified would be log2(15/16)0.09 shannons. See above for detailed examples.

See also

References

<templatestyles src="Reflist/styles.css" />

  1. Jones, D.S., Elementary Information Theory, Vol., Clarendon Press, Oxford pp 11–15 1979
  2. a b c d Script error: No such module "citation/CS1".
  3. a b Script error: No such module "citation/CS1".
  4. Script error: No such module "citation/CS1".
  5. Script error: No such module "citation/CS1".
  6. Script error: No such module "citation/CS1".
  7. Script error: No such module "citation/CS1".
  8. Thomas M. Cover, Joy A. Thomas; Elements of Information Theory; p. 20; 1991.
  9. Script error: No such module "Citation/CS1".
  10. Script error: No such module "citation/CS1".
  11. Script error: No such module "Citation/CS1".
  12. Myron Tribus (1961) Thermodynamics and Thermostatics: An Introduction to Energy, Information and States of Matter, with Engineering Applications (D. Van Nostrand, 24 West 40 Street, New York 18, New York, U.S.A) Tribus, Myron (1961), pp. 64–66 borrow.

Script error: No such module "Check for unknown parameters".

Further reading

External links

Template:Authority control