Shannon's source coding theorem

Template:Short description Template:Information theory

Script error: No such module "about".

In information theory, Shannon's source coding theorem (or noiseless coding theorem) establishes the statistical limits to possible data compression for data whose source is an independent identically-distributed random variable, and the operational meaning of the Shannon entropy.

Named after Claude Shannon, the source coding theorem shows that, in the limit, as the length of a stream of independent and identically-distributed random variable (i.i.d.) data tends to infinity, it is impossible to compress such data such that the code rate (average number of bits per symbol) is less than the Shannon entropy of the source, without it being virtually certain that information will be lost. However it is possible to get the code rate arbitrarily close to the Shannon entropy, with negligible probability of loss.

The source coding theorem for symbol codes places an upper and a lower bound on the minimal possible expected length of codewords as a function of the entropy of the input word (which is viewed as a random variable) and of the size of the target alphabet.

Note that, for data that exhibits more dependencies (whose source is not an i.i.d. random variable), the Kolmogorov complexity, which quantifies the minimal description length of an object, is more suitable to describe the limits of data compression. Shannon entropy takes into account only frequency regularities while Kolmogorov complexity takes into account all algorithmic regularities, so in general the latter is smaller. On the other hand, if an object is generated by a random process in such a way that it has only frequency regularities, entropy is close to complexity with high probability (Shen et al. 2017).^[1]

Statements

Source coding is a mapping from (a sequence of) symbols from an information source to a sequence of alphabet symbols (usually bits) such that the source symbols can be exactly recovered from the alphabet symbols (lossless source coding) or recovered within some distortion (lossy source coding). This is one approach to data compression.

Source coding theorem

In information theory, the source coding theorem (Shannon 1948)^[2] informally states that (MacKay 2003, pg. 81,^[3] Cover 2006, Chapter 5^[4]):

Template:Mvar i.i.d. random variables each with entropy $H (X)$ Script error: No such module "Check for unknown parameters". can be compressed into more than $N H (X)$ Script error: No such module "Check for unknown parameters". bits with negligible risk of information loss, as $N \to \infty$ Script error: No such module "Check for unknown parameters".; but conversely, if they are compressed into fewer than $N H (X)$ Script error: No such module "Check for unknown parameters". bits it is virtually certain that information will be lost.

The coded sequence of length

N H (X)

represents the compressed message in a biunivocal way, under the assumption that the decoder knows the source. From a practical point of view, this assumption is not always true. Consequently, when the entropy encoding is applied, the transmitted message may need to include information characterizing the source, usually inserted at the beginning of the transmitted message.

Source coding theorem for symbol codes

Let $Σ 1, Σ 2$ Script error: No such module "Check for unknown parameters". denote two finite alphabets and let $Σ Script error: No such module "Su".$ Script error: No such module "Check for unknown parameters". and $Σ Script error: No such module "Su".$ Script error: No such module "Check for unknown parameters". denote the set of all finite words from those alphabets (respectively).

Suppose that Template:Mvar is a random variable taking values in $Σ 1$ Script error: No such module "Check for unknown parameters". and let $f$ Script error: No such module "Check for unknown parameters". be a uniquely decodable code from $Σ Script error: No such module "Su".$ Script error: No such module "Check for unknown parameters". to $Σ Script error: No such module "Su".$ Script error: No such module "Check for unknown parameters". where $|Σ 2 | = a$ Script error: No such module "Check for unknown parameters".. Let Template:Mvar denote the random variable given by the length of codeword $f (X)$ Script error: No such module "Check for unknown parameters"..

If $f$ Script error: No such module "Check for unknown parameters". is optimal in the sense that it has the minimal expected word length for Template:Mvar, then (Shannon 1948):

\frac{H (X)}{\log_{2} a} \leq 𝔼 [S] < \frac{H (X)}{\log_{2} a} + 1

Where $𝔼$ denotes the expected value operator.

Proof: source coding theorem

Given Template:Mvar is an i.i.d. source, its time series $X 1, ..., X n$ Script error: No such module "Check for unknown parameters". is i.i.d. with entropy $H (X)$ Script error: No such module "Check for unknown parameters". in the discrete-valued case and differential entropy in the continuous-valued case. The Source coding theorem states that for any $ε > 0$ Script error: No such module "Check for unknown parameters"., i.e. for any rate $H (X) + ε$ Script error: No such module "Check for unknown parameters". larger than the entropy of the source, there is large enough Template:Mvar and an encoder that takes Template:Mvar i.i.d. repetition of the source, $X 1: n$ Script error: No such module "Check for unknown parameters"., and maps it to $n (H (X) + ε)$ Script error: No such module "Check for unknown parameters". binary bits such that the source symbols $X 1: n$ Script error: No such module "Check for unknown parameters". are recoverable from the binary bits with probability of at least $1 - ε$ Script error: No such module "Check for unknown parameters"..

Proof of Achievability. Fix some $ε > 0$ Script error: No such module "Check for unknown parameters"., and let

p (x_{1}, \dots, x_{n}) = \Pr [X_{1} = x_{1}, \dots, X_{n} = x_{n}] .

The typical set, $A Script error: No such module "Su".$ Script error: No such module "Check for unknown parameters"., is defined as follows:

A_{n}^{ε} = {(x_{1}, \dots, x_{n}) : | - \frac{1}{n} \log p (x_{1}, \dots, x_{n}) - H_{n} (X) | < ε} .

The asymptotic equipartition property (AEP) shows that for large enough Template:Mvar, the probability that a sequence generated by the source lies in the typical set, $A Script error: No such module "Su".$ Script error: No such module "Check for unknown parameters"., as defined approaches one. In particular, for sufficiently large Template:Mvar, $P ((X_{1}, X_{2}, \dots, X_{n}) \in A_{n}^{ε})$ can be made arbitrarily close to 1, and specifically, greater than $1 - ε$ (See AEP for a proof).

The definition of typical sets implies that those sequences that lie in the typical set satisfy:

2^{- n (H (X) + ε)} \leq p (x_{1}, \dots, x_{n}) \leq 2^{- n (H (X) - ε)}

The probability of a sequence $(X_{1}, X_{2}, \dots X_{n})$ being drawn from $A Script error: No such module "Su".$ Script error: No such module "Check for unknown parameters". is greater than $1 - ε$ Script error: No such module "Check for unknown parameters"..
$| A_{n}^{ε} | \leq 2^{n (H (X) + ε)}$ , which follows from the left hand side (lower bound) for $p (x_{1}, x_{2}, \dots x_{n})$ .
$| A_{n}^{ε} | \geq (1 - ε) 2^{n (H (X) - ε)}$ , which follows from upper bound for $p (x_{1}, x_{2}, \dots x_{n})$ and the lower bound on the total probability of the whole set $A Script error: No such module "Su".$ Script error: No such module "Check for unknown parameters"..

Since $| A_{n}^{ε} | \leq 2^{n (H (X) + ε)}, n (H (X) + ε)$ bits are enough to point to any string in this set.

The encoding algorithm: the encoder checks if the input sequence lies within the typical set; if yes, it outputs the index of the input sequence within the typical set; if not, the encoder outputs an arbitrary $n (H (X) + ε)$ Script error: No such module "Check for unknown parameters". digit number. As long as the input sequence lies within the typical set (with probability at least $1 - ε$ Script error: No such module "Check for unknown parameters".), the encoder does not make any error. So, the probability of error of the encoder is bounded above by Template:Mvar.

Proof of converse: the converse is proved by showing that any set of size smaller than $A Script error: No such module "Su".$ Script error: No such module "Check for unknown parameters". (in the sense of exponent) would cover a set of probability bounded away from $1$ Script error: No such module "Check for unknown parameters"..

Proof: Source coding theorem for symbol codes

For $1 \leq i \leq n$ Script error: No such module "Check for unknown parameters". let $s i$ Script error: No such module "Check for unknown parameters". denote the word length of each possible $x i$ Script error: No such module "Check for unknown parameters".. Define $q_{i} = a^{- s_{i}} / C$ , where Template:Mvar is chosen so that $q 1 + ... + q n = 1$ Script error: No such module "Check for unknown parameters".. Then

\begin{aligned} H (X) & = - \sum_{i = 1}^{n} p_{i} \log_{2} p_{i} \\ \leq - \sum_{i = 1}^{n} p_{i} \log_{2} q_{i} \\ = - \sum_{i = 1}^{n} p_{i} \log_{2} a^{- s_{i}} + \sum_{i = 1}^{n} p_{i} \log_{2} C \\ = - \sum_{i = 1}^{n} p_{i} \log_{2} a^{- s_{i}} + \log_{2} C \\ \leq - \sum_{i = 1}^{n} - s_{i} p_{i} \log_{2} a \\ = 𝔼 S \log_{2} a \end{aligned}

where the second line follows from Gibbs' inequality and the fifth line follows from Kraft's inequality:

C = \sum_{i = 1}^{n} a^{- s_{i}} \leq 1

so $log C \leq 0$ Script error: No such module "Check for unknown parameters"..

For the second inequality we may set

s_{i} = ⌈ - \log_{a} p_{i} ⌉

so that

- \log_{a} p_{i} \leq s_{i} < - \log_{a} p_{i} + 1

and so

a^{- s_{i}} \leq p_{i}

and

\sum a^{- s_{i}} \leq \sum p_{i} = 1

and so by Kraft's inequality there exists a prefix-free code having those word lengths. Thus the minimal Template:Mvar satisfies

\begin{aligned} 𝔼 S & = \sum p_{i} s_{i} \\ < \sum p_{i} (- \log_{a} p_{i} + 1) \\ = \sum - p_{i} \frac{\log_{2} p_{i}}{\log_{2} a} + 1 \\ = \frac{H (X)}{\log_{2} a} + 1 \end{aligned}

Extension to non-stationary independent sources

Fixed rate lossless source coding for discrete time non-stationary independent sources

Define typical set $A Script error: No such module "Su".$ Script error: No such module "Check for unknown parameters". as:

A_{n}^{ε} = {x_{1}^{n} : | - \frac{1}{n} \log p (X_{1}, \dots, X_{n}) - \overline{H_{n}} (X) | < ε} .

Then, for given $δ > 0$ Script error: No such module "Check for unknown parameters"., for Template:Mvar large enough, $Pr(A Script error: No such module "Su".) > 1 - δ$ Script error: No such module "Check for unknown parameters".. Now we just encode the sequences in the typical set, and usual methods in source coding show that the cardinality of this set is smaller than $2^{n (\overline{H_{n}} (X) + ε)}$ . Thus, on an average, $H n (X) + ε$ Script error: No such module "Check for unknown parameters". bits suffice for encoding with probability greater than $1 - δ$ Script error: No such module "Check for unknown parameters"., where Template:Mvar and Template:Mvar can be made arbitrarily small, by making Template:Mvar larger.

References

↑ Script error: No such module "citation/CS1".
↑ C.E. Shannon, "A Mathematical Theory of Communication Template:Webarchive", Bell System Technical Journal, vol. 27, pp. 379–423, 623-656, July, October, 1948
↑ David J. C. MacKay. Information Theory, Inference, and Learning Algorithms Cambridge: Cambridge University Press, 2003. Template:ISBN
↑ Script error: No such module "citation/CS1".

Script error: No such module "Check for unknown parameters".

[Shen2017-1] Script error: No such module "citation/CS1".

[Shannon-2] C.E. Shannon, "A Mathematical Theory of Communication Template:Webarchive", Bell System Technical Journal, vol. 27, pp. 379–423, 623-656, July, October, 1948

[MacKay-3] David J. C. MacKay. Information Theory, Inference, and Learning Algorithms Cambridge: Cambridge University Press, 2003. Template:ISBN

[Cover-4] Script error: No such module "citation/CS1".

[1]

[2]

[3]

[4]

Shannon's source coding theorem

Contents

Statements

Source coding theorem

Source coding theorem for symbol codes

Proof: source coding theorem

Proof: Source coding theorem for symbol codes

Extension to non-stationary independent sources

Fixed rate lossless source coding for discrete time non-stationary independent sources

See also

References

Navigation menu

Shannon's source coding theorem

Statements

Source coding theorem

Source coding theorem for symbol codes

Proof: source coding theorem

Proof: Source coding theorem for symbol codes

Extension to non-stationary independent sources

Fixed rate lossless source coding for discrete time non-stationary independent sources

See also

References

Navigation menu

Search