INFORMATION THEORY FOR CONTINUOUS SYSTEMS
INFORMATION THEORY FOR CONTINUOUS SYSTEMS
Shunsuke lhara Department of Math...
83 downloads
23576 Views
30MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
INFORMATION THEORY FOR CONTINUOUS SYSTEMS
INFORMATION THEORY FOR CONTINUOUS SYSTEMS
Shunsuke lhara Department of Mathematics College of General Education Nagoya University
V ^ g World Scientific I T
Singapore
* New Jersey
• London-Hong
Kong
Published by
World Scientific Publishing Co. Pie. Ltd. POBo* 128, Farter Road, Singapore 912S USA office: Suite IB, 1060 Main Street. River Edge, NJ 07661 UK office: 73 Lynton Mead, Totteridge, London N20 SDH
Library of Congress Cataloging-! n-Publicalion Data
Information theory for continuous systems / author, Shunsuke Diaia. p. cm. Includes bib1io|raphical references and index. ISBN 98102O985L I. Entropy (Information theory)-Mathematical models. 2. Neural trans mission-Mathematical models. I. Ihaia, Shunsuke. Q370.156 1993 003\54~dc20 93-23179 C1P
Copyright © 1993 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This bode, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording orany information storage and retrieval system now kiunvn or to be invented, without written permission from the Pubiiiher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Ciearance Center, Inc., 27 Congress Street, Salem, MA 01970. USA.
Printed in Singapore by Utopia Press.
Acknowledgments I would like to express my gratitude to Professor Takeyuki Hida for introducing me to this important research area, and for his valuable suggestions and encouragement, w i t h o u t which I could not have completed the present work. Among the many others who have made contributions, I would particularly like to thank Professors Tunekiti Sirao, Hisao Nomoto, Nobuyuki Ikeda, Hiroshi Kunita, Izumi Kubo, Masuyuki Hitsuda, Charles R. Baker and Kenjiro Yanagi, for valuable comments.
Preface This book is intended to provide a comprehensive and up-to-date treatment of information theory with special emphasis on the study of continuous information transmission systems, rather than discrete ones. Information theory in the strictest sense was largely originated by C.E. Shannon. In the paper t i t l e d "A mathematical theory of communication" [103], published in 1948, Shannon defined such fundamental quantities as entropy and mutual information, introduced general communication systems models, and proved coding theorems. Since then Shannon's basic idea and results have been extended and generalized by information theorists, both in mathematics and engineering. There are two large trends i n the development of information theory. The first one is the analysis of information theoretic quantities like entropy and mutual information. The concept of entropy was proposed, i n information theory, as a measure of uncertainty or randomness of a random variable, in other words, as a measure of the information involved i n the random variable. A generalization of the idea of entropy called relative entropy was introduced by S. Kullback {see [83]). This form of information measure, which is also referred to as information divergence or Kullback-Leibler information number, can be interpreted as a measure of similarity between two probability distributions. Many results for entropy and mutual information can be viewed as particular ones for relative entropy.
The
second trend is a generalization of coding theorems for more general and more complicated information transmission systems. The field of information theory has grown considerably.
Now modern infor-
mation theory deals w i t h various problems of information transmission, based on probability theory and other mathematical analysis, and is more than a mathematical theory of communication. I t is tied with estimation theory, filtering theory, and decision theory more closely than ever. vii
viii
PREFACE
Information theory is based on mathematics, especially on probability theory and mathematical statistics. Information transmission is, more or less, disturbed by noises which are arisen from a number of causes. Mathematically, the noises can be represented by suitable random variables or stochastic processes. Moreover, the significant aspect i n communication theory is that the actual message or signal to be transmitted is the one selected from a set of possible messages or signals. The system must be designed to operate for every possible selection, not just the one which will be actually chosen, since this is unknown at the time of design. The sets of possible messages and signals are given w i t h probability distributions. Thus messages and signals are also considered to be represented by random variables or stochastic processes. This indicates why the stochastic approach is quite useful for the development of information theory. A t the same time, information theory has fundamentally contributed not only to communication theory but also to statistical mechanics, probability theory, and statistics. The contents of this book fall roughly into two parts. I n the first part, entropy, mutual information and relative entropy are systematically analyzed and a unified treatment of these quantities in information theory, probability theory and m a t h ematical statistics is presented. Links between information theory and topics t h a t come from probability theory or mathematical statistics are frequently expressed in terms of relative entropy. It is shown that relative entropy is of central i m portance i n large deviation theorems, hypothesis testing, and m a x i m u m entropy methods i n statistical inference. The second part deals mostly w i t h information theory for continuous information transmission systems, based on the most recently available methods i n probability theory, and analysis developed i n the first part. The Gaussian channel is a communication channel disturbed by an additive noise w i t h Gaussian distribution. I t is known, as the central limit theorem, that a noise caused by a large number of small random effects has a probability distribution that is approximately Gaussian. This indicates that the study of Gaussian channels is especially i m p o r t a n t not only from the theoretical point of view but also from the viewpoint of application. One of the most interesting and complicated information transmission systems is a system w i t h feedback. We can handle feedback systems on the basis of causal calculus and innovation approach which have been developed i n probability theory. There are many excellent books on information theory.
While most of them
deal w i t h discrete information transmission systems, few of them treat continuous (both i n state spaces and time parameter spaces) systems. One of the major aims of this book is to cover the most recent developments i n information theory for continuous communication systems including Gaussian channels.
ix
PREFACE
Topics on information theory and stochastic processes are dealt w i t h i n moderately rigorous manner from the mathematical point of view. A t the same time, much effort is p u t into a rapid approach to the essentials of the matter treated and the maintenance of a spirit favorable to application, avoiding extremes of generality or abstraction. Most of the results i n this book are given as theorems and proofs. It is not necessary to understand the proofs completely to have a general grasp of the subject. The proofs can be skipped on the first reading when desired. They can be studied more carefully on the second reading. It is my hope that this book is accessible to engineers who are interested in the mathematical aspects and general models of the theory of information transmission and mathematicians who are interested i n probability theory, mathematical statistics and their applications to information theory. In Chapter 1, we introduce the fundamental quantities of information theory - entropy, relative entropy and mutual information - and study the relationships and a few interpretations of them.
They play important roles throughout the
text. These quantities are defined for stochastic processes i n Chapter 2. Especially stationary processes are studied from the information theoretical point of view. Chapter 3 is devoted to the study of large deviation theorems, maximum entropy or the m i n i m u m relative entropy spectral analysis, and hypothesis testing. Relative entropy plays fundamental roles i n the investigation of these topics. A large deviation theorem, which is a limit theorem in probability theory, shows that the speed of convergence is described i n terms of relative entropy. A model selection that maximizes the uncertainty or entropy yields the maximum entropy approach to statistical inference, and the minimum relative entropy method can be viewed as a generalization of the maximum entropy method. Interpretations of basic concepts i n communication theory, such as channel capacity and achievable rate of information transmission, are given in Chapter 4. Fundamental problems on information transmission over a communication channel are also investigated i n Chapter 4. The last two chapters are devoted to a study of information transmission over communication channels w i t h additive noises, especially Gaussian channels w i t h or without feedback. Such subjects as mutual information, channel capacity, optimal coding schemes, and coding theorems are investigated i n detail.
Chapter 5 is for the discrete time case.
I n particular.
Chapter 6 on the continuous time Gaussian channels is unique in this book. Some terminologies and results, which are used i n the text, i n probability theory are prepared i n the Appendix.
Contents
Preface
vii
C h a p t e r 1. E n t r o p y
1
1.1
Information Transmission
1
1.2
Entropy
4
1.3
Entropy for Continuous Distributions
13
1.4 1.5
Relative Entropy Properties of Relative Entropy
20 29
1.6
M u t u a l Information
35
1.7 1.8
Rate-Distortion Function Entropy for Gaussian Distributions
41 47
Historical Notes
56
C h a p t e r 2. S t o c h a s t i c Processes a n d E n t r o p y
57
2.1
Entropy Rate
57
2.2
Discrete Time Stationary Processes
63
2.3
Discrete Time Markov Stationary Processes
69
2.4
Entropy Rate of a Stationary Gaussian Process
76
2.5
Continuous Time Stationary Processes
83
2.6
Band Limited Processes
90
2.7
Discrete Time Observation of Continuous Time Processes Historical Notes
93 96
xi
CONTENTS
C h a p t e r 3. M a x i m u m E n t r o p y A n a l y s i s 3.1
M a x i m u m Entropy and M i n i m u m Relative Entropy
98 98
3.2
Large Deviation Theorems
3.3
M a x i m u m Entropy Spectral Analysis
118
3.4
Hypothesis Testing
127
3.5
Hypothesis Testing for Stationary Gaussian Processes Historical Notes
C h a p t e r 4. T h e o r y o f I n f o r m a t i o n T r a n s m i s s i o n
110
134 143 144
4.1
Model of Communication Systems
4.2
Information Stability
144 148
4.3
Source Coding Theorems
153
4.4
Channel Capacity
158
4.5
Channel Coding Theorems
162
4.6
Fundamental Theorem in Information Transmission
171
Historical Notes C h a p t e r 5. D i s c r e t e T i m e G a u s s i a n C h a n n e l s
175 177
5.1
M u t u a l Information in Channels w i t h Additive Noise
177
5.2
Discrete Gaussian Channels
183
5.3
M u t u a l Information in Gaussian Systems
188
5.4
Capacity of Discrete White Gaussian Channels
195
5.5
Capacity of Discrete Gaussian Channels without Feedback
200
5.6
Optimal Codings i n Discrete W h i t e Gaussian Channels
207
5.7
Capacity of Discrete Gaussian Channels w i t h Feedback
210
5.8
Rate-Distortion Function of a Discrete Gaussian Message
220
5.9
Coding Theorem for Discrete Gaussian Channels
223
Historical Notes C h a p t e r 6. C o n t i n u o u s T i m e G a u s s i a n C h a n n e l s 6.1
226 228
Continuous Gaussian Channels
228
6.2
M u t u a l Information in White Gaussian Channels (I)
231
6.3
M u t u a l Information in White Gaussian Channels ( I I )
236
6.4
Capacity of White Gaussian Channels
250
6.5
O p t i m a l Codings i n White Gaussian Channels
254
6.6
Gaussian Channels without Feedback
258
CONTENTS
Sill
6.7
Stationary Gaussian Channels
263
6.8 6.9
Gaussian Channels w i t h Feedback Rate-Distortion FAinction of a Gaussian Process
268 277
6.10 Coding Theorem for Gaussian Channels Historical Notes Appendix. Preparation from Probability Theory
280 285 287
A.l
Probability and Random Variable
287
A.2
Conditional Probability and Conditional Expectation
289
A.3
Stochastic Integral
292
Bibliography
293
List of Symbols
300
Subject
304
Index
Chapter 1 Entropy 1.1 I N F O R M A T I O N
TRANSMISSION
A mathematical foundation of information theory was presented by C.E. Shannon w i t h his famous paper "A mathematical theory of communication", published in 1948. I n a strict sense, information theory may be understood as a mathematical theory of communication, or a theory of information transmission. As the field of information theory has grown and developed, we shall consider information theory here i n a somewhat broader sense. Communication theory deals w i t h systems that transmit information or data from one place to another. Physically speaking there are various types of communication systems.
Most of such systems can be described by the block diagram
in Figure 1.1, which is known as Shannon's scheme. A message from the information source is encoded at the encoder. The encoded signal is input into the channel. Then the received signal is decoded at the decoder. The fundamental problem of communication is that of reproduction of the message either exactly or approximately. I t is significant to note that the actual message is the one selected from a set of possible messages w i t h probabilistic structure. The system must be designed so as to be ready to operate for each possible selection, not just the one which will actually be chosen. It is essential to evaluate the performance of communication for all possible selections as a whole, not for the actually chosen one. The fundamental problem of information transmission w i l l be discussed i n Chapter 4 i n detail. It is essential i n information theory to measure the quantity of "information". 1
2
ENTROPY
For this purpose, Shannon introduced the concepts of entropy and m u t u a l information from the viewpoint of communication theory. Entropy is defined as a measure of "uncertainty" or "randomness" of a random phenomenon. Suppose t h a t some information about a random variable is received. Then a quantity of uncertainty is reduced, and this reduction in uncertainty can be regarded as the quantity of transmitted information, which is called the mutual information. The capacity of a communication channel is defined to be the maximum value of the information that can be transmitted over the channel. Before Shannon's paper, i t was generally thought that decreasing the required probability of error decreases the transmission rate of information over the channel. However Shannon has shown that i t is not true; rather, an arbitrarily small probability of error is achievable at any transmission rate less than the channel capacity.
Source
Encoder
Channel
Decoder
User
Noise F i g u r e 1.1. A communication system
Developing the idea of entropy, the concept of relative entropy was introduced by Kullback to discriminate two probability distributions and to apply information theory to statistical inference. This form of information measure is also referred to as information divergence, Kullback-Lei bier information number, or information gain. Relative entropy is better interpreted as a measure of "distance"
between
two probability distributions. Mathematically speaking, relative entropy is not a true metric, but i t enjoys some of the properties of a metric. Many results for entropy and mutual information can be viewed as special cases of results for relative entropy and the formula for relative entropy arises naturally i n some proofs. Three mathematical notions, entropy, mutual information, and relative entropy, introduced as measures of "information", have played significant roles in information theory. We note that these quantities, i n particular entropy and relative entropy, play important roles not only i n information theory but also i n relevant fields such as probability theory, mathematical statistics, ergodic theory, statistical mechanics, and so on. Entropy, relative entropy and mutual information play central roles throughout
I.J INFORMATION
3
TRANSMISSION
the text. I n this section we introduce them only briefly, and we shall study each of them i n detail i n later sections. Let X be a random variable taking values on a set {a ,a ] x
= Oi) = pi,
P(X
w i t h probability
m
t = l,...,m.
Then H(X)
n = - f > t a B A
= H(p ...,p ) u
m
(1.1.1)
•=i
is called the e n t r o p y o f t h e r a n d o m v a r i a b l e X or the e n t r o p y o f t h e p r o b ability d i s t r i b u t i o n (pi, ...,p ).
The H(X)
m
was named entropy by Shannon
because he recognized some analogies between the H{X)
and the entropy i n sta-
tistical mechanics, and the symbol H came from the H i n Boltzmann's famous H theorem. Let X and Y be two random variables w i t h probabilities P(X
=
P(X
= - a , , r = 6,) = r y ,
a
i
) =
P i
,
P{Y
Then the m u t u a l i n f o r m a t i o n I(X,
= b) = j
q, j
t = l , - , m , j = l,...,n.
Y) between X and Y is defined by n
m
It can be expressed I(X,Y)
= H{X)~H{X\Y),
(1.1.3)
where TI
The quantity H(X\Y)
m
is called the conditional entropy of X given Y.
I t can be
explained as the remained uncertainty of X after we are informed of the outcome of Y. Hence we can say that /(X, Y) — the reduction of the entropy of X due to the knowledge of Y. In other words, I(X, in Y.
Y) describes the amount of information about X contained
4
ENTROPY
Let p. = ( p j , . . . , p ) and 1/ = (qj,---,q ) be probability distributions. T h e n , the r e l a t i v e e n t r o p y H{y.;v) of p. w i t h respect to c is defined by m
m
We shall see that the entropy and the mutual information can be expressed as a relative entropy. Here we have defined relative entropy and mutual information only for discrete probability distributions. We shall repeat these definitions i n later sections where we deal w i t h probability distributions of more general types.
1.2 E N T R O P Y
This section is devoted to a study of various properties of entropy and of axiomatic characterization of entropy. Suppose that we have a set ( a , , a
m
} of possible states whose probabilities of
occurrence are given by ( p i , . . . , p ) . Then the random experiment is described by m
a random variable X with probability P{X
= a,)=p„
i = i
(1.2.1)
m.
The ( p i , . . . , p ) is the probability distribution of X. Throughout this book, we denote by P{A) the probability of an event A . Here we repeat the definition of entropy. m
D e f i n i t i o n 1.2.1. Let X be a random variable w i t h distribution ( p i , . . . , p ) . Then the e n t r o p y o f t h e r a n d o m v a r i a b l e X or the e n t r o p y o f t h e d i s t r i b u t i o n ( p i , . . . , p ) is defined by m
m
m
H(X)
= H{ ,..., ) Pl
Pm
= -£p,logp,.
(1.2.2)
1=1
Suppose that a probability distribution ( p i , . . . , p ) is known and t h a t we do not m
know which event will occur. Then the entropy H(pi,
. . . , p ) shows how much m
1.2
S
ENTROPY
freedom one is given i n the selection of an event, or how uncertain the outcome is, or how difficult to predict the outcome. The logarithm may be taken to the base e or to the base 2. I n the case of base e, the entropy is measured in units of n a t s , while in the case of base 2, the entropy is measured i n units of b i t s . We take the logarithm to the base e throughout the text unless otherwise stated. B i t is usually more convenient for the practical use, and nat is more convenient for theoretical developments.
We use the notational
convention 0 log 0 = 0. Quite often i n the present text, we simply denote
JVj
f ° H H i > XI^=i i f no confusion occurs. Some basic properties of the entropy are now i n order. r
T h e o r e m 1.2.1. (Properties of Entropy)
The following properties ( H . l ) - (H.7)
hold for a probability distribution ( p i , . . . , p ) . m
(H.l)
tf( ,..., )>G. Pl
The equality H{pj, ( i 10(H.2) H(p
. . . , p ) — 0 holds if and only 'if pi=
1 for some i and p , = 0
m
. . . , p ) is a continuous function of ( p i , . . . , p ) .
u
(H.3)
Pm
m
m
The entropy is symmetric: ff£Pi)
- >Pm)
=
# ( P O ( 1 ) , •">Pl(m))l
where a denotes a permutation of ( 1 , ...,m). (H.4)
Let { ( f l i j i ,
4n|>); i = 1,—,"»} be a set of probability distributions. Then
it holds that H(Pl
P m Qn\m)
9 l | Li - - - . P i 9 n | l . P 2 9 l | 2 . m
= i 7 { p i , . . . , p ) + ^2piH{qn,
?„(;)•
m
(H.5)
If p
m
= q + r > 0 and
tf(p,,...,p _i,0\r) m
(H.6)
r > 0, then = i f ( p , , . . . , p „ , p ) - r p / Y (—, — ] . \Pm Pm / m
1
m
m
Let ( j j i , ...,9m) be an arbitrary probability distribution. Then m
H(pi,...,Pm)
< -^Pilog?.. i=]
6
ENTROPY
w i t h equality i f and only if p,- = qi for a l l t = l , . . . , m . (H.7)
0.
(1.2.3)
The equality holds i n (1,2.3) if and only if x — 1. We may assume that p, > 0 for all i . Using (1.2.3) we have m
ff(pi,...,p )
in
+ ^p,log , = ^p.log —
m
9
i=l
.=1
P
I
The equality holds i n (1.2.4) if and only i f p, — ••-,P]«n|l.P2?l|2, •;Pmq \m) n
(1.2.7)
Once we are informed that X = a; has occurred, while the uncertainty of X disappears, a reduced uncertainty H(qn„ ,.., g„|j) of y still remains. Therefore the second term on the right-hand side of (H.4) can be explained as the average value of the entropy of Y when X is known, in other words, the remained uncertainty of Y after X is known. Thus (H.4) explained that the uncertainty of (X, Y ) is equal to the sum of the uncertainty of X and the remained uncertainty of Y after X is known. We now introduce the concept of conditional entropy. D e f i n i t i o n 1.2.2. For a pair {X, Y ) of random variables with probabilities (1.2.1), (1.2.5) and (1.2.6), the c o n d i t i o n a l e n t r o p y H(X\Y)
r 7 m n = - I > '
Using (H.4), we see that H(X\Y) H(X\Y)
l
o
of X given Y is defined by
6 ^ -
(1-2-8)
has an alternative expression = H{X,Y)-
H(Y).
(1.2.9)
g
ENTROPY
The quantity m
H{X\Y
...,
= bj) = H( j, Pli
P H i
) - -^ntflogpiU
(1-2-10)
is called the conditional entropy of X given Y = bj. Then H(X\Y) tation of H(X\Y
is the expec-
= •): H{X\Y)
= j2liH(X\Y
(1.2.11)
= bj).
We collect some of the basic properties of conditional entropy. T h e o r e m 1.2.2. Let X,X\,X ,
Y be random variables w i t h discrete proba-
n
bility distributions. Then (H.8)
H(X\Y)
-,
X, n
Y)
= H(X,, =
ff{K)
Y) + H(X ,X |r, 2
4- H(X,\Y)
X,)
n
+
H(X ,...,X„\Y,X ). 2
1
Consequently H(X,
X \Y)
= H(X
n
}
\Y) +
H(X ,...,X \Y X ). 2
Repeating the same procedures, we obtain (H.10).
n
}
1
•
The property (H.8) says that the uncertainty of X is decreased when we have some information about Y, unless X is independent of Y. The equation i n (H.10) is often referred to as a chain rule for entropy. As we have seen above, the properties ( H . l ) - (H.10) show that the entropy denned by (1.2.2) is quite suitable as a measure of uncertainty of the outcome of a random experiment.
On the other hand, it seems to be reasonable to say
that a function to measure the uncertainty should possess at least the properties ( H . l ) - (H.4).
More importantly Shannon showed that the entropy defined by
(1.2.2) is only the function that satisfies ( H . l ) - (H.4). To state the Shannon's characterization of entropy, we introduce weaker conditions ( H . l ' ) and (H.2') than (H.l)and(H.2). (H.l')
There exists 0 < p < 1 such that H(p, 1 - p) > 0.
(H.2')
H(p, 1 — p) is a continuous function of p e [0,1].
T h e o r e m 1.2.3. Assume that a real valued function H(pt,
. . . , p ) , defined on m
the class of all discrete probability distributions, satisfies ( H I ' ) , (H.2 ), (H.3) and 1
(H.5). Then H(p,,...,p )
is the function of the form
m
m H(pt,...,p ) m
= -c^Tpilogpi,
(1.2.12)
i=i
where c is a positive constant.
P r o o f . A t first we shall prove (1.2.12) in the case of m = 2. For this purpose we put f( ) P
= 7/(p, 1 - p),
0 < p < l .
10
ENTROPY
Let p, q, r be non-negative numbers w i t h r > 0, p + q + r = 1. Then, by (H.3) and (H.5), H(p,q,r)
( j ^
= H(p,q + r) + (q + r)H = H
i
,
q
p
+
r
)
+
i
p
r
+
)
H
t
^
^
)
.
Accordingly, /(p) satisfies /(p) + (1 - ) f (j^j
= f(q) + ( 1 - ) / ( ^ )
,
5
P
(1.2.13)
where p , q > Q and p + q < 1. Let p be fixed and integrate both sides of (1.2.13) from 0 t o 1 — p w i t h respect t o q. Then we obtain (l-p)/ 0. Thus we have proved (1.2.12) for m = 2. Next, we shall prove (1.2.12) by induction w i t h respect to m , the number of components of distribution ( p i , . . . , p ) . Suppose that (1.2.12) holds for all m < k. m
Let ( p i , . . . , p t ) be a probability distribution. We may assume that p * i > 0. I t + 1
+
follows from (H.5) that H(pi,
. . . , p t - i , p * , p t + i ) =H(pi, ...,pk-i, k P
+p*+i)
By the assumption of induction, this can be rewritten as
tf(pi.—iP*+i)
= -
c
£pi
I o
g P i - (P* + P * + l ) ' ° 6 ( P i t C
+Pk+l)
•=i
meaning that (1.2.12) is true for m = k -f 1. Thus the proof is complete.
•
Here we prove an interesting inequality which is known as Fano's inequality. T h e o r e m 1.2.4. Let X and Y be random variables taking values aj,a , m
and
denote by *
e
=
P(X^Y)
the probability of error. Then (1.2.18)
ENTROPY
12
where h(x) = —x log x — (1 — x ) l o g ( l — i } P r o o f . Let Z be a random variable defined by 1,
Y*X,
I o,
Y = X.
Then P ( Z = 1) = 7r . Using the chain rule (H.10), we expand H(X, e
Z\Y) i n two
different ways to obtain fi(X,
Z\Y) = H(X\Y)
+ H(Z\Y, X) =
H(X\Y)
and H(X,
Z\Y) = ff (Z|K) + H(X\Y,
Z).
Thus //(XIV) = H{Z\Y)
+ H{X\Y, Z).
(1.2.19)
Using (H.8) and (H.7), we have H(Z\Y)
< H(Z)
= /»(«•«)
(1.2.20)
and H{X\Y,Z)
= P{Z
= 0)H(X\Y,Z
= 0) + P{Z
= \)H{X\Y, Z = 1)
= F ( Z = l ) / J ( X | r , Z = 1) < jt, b g m .
(1.2.21)
The desired inequality (1.2.18) follows from (1.2.19) - (1.2.21).
•
Fano's inequality w i l l play an important role to establish the channel coding theorem (cf. Proof of Theorem 4.5.2). If X is a random variable taking infinitely many values a i , a > — w i t h probabil2
ities pi,|>2,—, then as a natural extension of (1.2.2), the entropy of X is defined by //(*)
=
Note that, in this case, H{X)
ff(p,,p ,...) 2
CO = -X>logp .
is not necessarily
;
finite.
(1.2.22)
I n the case where Y is
also a random variable taking values of at most countable number, the conditional entropy H(X\Y)
is defined by (1.2.9).
1.3 ENTROPY
FOR
CONTINUOUS
1.3 E N T R O P Y F O R C O N T I N U O U S
13
DISTRIBUTIONS
DISTRIBUTIONS
I n this section we introduce the notion of continuous entropy (or differential entropy), which is defined for random variables w i t h continuous probability distributions. Let Xi,...,Xi (Xi,...,Xj)
If the probability of X
be real (valued) random variables.
S
e A is given by = J
P(X£A)
— j
p{x„...,x )dx,...dx , d
i
A
for any d-dimensional
Borel set A C R
( R = (—°o> cc) denotes the set of all real
d
numbers), then we say that X = IXi, ...,X )
is with d-dimensional
d
distribution, or simply that X is a continuous random variable. p(x) = p{x\, . . . , i d ) (x = [xi,Xi)
continuous
The function
e W*) is called the probability density. The
density function is nonnegative; p(x) > 0, and satisfies
J tL*
j p(x ,...,x )dx ...dx 1
d
l
d
= 1.
D e f i n i t i o n 1.3.1. The c o n t i n u o u s e n t r o p y of a if-dimensional continuous random variable X — (Xt,...,Xj)
w i t h probability density function p(x) is defined
as h(X)
= k(p) = -
[
p{x)logp(x)dx,
(1.3.1)
provided the integral exists. I t is often referred to simply as the e n t r o p y of A" or the e n t r o p y of p ( x ) . The continuous entropy k(X) tropy. We w i l l write j f(x)
has been traditionally called the differential en-
dx as the domain R
d
of the integral /
R J
/(x) dx if no
confusion occurs. Here we give two examples to calculate continuous entropy. E x a m p l e 1 . 3 . 1 . (Uniform Distribution)
Let X be a real random variable with
uniform distribution on an interval (a, a + L\ of length L. Then the probability density function p(x) is given by •L,
a 0. Hence, for a p a r t i t i o n A = {A,
V) = fiA)
log ^ |
+ rtA*) log f@
A ),
= cc,
c
1.1 RELATIVE
25
ENTROPY
meaning that (1.4.11) holds i n the sense of oo = oo. Next we assume that ji is absolutely continuous w i t h respect to v. Denote by tp — dji/dv the Radon-Nikodym derivative. Given an arbitrary partition A
=
{Aj,.... A } of X , we define a probability measure v& by m
VA(A)
=
A e 0(X),
^(x)<Mx),
where
Note that H&(fi\ v) can be written i n the form Ho.(p\v) =
/ log^ (x) Jx A
d)i{x).
Moreover i t can be shown that ji is absolutely continuous with respect to
and
the Radon-Nikodym derivative is given by P
d
I) x
-
By Theorem 1.4.1
This is equivalent to H (fi;v)=
/ log \b (x)dfi(x) Jx
A
A
< / logi^(i)d/i(i) = Jx
H(fi,v).
Since the partition A is arbitrary, we have H < H(ii-v).
(1.4.12)
We now proceed to prove the inverse inequality of (1.4.12). We put C„ = {x e X ; |logp(r)| > 7i}. Since Yim„ —
ao
n(C ) n
« ( c ) i o g 4£4 n
= 0, for any e > 0, there exists JV such that > M o i o g / i c c . ) > -e,
y» > m
(i.4.i3)
26
ENTROPY
Fix an integer n > N and define a partition A = {An i = 0, ± 1 , ± m ] At = ix e X ; ( A_ A_
m + i
m
=
( l
- ^ " < logv(i) < — } , m mj
by
- m + 2 < i <m,
= { z 6 X; - r . < log^(z)
J) dv'x)
(1.4.15)
where the supremum is taken over the set B of all bounded measurable functions defined on X . In the large deviation theory, D(fi, v) is called the entropy function or the rate function. T h e o r e m 1.4.4. Let fi and V be the probability measures on ( X , S ( X ) ) . Then H(p,i>) =
D(r,»).
(1.4.16)
1.4 RELATIVE
27
ENTROPY
To prove the theorem, we introduce auxiliary notations. Define classes U U ( f ) and V = \(u)
=
by
U = {u G Z ^ ) ;
> 0 (>-a. .), e
f u(x)dt>(x) = 1}, Jx
V = { u € U ; logueB}, where L ' ( f )
is the class of all v-integrable functions.
And define D(p;v)
and
D (ft; u) by 0
5 ( / i ; e ) = sup / l o g u ( x ) d u ( r ) , u€U J x D ( u ; c ) = s u p / logii{x)d/i(x). aev *ev J x 0
For every # € B , there corresponds a function u 6 V given by -l
«(*) =
exp(#(x)).
[^exp(