Information theory for continuous systems

INFORMATION THEORY FOR CONTINUOUS SYSTEMS INFORMATION THEORY FOR CONTINUOUS SYSTEMS Shunsuke lhara Department of Math...

Author: Shunsuke Ihara.

83 downloads 23576 Views 30MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

INFORMATION THEORY FOR CONTINUOUS SYSTEMS

INFORMATION THEORY FOR CONTINUOUS SYSTEMS

Shunsuke lhara Department of Mathematics College of General Education Nagoya University

V ^ g World Scientific I T

Singapore

* New Jersey

• London-Hong

Kong

Published by

World Scientific Publishing Co. Pie. Ltd. POBo* 128, Farter Road, Singapore 912S USA office: Suite IB, 1060 Main Street. River Edge, NJ 07661 UK office: 73 Lynton Mead, Totteridge, London N20 SDH

Library of Congress Cataloging-! n-Publicalion Data

Information theory for continuous systems / author, Shunsuke Diaia. p. cm. Includes bib1io|raphical references and index. ISBN 98102O985L I. Entropy (Information theory)-Mathematical models. 2. Neural trans mission-Mathematical models. I. Ihaia, Shunsuke. Q370.156 1993 003\54~dc20 93-23179 C1P

Copyright © 1993 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This bode, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording orany information storage and retrieval system now kiunvn or to be invented, without written permission from the Pubiiiher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Ciearance Center, Inc., 27 Congress Street, Salem, MA 01970. USA.

Printed in Singapore by Utopia Press.

Acknowledgments I would like to express my gratitude to Professor Takeyuki Hida for introducing me to this important research area, and for his valuable suggestions and encouragement, w i t h o u t which I could not have completed the present work. Among the many others who have made contributions, I would particularly like to thank Professors Tunekiti Sirao, Hisao Nomoto, Nobuyuki Ikeda, Hiroshi Kunita, Izumi Kubo, Masuyuki Hitsuda, Charles R. Baker and Kenjiro Yanagi, for valuable comments.

Preface This book is intended to provide a comprehensive and up-to-date treatment of information theory with special emphasis on the study of continuous information transmission systems, rather than discrete ones. Information theory in the strictest sense was largely originated by C.E. Shannon. In the paper t i t l e d "A mathematical theory of communication" [103], published in 1948, Shannon defined such fundamental quantities as entropy and mutual information, introduced general communication systems models, and proved coding theorems. Since then Shannon's basic idea and results have been extended and generalized by information theorists, both in mathematics and engineering. There are two large trends i n the development of information theory. The first one is the analysis of information theoretic quantities like entropy and mutual information. The concept of entropy was proposed, i n information theory, as a measure of uncertainty or randomness of a random variable, in other words, as a measure of the information involved i n the random variable. A generalization of the idea of entropy called relative entropy was introduced by S. Kullback {see [83]). This form of information measure, which is also referred to as information divergence or Kullback-Leibler information number, can be interpreted as a measure of similarity between two probability distributions. Many results for entropy and mutual information can be viewed as particular ones for relative entropy.

The

second trend is a generalization of coding theorems for more general and more complicated information transmission systems. The field of information theory has grown considerably.

Now modern infor-

mation theory deals w i t h various problems of information transmission, based on probability theory and other mathematical analysis, and is more than a mathematical theory of communication. I t is tied with estimation theory, filtering theory, and decision theory more closely than ever. vii

viii

PREFACE

Information theory is based on mathematics, especially on probability theory and mathematical statistics. Information transmission is, more or less, disturbed by noises which are arisen from a number of causes. Mathematically, the noises can be represented by suitable random variables or stochastic processes. Moreover, the significant aspect i n communication theory is that the actual message or signal to be transmitted is the one selected from a set of possible messages or signals. The system must be designed to operate for every possible selection, not just the one which will be actually chosen, since this is unknown at the time of design. The sets of possible messages and signals are given w i t h probability distributions. Thus messages and signals are also considered to be represented by random variables or stochastic processes. This indicates why the stochastic approach is quite useful for the development of information theory. A t the same time, information theory has fundamentally contributed not only to communication theory but also to statistical mechanics, probability theory, and statistics. The contents of this book fall roughly into two parts. I n the first part, entropy, mutual information and relative entropy are systematically analyzed and a unified treatment of these quantities in information theory, probability theory and m a t h ematical statistics is presented. Links between information theory and topics t h a t come from probability theory or mathematical statistics are frequently expressed in terms of relative entropy. It is shown that relative entropy is of central i m portance i n large deviation theorems, hypothesis testing, and m a x i m u m entropy methods i n statistical inference. The second part deals mostly w i t h information theory for continuous information transmission systems, based on the most recently available methods i n probability theory, and analysis developed i n the first part. The Gaussian channel is a communication channel disturbed by an additive noise w i t h Gaussian distribution. I t is known, as the central limit theorem, that a noise caused by a large number of small random effects has a probability distribution that is approximately Gaussian. This indicates that the study of Gaussian channels is especially i m p o r t a n t not only from the theoretical point of view but also from the viewpoint of application. One of the most interesting and complicated information transmission systems is a system w i t h feedback. We can handle feedback systems on the basis of causal calculus and innovation approach which have been developed i n probability theory. There are many excellent books on information theory.

While most of them

deal w i t h discrete information transmission systems, few of them treat continuous (both i n state spaces and time parameter spaces) systems. One of the major aims of this book is to cover the most recent developments i n information theory for continuous communication systems including Gaussian channels.

ix

PREFACE

Topics on information theory and stochastic processes are dealt w i t h i n moderately rigorous manner from the mathematical point of view. A t the same time, much effort is p u t into a rapid approach to the essentials of the matter treated and the maintenance of a spirit favorable to application, avoiding extremes of generality or abstraction. Most of the results i n this book are given as theorems and proofs. It is not necessary to understand the proofs completely to have a general grasp of the subject. The proofs can be skipped on the first reading when desired. They can be studied more carefully on the second reading. It is my hope that this book is accessible to engineers who are interested in the mathematical aspects and general models of the theory of information transmission and mathematicians who are interested i n probability theory, mathematical statistics and their applications to information theory. In Chapter 1, we introduce the fundamental quantities of information theory - entropy, relative entropy and mutual information - and study the relationships and a few interpretations of them.

They play important roles throughout the

text. These quantities are defined for stochastic processes i n Chapter 2. Especially stationary processes are studied from the information theoretical point of view. Chapter 3 is devoted to the study of large deviation theorems, maximum entropy or the m i n i m u m relative entropy spectral analysis, and hypothesis testing. Relative entropy plays fundamental roles i n the investigation of these topics. A large deviation theorem, which is a limit theorem in probability theory, shows that the speed of convergence is described i n terms of relative entropy. A model selection that maximizes the uncertainty or entropy yields the maximum entropy approach to statistical inference, and the minimum relative entropy method can be viewed as a generalization of the maximum entropy method. Interpretations of basic concepts i n communication theory, such as channel capacity and achievable rate of information transmission, are given in Chapter 4. Fundamental problems on information transmission over a communication channel are also investigated i n Chapter 4. The last two chapters are devoted to a study of information transmission over communication channels w i t h additive noises, especially Gaussian channels w i t h or without feedback. Such subjects as mutual information, channel capacity, optimal coding schemes, and coding theorems are investigated i n detail.

Chapter 5 is for the discrete time case.

I n particular.

Chapter 6 on the continuous time Gaussian channels is unique in this book. Some terminologies and results, which are used i n the text, i n probability theory are prepared i n the Appendix.

Contents

Preface

vii

C h a p t e r 1. E n t r o p y

1

1.1

Information Transmission

1

1.2

Entropy

4

1.3

Entropy for Continuous Distributions

13

1.4 1.5

Relative Entropy Properties of Relative Entropy

20 29

1.6

M u t u a l Information

35

1.7 1.8

Rate-Distortion Function Entropy for Gaussian Distributions

41 47

Historical Notes

56

C h a p t e r 2. S t o c h a s t i c Processes a n d E n t r o p y

57

2.1

Entropy Rate

57

2.2

Discrete Time Stationary Processes

63

2.3

Discrete Time Markov Stationary Processes

69

2.4

Entropy Rate of a Stationary Gaussian Process

76

2.5

Continuous Time Stationary Processes

83

2.6

Band Limited Processes

90

2.7

Discrete Time Observation of Continuous Time Processes Historical Notes

93 96

xi

CONTENTS

C h a p t e r 3. M a x i m u m E n t r o p y A n a l y s i s 3.1

M a x i m u m Entropy and M i n i m u m Relative Entropy

98 98

3.2

Large Deviation Theorems

3.3

M a x i m u m Entropy Spectral Analysis

118

3.4

Hypothesis Testing

127

3.5

Hypothesis Testing for Stationary Gaussian Processes Historical Notes

C h a p t e r 4. T h e o r y o f I n f o r m a t i o n T r a n s m i s s i o n

110

134 143 144

4.1

Model of Communication Systems

4.2

Information Stability

144 148

4.3

Source Coding Theorems

153

4.4

Channel Capacity

158

4.5

Channel Coding Theorems

162

4.6

Fundamental Theorem in Information Transmission

171

Historical Notes C h a p t e r 5. D i s c r e t e T i m e G a u s s i a n C h a n n e l s

175 177

5.1

M u t u a l Information in Channels w i t h Additive Noise

177

5.2

Discrete Gaussian Channels

183

5.3

M u t u a l Information in Gaussian Systems

188

5.4

Capacity of Discrete White Gaussian Channels

195

5.5

Capacity of Discrete Gaussian Channels without Feedback

200

5.6

Optimal Codings i n Discrete W h i t e Gaussian Channels

207

5.7

Capacity of Discrete Gaussian Channels w i t h Feedback

210

5.8

Rate-Distortion Function of a Discrete Gaussian Message

220

5.9

Coding Theorem for Discrete Gaussian Channels

223

Historical Notes C h a p t e r 6. C o n t i n u o u s T i m e G a u s s i a n C h a n n e l s 6.1

226 228

Continuous Gaussian Channels

228

6.2

M u t u a l Information in White Gaussian Channels (I)

231

6.3

M u t u a l Information in White Gaussian Channels ( I I )

236

6.4

Capacity of White Gaussian Channels

250

6.5

O p t i m a l Codings i n White Gaussian Channels

254

6.6

Gaussian Channels without Feedback

258

CONTENTS

Sill

6.7

Stationary Gaussian Channels

263

6.8 6.9

Gaussian Channels w i t h Feedback Rate-Distortion FAinction of a Gaussian Process

268 277

6.10 Coding Theorem for Gaussian Channels Historical Notes Appendix. Preparation from Probability Theory

280 285 287

A.l

Probability and Random Variable

287

A.2

Conditional Probability and Conditional Expectation

289

A.3

Stochastic Integral

292

Bibliography

293

List of Symbols

300

Subject

304

Index

Chapter 1 Entropy 1.1 I N F O R M A T I O N

TRANSMISSION

A mathematical foundation of information theory was presented by C.E. Shannon w i t h his famous paper "A mathematical theory of communication", published in 1948. I n a strict sense, information theory may be understood as a mathematical theory of communication, or a theory of information transmission. As the field of information theory has grown and developed, we shall consider information theory here i n a somewhat broader sense. Communication theory deals w i t h systems that transmit information or data from one place to another. Physically speaking there are various types of communication systems.

Most of such systems can be described by the block diagram

in Figure 1.1, which is known as Shannon's scheme. A message from the information source is encoded at the encoder. The encoded signal is input into the channel. Then the received signal is decoded at the decoder. The fundamental problem of communication is that of reproduction of the message either exactly or approximately. I t is significant to note that the actual message is the one selected from a set of possible messages w i t h probabilistic structure. The system must be designed so as to be ready to operate for each possible selection, not just the one which will actually be chosen. It is essential to evaluate the performance of communication for all possible selections as a whole, not for the actually chosen one. The fundamental problem of information transmission w i l l be discussed i n Chapter 4 i n detail. It is essential i n information theory to measure the quantity of "information". 1

2

ENTROPY

For this purpose, Shannon introduced the concepts of entropy and m u t u a l information from the viewpoint of communication theory. Entropy is defined as a measure of "uncertainty" or "randomness" of a random phenomenon. Suppose t h a t some information about a random variable is received. Then a quantity of uncertainty is reduced, and this reduction in uncertainty can be regarded as the quantity of transmitted information, which is called the mutual information. The capacity of a communication channel is defined to be the maximum value of the information that can be transmitted over the channel. Before Shannon's paper, i t was generally thought that decreasing the required probability of error decreases the transmission rate of information over the channel. However Shannon has shown that i t is not true; rather, an arbitrarily small probability of error is achievable at any transmission rate less than the channel capacity.

Source

Encoder

Channel

Decoder

User

Noise F i g u r e 1.1. A communication system

Developing the idea of entropy, the concept of relative entropy was introduced by Kullback to discriminate two probability distributions and to apply information theory to statistical inference. This form of information measure is also referred to as information divergence, Kullback-Lei bier information number, or information gain. Relative entropy is better interpreted as a measure of "distance"

between

two probability distributions. Mathematically speaking, relative entropy is not a true metric, but i t enjoys some of the properties of a metric. Many results for entropy and mutual information can be viewed as special cases of results for relative entropy and the formula for relative entropy arises naturally i n some proofs. Three mathematical notions, entropy, mutual information, and relative entropy, introduced as measures of "information", have played significant roles in information theory. We note that these quantities, i n particular entropy and relative entropy, play important roles not only i n information theory but also i n relevant fields such as probability theory, mathematical statistics, ergodic theory, statistical mechanics, and so on. Entropy, relative entropy and mutual information play central roles throughout

I.J INFORMATION

3

TRANSMISSION

the text. I n this section we introduce them only briefly, and we shall study each of them i n detail i n later sections. Let X be a random variable taking values on a set {a ,a ] x

= Oi) = pi,

P(X

w i t h probability

m

t = l,...,m.

Then H(X)

n = - f > t a B A

= H(p ...,p ) u

m

(1.1.1)

•=i

is called the e n t r o p y o f t h e r a n d o m v a r i a b l e X or the e n t r o p y o f t h e p r o b ability d i s t r i b u t i o n (pi, ...,p ).

The H(X)

m

was named entropy by Shannon

because he recognized some analogies between the H{X)

and the entropy i n sta-

tistical mechanics, and the symbol H came from the H i n Boltzmann's famous H theorem. Let X and Y be two random variables w i t h probabilities P(X

=

P(X

= - a , , r = 6,) = r y ,

a

i

) =

P i

,

P{Y

Then the m u t u a l i n f o r m a t i o n I(X,

= b) = j

q, j

t = l , - , m , j = l,...,n.

Y) between X and Y is defined by n

m

It can be expressed I(X,Y)

= H{X)~H{X\Y),

(1.1.3)

where TI

The quantity H(X\Y)

m

is called the conditional entropy of X given Y.

I t can be

explained as the remained uncertainty of X after we are informed of the outcome of Y. Hence we can say that /(X, Y) — the reduction of the entropy of X due to the knowledge of Y. In other words, I(X, in Y.

Y) describes the amount of information about X contained

4

ENTROPY

Let p. = ( p j , . . . , p ) and 1/ = (qj,---,q ) be probability distributions. T h e n , the r e l a t i v e e n t r o p y H{y.;v) of p. w i t h respect to c is defined by m

m

We shall see that the entropy and the mutual information can be expressed as a relative entropy. Here we have defined relative entropy and mutual information only for discrete probability distributions. We shall repeat these definitions i n later sections where we deal w i t h probability distributions of more general types.

1.2 E N T R O P Y

This section is devoted to a study of various properties of entropy and of axiomatic characterization of entropy. Suppose that we have a set ( a , , a

m

} of possible states whose probabilities of

occurrence are given by ( p i , . . . , p ) . Then the random experiment is described by m

a random variable X with probability P{X

= a,)=p„

i = i

(1.2.1)

m.

The ( p i , . . . , p ) is the probability distribution of X. Throughout this book, we denote by P{A) the probability of an event A . Here we repeat the definition of entropy. m

D e f i n i t i o n 1.2.1. Let X be a random variable w i t h distribution ( p i , . . . , p ) . Then the e n t r o p y o f t h e r a n d o m v a r i a b l e X or the e n t r o p y o f t h e d i s t r i b u t i o n ( p i , . . . , p ) is defined by m

m

m

H(X)

= H{ ,..., ) Pl

Pm

= -£p,logp,.

(1.2.2)

1=1

Suppose that a probability distribution ( p i , . . . , p ) is known and t h a t we do not m

know which event will occur. Then the entropy H(pi,

. . . , p ) shows how much m

1.2

S

ENTROPY

freedom one is given i n the selection of an event, or how uncertain the outcome is, or how difficult to predict the outcome. The logarithm may be taken to the base e or to the base 2. I n the case of base e, the entropy is measured in units of n a t s , while in the case of base 2, the entropy is measured i n units of b i t s . We take the logarithm to the base e throughout the text unless otherwise stated. B i t is usually more convenient for the practical use, and nat is more convenient for theoretical developments.

We use the notational

convention 0 log 0 = 0. Quite often i n the present text, we simply denote

JVj

f ° H H i > XI^=i i f no confusion occurs. Some basic properties of the entropy are now i n order. r

T h e o r e m 1.2.1. (Properties of Entropy)

The following properties ( H . l ) - (H.7)

hold for a probability distribution ( p i , . . . , p ) . m

(H.l)

tf( ,..., )>G. Pl

The equality H{pj, ( i 10(H.2) H(p

. . . , p ) — 0 holds if and only 'if pi=

1 for some i and p , = 0

m

. . . , p ) is a continuous function of ( p i , . . . , p ) .

u

(H.3)

Pm

m

m

The entropy is symmetric: ff£Pi)

- >Pm)

=

# ( P O ( 1 ) , •">Pl(m))l

where a denotes a permutation of ( 1 , ...,m). (H.4)

Let { ( f l i j i ,

4n|>); i = 1,—,"»} be a set of probability distributions. Then

it holds that H(Pl

P m Qn\m)

9 l | Li - - - . P i 9 n | l . P 2 9 l | 2 . m

= i 7 { p i , . . . , p ) + ^2piH{qn,

?„(;)•

m

(H.5)

If p

m

= q + r > 0 and

tf(p,,...,p _i,0\r) m

(H.6)

r > 0, then = i f ( p , , . . . , p „ , p ) - r p / Y (—, — ] . \Pm Pm / m

1

m

m

Let ( j j i , ...,9m) be an arbitrary probability distribution. Then m

H(pi,...,Pm)

< -^Pilog?.. i=]

6

ENTROPY

w i t h equality i f and only if p,- = qi for a l l t = l , . . . , m . (H.7)

0.

(1.2.3)

The equality holds i n (1,2.3) if and only if x — 1. We may assume that p, > 0 for all i . Using (1.2.3) we have m

ff(pi,...,p )

in

+ ^p,log , = ^p.log —

m

9

i=l

.=1

P

I

The equality holds i n (1.2.4) if and only i f p, — ••-,P]«n|l.P2?l|2, •;Pmq \m) n

(1.2.7)

Once we are informed that X = a; has occurred, while the uncertainty of X disappears, a reduced uncertainty H(qn„ ,.., g„|j) of y still remains. Therefore the second term on the right-hand side of (H.4) can be explained as the average value of the entropy of Y when X is known, in other words, the remained uncertainty of Y after X is known. Thus (H.4) explained that the uncertainty of (X, Y ) is equal to the sum of the uncertainty of X and the remained uncertainty of Y after X is known. We now introduce the concept of conditional entropy. D e f i n i t i o n 1.2.2. For a pair {X, Y ) of random variables with probabilities (1.2.1), (1.2.5) and (1.2.6), the c o n d i t i o n a l e n t r o p y H(X\Y)

r 7 m n = - I > '

Using (H.4), we see that H(X\Y) H(X\Y)

l

o

of X given Y is defined by

6 ^ -

(1-2-8)

has an alternative expression = H{X,Y)-

H(Y).

(1.2.9)

g

ENTROPY

The quantity m

H{X\Y

...,

= bj) = H( j, Pli

P H i

) - -^ntflogpiU

(1-2-10)

is called the conditional entropy of X given Y = bj. Then H(X\Y) tation of H(X\Y

is the expec-

= •): H{X\Y)

= j2liH(X\Y

(1.2.11)

= bj).

We collect some of the basic properties of conditional entropy. T h e o r e m 1.2.2. Let X,X\,X ,

Y be random variables w i t h discrete proba-

n

bility distributions. Then (H.8)

H(X\Y)

-,

X, n

Y)

= H(X,, =

ff{K)

Y) + H(X ,X |r, 2

4- H(X,\Y)

X,)

n

+

H(X ,...,X„\Y,X ). 2

1

Consequently H(X,

X \Y)

= H(X

n

}

\Y) +

H(X ,...,X \Y X ). 2

Repeating the same procedures, we obtain (H.10).

n

}

1

•

The property (H.8) says that the uncertainty of X is decreased when we have some information about Y, unless X is independent of Y. The equation i n (H.10) is often referred to as a chain rule for entropy. As we have seen above, the properties ( H . l ) - (H.10) show that the entropy denned by (1.2.2) is quite suitable as a measure of uncertainty of the outcome of a random experiment.

On the other hand, it seems to be reasonable to say

that a function to measure the uncertainty should possess at least the properties ( H . l ) - (H.4).

More importantly Shannon showed that the entropy defined by

(1.2.2) is only the function that satisfies ( H . l ) - (H.4). To state the Shannon's characterization of entropy, we introduce weaker conditions ( H . l ' ) and (H.2') than (H.l)and(H.2). (H.l')

There exists 0 < p < 1 such that H(p, 1 - p) > 0.

(H.2')

H(p, 1 — p) is a continuous function of p e [0,1].

T h e o r e m 1.2.3. Assume that a real valued function H(pt,

. . . , p ) , defined on m

the class of all discrete probability distributions, satisfies ( H I ' ) , (H.2 ), (H.3) and 1

(H.5). Then H(p,,...,p )

is the function of the form

m

m H(pt,...,p ) m

= -c^Tpilogpi,

(1.2.12)

i=i

where c is a positive constant.

P r o o f . A t first we shall prove (1.2.12) in the case of m = 2. For this purpose we put f( ) P

= 7/(p, 1 - p),

0 < p < l .

10

ENTROPY

Let p, q, r be non-negative numbers w i t h r > 0, p + q + r = 1. Then, by (H.3) and (H.5), H(p,q,r)

( j ^

= H(p,q + r) + (q + r)H = H

i

,

q

p

+

r

)

+

i

p

r

+

)

H

t

^

^

)

.

Accordingly, /(p) satisfies /(p) + (1 - ) f (j^j

= f(q) + ( 1 - ) / ( ^ )

,

5

P

(1.2.13)

where p , q > Q and p + q < 1. Let p be fixed and integrate both sides of (1.2.13) from 0 t o 1 — p w i t h respect t o q. Then we obtain (l-p)/ 0. Thus we have proved (1.2.12) for m = 2. Next, we shall prove (1.2.12) by induction w i t h respect to m , the number of components of distribution ( p i , . . . , p ) . Suppose that (1.2.12) holds for all m < k. m

Let ( p i , . . . , p t ) be a probability distribution. We may assume that p * i > 0. I t + 1

+

follows from (H.5) that H(pi,

. . . , p t - i , p * , p t + i ) =H(pi, ...,pk-i, k P

+p*+i)

By the assumption of induction, this can be rewritten as

tf(pi.—iP*+i)

= -

c

£pi

I o

g P i - (P* + P * + l ) ' ° 6 ( P i t C

+Pk+l)

•=i

meaning that (1.2.12) is true for m = k -f 1. Thus the proof is complete.

•

Here we prove an interesting inequality which is known as Fano's inequality. T h e o r e m 1.2.4. Let X and Y be random variables taking values aj,a , m

and

denote by *

e

=

P(X^Y)

the probability of error. Then (1.2.18)

ENTROPY

12

where h(x) = —x log x — (1 — x ) l o g ( l — i } P r o o f . Let Z be a random variable defined by 1,

Y*X,

I o,

Y = X.

Then P ( Z = 1) = 7r . Using the chain rule (H.10), we expand H(X, e

Z\Y) i n two

different ways to obtain fi(X,

Z\Y) = H(X\Y)

+ H(Z\Y, X) =

H(X\Y)

and H(X,

Z\Y) = ff (Z|K) + H(X\Y,

Z).

Thus //(XIV) = H{Z\Y)

+ H{X\Y, Z).

(1.2.19)

Using (H.8) and (H.7), we have H(Z\Y)

< H(Z)

= /»(«•«)

(1.2.20)

and H{X\Y,Z)

= P{Z

= 0)H(X\Y,Z

= 0) + P{Z

= \)H{X\Y, Z = 1)

= F ( Z = l ) / J ( X | r , Z = 1) < jt, b g m .

(1.2.21)

The desired inequality (1.2.18) follows from (1.2.19) - (1.2.21).

•

Fano's inequality w i l l play an important role to establish the channel coding theorem (cf. Proof of Theorem 4.5.2). If X is a random variable taking infinitely many values a i , a > — w i t h probabil2

ities pi,|>2,—, then as a natural extension of (1.2.2), the entropy of X is defined by //(*)

=

Note that, in this case, H{X)

ff(p,,p ,...) 2

CO = -X>logp .

is not necessarily

;

finite.

(1.2.22)

I n the case where Y is

also a random variable taking values of at most countable number, the conditional entropy H(X\Y)

is defined by (1.2.9).

1.3 ENTROPY

FOR

CONTINUOUS

1.3 E N T R O P Y F O R C O N T I N U O U S

13

DISTRIBUTIONS

DISTRIBUTIONS

I n this section we introduce the notion of continuous entropy (or differential entropy), which is defined for random variables w i t h continuous probability distributions. Let Xi,...,Xi (Xi,...,Xj)

If the probability of X

be real (valued) random variables.

S

e A is given by = J

P(X£A)

— j

p{x„...,x )dx,...dx , d

i

A

for any d-dimensional

Borel set A C R

( R = (—°o> cc) denotes the set of all real

d

numbers), then we say that X = IXi, ...,X )

is with d-dimensional

d

distribution, or simply that X is a continuous random variable. p(x) = p{x\, . . . , i d ) (x = [xi,Xi)

continuous

The function

e W*) is called the probability density. The

density function is nonnegative; p(x) > 0, and satisfies

J tL*

j p(x ,...,x )dx ...dx 1

d

l

d

= 1.

D e f i n i t i o n 1.3.1. The c o n t i n u o u s e n t r o p y of a if-dimensional continuous random variable X — (Xt,...,Xj)

w i t h probability density function p(x) is defined

as h(X)

= k(p) = -

[

p{x)logp(x)dx,

(1.3.1)

provided the integral exists. I t is often referred to simply as the e n t r o p y of A" or the e n t r o p y of p ( x ) . The continuous entropy k(X) tropy. We w i l l write j f(x)

has been traditionally called the differential en-

dx as the domain R

d

of the integral /

R J

/(x) dx if no

confusion occurs. Here we give two examples to calculate continuous entropy. E x a m p l e 1 . 3 . 1 . (Uniform Distribution)

Let X be a real random variable with

uniform distribution on an interval (a, a + L\ of length L. Then the probability density function p(x) is given by •L,

a 0. Hence, for a p a r t i t i o n A = {A,

V) = fiA)

log ^ |

+ rtA*) log f@

A ),

= cc,

c

1.1 RELATIVE

25

ENTROPY

meaning that (1.4.11) holds i n the sense of oo = oo. Next we assume that ji is absolutely continuous w i t h respect to v. Denote by tp — dji/dv the Radon-Nikodym derivative. Given an arbitrary partition A

=

{Aj,.... A } of X , we define a probability measure v& by m

VA(A)

=

A e 0(X),

^(x)<Mx),

where

Note that H&(fi\ v) can be written i n the form Ho.(p\v) =

/ log^ (x) Jx A

d)i{x).

Moreover i t can be shown that ji is absolutely continuous with respect to

and

the Radon-Nikodym derivative is given by P

d

I) x

-

By Theorem 1.4.1

This is equivalent to H (fi;v)=

/ log \b (x)dfi(x) Jx

A

A

< / logi^(i)d/i(i) = Jx

H(fi,v).

Since the partition A is arbitrary, we have H < H(ii-v).

(1.4.12)

We now proceed to prove the inverse inequality of (1.4.12). We put C„ = {x e X ; |logp(r)| > 7i}. Since Yim„ —

ao

n(C ) n

« ( c ) i o g 4£4 n

= 0, for any e > 0, there exists JV such that > M o i o g / i c c . ) > -e,

y» > m

(i.4.i3)

26

ENTROPY

Fix an integer n > N and define a partition A = {An i = 0, ± 1 , ± m ] At = ix e X ; ( A_ A_

m + i

m

=

( l

- ^ " < logv(i) < — } , m mj

by

- m + 2 < i <m,

= { z 6 X; - r . < log^(z)
J) dv'x)

(1.4.15)

where the supremum is taken over the set B of all bounded measurable functions defined on X . In the large deviation theory, D(fi, v) is called the entropy function or the rate function. T h e o r e m 1.4.4. Let fi and V be the probability measures on ( X , S ( X ) ) . Then H(p,i>) =

D(r,»).

(1.4.16)

1.4 RELATIVE

27

ENTROPY

To prove the theorem, we introduce auxiliary notations. Define classes U U ( f ) and V = \(u)

=

by

U = {u G Z ^ ) ;

> 0 (>-a. .), e

f u(x)dt>(x) = 1}, Jx

V = { u € U ; logueB}, where L ' ( f )

is the class of all v-integrable functions.

And define D(p;v)

and

D (ft; u) by 0

5 ( / i ; e ) = sup / l o g u ( x ) d u ( r ) , u€U J x D ( u ; c ) = s u p / logii{x)d/i(x). aev *ev J x 0

For every # € B , there corresponds a function u 6 V given by -l

«(*) =

exp(#(x)).

[^exp(

Information theory for continuous systems

Realizability Theory for Continuous Linear Systems

Social Theory and Philosophy for Information Systems

Information Theory of Molecular Systems

Information Theory of Molecular Systems

Continuous-Time Systems

Simulating Continuous Fuzzy Systems

Vibration of Continuous Systems

Vibration of Continuous Systems

Planning for Information Systems (Advances in Management Information Systems)

Information Systems for Emergency Management (Advances in Management Information Systems)

Strategic Planning for Information Systems

Visualisation for semantic information systems

Strategic Planning for Information Systems

Technologies for Business Information Systems

Information systems for sustainable development

Social Theory and Philosophy for Information Systems (John Wiley Series in Information Systems)

Dynamical Systems of Continuous Spectra

Control of Continuous Linear Systems

Theory of Neural Information Processing Systems

Information retrieval systems: theory and implementation

Theory of Neural Information Processing Systems

Theory of Neural Information Processing Systems

Dynamics of information systems : theory and applications