An Introduction to Information Theory

Fazlollah M. Reza 1961 DOVER PUBLICATIONS, INC. New York Statistical theory of communication is a broad new field com...

Author: Fazlollah M. Reza

425 downloads 2800 Views 47MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

Fazlollah M. Reza 1961

DOVER PUBLICATIONS, INC. New York

Statistical theory of communication is a broad new field comprised of methods for the study of the statistical problems encountered in all types of communications. The field embodies many topics such as radar detection, sources of physical noise in linear and nonlinear systems, filtering and prediction, information theory, coding, and decision theory. The theory of probability provides the principal tools for the study of problems in this field. Information theory as outlined in the present work is a part of this broader body of knowledge. This theory, originated by C. E. ~n~tnn.on9 introduced several important new concepts and, although a of applied communications sciences, has acquired the unique distinction of opening a new path of research in pure mathematics. The communication of information is generally of a statistical nature, and a current theme of information theory is the study of ideal statistical communication models. The first objective of information theory is to define different types of sources and channels and to devise statistical parameters describing their individual and ensemble operations. The concept of Shannon's communication entropy of a source and the transinformation of a channel provide most useful means for studying simple communication models. In this respect it appears that the concept of communication entropy is a type of describing function that is most appropriate for the statistical models of communications. This is similar in principle to the way that an impedance function describes a linear network, or a moment indicates certain properties of a random variable. The introduction of the concepts of communication entropy, transinformation, and channel capacity is a basic contribution of information theory, and these concepts are of such fundamental significance that they may parallel in importance the concepts of power, impedance, and moment. Perhaps the most important theoretical result of information theory is Shannon's fundamental theorem, which implies that it is possible to communicate information at an ideal rate with utmost reliability in the presence of "noise." This succinct but deep statement and its consequences unfold the limitation and complexity of present and future iii

iv

PREFACE

methods of communications. of the offers many benefits to those interested in the analysis and synthesis of communication networks. For this reason we have included several methods of of this theorem (see Chapters 4 and 12). the reader entails much preparation which may prove is forewarned that the to be burdensome. This book originated about five years ago from the author's lecture notes on information theory. In presenting the subject to engineers. the need for preliminary lectures on probability theory was observed. A course in probability, even now, is not included in the curriculum of a majority of engineering schools. This fact motivated the inclusion of an introductory treatment of probability for those who wish to pursue the general study of statistical theory of communications. The present book, directed toward an engineering audience, has a threefold purpose: 1. To present elements of modern probability theory (discrete, continuous, and stochastic) 2. To present elements of information theory with emphasis on its basic roots in probability theory 3. To present elements of coding theory Thus this book is offered as an introduction to probability, mtormation. and coding theory. It also provides an adequate treatment of probability theory for those who wish to pursue topics other than information theory in the field of statistical theory of communications. One feature of the book is that it requires no formal prerequtsites except the usual undergraduate mathematics included in an engineering or science program. Naturally, a willingness to consult other references or authorities, as necessary, is presumed. The subject is presented in the light of applied mathematics. The immediate involvement in technological specialities that may solve specific problems at the expense of a less thorough basic understanding of the theory is thereby avoided. A most important, though indirect, application of information theory has been the development of codes for transmission and detection of information. Coding literature has grown very rapidly since it, presumably, applies to the growing field of data processing. Chapters 4 and 13 present an introduction to coding theory without recourse to the use of codes. The book has been divided into four parts: (1) memorylessdiscrete schemes, (2) memoryless continuum, (3) schemes with memory, and (4) an outline of some of the recent developments. The appendix contains some notes which may help to familiarize the reader with some of the literature in the field. The inclusion of many reference tables and a bibliography with some 200 entries may also prove to be useful.

PREFACE

The the book is on such basic as the probability measure associated with sets, space, random ables, information measure, and capacity. These concepts from set theory to probability theory and then to information and theories. The of the to such as detection, optics, and linguistics was not undertaken. We make no pretension for H usefulness" and immediate of information theory. From an educational standpoint, it appears, however, that the topics discussed should provide a suitable training ground for communication scientists. The most rewarding aspect of this undertaking has been the pleasure of learning about a new and fascinating frontier in communications. working on this book, I came to appreciate fully many subtle points and ingenious procedures set forth in the papers of the original contributors to the literature. I trust this attempt to integrate these many contributions will prove of value. Despite pains taken by the author, inaccuracies, original or inherited, may be found. Nevertheless, I the reader will find this work an existence proof of Shannon's fundamental theorem; that "information" can be transmitted with a reliability at a rate close to the channel capacity despite all forms of " noise." At any rate, there is an eternal separation between what one strives for and what one actually achieves. As Leon von Montenaeken La vie est breve, Un peu Un peu de reve, Et puis-bonsoir. Fazlollah M. Reza

The author wishes to acknowledge indebtedness to all those who have directly or indirectly contributed to this book. Special tribute is due to Dr. C. E. Shannon who primarily initiated information theory. Dr. P. Elias of the Massachusetts Institute of Technology, has been generous in undertaking a comprehensive reading and reviewing of the manuscript. His comments, helpful criticism, and stimulating discussions have been of invaluable assistance. Dr. L. A. Cote of Purdue University has been very kind to read and criticize the manuscript with special emphasis upon the material probability theory. His knowledge of technical Russian literature and his unlimited patience in reading the manuscript in its various stages of development have provided a depth that otherwise would not have been attained. Thanks are due to Dr. N. Gilbert of Bell Telephone Laboratories and Dr. J. P. Costas of General Electric Company for comments on the material on coding theory; to Prof. W. W. Harman of Stanford University for reviewing an early draft of the manuscript; to Mr. L. Zafiriu of Syracuse University who accepted the arduous task of proofreading and who provided many suggestions. In addition numerous scientists have generously provided reprints, reports, or drafts of unpublished manuscripts. The more recent material on information theory and coding has been adapted from these current sources but integrated in our terminology and frame of reference. An apology is tendered for any omission or failure to reflect fully the thoughts of the contributors. the past four years, I had the opportunity to teach and lecture in this field at Syracuse University, International Business Machines Corp., General Electric Co., and the Rome Air Development Center.. The keen interest, stimulating discussions, and friendships of the scientists of these centers have been most rewarding. Special acknowledgment is due the United States Air Force Air Development Center and the Cambridge Research Center for supporting several related research projects. vii

viii

ACKNOWLEDGMENTS

am indebted to my colleagues in the of Electrical neering at Syracuse University for many helpful discussions and to Mrs. H. F. Laidlaw and Miss M. J. Phillips for their patient typing of the manuscript. I am particularly grateful to my wife and family for the patience which they have shown.

PREFACE

iii

CHAPTER 1 Introduction 1-1. 1-2. 1-3. 1-4. 1-5. 1-6. 1-7.

Communication Processes. A Model for a Communication System . A Quantitative Measure of Information. A Binary Unit of Information. Sketch of the Plan. Main Contributors to Information Theory . An Outline of Information Theory Part 1: Discrete Schemes without

""

3 5 7 9

11

..

CHAPTER 2 Basic Concepts of Probability 2-1. 2-2. 2-3. 2-4. 2-5. 2-6. 2-7. 2-8. 2-9. 2-10. 2-11. 2-12. 2-13. 2-14. 2-15. 2-16. 2-17. 2-18. 2-19. 2-20.

Intuitive Background . Sets . Operations on Sets . Algebra of Sets . Functions Sample Space Probability Measure Frequency of Events . Theorem of Addition . Conditional Probability Theorem of Multiplication. Bayes's Theorem Combinatorial Problems in Probability . Trees and State Diagrams. Random Variables . Discrete Probability Functions and Distribution . Bivariate Discrete Distributions . Binomial Distribution . Poisson's Distribution . Expected Value of a Random Variable . ix

19 23 24 30 34 36 38 40 42 44 46 49 52 58 59 61 63

65 67

x

CONTENTS

3

3-1. 3-2. 3-3. 3-4. 3-5. 3-6. 3-7. 3-8. 3-9. 3-10. 3-11. 3-12. 3-13. 3-14. 3-15. 3-16. 3-17. 3-18. 3-19.

Basic Concepts of Information

A Measure of Uncertainty. An Intuitive 78 Formal Requirements for the Average 80 II Function as a Measure of Uncertainty 82 Function An Alternative Proof That the Possesses a Maximum 86 89 Sources and Binary Sources Measure of Information for Two-dimensional Discrete Finite Probability Schemes. 91 Conditional Entropies . 94 A Sketch of a Communication Network. 96 99 Derivation of the Noise Characteristics of a Channel. Some Basic Relationships among Different Entropies. A Measure of Mutual Information 104 Set-theory Interpretation of Shannon's Fundamental Inequalities 106 Efficiency, and Channel 108 Capacity of Channels with Noise Structures BSC and BEC Capacity of Binary Channels. 115 Binary Pulse Width Communication Channel 122 124 Uniqueness of the Entropy Function.

CHAPTER 4 Elements of 4-1. 4-2. 4-3. 4-4. 4-5. 4-6. 4-7. 4-8. 4-9. 4-10. 4-11. 4-12. 4-13. 4-14.

1kl''ll''\A.. ",rII,.,..,,1'!I'

The Purpose of Encoding . 131 Separable Binary Codes 137 Shannon-Fane Encoding 138 Necessary and Sufficient Conditions for Noiseless Coding. 142 A Theorem on Decodability 147 Average Length of Encoded Messages 148 Shannon's Binary Encoding 151 Fundamental Theorem of Discrete Noiseless 154 Huffman's Minimum-redundancy Code. 155 Gilbert-Moore Encoding 158 Fundamental Theorem of Discrete Encoding in Presence of Noise 160 Error-detecting and Error-correcting Codes. 166 168 Geometry of the Binary Code Space . Hamming's Single-error Correcting Code 171

CONTENTS

4-15. Elias's Iteration 4-16. A Mathematical Proof of the Fundamental Theorem of Information Theory for Discrete BSC. 4-17. Encoding the English Alphabet .

: Continuum without

180

.a,."..... ".ll.JII..I.,Jf.l.

CHAPTER 5 Continuous Probability Distribution and Density 5-1. Continuous Sample Space. 191 5-2. Probability Distribution Functions . 194 5-3. Probability Density Function. 5-4. Normal Distribution 196 198 5-5. Cauchy's Distribution . 5-6. Exponential Distribution . 199 5-7. Multidimensional Random Variables 200 5-8. Joint Distribution of Two Variables: Marginal Distribution . 202 5-9. Conditional Probability Distribution and Density . 5-10. Bivariate Normal Distribution 206 208 5-11. Functions of Random Variables . 5-12. Transformation from Cartesian to Polar Coordinate System. CHAPTER 6 Statistical Averages 6-1. Expected Values; Discrete Case 220 6-2. Expectation of Sums and Products of a Finite Number of Discrete Random Variables . 222 224 6-3. Moments of a Univariate Random Variable. 6-4. Two Inequalities 227 229 6-5. Moments of Bivariate Random Variables 6-6. Correlation Coefficient. 230 232 6-7. Linear Combination of Random Variables . 6-8. Moments of Some Common Distribution Functions 234 238 6-9. Characteristic Function of a Random Variable . 6-10. Characteristic Function and Moment-generating 239 Function of Random Variables. 6-11. Density Functions of Sum of Two Random Variables . 242 CHAPTER 7 Normal Distributions and Limit Theorems 7-1. Bivariate Normal Considered as an Extension of Onedimensional Normal Distribution . 248 250 7-2. Multinormal Distribution. 7-3. Linear Combination of Normally Distributed Independent Random Variables . 252

xii

CONTENTS

Central-limit Theorem . Random-walk 258 of the Binomial Distribution the Normal Distribution 259 7-7. Approximation of Poisson Distribution a Normal Distribution . 7-8. The Laws of Large Numbers . 263

7-5. 7-6.

8 8-1. 8-2. 8-3. 8-4. 8-5. 8-6. 8-7. 8-8. 8-9. 8-10.

8-11. 8-12.

Continuous Channel without Memory

Definition of Different Entropies. The Nature of Mathematical Difficulties Involved. Infiniteness of Continuous Entropy . The Variability of the Entropy in the Continuous Case with Coordinate Systems A Measure of Information in the Continuous Case. Maximization of the Entropy of a Continuous Random Variable . Entropy Maximization Problems. Gaussian Noisy Channels . Transmission of Information in Presence of Additive Noise . Channel Capacity in Presence of Gaussian Additive Noise and Specified Transmitter and Noise Average Power. Relation between the of Two Related Random Variables Note on the Definition of Mutual Information.

267 269 270 273 275

278 279 282 283

285 287 289

CHAPTER 9 Transmission of Band-limited Signals 9-1. 9-2. 9-3. 9-4. 9-5. 9-6. 9-7. 9-8. 9-9. 9-10. 9-11. 9-12.

Introduction Entropies of Continuous Multivariate Distributions . Mutual Information of Two Gaussian Random Vectors A Channel-capacity Theorem for Additive Gaussian Noise . Digression Sampling Theorem . A Physical Interpretation of the Sampling Theorem . The Concept of a Vector Space . Fourier-series Signal Space Band-limited Signal Space Band-limited Ensembles . Entropies of Band-limited Ensemble in Signal Space.

292 293 295 297 299 300 305 308 313 315 317 320

CONTENTS

9-13. A 9-14. 9-15. 9-16. 9-17.

9-18.

for Communication Continuous Signals . OptimalDecoding A Lower Bound for the Probability of Error An Upper Bound for the Probability of Fundamental Theorem of Continuous Channels in Presence of Additive Noise .. Thomasian's Estimate .

Part 3 : Schemes with

325 327 329 330

I ' l l g'lIl'TIn"l!""U'

CHAPTER 10 Stochastic Processes Stochastic Theory . Examples of a Stochastic Process. Moments and Expectations Stationary Processes Ergodic Processes . Correlation Coefficients and Correlation Functions. Example of a Normal Stochastic Process Examples of Computation of Correlation Functions Some Elementary Properties of Correlation Functions of Stationary Processes 10-10. Power Spectra and Correlation Functions . 10-11. Response of Linear Lumped Systems to Excitation 10-12. Stochastic Limits and Convergence . 10-13. Stochastic Differentiation and Integration . 10-14. Gaussian-process Example of a Stationary Process. 10-15. 'I'he Over-all Mathematical Structure of the Stochastic Processes . 10-16. A Relation between Positive Definite Functions and Theory of Probability

10-1. 10-2. 10-3. 10-4. 10-5. 10-6. 10-7. 10-8. 10-9.

CHAPTER 11

343 344 347 349 352 353 356 357 359 363 365 367 368 370

Communication under Stochastic Regimes

Stochastic Nature of Communication Finite Markov Chains. A Basic Theorem on Regular Markov Chains . Entropy of a Simple Markov Chain. Entropy of a Discrete Stationary Source Discrete Channels with Finite Memory. Connection of the Source and the Discrete Channel with Memory . 11-8. Connection of a Stationary Source to a Stationary Channel

11-1. 11-2. 11-3. 11-4. 11-5. 11-6. 11-7.

338

374 376 377 380 384 388 389 391

xiv

CONTENTS

CHAPTER 12 The Fundamental Theorem of mrormation .

B · h ........ _,.......

PRELIMINARIES

12-1. A Decision Scheme. 12-2. The Probability of Error in a Decision Scheme . A Relation between Error Probability and 12-4. The Extension of Discrete Memoryless Noisy onenneis

398 398 400

FEINSTEIN'S PROOF

12-5. On Certain Random Variables Associated with a Communication System . 12-6. Feinstein's Lemma. 12-7. Completion of the Proof .

403 405 406

SHANNON'S PROOF

12-8. Ensemble Codes. 12-9. A Relation between Transinformation and Error Probability 12-10. An Exponential Bound for Error Probability W OLFOWITZ '8 PROOF 12-11. The Code Book. 12-12. A Lemma and Its Application. 12-13. Estimation of Bounds . 12-14. Completion of Wolfowitz's Proof.

409 412

416 419 421

CHAPTER 13 Group Codes 13-1. 13-2. 13-3. 13-4. 13-5. 13-6. 13-7. 13-8.

Introduction The Concept of a Group . 425 Fields and Rings 428 Algebra for Binary n-Digit Words 429 Hamming's Codes . 431 Group Codes 435 A Detection Scheme for Group Codes 437 Slepian's Technique for Single-error Correcting Group Codes . 438 13-9. Further Notes on Group Codes . 442 13-10. Some Bounds on the Number of Words in a Systematic Code . 446

APPENDIX Additional Notes and Tables N-1. The Gambler with a Private Wire. N-2. Some Remarks on Sampling Theorem. N-3. Analytic Signals and the Uncertainty Relation

.

452 454

CONTENTS

N-4. N-5. N-6. N-7. T-l. T-2. T-3.

Elias's Proof of the Fundamental Theorem for Further Remarks on Coding Theory . Partial Ordering of Channels Information Theory and Radar Problems Normal Probability Integral Normal Distributions A Summary of Some Common Probability Functions Probability of No Error for Best Group Code T-5. Parity-check Rules for Best Group Alphabets T-6. Logarithms to the Base 2 T-7. Entropy of a Discrete Binary Source .

460 462 464 465 .

467 468 469 471 476

BIBLIOGRAPHY. .

481

NAME INDEX.

~1

SUBJECT INDEX .

493

dH,APTER

-1

Information theory is a new branch of probability theory with extensive potential applications to communication systems. Like several other branches of mathematics, information theory has a physical origin. It was initiated by communication scientists who were studying the statistical structure of electrical communication equipment. Our subject is about a decade old. It was principally originated by Claude Shannon through two outstanding contributions to the mathematical theory of communications in 1948 and 1949. These were followed by a flood of research papers speculating upon the possible appucations of the newly born theory to a broad spectrum of research areas, such as pure mathematics, radio, television, radar, psychology, semantics, economics, and biology. The immediate application of this new discipline to the fringe areas was rather premature. In fact, research in the past 5 or 6 years has indicated the necessity for deeper investigations into the foundations of the discipline itself. Despite this hasty generalization which produced several hundred research papers (with frequently unwarranted conclusions), one thing became evident. The new scientific discovery has stimulated the interest of thousands of scientists and engineers around the world. Our first task is to present a bird's-eye view of the subject and to specify its place in the engineering curriculum. In this chapter a heuristic exposition of the topic is given. No effort is made to define the technical vocabulary. Such an undertaking requires a detailed logical presentation and is out of place in this informal introduction. However, the reader will find such material presented in a pedagogically prepared sequence beginning with Chap. 2. This introductory chapter discusses generalities, leaving a more detailed and precise treatment to subsequent chapters. * The specialist interested in more concrete statements may wish to forgo this introduction and begin with the body of the book. * 1-1. Communication Processes. Communication processes are concerned with the flow of some sort of information-carrying commodity in

* With the exception of Sec. 1-7, which gives a synopsis of information theory for the specialist. 1

2

INTRODUCTION

some network. The need not be tangible; for example, process which one mind affects another mind is a communication procedure. This may be the sending of a message by telegraph, visual communication from artist to viewer, or any other means by which tion is conveyed from a transmitter to a receiver. The subject matter deals with the gross aspects of communication models rather than with their minute structure. That is, we concentrate on the over-all performance of such systems without being restrained to any n~::ll1rof'.ltf"ilnl(~"iI" equipment or organ. Common to all communication processes is the flow of some, commodity in some network. While the nature of the commodity can be as varied as electricity, words, pictures, music, and art, one could suggest at least three essential parts of a communication system (Fig. 1-1): 1. Transmitter or source 2. Receiver or sink 3. Channel or transmission network which conveys the communique from the transmitter to the receiver l]ty)tl1l)Ol)'jnn_

FIG.

1-1. The model of a communication system.

This is the simplest communication system that one can visualize. Practical cases generally consist of a number of sources and receivers and a complex network. A familiar analogous example is an electric power system using several interconnected power plants to supply several towns. In such problems one is concerned with a study of the distribution of the commodity in the network, defining some sort of efficiency of transmission and hence devising schemes leading to the most efficient transmission. When the communique is tangible or readily measurable, the problems encountered in the study of the communication system are of the types somewhat familiar to engineers and operational analysts (for instance, the study of an electric circuit or the production schedule of a manufacturing plant). When the communique is "intelligence" or "information," this general familiarity cannot be assumed. How does one define a measure for the amount of information? And having defined a suitable measure, how does one apply it to the betterment of the communication of information? To mention an analog, consider the case of an electric power network transmitting electric- energy from a source to a receiver (Fig. 1-2). At the source the electric energy is produced with voltage Va. The receiver requires the electric energy at some prescribed voltage V r • One of the

INTRODUCTION

involved is the heat loss the cnanner In words, of the wires acts as a narasmc 1"t:l>f"i·~l":T~'r One of the many tasks of the designer isto theloss mission lines. This can be accomplished the of the transmission lines. A parallel method of transmission imnrovement is to increase the voltage at the terminals of the line. As is well .known, this the of transmission energy losses in the line. A transformer installed at the input terminals of the line is appropriate. At the terminals another transformer (step-down) can provide the specified voltage to the receiver. Without being concerned about mathematical discipline in this ductory chapter, let us ask if similar could be to the transmission of information. If the channel of transmission of information is a lossy one, can one still improve the efficiency of the transmission

n1"~"'\hlt:l>Ttf'H;!

FIG.

1-2. An example of a communication system.

by procedures similar to those in the above case? This of course uenenns, in the first place, on whether a measure for the of transmission of information can be defined. 1-2. A Model for a Communication The communication systems considered here are of a statistical nature. That is, the performance of the system can never be described in a deterministic sense; rather, it is always given in statistical terms. A source is a device that selects and transmits sequences of symbols from a given alphabet. Each selection is made at random, although this selection may be based on some statistical rule. The channel transmits the incoming symbols to the receiver. The performance of the channel is also based on laws of chance. with a of A} and If the source transmits a symbol, say the channel lets through the letter A with a denoted P {AlA}, then the probability of transmitting A and receiving A is PtA} · P{AIA}

The communication channel is generally lossy; i.e., a part of the transmitted commodity does not reach its destination or it reaches the destination in a distorted form. There are often unwanted sources in a com-

munication cnannei, such as noise radio television or passaae vehicle in direction a one-way These sources noise. disturbance are referred to as noise sources or task of the is the the loss and ontrmum recovery of the commodity when it is the effect of noise. deterministic model of that one device which may be used to the of the is called a transformer. In the vocabulary of information theory a device 1'lYVl1nn"t"1"1QI11''\1"_

FIG.

1-3. General structure of a communication system used in information theory.

that is used to the of the channel may be called an encoder. An encoded message is less to channel noise. At the receiver's terminal a decoder is employed to transform the encoded form which is acceptable to the It messages into the could be said that, in a certain sense, for more" efficient" commumcation, the encoder a one-to-one mathematical mapping or an F on the I, while the decoder the inverse of that ",,,,,,r.,..n1""I1"'~ 'U'!IJ.>J ..........

Encoder: Decoder:

F

r-:

I F(I)

F(I)

v ....' U' ........

(1-1)

This perfect procedure is, of course, hypothetical; one has to face the ultimate effect of noise which in physical systems will prevent perfect communication. This is clearly seen in the case of the transmission of electrical energy where the transformer decreases the heat loss but an efficiency of 100 per cent cannot be expected. The step-up transformer acts as a sort of encoder and the step-down transformer as a decornnz Thus, in any practical situation, we have to add at least three more basic parts to our mathematical model: source of noise, encoder, and decoder (Fig. 1-3). The model of Fig. 1-3 is of a general nature; it may be applied to a variety of circumstances. A novel application of such a model was made by Wiener and Shannon in their discussions of the statistical nature of the communication of messages. It was pointed out that a radio, television, teletype, or speech

selects sequences of at rnmnom.

such communication the source, cnanner, encoder, uecouer. defined. This source, and receiver must be itself constitutes a significant contribution to the communication sciences, In of this one comes to realize a basic tion systems some of n ....r'\hcJl hl.11"-.:r cation theories cannot be studied without background of readers fundamentals of can most erncientiv research in the field of communication. In the macroscopic study of communication systems, some of the basic questions us are these: for 1. How does one measure information and define a suitable such measurements? 2. defined such a how does one define an information source, or how does one measure the rate at which an information source supplies information? channel? does define the 3. What is the which a channel transmits information? how does one the 4. Given a source and a transmission of information and how does one go about rate? How far can the rate be . . ,. ". ,. . . . . . . . ., . ..,.,.,.rII 5. To what extent does the presence of noise limit the rate sion of information without the communication To present answers to these task. This is undertaken in the following chapters. for the benefit we of those who wish to acquire a heuristic introduction to the include a brief discussion of it here. In our we deal with ideal mathematical models of communication. We confine ourselves to models that are statistically defined. That is, the most nificant feature of our model is its The source, instance, transmits at random anyone of a set of prespecified messages. as to which message will be transmitted We have no next. But we know the each message directly, or to that effect. If the behavior of the model were an amount predictable (deterministic), then recourse to information would hardly be necessary. When the model is statistically defined, while we have no concrete a sense, assurance of its detailed performance, we are able to describe, its "over-all" or "average" performance in the light of its statistical description. In short, our search for an amount of information is virtu'1I"1!"nn.... 4""l."iTl"no-

Lv.UlarU..ll..ll.D.V

6

INTRODUCTION

a

associated a The indicate a relative measure of uncertainty relevant to the occurrence of each particular message in message ensemble. We shall illustrate how one goes about defining the amount of information by a well-known rudimentary example. Suppose that you are with selection of equipment from a catalog which indieatesn distinct models: 1J.1 \.HJQIU.11..llJ_Y

The desired amount of information I(xk) associated with the selection of a particular model Xk must be a function of the probability of choosing Xk: (1-2)

for simplicity, we assume that each one of these models is selected with an equal probability, then the desired amount of information is only a function of n.

Next assume that each piece of equipment listed in the catalog can be ordered in one of m distinct colors. If for simplicity we assume that the selection of colors is also equiprobable, then the amount of information associated with the selection of a color Cj among all equiprobable colors . . . ,Cm] is

where the function f(x) must be the same unknown function used in (1-2a).

Finally, assume that the selection is done in two ways: 1. Select the equipment and then select the color, the two selections being independent of each other. 2. Select the equipment and its color at the same time as one selection from mn possible equiprobable choices. The search for the function f(x) is based on the intuitive choice which requires the equality of the amount of information associated with the selection of the model Xk with color Cj in both schemes (1-2c) and (1-2d). I(xk and Cj) == Ii(xk) I(xk and c;)

=

+ 1 (cj) = f(~) + f

f(~n)

2

1

(1-2c) (1-2d)

,INTRODUCTION

f

=1

+1

1

This functional equation has several solutions, for our purpose, is f(x) == -- log x

most imnortant

To give a numerical example, let n = 18 and m = 8. I 1(Xk) = log 18 1 2 (cj) = log 8 I(xk and Cj) = Il(~k) I(xk and Cj) = log 18

+ 1 (cj) 2

+ log 8 = log 144

Thus, when a statistical experiment has n equiprobable outcomes, the average amount of information associated with an outcome is log n. The logarithmic information measure has the desirable property of additivity for independent statistical experiments. These ideas will be elaborated upon in Chap. 3. 1.. .4. A Unit of Information. The simplest case to consider is a selection between two equiprobable events E 1 and be, say, head or tail in a throwing of an "honest" coin. (1-4), the amount of information associated with the selection of one out of two equiprobable events is -- log % = log 2 An arbitrary but convenient choice of the base of the logarithm is 2. In that case, -- log, ~ = 1 provides a unit of information. This unit. is commonly known as a bit. t

FIG. 1-4. A probability space with two equiprobable events.

* Another solution is f(x)

== number of factors in decomposition of ! in product of primes with minus sign x

For example, let n == 18, m == 8; then n==2·3·3 m==2·2·2

mn == 2 . 2 . 2 . 2 . 3 . 3

fa) = -3 f(l) = -3 f(~n) ~

-6

In information theory we require thatf(x) be a decreasing function of the probability of choices. This narrows down the solution of Eq. (1-4) to k log », where k is a constant multiplier. [For an axiomatic derivation, see Sec. 3-19 or A. Feinst-ein (1).] t When the logarithm is taken to the base 10, the unit of information corresponds to the selection of one out of ten equiprobable cases. This unit is sometimes referred to as a Hartley since it was suggested by Hartley in 1928. When the natural base is used, the unit of information is abbreviated as nat.

8

INTRODUCTION

consider the selection of one out of , a into two prnUUII"ir choices. successively likely selections, we come to the conclusion that the amounts of tion associated with the I L selection schemes are, respectively, U- 2, 3, 4, . . . , N bits. In a slightly more case, consider a source with a finite, number of messages and their corresponding transmission probabilities. l"nT,n'r'l1nC_

[Xl,X2, . . . ,xn ] [P{Xl},P{X2}, ... ,P{xn } ]

The source selects at random each one of these messages. Successive selections are assumed to be statieFIG. 1-5. Successive partitioning of the tically independent. The probabilprobability space. associated with the selection of message Xk is P {Xk} • The amount of information associated with the transmission of message Xk is defined as

is also called the amount of self-information of the message average information per message for the source is I = statistical average of

Xk.

The

(1-5)

For instance, the amount of information associated with a source of the above type, transmitting two symbols 0 and 1 with equal probability, is I = - (% log

%+~

log

72) =

1 bit

If the two symbols were transmitted with probabilities ex and 1 -- ex, then the average amount of information per symbol becomes I = - a log a - (1 -- ex) log (1 -- a)

(1-6)

The average information per message I is also referred to as the entropy (or the communication entropy) of the source and is usually denoted by the letter H. For instance, the entropy of a simple source of the above type is H(Pl,P2, · .. ,Pn) = - (PI log PI

+ P2 log P2 + · · · + P» log Pn)

INTRODUCTION

where (Pl,P2, . refers to a discrete commtete h't'B

A :> B} ACB

A = B

In many instances, when dealing with specific problems, it is most convenient to confine the discussion to objects that belong to a fixed class of elements. This is referred to as a universal set. For example, suppose that, in a certain problem dealing with the study of it may be required to define the set of all integers I, or the set of positive numbers or the set of perfect square integers S. All these sets can be looked upon as subsets of the larger set of all real numbers. This latter set may be considered as the set U, a definition which is useful U dealing with the under discussion. In problems concerned with the interrelationship of sets, an IIR.I~ILI-t1, tive diagram called a Venn * diagram is of considerable visual assistance. FIG. 2-1. Example of a Venn diagram. The elements of the universal set in a Venn diagram are generally shown by points in a rectangle. The elements of any set under consideration are commonly shown by a circle or by any other simple closed contour inside the universal set. The universe associated with the aforesaid is illustrated in 2-1. A set may contain a finite or an infinite numbe-r of elements. a set has no element, it is said to be an empty or a null set. For example, the set of the real roots of the equation 11n'I",(7.o1"'00

2z 2

+1= 0

is a null set.

* Named after the English logician

John Venn (1834-1923).

BASIC CONCEPTS OF DISCRETE PROBABILITY

Sets. Consider a universal set U elements. U contains all possible elements under consideration. The universal set may contain a number of subsets A, C, . . which individually are well-defined sets. The operation of union, mtersection. and complement is defined as follows: The union. or sum of two sets A and B is the set of all those elements that belong to A or B or both. The intersection or product of two sets A and B is the set of all those elements that belong to both A and B. The difference B - A of any set A relative to the set B is a set consisting of all elements of B that are not elements of A.

FIG.

2-2. Sum or union A

FIG.

+ B.

2-4. Complement.

FIG.

2-3. Intersection or product A .

FIG.

2-5. Difference A-B.

The complement or negation of any set A is the set A I containing all elements of the universe that are not elements of A. In the mathematical literature the following notations are commonly used in conjunction with the above definitions.

A V B A (\ B A - B B A A I'J

A union B, or A cup B A intersection B, or A cap B relative complement of B in A B is contained in A complement of A

In the engineering literature the notations given below are . . . used. A+B sum or union A· B or AB intersection or product A-B difference A' complement

(2'-8) (2-9)

J1.J1..JLJl.J1.'-'U'J1.J1.J1..J

(2-.12) (2-13) (2...14) (2-15)

24

DISCRETE SCIIEMES WITHOUT MEMORY

For the convenience of the we shall to latter notations. However, where any confusion in notation may occur we resort to mathematical notation. The universe and the empty set will be denoted U and 0,.. When the product of two sets A and B is an set, that is, "'''-1I..,VV ......

the two sets are said to be mutually exclusive. When the product of the two sets A and B is equal to B, then B is a subset of A.

RCA

implies

Af'\B=B

(2-17)

The sum, the product, and the difference of two sets and the complement of any set A are illustrated in the shaded areas of the Venn diagrams

o 2-6. AB == O.

FIG.

Mutually

exclusive

sets.

FIG.

2-7. Subset B

C

A.

2-2 to 2-5. Figures 2-6 and 2-7 illustrate the sets and (2-17). Example 2-1.

AB ==

1"ot.01"1""I"O'

to

Let the universe consist of the set of all positive integers, and let A = {1,2,3,6,7,lO} B = {3,4,8,lOI 0= [z ]

where x is any positive integer larger than 5. Find A + B, A . B, A - B, A . 0, B . 0, 0', and A + B + C. Solution A + B = {1,2,3,4,6,7,8,10J A·B={3,10}

A - B = {1,2,6,7} A . C = {6,7,lO} (A

+

B·C={8,10} 0' = {1,2,3,4,51 B) + 0 = U - {5}

2.. .4. of Sets. We now state certain important properties concerning operations with sets. Let A, B, and C be subsets of a universal set U; then the following laws hold.


FIG.

AB

2-8. Distributive law.

A (B

+ AC.

+ C)

=

FIG.

2-9. Distributive law.

..---..... /'

" A '\.

"-

2-10.

FIG.

--.

,,

~~~ (A

/'

-

....... '\./

r

\ '\.

.,A

"-

"'-

+ B)' =

FIG.

A'B'.

A'

.........

...""

B'.

A+B=B+A AB = BA

Associaiioe Laws:

+C =

A

+

+ C)

(AB)C = A(BC)

Distributive Laws: A(B

A

+

+ BC

+

C) = AB AC = (A B)(A C)

+

+

Complementarity:

A

+ A'

== U

0 A+U=U AA' =

AU = A A +0= A

A0

=

-

2-11. Dualisation.

+

Commutative Laws:

+ B)

-

0

Difference Law: CAB) + (A - B) = A (AB)(A - B) = 0 A - B = AB'

"'"

+ BC

'\.

l--B

A--{

..."""

Dualization.

A

+ B) (A + C).

(A

./

(AB)' ==

=

26

DISCRETE SCHEMES WITHOUT MEMORY

Dualization. or De

Law:

(AB)' = A'

+ B'

Involution Law:

=A The of the set A' is the set A. iaemooteni Law: For all sets

A AA

A=A = A

While the afore-mentioned laws are not meant to offer an axiomatic presentation of set theory, they are of a fundamental nature for deriving a large variety of identities on sets. The agreement of all these laws with the laws of thought can be verified. One assumes that an element x is a member of the set of the left side of each identity, and then one has to prove that x will necessarily be a member of the set of the right side of the same equation. For instance, in order to prove the distributive law let

+

x

Then

A

x (B

x

+ C)

Then at least one of the following three cases must be true: x

x

A B

(b) x x

A C

(c) x

E A

x x

C

These are in turn equivalent to (a) x

AB

(b)

x

AC

(c) x

E ABC

but ABC CAB Therefore it is sufficient to require x

E AB

AC

Similarly, one can show that x E AB + AC implies x A(B + The Venn is often a very useful visual aid. Its use is of ble assistance in solving problems, as long as the formal are not overlooked. .tiX~Lm1Jlle

2.. .2. Verify the following relation: (A

Solution.

+ B)

- AB == AB'

+ A'B

virtue of the third relation of Eqs. (2-24), (A

+ B)

- AB == (A

+ B) (AB)'

BASIC CONCEPTS OF DISCRETE PROBABILITY Appll(~at:lon of

(A

De Morgan's law yields

(A + B) (AB)' = (A + B)(A' + B') + B)(A' + B') = A'A + A'B + B'A + B'B = AB' + A'B

For an alternative proof, let

z

[(A

+ B)

-AB]

Then only one of the following two cases is possible: (a) x

A

(b) x

x

B

x

B A

These cases are equivalent to (b) :

~ ~,}x E

A'B

Note that AB' and A'B are mutually exclusive sets. Similarly, one can show that all the elements belonging to the set at the right side of the above equation also belong to the set of the left side. Thus the two sides present equivalent sets. Example 2-3. Express the set composed of the hatched region of Fig. E2-3 in terms of specified sets. Solution. The desired set A is

See Fig. E2-3.

FIG.

Example 2-4.

E2-3

Verify the relation (A

+ B)'C = C -

C(A

+ B)

Solution. We may wish to verify the validity of this relation by using the Venn diagram of Fig. E2-4. The left side of this equation represents the part of the set C that is not in A or B. The right side represents C - CA - CB, that is, the part of C that is not included either in A or in B.

FIG.

E2-4

28


Example Consider the circuit of Fig. E2-5. The setup CUIJ.~~l11!:'j which must be activated for or opening the corresponding relay. are normally open relays and A', B', and C' are normally closed relays which are respectively activated by the same controlling source. For instance, when relay is open because of the effect of its activating coil, A' is closed. In order to have a current flow between the terminals M and N, we must have the set of relay operations indicated by ABC + AB'C + A'B'C. With this in mind, the question is to the given network by a less complex equivalent circuit. A

(a)

(b) FIG.

Solution.

E2-5

A way of simplifying the above expression is the following:

F = C(AB + AB' + A'B') F = C[A(B + B') + A'B'] F = C(AU + A'B') F = C(A + A'B') F = C(A + B') A circuit presentation of this example is illustrated in Fig. E2 ..5b. Example 2-6. Verify the equivalence of the two relay circuits of Fig. E2-6. A

B'

B

(a)

(b) FIG.

Solution.

E2..6

The set that corresponds to the operation of the circuit in Fig. E2 ...6b is (A

+ B)(A' + B')

Direct multiplication gives

AA'

+ AB' + BA' + BB'

== AB'

+ A'B

The latter set can be immediately identified with the circuit of Fig. E2-6b.

Sheffer-stroke Operation. Examples 2-5 and 2-6 have illustrated some use of Boolean algebra in relay circuits. As another example of the use of


0+-=--FIG.

(XI Y)

2-12. Sheffer stroke.

Boolean algebra in problems, we discuss what is referred to as the Sheffer-stroke operation. This operation for two sets X Y is denoted by (XI Y) and is defined by the equation (XI Y) = X' U Y'

not X, or not Y, or not X and Y

The Sheffer stroke commonly illustrated by the three-port diagram of Fig. 2-12 has the distinct property that it can replace all three basic

ot----xnY FIG.

2-13. Product operation by two Sheffer strokes.

x OoooI---XU Y

y

FIG.

2-14. Summing operation by three Sheffer strokes.

operations of Boolean algebra (sum, product, and negation). validity of this statement can be exhibited in a direct manner.

FIG.

2-15. Operation of negation with a Sheffer stroke.

PRODUCT OPERATION.

Reference to the diagram of Fig. 2-13 suggests

that «XIY)\(XIY» SUMMING OPERATION.

= =

(X' U Y')' «Xt) V)')' = X t> Y

The diagram of Fig. 2-14 suggests

«XIX)I(YIY»

= (XIX)' V

(Y\Y)'

=XVY NEGATION.

The

Reference is made to the diagram of Fig. 2-15. (XIX)

= X' V X' = X'


Functions. In this section, some or numbers will be associated with each and every element of a given set. The on which this relationship is based is commonly known as function.

Range

Domain

FIG. 2-16. Domain X and range Y.

If X = {x} is a set and y = f(x) is a rule, that is, a sequence of specified operations and correspondence for assigning a well-defined object y to every member of then by applying this rule to the set X, we obtain a set Y = {y}. The set X is called the domain and Y the range. When x covers the elements of X, then y will correspondingly cover the elements of Y. For example, let X be the set of all persons living in the state of California on January 1, 1959, and let the function be defined as follows: anyone who is the father of a person described by X and is in the state of Colorado on January 1, 1959. Assuming that all the words appearing in the rule, such as father, California, Colorado, are well-defined words, this may be considered as a well-defined function. To each member of X there corresponds an object in the set Y. In this example, element zero in Y corresponds to some of the elements of X, and several members of X might have a unique correspondent in Y. As another simple example, consider the set X = {1,2,O,-2,-I,31',10}

and the function f(x)

= x 2 --

1

which lead to the set Y

= {O,3,-1,3,O,-%,99}

The domain of x and the range of yare shown in Fig. 2-16, the correspondence being one-to-one from X to the Y set. Example 2-7. A set of ordered pairs 8 = {(x,y)}, that is, a set of points in the rectangular coordinate system, is given in Fig. E2-7a.

BASIC CONCEPTS OF DISCRETE PROBABILITY J

s

Y

4

3

4

-

.-~(7~

4 =

15 >3·~5

+ J~l + ~5)

-

21 61

Bayes's" theorem comprises one of the most misused, concepts of probability theory. In many problems an event may occur as an "effect" of several "causes." From a number of observations on the occurrence of the effect, one can make an estimate on the occurrence of the cause leading to that effect. This rule is applied to communication problems, particularly in the detection of signals from an additive mixture of signals and noise. When the uetectmz instrument indicates a signal, we have to make a decision whether the received signal is a true one or a false alarm due to undesired signals (noise) in the system. Such decisions are generally made an application of Bayes's rule which is also called the rule of inverse bility. The decision criterion may be made more effective the some kind of weighting coefficients called l088 matrix and over-all "loss." 2-13. Combinatorial Problems in many involving choice and probability, the number of possible ways of arranging a given number of objects on a line is of interest. For example, if and C are standing in a the that three persons A, remains next to B can be calculated as follows: There are six different arrangements possible: +,..D.11l11Q1n11-1"lr'F'

ABC

ACB

BAC

BCA

CAB

CBA

Of these arrangements, there are four desirable ones. concept of equiprobable measures is assumed, the probability in \...IU.~00lJ.I.V.J..J. is ;~. Combinatorial problems have a limited use in our subsequent studies. For this reason, we shall give only a review of the most pertinent definitions in this section. The reader interested in combinatorial nroruems will find a considerable amount of information in Feller 2 to Permutation: A permutation of the elements of a finite set is a one-toone correspondence between elements of that set (such a corresnondence is also called a mapping of the set onto itself). For example, if a set

* Reverend Thomas Bayes's article An Essay towards Solving a Problem in the Doctrine of Chances was published in Philosophical Transactions of the Royal Society of London (vol. 1, no. 3, p. 370, 1763). However, Bayes's work remained rather unknown until 1774, when Laplace discussed it in one of his memoirs.

50

DISCRETE SCHEMES WITHOU'T MEl\10RY

B, C, and D, we rnay write two

four objects

contains sets

A, B,

The ordered sets;

c, D

[~ ~

and

i ~]

and

B, C,

D

[~ ~

A 3

~]

O.Nl1'i"tT'JiI.cn'\i"

are two per-

mutations of the elements of the original set, since

A--tB B--tC A D--tD

(2-99)

c-.

The following definition is of considerable assistance in dealing with combinatorial problems. Factorial: The factorial function for a positive integer n is defined as

n! = n(n - l)(n - 2) · · · 4 ·3 · 2 · 1

(2-100)

with the additional convention

O!

=

1

The number of different permutations of a set with n distinct elements is P; = n(n - l)(n - 2) . . . 4 ·3 ·2·1 = n!

(2-102)

Combination: The number of different permutations of r objects selected from n objects is Prn = n(n - l)(n - 2) . . . (n - r

+

n'

1) = (n _. r)!

(2-103)

Every permutation of elements of a set contains the same elements but in different order. When two sets of objects are in one-to-one correspondence so that some of the elements of one do not appear in the other they are called different combinations. For example, if we combine the members of the set {A,B,C,D} two by two, AB, AC, DB are different binations but AB and BA are not. The number of different combinations of n objects taken r at a time is

P»

c» = rT =

n!

r!(n _ r)!

(2-104)

When confusion will not result, one may use the notation (;) for C/'. Note that


=n

+

+

+

(2-108)

The following theorem is often used in combinatorial problems. a set contain k mutually exclusive subsets of objects: with

Let

{A 1,A 2, • • • ,A k } Ai = {ail,ai2, . . . ,ain,} i = 1, 2, . . . , k

n, being the number of elements in the set Ai. tions of the total number of elements n is

The number of

'In.O'''''"tY\·lIl11''€"l_

n! In fact, one has to divide the number of permutations of n objects (for i = 1, 2, . . . , k) since the permutations of the identical objects of the set cannot be distinguished from each other. :For 'V~(;lu the number of color of three black and two white balls is ..U .•JJl.'V9

5! 5 X4 3121 = -"2T- = 10

Binomial Expansion: Let n be a positive integer; then (a

+ b)n = an + (~) an-1b + (;) an- + . · · 2b 2

or (a

+ b)n = an + nan-1b + n(n 2~ 1) an+ n(n - ~!(n - 2) an- + ... + b" 2b 2

3b3

(2-111)

A useful display of a binomial coefficient is given in a table which is called Pascal' 8 triangle:

52

DISCRETE SCHEMES 'VITHOUT MEMORY

1 1

121

3 3 1 464 1

In the following a number of simple examples dealing with permutations and combinations are presented. In these examples, the primary assumption is that the probability is given by the frequency of the event under consideration; that is, the concept of equiprobable measure prevails. Hence, such problems are reduced to a study of the ratio of the favorable cases to all possible cases. In this respect the formula of combinatorial analysis will be used. Example 2...22. What is the probability of a person having four aces in a bridge hand? Solution. The number of all possible different hands equals the combination of 13 from 52 cards. For the number of favorable cases one may think of first removing the four aces from the deck and then dealing all possible combinations of hands 9 by 9. The addition of the four aces to each one of these latter hands gives a favorable case. 10 ·11 ·12 ·13 11 48) (52) (9 : 13 = 49 . 50 . 51 . 52 = 4,165 Example 2..23. Two cards are drawn from a regular deck of cards. What is the probability that neither is a heart? Solution. Let A and B be the events that the first and the second card are hearts, respectively; then we wish to know P {A'B' }. PIA'} P{B'IA'}

= 1 - PIA} = 1 = P{A'R'}

- }~

=

,~

PIA'}

Therefore

P{B'!A'} = PIA B'} =

3%1 3%1 •

% =

1~34

If we wish to apply combinatorial principles, we may say that the number of all possible cases of selecting two cards is

(:2).

The number of favorable cases is

Therefore the probability in question is 39) . (52) (2 . 2

=

39! 2!50! 2 !37! 52!

= 39 ·38 = 19 51 . 52

34

2-14. Trees and State Diagrams. The material of this section is intended to offer a graphical interpretation for certain simple problems of probability which arise in dealing with repeated trials of an experiment.


For suppose that a biased coin is tossed once; be denoted by Hand T and shown by the of larly, if the coin is tossed twice, the second set of outcomes may be shown in the same treelike diagram. If the probability of a head is p, then the probability of say, HT can be 1111"t.'lllb11".1" denoted computed from the weighted length of the associated tree p(l - p)

If it is desired to obtain the probability of getting a head and a tail B

HH H

I-p

p

T

A

HT

2/1

3/5

TH

B

TT

W

B W

FIG. 2-20. A simple tree diagram.

FIG. 2-21. An example of a probability tree.

irrespective of their order, then the answer to the problem is summing up the two weighted tree paths. p(l - p)

This problems.

+ (1 -

p)p = 2p(1 - p)

graphical procedure can be used in certain The following are examples of such problems.

Example 2-24. The urn A contains five black and two white balls. The urn B contains three black and two white balls. If one urn is selected at random, what is the probability of drawing a white ball from that urn? Solution. From the tree diagram of Fig. 2-21 one can see that the probability of the event of interest is the sum of the following measures:

.r"Xlllmple 2-26. Find the probability that at least three heads are obtained in sequence of four throws of an honest coin. Solution. From a tree diagram or from the binomial expansion one obtains

(:) .

U~)4 +

G) ·

(%)3(%)1

= ~6 + ~6

=

%6

If a coin is tossed n times, we note that the probability of getting, say, exactly r heads (r < n) is the sum of the tree measures of all tree leading to r heads and n -- rtails.

Since there are

such states, it is

54

DISCRETE SCHEMES WITHOUT

found that the desired """'.". . .

' l h . . - . . h .....

+ ......

.A.lI.11..&:.ls(15 log 5

+ 12 log 3 -

32) bits

H(>'5,%) = >15(15 log 5 - 24) bits H(>~,%)

= >15(15 log 3 - 10) bits

(c) It is a matter of numerical computation to verify that

H(>i,%IS,%5) = H(>'5,%)

+ %H(~:1,%)

Example 3...2. Verify the rule of additivity of entropies for the following pro bability schemes (Fig. E3-2a).

(a)

(b)

(c)

(d) FIG.

(a) [A,B,C,D] (Fig. E3-2b). [BIA', CIA', DIA'] (b) [A,A'] (0) (Fig. E3-2d.)

E3-2

(Fig. E3-20).

Numerical example:

[Prritlltiy] = [~--~-~-~~] Solution. The object of the problem is to demonstrate that the average uncertainty in a system is not affected by the arrangement of the events, as long as the probabnrties of the individual events do not change. (a)

H = -PA log PA -PBlogPB -PologPo -PDlogP D PA = ~'2

where H

=

PB

= }4

Po

= }8

PD

=>8

~ ~'2 log ~'2 - ~~ log ~4 - ~8 log ~'8 - ~'8 log ~'8 = ~2 log 2 ~~ log 4 ~'8 log 8 ~'8 log 8

=

~2

+

+ ~'2 + % +%

= 1~~ bits

+

+

DISCRETE SCHEMES

additivity

(b) Accornmg

=

[-P A log

n1"nn.a1"1r:u

ll./I"'R:i'lJ Ulrnnv ·

T1iTTMn1l:TnTTfn

[Eq. (3-19)] of the

- (1

+ =

-PA

P A - (1 -PA) log (1 -PA) -PBlog

- Pc log

log

-PA log P A. - (P B + Pc + PD) log (1 - P A) - P B log P E + P B log (1 - PA) - Po log Po + Po log (1 - PA) - PDlogP D + PDlog (1 - PAl = -PAlogPA -PBlogPE -PalogPc -PDlogPD

=

= .~

where

Pc = ~'8

PD

=

~'8

- }21og }2 - }2 1og }2 - }2(}2 Iog }2 + }~ log }i + }~ log }2 log 2 + }21og 2 + }2(}2Iog 2 + }~ log 4 + }i log 4) = }2 + ~2 +}2(}2 + }2 + }~) = 1% bits (c) H == -(PA +PB ) log (P A +PE ) - (Po +PD) log (Po +P D)

H

=

}~)

=

==

+ (P A + P B) + + P D) ( -(P + P A

B)

-

log -=------==--

-

log (P A

+ PE)

-

(Po

+ P D)

log (Pc

- Pc log

- p B log

log

log =------=:---

+ P n)

=----:::-

+ P B ) - (Pc + P D ) log + PB + P B log (P A + P B) - PA logP', + PA log (P A + P B) - P E log Po + Po log (Po + P D) log P D + P D log (Po + PD) -PA log P A - P E log P B - Po log Po - P D log P D

= -(PA + PE) log (PA

=

P B = ~~

where

H =

-C%:) log % -

~i

Po =}S

log}i + %(-% log % -

P D =}S

Ji

log }i) + }~( -}2Iog ~ - ~2log ~}) = -% log 3 + % log 4 + ~~ log 4 + %( - % log 2 + log 3) + }i(}21og 2 + }2log 2) = -%log 3 + % +}2 + %(-;~ 3) + ~~(}~ }2} = - % log 3· + % - }~ + % 3 = 1j~ bits

+

+

3-5. Alternative Proof That Function Possesses Maximum. The Shannon-Wiener of information is strongly linked with the logarithmic it is desirable to some time investigating some of the basic mathematical properties of logarithmic function. Such mathematicalpresentations may seem distant from an immediate engineering application;· however, they are of prime significance to those who are interested in basic research in the field..

BASIC CONCEPTS OF INFORMATION TH.EORY

First we shall prove a lemma on the convexity of the function. Then the lemma will be employed in giving an alternative proof for property 3 of the previous section, Lemma 1. The logarithmic function is a convex function. The reader will recall that a function of the real variable y = is said to be convex upward in a real interval if for any Xl and X2 in that interval one has i

"Bnt?Ql" l 1". t'1 'lnt"\ 11 O"

72[f(XI) +f(X2)] ::; f

(Xl t X2)

Geometrically this relation can be simply interpreted by saying that the chord connecting points 1 and 2 lies below the curve. An equivalent definition can be given for a curve that is convex upward in an interval. That is, af(xl)

+ (1 --

a)f(x2)

~

f[axl

+ (1 -

a)x2]

0

::s; a

~

1

(3-38)

The geometrical interpretation of Eq. (3-38) is that in the interval under consideration the chord lies everywhere below the curve (see Fig. 3-3a). y

y=rnx

......---:;IC---....L-----~x

(a) FIG. 3-3. (a) An upward convex function. convex.

(b) Logarithmic function is upward

A necessary and sufficient condition for y = f(x) to be convex on the real axis is that

for every point of the real axis, provided that the second derivative exists. This requirement is satisfied for the function y = In X d 2y 1

In fact,

(3-40) (3-41)

dx 2 = - x 2

2

d y2

dx

< 0 -

for 0

.·K'~•..lI1"1!o1l11T"T

'tnlllnU71THJ'

scheme:

The sum of the elements of this matrix is unity; that is, the probability scheme thus described in not only finite but also complete. Therefore an entropy may be directly associated with such a situation. n

= - ~ p{k,j} log p{k,j}

H(Xly')

~ P{Yi}

J

P{Yi}

k=1

(3-86) Now one may take the average of this conditional entropy for all admissible values of Yh in order to obtain a measure of average conditional of the system. m

H(XIY) == H(X\Yi) == ~ p{Yi}[H(X\Yi)] isl m

n

i=1

k=1

L p{Yj} L p{xklYil log p{xklYil .

= m

Y) = -

(3-87)

n

L L p{Yj}p{xkIYi} log p{xkIYj}

(3-88)

i=1 k=1

Similarly, one can evaluate the average conditional entropy H(Y\X): n

H(YIX) == -

L

m

~ P{Xk}P{Yilxk} log p{Yilxk}

(3-89)

k=1 i=1

The two conditional entropies (the word "average" will be omitted for briefness) can be written as m

Y) ==

n

L L P{Xk,Yi} log p{xkIYi}

i=1 k=1

n

H(Y\X) == -

m

L L p{xk,Yil log p{Yilxk}

(3-91)

k=l i=1

The conditional entropies along with marginals and the joint entropy compose the five principal entropies pertaining to a joint distribution. All logarithms are taken to the base 2 in order to obtain units in binary


Note that all are essentially as are sums of numbers. The of the different entropies will be discussed section. in the Deterrnine five entropies pertaining to the joint probability matrix of Example 2-.30.

Solution 6 6

H(X,Y)

= -

LLP

ij

log

%6

-log ~~6 =2(1

=

+ log 3)

1 1

6

HeX) = H(Y) = -

LPi log ~'6 =

- log

>'6

= 1

+ log 3

1

HeXI Y) = H(YIX) = -

6

6

1

1

L LP

ij

log

>'6

= 1

+ log 3

Sketch Communication Network. In this section, we wish to an informal sketch of a model for a communication network. In contrast to the material of the previous sections, the content of this in a strict mathematical frame, The words section is not source, transducer, transmitter, and receiver are used in their common engineering sense. Later on, we shall assign a strict mathematical description to some of these words, but for the present the reader is cautioned against any identification of these terms with similar terms defined in the professional literature. In the of physical systems from a systems engineering point of view, we generally focus our attention on a number of points of entry to the system. For example, in ordinary electric networks, we may be interested in the study of voltage-current relationships at the same port of entry in the network (Fig. 3-8a). This is generally known as a one-port system. When the voltage-current relationships between two ports of entries are of interest, the situation is that of a two-port system. In a two-port system, a physical driving force is applied to one port and its effect observed at a second port. The second port may be connected to a "receiver" or "load" (Fig. 3-8b). Such a system is usually known as a or a loaded transducer. More generally, in many physical problems we may be interested in the study of an n-port network (Fig. 3-8c). From linear network theory, we know that a complete study of n-port systems requires a knowledge of transmission functions between different ports. For example, if we concentrate on different impedances of a network, the following matrices are considered for a general study of a one-port, two-port, and n-port, respectively.

BASIC CONCEPTS OF INFORMATION THEORY

Z l1

[Z1t] [

·

~~1. ~~2. Znl

Zn2

(The impedances are used in the ordinary circuit sense, the transfer impedance between the kth and the }th port.) of An equivalent interpretation can be made for the systems. In fact, the systems point of view does not rely on the deter2

Load

(a)

(b)

(c) FIG. 3-8. (a) A one-port network. (b) A two-port analog of a channel connecting a source and a receiver. (c) An n-port analog of a comrnunication system consisting of several sources, channels, and sinks.

ministic or probabilistic description of the performance. It is based on the ports of application of stimuli and observation of responses. For instance, consider a source of communication with a given alphabet. The source is linked to the receiver via a channel. The system may be described by a joint probability matrix, that is, by giving the probability of the joint occurrence of two symbols, one at the input and the other at the output. The joint probability matrix may be designated by

p{Xl,Yn}] P{X2,Yn}

......

(3-93)

P{Xm,Yn} But in a product space of the two random variables X and Y there are

98


five basic probability schemes [P{X,Y}] [P{X}] [P{ Y}] [P{X\Y}] [P{ Y\X}]

interest.

These are

joint probability matrix marginal probability matrix of marginal probability matrix of Y conditional probability conditional probability matrix

(3-95)

Thus we are naturally led to five distinct functions in the study a communication modeL This idea can be generalized to n-port communication systems. The problem is similar to the study of an n-dimensional discrete random variable or product space. In each product probability space there are a finite humber of basic probability schemes (marginals and conditionals different orders). With each of these schemes, we may associate an entropy and directly interpret its physical significance. A source of information is in a way similar to the driving source in a circuit; the receiver is similar to the load, and the channel acts as the network connecting the load to the source. The following interpretations of the different entropies for a two-port communication system seem pertinent. H(X)

H(Y) H(X,Y)

H(YIX)

H(X\Y)

Average information per character at the source, or the entropy of the source. Average information per character at the destination, 'or the entropy at the receiver. Average information per pairs of transmitted and received characters, or the average uncertainty of the communication system as a whole. A specific character Xi being transmitted; one of the permissible Yi may be received with a given probability. The entropy associated with this probability scheme when Xi covers sets of all transmitted symbols, that is, H(Ylxi), is the conditional entropy H(Y\X), a measure of information about the receiving port, where it is known that X is transmitted. A specific character Yi being received; this may be a result transmission of one of the Xi with a given probability. The entropy associated with this probability scheme when Yi covers all the received symbols, that is, H(XIYi), is the entropy H(X\ Y) or equivocation, a measure of information about the source, where it is known that Y is received.

H(X) and H(Y) give indications of the probabilistic nature of the transmissionand reception ports, respectively. H(Y\X) givesan indica-

CONCEPTS

munication problems in the matrix is not It is customary to specify the noise characteristics of a cnanner source alphabet From these we can " '_.,' the the l.)rlJuC:tJUH.ILY matrix is

which can be written as YIX}] = this form we assume that the marzmai prooaounv matrix is convenience

is

"ll"lrT1I"'IT'TCI,"ln

'lr'lIl'T'1I't'I-f-+I"I,_

a

also be a row matrix naC~'N'lfllo1"'lnn' the orooeoinues ot the output alphabets. This section offers for discussion two nl.l·r1".1~"1l1'111.l-rI~T tionohannels: Discrete noise-free channel 2. Discrete channel with mdependent HIUI,II,-',HIII,UI.IIJ Discrete Channel. every a of the The the channel probability matrix, is of the oiaconer

o

o

P{X2,Y2}

o

0

100

DISCRETE .SCHEMES WITHOUT MEMORY

o ... [P{XI

1

] = [PI YIX}] =

o ."" For a noise-free channel the entropies are n

Y) =

= -

L P{Xi,Yi}

i=l

H(YIX)

= H(XIY) = 0

(3-101)

The interpretation of these formulas for a communication system is rather clear. To each transmitted symbol in a noise-free channel there corresponds one, and only one, received symbol. The average uncertainty at the receiving end is exactly the same as at the sending end. The individual conditional entropies are all equal to zero, a fact that reiterates a nonambiguous or noise-free transmission. Discrete Channel with Independent Input-Output. In a similar fashion, one can visualize a channel in which there is no correlation between input and symbols. That is, an input letter Xi can be received as any one of the symbols Yj of the receiving alphabet with equal probability. As will be shown, such a system is a degenerate one as it does not transmit any information. The joint probability matrix has n identical columns.

""y X[P [P{X,Y}] = :2.

PI

pm

pm

P2

P]

P2

1 Pi = -

n

(3-102)

pm

The input and output symbol probabilities are statistically independent of each other, that is, (3-103)

This can be shown directly by calculation:

From this one concludes that (3-105) (3-106)

BASIC CONCEP'J,'SOF INFORMATION

The different

,...,..,.+"""".....,.,...lI..n.C'I

can be

nr.."'I''''''''If'\.·'lI+r"N

.rll·.....n .... +B'Y ... •

= -n

H(X,

n

m

HeXI Y) = -

1 npi log tip, = i=l

H(YIX) = -

1 n

npi log - = log n =

The interpretation of the above formula is that a channel with independent input and output ports conveys no information whatsoever. To mention a network analogy, this channel seems to have the internal "loss," like a resistive network, in contrast to the noise-free channel which resembles a "lossless' network. among this 3.. .11.. Some Basic section we should like first to investigate some of the fundamental mathematical relations that exist among different entropies in a out their t.:lJ.~.I..ll..a..l.J'V'-'lJLILV'l.l two-port communication system and then communication theories. OUf starting is the evident fact that the different probabilities in a two-dimensional distribution are interrelated, plus the fact that the chosen logarithmic weighting function is a convex function on the positive real axis. We begin with the basic relationship that exists among the joint, marginal, and conditional probabilities, that is, 1-n1l"Arl1111fl1"

P{Xk,Yj} = p{xkIYj} · p{Yj} = P{YjIXk} . p{xkl log P{Xk,Yj} = log p{xkIYj} + log p{Yj} = log P{YjIXk} + log P{Xk}

The direct substitution of these relations in the entropies leads to the basic identities: H(X,Y) = H(XIY) H(X,Y) = H(YIX)

(3-112) (3-113)

of the

+

+ H(X)

(3-115)

Next we should like to establish a fundamental inequality first shown by Shannon, namely, (3-116) H(X) ~ H(XI Y)

ME~MO~RY

DISCRETE SCHEMES WITHOUT ATnnlnlV

P{Xk,YiJ log k=l

once

P{XkJ

{ I J P Xk Yi

But "the right side of this inequality is identically zero as (P{Yi} - P{Yi})

e=

~H(XI

and snmlarlv one shows that ~

The equality signs hold

and only if, X and Yare statistically independent. It is only in 'such a case that our key inequality beco~es (at Y1 that is,

'2 '3

p{xkl == 1 p{XklYil for all permissible values of k and is the case of betweenX and Y. :£xample 3-4. A transmitter has an alphabet consisting of five .letters . {Xl,X2;X~,X4,Xi} and the receiver has an alphabet .of four letters {Yl,Y2,Ya,Y4}. The joint probabilities for the communication are See Fig. E3-4.

%5

FIG.E3-4

VI

Xl X2 X3 X4 X5

Y2

Y3

Y4

[0.25 0 0 0] .0.10 0.30 0 O.

0 0 0

0.05 0 0

0.10 0 . 0.050.10 0

0.05

Determine the, different entropies for this channel.

BASIC CONCEPTS, OF 'INFORMATION THEORY

Solution 11 (~l)

== 0.25

+ 0.30 =0.40 == 0.05 + 0.10 == 0.15 = 0.05 + 0.10 = 0.15

!1(X2) == 0.10 11(xa)

f 1 (X4) !1 (X6) == 0.05

== == == ==

!2(YI) !2(Y2) !2(Ya) f2(Y4)

0.25 0.30 0.10 0.10

+ 0.10

== 0.35

+ 0.05 == 0.35 + 0.05 + 0.05

== 0.20

. '" ) - !(Xl,Yl)_ 0.25 _ ~ f( Xl Yl - !2(Yl) - 0.35 - 7 !(x2IY2) == 0.30 ==

0.35

!(Xa!Ya)

~

7

= ~:~g = ~

!(x4IY4) == 0.10 == 1

0.10

f(X2IYl) == 0.1
~

0.35 7 0.05 1 f(X3IY2) == 0.35 == "7 ==

!4

!(x5IYa) '== 0.05 ==

!4

f(X4IYa)

= 0.05 0.20

0.20

H(X,Y) == -

L

L!(X,y) logf(x,y)

z 11

== -0.25 log 0.25 - 0.10 log 0.10 - 0.30 log 0.30 - 0.05 log 0.05 - 0.10 log 0.10 - 0.05 log 0.05 - 0.10 log 0.10 - 0.05 log 0.05

== 2.665 H(X) == L!(x,y) logfl(x)

L :x;

11

== -0.25 log 0.25 - 0.10 log 0.40 - 0.30 log 0.40 - 0.05 log 0.15 - 0.10 log 0.15 - 0.05 log 0.15 - 0.10 log 0.15 - 0.05 log 0.05

== 2.066 H(Y) == LLf(x,y) logf2(Y) :x;

11

:;;: -0.25 log 0.35 - 0.10 log 0.35 - 0.30 log 0.35 - 0.05 log 0.35 - 0.10 log 0.20 - 0.05 log 0.20 - 0.05 log 0.20 - 0.10 log 0.10 == 1.856 H(YIX) == - \' \' f(x y) log f(x,y) ~ ~' flex) :x;

11

== -0.10 log - 0.10 log == 0.600

>~

- 0.30 log % -

3'3 -

0.05 log

~'3

-

0.05 log ~~ 0.10 log %

H(XIY) == - \' \' f(x Y) logf(x,y) " ~'~' f2(y) :x;

11

== -0.25 log ;f - 0.10 log ;7 - 0.30 log ;7 - 0.05 JOg - 0.10 log >~ == 0.809

-

0.05 log >~

-

0.05 log

>t

Y1

DISCRETE SCIIEMES WITHOUT MEMORY

Note that H(X,Y)

2.665

and

HEX, Y) = H(Y)

2.665 = 1.856

< H(X) + H(Y) < 2.066 + 1.856

+ H(XI Y) = H(X) + H(YIX) + 0.809 = 2.066 + 0.600

.I..'1A.~'Q,~lU.Jl.~ of Information. Consider a discrete communication system with given joint probabilities between its and output terminals. Each transmitted symbol Xi while going through the channel has a certain probability P {Yi)Xi} of being received as a particular symbol Yi. In the light of previous developments, one may look for a function relating a measure of mutual information between Xi and Yi. In other words, how many bits of information do we obtain in knowing that Yi corresponds to Xi when we know the over-all probability of Xi happening along with different y? In order to avoid a complex mathematical presentation, we follow a procedure similar to that of Sec. 3-3. We assume a definition for mutual information and justify its agreement with that of the previously adopted definition of the entropy. Finally, we shall investigate some of the properties of the suggested measure of mutual information. A measure for the mutual information contameu in (xiIYi) can be given as

(3-122) This expression gives a reasonable measure of mutual information conveyed a pair of symbols (Xi,Yi). For a moment, we concentrate on the received symbol Yi. Suppose that an observer is stationed at the receiver end at the position of the signal Yi. His a priori knowledge that a symbol Xi is being transmitted is the marginal probability p {Xi}, that is, the sum of the probabilities of Xi being transmitted and received as any one of the possible Yi. The a posteriori knowledge of our observer is based on the conditional probability of Xi being transmitted, given that a particular Yi is received, that is, p {xilYi[. Therefore, loosely speaking, for this observer the gain of information is the logarithm of the ratio of his final and initial ignorance or uncertainties. However, the mathemaucauv inclined reader may wish to forgo such justification use (3-122) as a definition. The following elementary properties can be derived for the mutual information function: 1. Continuity. I(xi;Yi) is a continuous function of p{xiIYil. 2. Sym1netry or reciprocity. The information conveyed by Yi about Xi is the same as the information conveyed by Xi about Yh that is, I(xi;Yj) = I(Yi;xi)

Obviously, Eq. (3-122) is symmetric with respect to

(3-123) Xi

and Yj.

BASIC CONCEP'l'S OF INFORMATION THEORY

3. Mutual and self-information. The function may the self-information of a symbol Xi. That is, if an observer is the position of the symbol Xi his a priori knowledge of the situation is that Xi will be transmitted with the probability p{xil and his a Irni,\urltPn(rtP is the certainty that Xi has been . thus C11"'O''iI"''In''ll"nrl

nnc::!1"o't'1In't'1I

=

I(xi)

Obviously,

I(xi;xi)

1

= log -{-I P

I(xi;Yj) ~ I(xi;xi) I(xi;Yj) ~ I(Yj;Yj)

= =

Xi

I(xi) I(Yj)

(3-125) (3-126)

Aninteresting interpretation of the concept of mutual information can be given by obtaining the average of the mutual information per pairs, that is, I(X;Y)

=

I(xi;Yj)

=

LL p{xi,Yj}I(xi;Yj) i i

I(X;Y) =

P{Xi,Yi} log

p;{~;IJ

(3-128)

It could be ascertained that this definition provides a proper measure for the mutual information of all the pairs of symbols. On the other hand, the definition ties in with our previously defined basic entropy formulas. Indeed, by direct application of the defining equations one can show that I(X;Y) I(X;Y) I(X;Y)

= = =

+

H(X) - H(X\Y) H(Y) - H(Y\X)

(3-131)

The entropy corresponding to the mutual information, that is, indicates a measure of the information transmitted through the channel. For this reason it is referred to as transferred information or traneinformation of t~e channeL Note that, f)"ased 00 the fu~d~~~i;tarequatTon (3-116);the ~ight~"~fd~~of Eq. (3-130) is a nonnegative number. Hence, the average mutual information is also nonnegative, while the individual mutual-information quantities may become negative for some pairs. For a noise-free cnannei, I(X;Y) I(X;Y)

= =

H(X) = H(Y) H(X,Y)

(3-133)

For a channel where the output and the input symbols are independent, l(X;Y)

= H(X) =

- H(X\Y) lI(X) - H(X) = 0

no information is transmitted through the channel.

(3-134)

106


Exampl~ 3... (S.. The joint probability matrix of a channel with output is' given below:

Find the different entropies and the mutual information. Solution. The marginal probabilities are P{xd == P{X2} == P{yd ==.P{Y2} ==

,'2 ,~

The entropies are H(X) == H(Y) == 1 H(X,Y) == 2 I(X;Y) == H(X) H(Y) - H(X,Y) == 0

+

The transinformation is zero, as the input and the output symbols are independent. In other words, there is no dependence or correlation between the symbols at the output" and the input of the channeL

3-131! Set-theory Interpretation of Shannon's Fundamental mequauties. A set-theory interpretation of Shannon's fundamental inequalities

1(%;y) FIG. 3-9. A set-theory presentation of a simple communication system. (X, Y) represents the joint operation of the source and the channel.

3-10. A set-theory presentation of different entropies associated with a simple communication model.

FIG.

along with the material discussed previously may be illuminating. Consider the variables A and B as sets. We may symbolically write m(A) and m(B) ,as some kind of measure (say the area) associated with sets X and Y. The entropies of discrete schemes are essentially nonnegative, and they possess the property of Eqs. (3-114) and (3-119). Thus one may observe that, in a sense, the law of ·"additivity" of the entropies holds for disjoint sets. Thus, the following symbolism may be useful in visualizing the interrelationships (see Figs. 3-9 and 3-10)~ m(A) m(B) m(A U B) m(AB')

H(X) H(Y) H(X,Y) H(X\ Y)

(3-1~5)

(3~136)

(3-137) (3-138)


r, V B)

~ meA)

m(B)

~m(A)

~

meA V B) = m(AB') m(BA') + meA

r.

When the channel is noise-free, the two sets become "coincident" as follows: meA) = m(B) meA V B) = meA) = m(B) m(AB') = 0 m(BA') = 0 meA n B) = meA) = m(B) = meA V B)

Y) H(XIY)

= =0

I(X;Y)

= H(X)

=0

When the channel is such that and pendent, the two sets A and B are considered mutuanv meA V B) = meA) m(AB') = meA) meA n B) == 0

m(B)

H(X,Y) H(XIY)

= =

+

=0

This procedure may be extended to the case of channels with several ports. For example, for three random variables (X,Y,Z) one may write

+

H(X,Y,Z) ~ H(X) H(Y) H(ZIX,Y) ~ H(ZlY)

+ H(Z)

(3-153) (3-154)

see See Fig. 3-11. For a formal proof of Eqs. (3-153) and Khinchin. Similarly, one may give formal proof for thefollowing inter-

H(X,Y,Z)

FIG. 3-11'. A set-theory presentation of the entropies associated with (X, Y,Z) space.

108

DISCRETE SCH"EMES WITHOUrr MEMORY

y)+ I(Y,Z;X) = I(Y;X)

+

The set diagrams for these relations are given in Figs. I(X;Y) =I( Y;X)

and 3-13. I(X;ZIY)

11(Y)

FIG. 3-12. A set-theory presentation of the entropies associated with (X, Y,Z) space.

FIG. 3-13. A set-theory presentation of the transinformations I (X; Y) and Y).

In conclusion, it seems worthwhile to present a simple set-theory rule for deriving relationships between different entropy functions of discrete schemes: Draw a set for each random variable of the multidimensional random variable (X 1,X2, • • • ,Xn ) . When two variables X k and are independent, their representative sets will be mutually exclusive. Two random variables Xi and X h describing a noise-free channel (with a diagonal probability matrix) will have overlapping set representation. The following symbolic correspondences are suggested: AkA~.

A k V Ai A k ( \ Ai AkA i = 0

H(Xk\Xj ) H(Xk,Xj ) I(Xk;Xj ) I(Xk;Xj ) = 0 I(X 1 ;X 2 ; • • • ;X n ) R(B) ~ H(C)

(3-157) (3-158) (3-159) (3-160) (3-161) (3-162) (3-163)

3-14. Redundancy, Efficiency, and Channel Capacity. In Sec. 3-9 we have presented an interpretation of different entropies in a communication system. It was also pointed out that the transinformation I(X; Y) indicates a measure of the average information per symbol transmitted in the system. Thesignificance of this statement is made clear by referring


to In this section, it is intended to introduce a ~U.i.lJUlIVJl.v measure for efficiency of transmission of information by a comparison between the actual rate and the upper bound of the rate of transmission of information for a given channel. In this respect, Shannon has introduced the significant concept of channel capacity. Shannon, in a discrete communication system the channel caoocuu maximum of transinformation. C

= max I(X;Y) = max [H(X) - H(XIY)]

of pr()DaJD1l1tH~S amnaoet, that is, memoryless sources. of application and computation of channel capacity, a somewhat analogous concept from linear network theory may be worth mentioning. Consider a linear, resistive, passive, two-port network connected to a linear resistor R at its terminals. The power dissipated in R under a given regime depends on the network and the load. The maximum power dissipated in the load occurs when there is a matching between the load and the network, when the resistance of the network seen from the output terminals is identical with R. This situation can be further analyzed by that, for a given network, the power transferred to the load depends on the value of the load; the maximum power transfer occurs only when the load and the source are properly matched through a transducer.. In a discrete communication channel, with prespecified noise characteristics, i.e., with a given transition probability matrix, the rate of information transmission depends on the source that drives the channeL Note in the network analogy, one could specify the load and determine the class of transducers that would match the given load to a specified class of sources. The maximum (or the upper bound) of the rate of information transmission corresponds to a proper matching of the source and the channel. This ideal characterization of the source depends in turn on the probability transition characteristics of the given channel. Discrete Noiseless Channels. The following is an example of the evaluationof the channel capacity of the simplest type of sources. Let X = {Xi} be the alphabet of a source containing n symbols. Since the transition probability matrix is of the diagonal type, we according to Eq. (3-132), AnC::lLyr"Ir'TlI1ntrlf'

n

C= max I(X;Y) = max [H(X)] = max [ -

L p{xd log p{xd ]

(3-165)

i=l

According to Eq. (3-14), the maximum of H(X) occurs when all symbols are equiprobable ; thus the channel capacity is

C

= log

n

bits per symbol

(3-166)

DISCRETE SCHEMES

a cnannei, as cnannet, can be orn:l1"l:T~ expressed instead of bits per For 1l"Or"1I11l'r.r~n for the transmission hnlQ have common duration of t seconus. second lo·n1". I "l:T

Q"{Tl'TI

bits per the have

}~or

noise-free communication

1

=-0=t t

n

described

bits per

The difference between the actual rate transmission of information I (X ;Y) and its possible value is defined as of the communication The ratio of aosorute redunuancv as the ~!!!:flffQ~~~~~1&?1f1&!'l1&y. afore-mentioned "'lV"",.''£II'il_

Absolute reuunuancv for noise-free channel = 0 Relative reounuancv for noise-free channel

efficiency of the above ~nlCH3nc:y

= ----::----

can be defined in an obvious fashion as

of noise-free channel = - - - -

When the time for the transmission of symbols is not necessaruv a similar procedure may be applied. Let ti be the time associated with the symbol Xi; then the average transinformation of a noise-free channel per unit time is

set is known as the rate of transmission of mtormation. For a of if, (i = 1, 2, . . . ,n), one can evaluate the Xi to the maximum The computation will rate of transmission of information per second not be undertaken here. JL'\JGNVZoJLJL1LP'\


Discrete Noisy Channel. The channel capacity is the maximum average mutual information when the noise characteristic p{YilxiJ of the channel is prespecified. n

C

=

max(l: i=l

where the maximization is with respect to PI {Xi} • Note that the marginal probabilities P2 {Yi} are related to the independent variables PI {Xi} through the familiar relation n

P2{Yi} =

L Pl{Xi}P{YiI Xi }

(3-174)

i=l

Furthermore, the variables are, of course, restricted by the following constraints : i = 1,2, . . . ,n (3-175) Pl{xd ~ 0 n

L Pl{Xi} = 1

i=l

The maximization of (3-173) with respect to the input probabilities does not necessarily lead to a set of admissible source-symbol probabilities. From the physical point of view, the problem of channel capacity isa rather complex one. The communication channels are not generally of the aforesaid simplest types. When there is an interdependence between successive channel symbols, the statistical identification of the source and the maximization problem are more cumbersome. In these more general cases, the system will exhibit a stochastic nature. Therefore, more elaborate techniques need to be introduced for deriving the channel capacity of such systems. Because of this complexity, Shannon's fundamental channel-capacity theorems require adequate preliminary preparation. These will be considered in a later chapter. 3...16. Capacity of Channels with Symmetric Noise Structures. The computation of the channel capacity in general is a tedious mathematical problem, although its formulation is straightforward. The of maximization requires some special mathematical techniques such as the method of Lagrangian multipliers. In the present section we should like to compute the capacity of some special channels with symmetric noise characteristics as considered by Shannon. Consider a channel such that each input letter is transformed into a finite number of output letters with a similar set of probabilities for all the .input letters. In this case the channel characteristic matrix P{Yilxi} contains identical rows and identical columns but not necessarily in the

112


same

p,Jv~'J.\JA.'vJ.J\..

See

where we have

For such channels the capacity can be computed without any The key to the simplification is the fact that the conditional entropy H(YIX) is independent of the probability distribution at the Indeed, for a letter Xi with marginal probability a; we may write rht-lhn ......Ii-'\I[7'

P{Yilxi} = Olii P {Xi,Yi} = aiOlii

(3-176) (3-177)

The conditional entropy pertinent to the letter Xi will be m

H(Ylxi) = -

L p{Yilxi} log p{Yilxi} i=l

(3-178)

Now let H(Ylxi) = const = h

for i = 1, 2, . . . ,n

Thus, Eq. (3-89) yields H(YIX)

= (at

+ a2 + · . · + an)h = h

That is, the average conditional entropy is a constant number independent of the probabilities of the letters at the input of the channel. Transmitter

10 20

Receiver

Transmitter

1

1

2

2

Receiver

3

o--....-...,;.-------~j

m

no FIG.

n

om

3-14. A channel with a particularly symmetric structure.

instead of maximizing the expression H(Y) we may simply maximize the expression H(Y) - h, or H(Y), as h is a constant. But the maximum of H(Y) occurs when all the received letters have the same probabilities, that is, ...... JLJL'VJLV.L'Vf.Lv.

C = log m - h

(3-180)

We may wish to investigate further what restriction Eq. (3-180) imposes on the channel. For this, reference can be made to the channel probability matrix and the conditional probability matrix P {XI Y} of


Let p {Xi\Yi} =

the system.

{3ij

and note that

P{Xi }aij = P{Yi}(3ii It can be shown that when I(X;Y) = C, the {3ii matrix will also identical rows, that is, the tree of all probabilities at the output of the channel assumes similar symmetry for all sets of the received letters (Fig. 3-14). Furthermore it can be shown that the of the transmitted letters p {x i} will have to be equal for i = Conversely, if the situation of Fig. 3-14 prevails, then C

= max = max

[H(X) - H(X\Y)] [H(X)] - h' = log n - h'

where h' is the conditional entropy H(X\Y). Example 3-6.

Find the capacity of the channel illustrated in Fig. E3-6.

FIG.

Solution.

E3-6

Applying Eq. (3-180), one finds C = log 4 - ,~ log 3 C = % - log 3 bits

Example 3-7.

>~

log 6

A binary channel has the following noise characteristic:

o

1

o [;~ >~J 1

>'3 %

(a) If the input symbols are transmitted with respective probabilities of j~ and ~t,

find

H(X), H(Y), H(XIY), H(YIX), I(X;Y) (b) Find the channel capacity and the corresponding input probabilities.

Solution H(X) == 0~'81 H(XIY) == 0.75 I(X;Y) == 0.06

(a)

(b)

C == 1

H(Y) :z 0.98 H(YIX) == 0.92

+ % log % +>~ log >~ p{O}

=

p{l} ==

=%~~

log 3 == 0.08


BSC and BEe. The simplest type of source to }. In this section we assume that the sidered is binary such a source is transmitted via a binary symmetric (BSC) or a binary erasure (BEe) channel. Figure 3-15 shows a BSC and Fig, 3-16 a BEC.

3-15. i\. binary symmetric channel (BSC).

FIG.

FIG. 3-16. A binary erasure channel (BEC).

The rate of transmission of information and the capacity will be derived for each case. BSC. Let prO} = a P{l} = 1 - a P{OIO} = P{111} = p P{Oll} = P{110} = q Then H(X) H(YIX) I(X;Y) C

= H(a, 1 - a) = -a log a - (1 - a) log (1 - a)

= -- (p log p + q log q) = H(Y) + p log p + q log q = 1

+ p log p + q log q

(3-181)

BEC. The channel has two input {O,l} and three output symbols {O,y,l}. The letter y indicates the fact that the output is erased and no deterministic decision can be made as to whether the transmitted letter was 0 or 1. Let P{O} = a P{l} = 1 - a P{OIO} = P{111} = p P{yIO} = P{yll} = q H(X) H(XIY) I(X;Y) C

= II(a, 1 - a) = (1 - p)H(X)

= pH(X) =

p

(3-182)

Equations (3-181) and (3-182), specifying the capacity of BSC and BEC, respectively, will be referred to frequently in the subsequent discussion. Example 3-8. Consider the BSC shown in Fig. E3-8. Assume P {O} == a and that the successive symbols are transmitted independently. If the channel transmits, all possible binary words U of length 2 which are received as binary words V, derive

BASIC CONCE,PTS OF INFORMATION THEORY

The input entropy H(U). (b) The equivocation entropy (c) The capacity of the new channel (called the second-order extension of channel).

FIG.

E3-8

(d) Generalize the results for the case of transmitting words each n binary digits long. Solution. Let U be a random variable encompassing all the binary words 00, 01, 10, and 11 at the input. Let Xl and X 2 be random variables referring to symbols in the first and the second position of each word, respectively. Similarly, let V, Y 1, and Y 2 correspond to the output. Symbolically we may write

U = Xl, X 2 V = Y 1 , Y2

Because of lack of memory, the probability distributions are given by P{U}

=

P{XdP{X 2 }

P{V'} == P{YdP{Y 2 } PIVIU} = P{YIIXdP{Y 2 !X2 } PIUIV} = P{XIIYdI~{X2IY2} P{U,V} = P{U}P{ } = P{X1 , Y d P { X2,Y2}

(a) The source entropyH(U) can be thought of as the entropy associated with the two independent random variables Xl and X'2. Thus H(U)

=

H(X I ,X 2)

=

H(X 1 )

+ H(X

2)

=

-2[01. log

Of.

+ (1

- a) log (1 - a)]

since II(X 1 ) = H(X 2 ) . (b)

H(U, V) = H(Xl , Y I ) H(VIU) = H(YIIX I) H(UIV) = H(XIIY I)

+ H(}{2, Y2)

+ H(Y 2IX2) + H(X 2IY2)

(c) The transinforrnation becomes I(U;V)

=

H(U) -

The extended channel capacity is twice the capacity of the original channel. (d) Similarly, one can show that the capacity of the nth-order extension of the channel equals nc, where c is the capacity of the original channel. Note that this statement is independent of the structure of the channel; that is, it holds for any memoryless channel. ,="Q,I.PQ,,,,.u.llty of Binary channels are of considerable interest in the transmission and storage of information. The vast field of digital computers offers many examples of such information

DISCRETE. SCHEMES WITHOUT MEMORY

channels. The problem undertaken in this section is the evaluation the maximum rate of transmission of information of binary channels. The source transmits independently two o 0 bols, say 0 and 1, with respective pronaourues PI and P2. The channel characteristic is known as (see Fig. 3-17) P II [ P21 FIG.

PI'-]

P22

3-17. BO.

In order to evaluate the capacity of such a channel, when the entropy curve is available a simple geometric procedure can be devised (see Fig. 3-18). The points A l and A 2 on the segment OM are selected so that MAl

=

Pl1

OA 2 = P22

The ordinates of the entropy curve at A 1 and A 2 are BlA I = H(Pll)

B 2 A 2 = H(P22)

Now, for any given channel output probabilities such as OA = p and

~-----P22- - - - ~ ~

1cE-------p--FIG. 3-18. A geometric determination of different entropies, transinformation, and channel capacity of aBO.

MA = 1 - p, where p is the probability of receiving 1, the transinformation can be geometrically identified. In fact, I(X;Y) = H(Y) - H(YIX) I(X;Y) = H(p) - P1H(Pll) - P2H(P22) I(X;Y) = BA - FA

Of course, the point A corresponding to the desired mode of operation is not known. A glance at Fig. 3-18 suggests that the largest value of


transinformation is obtained when the at the are represented by point A o p t corresponding to point of the entropy curve at point R o p t is parallel to At R o p t the "iTOl"'t'l/flQ i segment representing the transinformation assumes its value. The corresponding source probabilities can be derived in a direct manner. C. E. Shannon has generalized this procedure to 3 X 3 and more /fl£'\'VY\'Ir'lllnv channels. * His procedure is based on the use of a barycentric coordinate system. For complex channels, however, an analytic approach is often more desirable than a geometric procedure. The following method for evaluation of the channel capacity has been suggested by S. Muroga. First one introduces auxiliary variables and Q2 which satisfy the following equations: 'li"nn.n."II,.,... ...",np

PIIQl + PI2Q2 P2IQI + P22Q2

= + (PII log PII + P12 log P12) = + (P21 log P2I + P22 log P22)

(3-183)

The rate of transmission of information I (X; Y) can be written as

+

I(X;Y) = H(Y) - H(YIX) = -(pi log pi P~ log p~) PI(PII log PII PI2 log P12) + P2(P21 log P21 P22

+

+

P22)

where pi and P~ are the probabilities of 0 and at the port, respectively. Next, we introduce Ql and into through Eq. (3-183): I (X; Y)

= - (pi log pi = -

+ P~ log p~) + (pi log

+ P2P21) QI + (PIP12 +

+ p~

+

·U'XU9.9.H:.",9.

+

The maximization of I(X;Y) is now done with respect to pi and the probabilities at the output. In order to do this, we may use the method of Lagrangian multipliers. This method suggests maxirmsmg the function

u=

-

(pi log pi

+

p~ log p~)

+ piQl + P~Q2 + p.(pi + p~)

(3-185)

through a proper selection of the constant number p.. Therefore one requires

au au ap~ =

- (log e +

- (log e + log

+p.=O p~)

+

Q2

+ P. = 0

The simultaneous validity of these equations requires that p.=- Ql

+ (log e + log pi) = -

Q2 + (log e + log

* C. E. Shannon, Geometrische Deutung einiger Ergebnisse bei der Berechnung der Kanalkapasitet, NTZ..Nachr~ech· Z., vol. 10, no. 1; pp. 1-4, 1957.


Pll

for binary andP22.

FIG. 3-19. A chart for determining values of Ql in terms of PH and channels. The corresponding value of Q2 is obtained by an interchange

0.8

I-+~+---+---+--

0.6

I---+/-o--/'r---il------lf--I-----il---t----".I---"-+l---I

0.2

0.2

0.4

0.6

0.8

1.0

Pll FIG.

3-20. Capacity of 3, binary channel in terms of P u and P22.


The channel

".nw...n"......

~,.

is found to be

C = max [I(X;Y)] =

The values of But note that

log

-

pi =

Q2 - log

and Q2 may be obtained from the set of Eqs.

= exp (Qi -

p~

i

C)

1,2

=

(3-188)

Thus

C = log [exp (Ql)

exp (Q2)]

i=2

L exp (Qi) = log (2

C = log

Q1

+2 Q 2)

i=l

A similar result was obtained earlier in a different way by Shannon. Later on Silverman and Chang derived further additional interesting results. The chart of Fig. 3-19 gives the value of Ql and Q2 for a binary channel. The chart of 'Fig. 3-20 gives the corresponding capacities (IRE Trans. on Inform. Theory, vol. IT-4, p. 153, December, 1958). Note that the capacity of a binary channel is greater than zero except when pu

+

P22

= 1

Example 3-9. Find the capacity of the following three binary channels, first directly and then from the graph of Figs. 3-19 and 3-20, in each of the following three cases: (a) (b) (c)

= = »» = pu pu

P22

=

1

P12 = P21 P 12

=

P22

~~

=

P21

= ~~ = ~~

P22

=%

Solution (a)

PH = P22

== 1

P12 := P21

=0

Direct computation yields Ql = Q2 = 0 C = log (2Ql

+2

( 2)

= 1 bit

This channel capacity is achieved when the input symbols are equiprcbable. (b)

PH =

Pl2

= P21 = P22

:=

7~

The noise matrix is singular and leads to Ql

+ Q2 =

-2

= Q2 = -1 C = log (2- 1 + 2- 1) = 0

Ql

Any input probability distribution will lead to zero transinformation as the input and the output are independent. This result can be verified by checking with Figs. 3-19 and 3-20.

120

DISCRETE SCIIEMES WrrHOUT MEMORY

(c)

PH P21

~'2]

%

C

= =

P12

= }'2

~:4:

P22

=%

[QIJ [~'2 log ~'2 + ~'2 log P-2J Q2 = ~~ log ~:4: + J:4: log J:4: = -2 + % log l] [ 1 - J-2 log 3J [ -1.378J [Q Q2 = -3 + J-2 log 3 = -0.622

= log

(2-1.378

+ 2-

0 •622 )

log 1.0345

=

=

0.048 bit

This answer can be verified from the graph of Fig. 3-20.

The generalization of the above method for a channel with an m X m noise matrix is straightforward. In fact, let m

PIIQl

+ . · . + PlmQm

2:

=

Plj log Plj

;'=1

(3-190)

+ PmmQm

m

L pmj log pmj

=

j=l

and assume that the solution to this set exists. mission of information, as before, will become m

m

i=l

i=l

Then the rate of trans-

- L P~ log P~ + L P~Qi The use of the Lagrangian multiplier method will lead to C = Qi - log P~ m

C = log

L

2Qi

i= 1

It is to be kept in mind that the values of C thus obtained may not necessarily correspond to a set of realizable input probabilities

(0 ~

Pi ~

1, LPi = 1) i

In the latter case the calculation of the channel capacity is more complicated. Also, the solution to the set [Eq. (3-190)] may not exist, or the channel matrix may not even be a square matrix. In such cases some modifications of the above method are suggested in the afore-mentioned references. At any rate, although the formulation of the equations leading to the channel capacity is straightforward, computational difficulties exist and the present methods are not completely satisfactory. The capacity of a general binary channel has been computed by R. A. Silverman, S. Chang, and J. Loeb. A straightforward computation leads to C(a,{3)

= -(3H~)_+a aH({3) + log [1 + exp H(aJ'

=:({3~]


where parameter a == Pl1, (3 == binary source. Note that

stands for the

and

P2h

C(a,{3) == C({3,a) == C(l - a, 1 - (3) = C(l

o'n't"'t"n:ln'u'

(3, 1,-

The input probability P{O} leading to the channel capacity is Silverman and Chang as

P

=

P(a,(3) = (3«(3 - a)-l - «(3 - a)-l [ 1

O.37~! ~ e

~

P{O}

!e

1-

+ exp - - - -

rv

O.63

The probability of receiving zeros when the channel capacity is achieved is

Example 3-10. below:

Find the capacity of the channel with the noise matrix as shown

[

~2

~i

o o

o ~i] o 0

1 0

~~

0

1

0

~~

~2

Solution

C = log (2- 2 Example 3-11. matrix

Ql = Q4 = -2 Q2 = Qa = 0 20 2- 2) = log 5 - 1 = 1.321 bits

+2 + 0

+

Determine the capacity of a ternary channel with the stochastic

[P] =

[~~

1

o

~

~~]

a

I-a

a

O:::;a:::;1

Solution.

Since the channel matrix is a square matrix, Eqs. (3-190) yield [PHQ] [Q]

1 2a 1

Ql Q2

[H] [P]-l[H] I

I

-~

1

Ol

a-I 1 - 2a

Qa

where

= = -

1

1

~

h

1 h

h = -a log a - (1 - a) log (1 - a)

Q C

= log

=

(2Q1

l- ~~~] 1 -1

Ol

+ 2Q2 + 2

Q3 )

= log

(

1

- h) + exp a-.I-a

DISCRETE ·SCHEMES WITHOUT MEMORY

According to the method applied by Muroga, Silverman, and putation leads to the following values for the probability of the ith achieving channel capacity. m

L p;l2

Pi =2-0

Qk

1 0.3 > 0.2 > 0.1 0.4 2- 2 2- 2 2- 3 2- 4

2:: 2:: 2:: 2::

a3

= 0.7 n.; = 2 n2 = 2 na = 3 n4 = 4 Xl

= 0111 = 10111 = 11101

X2

Xa X4

00 01 101 1110


rmnortant theorem due to C. E. Shannon which states: Theorem. Let S be a discrete source without memory with a munieation and a noiseless channel with C bits per message. It is possible to encode the output of S so that, if the encoded messages are transmitted through the channel, the rate of transmission information approaches C per symbol as closely as desired. Proof. We have already seen that for a source with messages the length of each encoded message may be constrained the following inequalities: _ log P{Xi} < n: log D - ~

< _ log P{xil + 1

i = 1,2, . . . ,N

log D

Furthermore it can be seen that the average length

satisfies the relation

H(X l ) < log D -

Now suppose that we consider the source ® X«, that is, a source which transmits independently the IOlJcOWlnQ' messages:

[X 2] =

X IX l

XIX2

XIXa

~2~1.

X2X2

X2Xa

• •• • ••

XIXN] X2XN

• ••

XNXN

· . ..

[

XNXI

.

it is assumed that the successive messages are responding probability matrix is

P {Xl }P {Xl } [

P{XI}P{X2}

~!X.2}:.{~1~ . ~!X.2}:.{~2~

. . . P{XI}P{XN}] ..• P{X2}P{XN}

P{XN}P{XI}

. . . P{XN}P{XN}

P{XN}P{X2}

the

(4-43)

This source will be referred to as a second-order extension of the original source. N ow if this message ensemble is encoded, we expect that the average length £2 of the messages of the new source will satisfy the relation H(X 2) < L < H(X 2) + 1 log D - 2 log D As the successive messages are independent, one finds

H(X 2 )

= H(X l )

+ H(X

I)

= 2H(X 1)

(4-45)

Thus 2H(X 1) < log D , -

L < 2H(X 1) 2

log D

+1

(4-46)

ELEMENTS OF ENCODING

In a similar fashion we consider the Mth-order extension of source. If the message ensemble of encoded, we conclude that

or Finally, H(X 1) < L M log D - M

< H(X l ) +

(4-48)

log D

When Ill" is made infinitely large,we obtain

lim M-+QO

This completes the proof of the so-called first fundamental coding theorem. It should be kept in mind while we asymptotically the above limit, the does not a m0 increasing improvement. That is, it is to have a situation where

1nO"tOIJllC2LllY

1

L M_ >

4-90 Huffman's Code. Huffman has suggested a simple method for constructing separable codes with minimum redunThe of the dancy for a set of discrete messages latter term will be described shortly. Let [X] be the message ensemble, [P] the corresponding probability matrix, [D] the encoding alphabet, and L(Xk) the length of the encoded message Xk. Then N

L = E[L(Xk)] =

L P{xk}L(Xk)

(4-50)

k=l

A minimum redundancy or an optimum code is one that leads to the lowest possible value of L for a given D. This definition is accepted, having in mind the irreducibility requirements. That is, distinct meswords with the sages must be encoded in property. To comply with these requirements, Huffman derives the following results: 1. For an optimum encoding, the longer code word should correspond to a message with lower probability; thus if for convenience the messages are numbered in order of nonincreasing probability, then

P{Xl} ~ P{X2} ~ P{Xa} ~ . . . L(Xl) ~ I.J(x2) :::; L(xa) :::; . . .

~ ~

P{XN} L(XN)

(4-51) (4-52)


is not met for two messages Xk and mtercnanze their codes and arrive at a lower value Thus such codes cannot be of the optimum type. code it is necessary 2. For an

If we similar words to XN and XN-l the final our purpose is served. additional digit for XN and XN-l unnecessarily increases Therefore, at least two messages XN-l and XN should be encoded in words of identical length. However, not more than D such messages could have equal length. It can be shown that, for an optimum encoding, no, the number of least probable messages which should be encoded in words of length, is the integer satisfying the requirements

N-

-=--- =

integer

2 ::; no ::; D

3. Each sequence of length L(XN) - 1 digits either must be used as an encoded word or must have one of its prefixes used as an encoded word. In the following we shall restrict ourselves to the binary case A similar procedure applies for the general case as shown in Example 4-8. Condition 2 now requires that the two least probable messages have the same Condition 2 specifies that the two encoded messages of length m are identical except for their last digits. We shall select these two messages to be the nth and (n - l)st original messages. After such a selection we form a composite message out of these two messages with a probability equal to the sum of their probabilities. The set of messages X in which the composite message is replacing the afore-mentioned two messages will be referred to as an auxiliary ensemble of order 1 or simply AEI. Now we shall apply the rules for finding optimum codes to AEl; this will lead to AE2, AE3, and so on. The code words for each two least probable members of any ensemble AEK are identical except for their last digits, which are 0 for one and 1 for the other. The iteration cycle is continued up to the time that AEM has only two messages. A final 0 is assigned to one of the messages and 1 to the other. Now we shall trace back our path and remember each two messages which have to differ only in their last digits. The optimality of the procedure is a direct consequence of the previously described optimal steps. (For additional material see Fano [1].) , Huffman's method provides an optimum encoding in the described sense. The methods suggested earlier by Shannon and Fano do not necessarily lead to an optimum code.

EI.JEMENTS OF ENCODING

Example Given the following set of messages and their corresponding transmission probabilities [m 11m 2,ma] [~~,~,~~]

(c) Construct a binary code satisfying the prefix condition and having the minimum possible average length of encoded digits. Compute the efficiency of the code. (b) Next consider a source transmitting meesages m Im I

man, [ m.an,

mIma] m2m2 m2m a mam 2 mama

mlm2

Construct a binary code with the prefix property and minimum average length and compute its efficiency. Solution (a) If the binary code must have the prefix property, then we assign the following code:

The average length of the code word is ~~

o and

.1

+ ~~ . 2 + ~~ . 2

1 appear with probabilities of ,~ and

. Efficiency =

= % binits

%, respectively.

log 3 og

~l 2

73

= 0 .95

Shannon's encoding also leads to the same result. (b) We construct a Huffman code:

1 0 101 100 011 010 001

000 p{O}

=

1~~9

1 1 1 1 1 110

158


As all messages are equrproname,

Average length = 7

4 = 2: = 3.22 binits

. 2 log 3 Efficiency = 3.22 = 0.98 Example 4...8. Huffman's encoding procedure to the following message ensemble and determine the average length of the encoded message.

=

{X}

{Xl,X2,Xa,X4,X6,X6,X7,XS,X9,XlO}

= {0.18,O.17,O.16,0.15,0.10,O.08,0.'05,O.05,0.04,0.02}

p {X}

The encoding alphabet is {A} = {0,1,2,3}. Solution

Xl X2

0.18 0.17

0.18 0.17

°rl.

OO

0.18 1 0.491

1

0.17 2

2

0.16 3 0.16 .0 00 %4 0.15 1 01 X's 0.10 2 02 X6 0.08 3 03 X1 O.05} 30 Xs 0.05 31 %9 '0.04 32 Xu 0.02 3 33 = 0.18 X 1 + 0.17 X 1 + 0.16 X 2 + 0.15 X' 2 + 0.18 X 2 + 0.05 X 2 0.05 X 2 + 0.04 X 2 + 0.02 X 2 = 1.65 Xa

L

+

Gilbert-Moore The Gilbert-Moore alphabetical encoding is an interesting and simple procedure. While Shannon's encoding discussed in Sec. 4-7 was based on (4-31), the Gilbert-Moore method is based on the inequality 2 1-

n

k

~

P{Xk}

P{Xj}

(a) is valid, then n; ~ nj, but since OI.j

+ + 2-n + 2-

OI.i+

OI.j ~ OI.i

i

11,

j

we find-that the jth code word cannot be identical with the first nj places of the ith code word. A similar conclusion can be reached if (b) is true.. Thus the code has the prefix property. As an example, consider the first four letters of the English alphabet and their corresponding probabilities: [space,A,B,C] [0.1859,0.0642,0.0127,0.0218]

The corresnonomz ni and [4,5,8,7]

OI.i

and

are found to be [0.09295,0.2180,0.25635,0.2736]

Thus the desired code is Space: A: B: C:

0001 00110 01000001 0100011

160

DISCRETE SCH~MES WITHOUT MEMORY

A similar n"r\,nnr'1'l"ll1n_

'Y\''It"n.n,on1111'1t"O

has been

su~~ge:st€~a

the first n. digits of the binary expansion of This procedure, which preserves the original message order in a order and has also the is to as an alphabetical encoding. The amount of computation for an alphabetical encoding is very little, but the existing method for finding the alphabetical encoding with the least average cost is rather complex. One may wish to apply this latter procedure to the English alphabet in its ordinary alphabetical order. The Gilbert-Moore answer to this problem is given in Table 4-2. In the code listed in this table, word lengths have been shortened to a minimum without losing the Such codes have been referred to as the best alphabetic codes. The average length of the best alphabetic code can be made reasonably close to the best possible average length obtained by Huffman's technique. Fundamental Theorem Discrete Sec. 4-8 we discussed the first fundamental theorem of information was shown there that, for a given discrete noiseless memoryless channel with capacity C and a given source (without memit is possible to devise proper encoding proory) with an entropy cedures such that the encoded output of the source can be transmitted the channel with a rate as close to C as desired. In this section we wish to extend the foregoing to cover the case of discrete channels when independent noise affects each symbol. It will be shown that the output of the source can be encoded in such a way that, when transrnitted over a noisy channel, the rate of transmission may approach the channel capacity C with the probability of error as small as desired, This statement is referred to as the second fundamental theorem of information theory. Its full meaning will be minutely restated at the end of this section, where a more analytic statement is derived. Second Fundamental Encoding .Theorem. Let C be the capacity of a discrete channel without memory, R any desired rate of transmission of < C), and S a discrete independent source with a sneeineu information I t is possible to find an appropriate encoding procedure to encode the output of S so that the encoded output can be transmitted through the channel at the rate R and decoded with as small a probability of error or equivocation as desired. Conversely, such a reliable transmission for R > C is not possible. From a mathematical standpoint, the proof of this fundamental theorem and its converse is the central theme of information theory. nh.1t"n"n'11n1l....

11'1'1'111'11'1 1r'\01l'"11 Tl

log No log No

nco, + log

(n

+ ~) log

+ 2~) - log {3 ~

e + (na

+ ~) log

+ 2~a)

V2?rena~ -

< nco + log V2'n-na{3

+ (n{j + %)

+

a

1

The entropy per transmitted symbol satisfies the inequalities

11og N 0 -n

-

Co

> - 1 1og 21rena«(3

- a)7

(3

2n

+

~ log No -

Co

< 2~ log

2?rna{3e

2

+ (a +;n) log

1

+2~a)

+ ({3 + 2~) log

+ 2~(3)

By applying the above detection scheme and greatly increasing the word length, one finds lim

n~oo

(! log NO) - Co = 0 n

(4-100)

The upper bound of the number of words in the transmitter's vocamuarv approaches 2 n Co as n is made larger and larger. Meanwhile, the error in decoding each message a, is kept under controL If the message a; is transmitted with a probability pi, for i = 1, 2, . . . ,N, the average error remains bounded, that is, I

No

Average error

2,

>~, %0]

(b) Encode the output of the second-order extension of the source to the channel

in an optimum binary code. (c) Determine the coding efficiency in (a) and (b). (d) What is the smallest order of the extension of the channel if we desire to reach an efficiency of 1 - 10-3 and 1 - 10-4, respectively? 4-16. A pulse-code communication channel has eight distinct amplitude levels [XI,X2, • • • ,xsl. The respective probabilities of these levels are [Pl,P2, • • • ,Ps]. The messages are encoded in sequences of three binary pulses (that is, the third-order extension of the source). The encoded messages are transmitted over a binary channel (p,q). (a) Compute H(X). (bY Compute H(XIY). (c) Compute I(X;Y). (d) Calculate (a), (b), and (c) for the numerical case, where

= P2 = P3 = ~S P4 = Pi = P6 = ~16 P7 = >i ps = ;}6 P == 0.9

(1) PI

(2) PI = P4 = P7 = Ps = P ==

P2 po

= P3 = ~8 =

~~

;}6 0.99

~16

188

DISCR.ETE S'CHEMES WITHOUT MEMORY

We wish to transmit eight blocks of binary digits over a BEe. three positions are used for the information and the rest for checks. The following equations indicate the relations between information parity digits. 1 1 1 (a) Determine how many combinations of single and double erasures may be

corrected. (b) Find the average erasure per block after correcting all possible single and double errors. (c) What is the average rate of information over the channel? 4...17. Apply Hamming's single-error correcting in the following cases: (a) (b) (c) (d) (e)

m = 2

=4 m = 4 m = 6 m = 11 nt

k=3 k=3 k=4 k=4

k=5

4-18. Show that for a Huffman binary code H~L~H+l

Prove that Huffman's encoding for a given alphabet has a cost which is less than or equal to that of any uniquely decipherable encoding for that alphabet (see Gilbert-Moore, theorem 11).

PART

2

The science of physics does not only give us [mathematicians] an opportunity to solve problems, but helps us also to discover the means of solving them, and it does this in two ways: it leads us to anticipate the solution and suggests suitable lines of argument. Henri Poincare La valeur de 131 science

5

In Sec. 2-15 we presented the COIlLCel)t of a discrete sample space and its associated discrete random variable. In this section we should like to introduce the idea of a random variable assuming a continuum of values. Consider, for instance, X to be a random noise which can assume any value between zero and 1 volt. Since by assumption the outcomes of this experiment are points on the real line interval clearly X assumes a continuum of values. Furthermore, if we make this assumption, then we may state that X is a random variable a continuum of values. The preceding intuitive approach in defining a random variable is unavoidable in an introductory treatment of the subject. On the other hand, a mathematically rigorous treatment of this more or less concept requires extensive preparation in the professional field of measure theory. Such a presentation is the scope of this for a complete coverage see Halmos and Loeve, For the time being, the reader may satisfy himself with the following. As in the case of a discrete sample space, an event is interpreted as a subset of a continuous sample space. In the former case we have already given methods for calculating probabilities of events. For the continuous case, however, it is not possible to give a probability measure satisfying all four requirements of Eqs. (2-36) to (2-39) such that every subset has a probability. The proof of this statement is involved with a number of mathematical complexities among which is the so-called hypothesis." Because of these difficulties in the of continuous sample space, one has to confine oneself to a family of subsets of the sample space which does not contain all the subsets but which has subsets so that set algebra can be worked out within the members of that family (for example, union and intersection of subsets, etc.). Such a family of subsets of the sample space n will be denoted by ff (ff stands for the mathematical term field). More specifically, the events of if must satisfy the following two requirements: J."'.LJlJ.J..U.CCllIJ.

V.II.... 'J\A.Sioo•.L.a.

191

192

CONTINUUM WITHOUT MEMORY

E ff, then eo

U

ff

i=l

2. If A

ff, then

U

E

ff

The first simply implies that the union of a denumerable sequence of events must also be an event. The second n1"l"\no'ri"'u reouires that the complement of an event also be an event. With such a family in mind, the next step will be to define a probability measure P{ A} for every event A of that family. This can be done in a way similar to the definition of a probability measure over the discrete sample space, namely, 1. For each A E ff, (5-3) o ~ PtA} 2. For all denumerable unions of disjoint events of ff family, co

U i=l

3.

Ad {U} = 1

We assume the validity of these axioms and then proceed with defining the probability distribution and density of a continuous random variable. It is to be noted that, in the strict sense, a random variable need not be real-valued. One can directly define a random variatwo real-valued variables: ble x+v~lY

This simply requires the measure space to be a complex two-dimensional space rather than an ordinary real space. 5.. .2. Probability Distribution Functions. For simplicity, consider first the case of a random variable taking values in a one-dimensional real coordinate space. The probability that the random variable X assumes values such that E = {a < X ~ b} (5-6) is shown < X ~ b} = PtE} The event E in Eq. (5-6) consists of the set of all subevents such that their corresponding values of X satisfy the above inequality. In particular, consider the event E 1 defined by Note that

E 1 = {X ~ a} E1E = 0

00

E U E 1 = {X

1i1.·""TQ1rmV

(0) a, the prooamntv

X assum-

<x=:;

lim E~O

== lim

F(a......

== 0

«5-.0

For continuous random variables the probability of the random variable being' in an interval decreases with the of that interval and in the limit becomes zero. a discrete random if X = a is a value for the random variable, then =

lim [F(a) e-+O

The mathematical implication of this eouation is rather clear. However, the reader may find convenient for his F(x)

C

FIG. 5-1. Example of a discrete CDF.

FIG. 5-2. Example of the discrete distribution corresponding to Fig. 5-1 in terms of impulse functions. P(X = = a. P(X = b) =~. P(X =c) 0I.+~+'Y=1.

use to illustrate the density distribution function in the discrete case of Dirac or A unit effective with the at a x = a will be denoted uo(x The discrete probability distribution function of 5-1 leads to the density distribution function of Fig. 5-2. When dealing with continuous probabilities due to Eq. (5-15), we understand that the expressions P {X < a} and P {X :::; a} are equivalent. t While the rigorous use of Dirac "functions" requires special mathematical consideration, their employment is frequent and commonplace in electrical engineering literature. In this respect, Fig. 5-2 may be of interest to the electrical engineer.


Example A random process gives measurements z between 0 and nrobamntv density function I(x) I(x)

with

= l2x 3 - 21x 2 + lOx =0 elsewhere

(a) Find P{X :::; >~} and P{X > >~). (b) Find a number K such that P{X :::; K}

=

~~.

Solution P{X :::; >~}

(a)

P{X

> >~}

K

10

(b)

r~ = Jo (12x 3 - 21x 2 = 1 -- 716 = X6

(12x 3

-

2lx 2

+ lOx) dx

+ lOx) dx + 5K2

=

== 'l6

~~

= >~

3K4 - 7K3

The permissible answer is the root of this equation between 0 and 1; this is found to be

K = 0.452

5.. .4. Normal Distribution. A random variable X with a cumulative distribution function given by F(x)

1

= P{X < z ] = 00

exp [ - (t - 2a)2J dt 20-

0-

is called a variable with normal or gaussian distribution. sponding density function is 1

. [

f(x) = u y12,; exp

(5-18)

The corre-

(x - a)2]

-

2u 2

which is symmetrical about x = a. The numbers a and 0- are called the average and the standard deviation of the random variable, respectively; their significance' is discussed in Chap. 6. One may be interested in checking the suitability of the function f(x) of Eq. (5-19) as a density function. In other words, one has to show that

1-"'", f(x) dx =

1

For this purpose, we may shift the density curve to the left by a units. Next, we consider the double integral

(f '" ----.!yI 211'" -

00

E- z'/2u'

dX)2

= ~ 211'"0"

0-

=

foo

e-·:r;2/2q 2

-- 00 -

00

foo -

-~ J'oo J''''

211'"0-

dx

-

00

e-y2/2q2

dy

00

e-(z'+u'l/2u3

dx dy

(5-20)

CONTINUOUS PROBABILITY DISTRIBUTION AND DENSITY

coordinates

mtroducmz the familiar 1

e- r 'l/ 2Q2 r dr dO =

In the next it will be shown that the a in denotes the "average" value of the random variable with a normal distribution. When the parameters of the normal distribution have values

a=O

and

a

=

1

the distribution function is called standard normal distribution. F(x)

= P{X < x}

1

=

e-(t 2 / 2)

dt

Tables of standard normal distribution are commonly available. use such tables one first employs a transformation of variable of the

To

x-a

t=-(J

in Eq. (5-19) in order to transform the normal curve into a standard normal curve. Standard normal distributions are given in Table T-2 of the Appendix. The use of this table for the of a random variable being in an interval is self-explanatory. All one has to suggests the between remember is that bility and the area under the density curve between points of interest. For example, if X has a standard normal probability density distribution, then P{O < X < 2} = 0.47725 P{ -2 < X < 2} = 0.95450 P{ (X < -2) U (X > 2)} = 1 - 0.95450 = 0.04550 P{X < 2} = 0.97725 P{X > 2} = 1 - 0.97725 = 0.02275

More detailed information is

in Table T-3 of the

.L.1.U·UV.l.

.I.\A..I.~.

Example 6·,4.. The average life of a certain type of electric bulb is 1,200 hours. What percentage of this type of bulb is expected to fail in the first 800 working hours? What percentage is expected to fail between 800 and 1,000 hours? Assume a normal distribution with (1' = 200 hours. Solution. Referring to Sec. 2-8, one notes that in a large number of samples the frequency of the failures is approximately equal to the probability of failure. In this connection the word percentage is used synonymously with frequency . Using the average life of 1,200 hours for a, we make a change of variable y = (x - a) ItT which

198


allows the normal curve in y to be symmetrical about y == 0, of a table of normal probability. Yo ==

_ fx-soo

1

x=1,200

200

V 211" exp

x -

(J .

==

-q-

800 - 1,200 2 200· == -

f

[(X -

1200)2J 80 000 dx == ,

-

2

1 ~ /- e- y 2 / 1 dy == 0.477 211"

0 V

The area under the whole normal curve being unity, the desired probability is 0.500 - 0.477 == 0.023 For the second part of the problem, let , _ 1,000 - 1,200 200

-1

Yo -

fo

1

fCy) dy = 0.341

0.500 - 0.341 == 0.159 0.159 - 0.023 == 0.136

The reader should note that, in view of our assumption of normal distribution, there is a fraction of bulbs with negative life expectancy (- eo to zero). This fraction is included in the above calculation; that is, the number 0.023 == P { - eo < X < 800} will be somewhat larger than

o

,J

t

P{O

< X < 800}

~

Of course, if we had used a density distribution bounded between 0 and %=1,000 infinity, we should not be confronted with y=-l the problem of negative life expectancy. On the other hand, tables of such distriFIG. E5-4 butions are not readily available. The calculation of P{800 < X < 1,200} in lieu of P{O < X < 800} was a simple matter of using Table T-2 of normal distributions. %=800 y=-2

life of the bulb, '%=1,200 hours y=O

5-5. Cauchy's Distribution. Cauchy distribution if F(x)

A random variable X is said to have a

< x}

=

t

-00

r(I dt + t2)

The corresponding density function is f(x) = reI

Note that

f(x) dx

I

+

x 2)

= r-I [tan- 1 x]~oo = 1

(5-24)

CONTINUOUS

DISTRIBUTION AND

E5...5.

_ _ _ _---*-

__l o__

....

X

a

FIG. 5-3. A normal PDF.

FIG. 5-4. A normal CDF.

Example 6-5. Consider a point M on the vertical axis of a two-dimensional rectangular system with OM == 1. A straight line M N is drawn at a random (Fig. E5-5). What is the probability distribution of the random variable ON == X?

I(x)

F(x)

o

o

FIG. E5-5

Solution. The problem suggests that the angle (J in Fig. E5-5 has a uniform ability distribution. Thus the probability of drawing a line in a particular d8 is The random variable of interest is ON == X == tan 8. Accordingly, dx

dB == 1 P{x - dx ~ X ~ z] == P{8 - d8

+ x2 < angle

~

8}

1

f(x) == r(l F(x)

==

F(x) ==

6-6. the

~X]ponlen·t18.1

:c

+ x:l) :c

f(x) dx ==

! ltan- 1 x]:oo

Distribution.

==

!

r

dx x

+

A probability density distribution

j(z) = ae-Ga: d» a >0 f(x) = 0 elsewhere

z>o

(5-25)

200


is referred to as an exoonenuaa distribution. The corresnondme by p < x} = ae:" dt = [- e......t]~

is

h'"

F(x)

= 1 - e-ax

A graph of the exponential density and its CDF are given in and b, respectively.

FIG. 5-5. (a) Example of an exponential PDF. in Fig. 5-5a.

5-5a

(b) CDF of the density illustrated

5.. . 7. Multidimensional Random Variables. The coordinate space can be a multidimensional space. In this case the random variable X assumes values of the type (Xl,X2,X3, • • • ,x n ) , that is, n-tuples of real numbers. For example, if four dice of different colors are thrown simultaneously, the random variable associated with the outcome takes. certainnumber quadruples as values. In fact, we are considering a sample space that is the cartesian product of a finite number of other sample spaces. If the outcome E; is in the sample space nk , the n-fold outcome E is defined as a point in the cartesian product space n, that is,

Ek E

nk

k = 1, 2, E = {E 1,E2,

n = n1 ® n2 ® Then

,n ,En}

· · · ®

(5..27)

nn

EEn

If the outcome E is a permissible point of the product sample space n and if the events E k are mutually independent (this is not always the case), then the probability measure associated with E equals the product of the individual probability measures, Le., (5-28)

It is to be noted that by the event E; E nk we understand the set of all events in the n space where the kth random variable assumes a specified value but variables other than that are arbitrary. Such a set of events is usually called a cylinder set. The probability measure associated with this cylinder is defined as the probability of the event Elk. By analogy with the one-dimensional case, we define the cumulative probability distribution function (CDF) of the n-dimensional random

CONTINUOUS PROBABILITY DISTRIBUTION AN'D DENSITY

as

variable

.z,) = P{ -

< Xl::;

00

= P{X 1 :S

Xl,

X2

00 < ::; ... ,-00
f(x,y)

dy

= f-«>~ o exp [ -

(:: - 2

Xi + t:)] dy

(5-65)

208


of variable

the following results: 1

flex)

e-(x 2/2(J':x;2)

(J',;;

1

f2(Y)

e-(y2/2u y2)

Uy

with

U,;;2

=

'12 -----

u

y

2 -

1

2' k 2

b2k2 a2 b2

(5-68)

The marginal density distributions are also normal distributions. 3. The conditional probability densities, as obtained from the joint and marginal densities, are

(5-69) _ !(x,y) fy(y\x) - !l(xf

exp

[-

k2

2(k 2 - a2b2)

(~y - abx)2] ku,;;

For any given value of x or y, the associated conditional densities as well as the marginal densities are normally distributed. 5-11. Functions of Random Variables. One of the most fundamental problems in mathematics and physics is the problem of transforming a set of given data from one coordinate frame to another. For instance, we may have some information concerning a variate X = (X 1,X2, • • • ,Xn ) , and, knowing a function of this variate, say Y = g(X), we wish to obtain comparable information on function Y. The simplest examples of such functions are given by ordinary mathematical functions of one or more variables. In the field of probability, knowing the probability density of a random variable X, we desire to find the density distribution of a new random variable Y = g(X). A moment of reflection is sufficient to realize the significance of such queries in physical problems. In almost any physical problem, we express the result of a complex observation or experiment in terms of certain of its basic constituents. We express the current in a system in terms of certain parameters, say resistances, voltages, etc. Thus, the problem generally requires computing the value of an assumed function, knowing the value of its arguments. The computation of interest may be of a deterministic or a probabilistic origin. Our present interest in the problem is, of course, in the latter direction. First we shall consider the case of a real single-valued continuous strictly increasing function. Then the procedure will be extended to cover the more general cases.

CONTINUOUS PROBABILITY DISTRIBUTION AND DENSITY

One-dimensional Case. Let X be a random variable and let g(x) be a real single-valued continuous strictly increasing function. The CDF of the new random variable Y = g(X) can be easily calculated as G(y) = P{g(X) :::; y} = 1 G(y) = P{g(X) ~ y} = F[g-l(y)] G(y) = 0 g-l (y) being the inverse of g (y) .

Thus, G(y), the CD:F of the new variable, is completely determined in terms of F(x) and the transformation g(x). If X has a density f(x), the density of Y = g(X) can easily be found as

When this latter integral exists, the density function of Y is p(y) = 0 p(y) =

y ~ g( - 00)

for f(x)

ldY7dXl

g( - 00)

p(y) = 0

~6

·252

=:

7

This direct calculation checks with the result E(X + Y) = 'J1 + 'J1 = 7. The product of the two face numbers assumes the following values, with their corresponding probabilities. 1

2

3

4

5

6

8

9

10 12 15 16 18 20 24 25 30 36

~6%6%6%6%6%6%6~6%6%6%6~6%6%6%6~6%6%6 E(XY) = ~~6(1 + 4 + 6 + 12 + 10 + 24 + 16 + 9

+ 20 + 48 + 30 + 16 + + 48 + 25 + 60 + 36)

36

+ 40

E(XY) = >'36 . 441 = 4% E(X) ·E(Y) = ~ % = 4%

a Univariate Random Variable. Equations (6-1) and (6-5) describe the mathematical expectation associated with a general function of a random variable g(X). The particularly simple function g(X) = Xr

plays an important role in the theory of probability and application problems. The expectation of Xr, that is,

or is called the rth moment about the origin of the random variable X. The definition is contingent on the convergence of the series or the existence of the integral, i.e., on the finiteness of the rth-order moment. For r = 0 one has the zero-order moment about the origin.

STATISTICAL AVERAGES

For r = the first-order moment about the origin or the mean value of random variable is 00

E(Xl) = E(X) = m, =

L

PiXi

i=l

+00

m, =

or

co

xf(x) dx

For r = 2 the second-order moment about the origin is E(X 2) =

or

E(X

2

)

=

PIX12

+ P2X22 + · · · + Pn Xn 2 + ·

J-+"," x

2f(x)

dx

The rth-order moment about a point c is defined by (6-17)

A very useful and familiar case is the moment of the variable centered about the point ml, the mean value of the variable. Such moments are called central moments. JJr = central moment of order r = E[X - E(X)]r = E(X

ml)r

The values of central moments of first and second order in terms of ordinary moments are discussed below.

First-order Central Moment Pl

= first-order central moment = expectation of deviation of random variable from its mean value m;

= E(X fJ.l = E(X -

ml) ml)

= E(X) -

E(ml)

=

ml -

m, = 0

(6-19)

Second-order Central Moment JJ-2

= second-order central moment = E[(X -

fJ.2 =

E(X2) - 2E(Xm l )

+ E(m1

2)

=

ml)2]

E(X2) - m12

(6-20)

The second-order central moment is also called the variance of the random variable X. fJ.2 = var(X) = E[(X - m)2] = E(X2) - m12 = m« - m12 The nonnegative square root of the variance is called the standard deviation of the random variable X. Standard deviation =

(J'X

=

vi fJ.2 = vi m2 -

m1 2

(6-22)

The physical interpretation of the first and second moments in engineering problems is self-evident. The first moment m, is the ordinary

226


mean or the average value of the under conaideration, the average of the square of that quantity (mean square). For mstance, if X is an electric current, ml and m2 are the average (or d-e level) of the current and the power dissipated in the unit resistance, resnecuverv Similarly, the standard deviation is the root mean square (rms) of current, about its mean value. Example Solution

In part (a) of Example 6-1, find mi, me, JJ2, and E(X) == m, == 7.6 E(X2) JJ2

a

== m2 == ~o ·100 + %0 ·49 + ~lo ·16 == 61.0 == m2 - m1 2 == 61.00 - 57.76 == 3.24 = V;; == 1.8

The above definition can be extended to the sum of two or more random variables. The pertinent algebraic operations will be simplified by the use of some familiar mathematical formalism. Let E(X) and u(X) stand for the expectation and the standard deviation of a random variable X, respectively; then for the sum of two random variables we write var (X

+

Y)

= E[(X + Y)2J - [E(X + Y)]2 = E(X2) + E(Y2) + 2E(XY) - [E2(X)

+ E2(Y) + 2E(X)

If the two variables are independent, then Eq. (6-23) yields

· E(Y)}

= E(X) ·

(6-23)

and

This result can be extended to obtain the standard deviation of the sum of a finite number of independent random variables. (6-24) Example 6-4. X is a random electric current normally distributed, with g == 0 and a == 1. If this current is passed through a full-wave rectifier, what is the expected value of the output? Solution. Let the output of the rectifier be Y; then

Y ==

IXI

Applying Eqs. (5-73), the probability density of Y is

The density distribution has the shape of a normal curve for y where.

~

0 and is zero else-

amperes


The internal noise of an amplifier has an rms value of 2 volts. When the signal is added, the rms is 5 volts. What would be the rID.S value of the output when the signal is Solution. This simple example is rather familiar to electrical engineers. the solution may look obvious, we shall give the hypotheses under which the familiar solution is obtained. Let Sand N be independent random variables (signal and noise) with zero means and given standard deviations (J'x and ««. Root mean square of S = yE(S2) = Root mean square of N = VE(N2) =

=

(fa

= (fn = 2

As the random variables are assumed to be independent, by Eq. (6-23a) we find E[(S

+ N)2] = E(S2) + E(N2) = (fa"l.

The problem asks for Now

+ N)2 yE(3S + N)2 E(3S

= =

(fa'},

+

= 52

(fn?

= 52 - 2 2 = 21

yE(3S E(9S 2 )

+ N)2 + E(N2)

= 9 X 21 = 13.9 volts

+2

2

Note that no additional assumption is necessary as to the nature of the distribution functions of the signal and the noise.

An Inequality Second-order Moments. We should like to compare the second-order moment of a random variable about a c with the second-order central moment J.l.2 of the same variable. E[(X C)2] E[(X - C)2] - C)2]

= E[(X - m, + m, - C)2] = E[(X ml)2] + - ml)(ml = J.l.2 + (ml - C)2

The second-order moment about any central second moment: E[(X

C)2]

c

~

c)]

+ E[(ml

- C)2]

m, is larger than the

>

The fact that the second-order central moment of a random variable is the smallest of all second-order moments is of basic significance in the theory of error. In the analogy with electrical engineering, note that the smallest possible root mean square of a fluctuating current or voltage J(t) is obtained when that quantity is measured with respect to its mean rather than any other level. value Chebyshev Inequality. The Chebyshev inequality suggests an mterestmz relation that exists between the variance and the spreading out of the probability density. Let X be a random variable with a probability density distribution J(x), first moment m-, and standard deviation cr. Chebyshev's inequality states that the probability of X - ml assuming values larger than ka is less than 1/k 2 •

P {lX - mIl

.~ ku}

:s;

1 k2

K

>0

(6-25)

228

to


To prove the validity of this inequality, let Y 6-1. The desired probability is P{ y

2:

ku} =

r.:

f(x) dx

-00

=

IX

+ f,ml+kcr 00 f(x) dx

Multiplication by k 2u 2 yields k 2u 2P{lX -

mll 2::

ku} =

i:: k -00

2u 2

f (x) dx

+ f,oo k ml+kcr

2u 2

f (x) dx

Note that in each of the ranges of integration ko :::;

Thus k 2u 2Pl!X -

mll ~

Ix - mll

ku} :::; r:~-ku (x - ml)2f(x) dx

+ f,ml+kcr 00 (x -

ml)2f(x) dx

(6-27)

But the right side is certainly not greater than the second-order central moment of X; therefore, k 2u 2P{iX - mll ~ ku} :::; (x - ml)2f(x) dx = u 2

f-"'",

Dividing both sides of this inequality by k 2u 2 gives the desired inequality. I(x)

FIG.

~y

I

6-1. Illustration of Chebyshev's inequality.

The Chebyshev inequality expresses interesting bounds on the probability of the centralized random variable exceeding any units of standard deviation. For example, P {IX - mll 2:: 2X (t) dt for X = flex) =

for Z = fez) =

is to be noted that the between a function its associated characteristic function is that of the familiar Fourier I this in we recall the convolution theorem of the Fourier Laplace) transforms. According to the theorem for Fourier transforms, the inverse Fourier Laplace) transform of the two functions to the convolution integrals of the two inverse if functions. That ln1",OO"l"Q

flex) = ff[c/>x(t)] f2(Y) = ff[c/>y(t)] = ff[c/>x(t)c/>y(t)] = flex)

then where

flex)

* f2(Y)

(6-93)

* f2(Y)

fl(z - y)f2(Y) dy

=

(6-95)

The result of (6-95) can be generalized in order to obtain the probability density function of the sum of a finite number of muenenuent random variables. Let n

=L 1

where the

are all independent.

The probability density of X is (6-96)

If every variable is distributed, then their sum will also normally distributed. The same is true when all variables have distributions of the Poisson or binomial type (see Probs. 6-14 and 6-7).

244


Example 6-12. X and Yare two independent random variables 'with standard normal distributions. Derive the distribution of their sum. Solution. According to Eq. (6-94), the probability density of Z == X + Y is g(z)

== !l(X) * f2(X)

1_ J2; = i.1

X)2J} J21r [exp ( -

00

g(z) =

{ex p [-(z 2-

00

_0000

2 1

(x - ~r

exp [ -

e-"/

~

4

1_

-

~J dx

00 00

~) J dx

exp [ -

(x - ~rJ dx

Make the change of variable: z v x--=-

(V dv exp - '2 V 2 == 1 2

co -co

)

== -1-

g(z)

Thus

y!2

2

y; 1 1

2Y;

e-(z'J/4)

The density distribution of Z is also a normal distribution with zero mean but a standard deviation of V2. Note that the mean and standard deviation could have been predicted by applying Eq. (6-23a).

+ E(Y) == 0 + E(Y2) == 2

E(Z) == E(X) E(Z2) == E(X 2)

Example 6-13. A gun fires at a target point which is assumed to be the center of the rectangular coordinate system. Let (x,Y) be the coordinate of the point at which the bullet hits, and assume that the associated random variables X and Yare independent. We furthermore assume that X and Yare normally distributed with 0 mean and standard deviation 1. Find the probability distribution of the random variable R2 == X 2 Y2

+

Solution.

It was shown in Sec. 5-11 that the density of X 2 is

Application of the rille of convolution yields f(r 2) ==

r

r

10

_1_.

e- z / 2

VZ;Z

1

== 2r e-

r i /

t'

10

1 V21r(r - e)

e-(r-z)/2

dz yz(r - z)

From an integral table one finds

r 1 fco yz(r. -

z)

Jr ==

. 2z - r dx == sln- 1 - r 0

Finally we find r~O

11"

dz


Example The random variables X and Yare uniformly distributed over real interval zero to a (a > 0). Find the density distribution of (a)

z=x+y

(b)

Z==X-

Solution (a) Since it is likely that most readers are more familiar with the ordinary (onesided) Laplace transform than with the Fourier transform, in this example we shall change the parameter t to s, which is the symbol commonly used in engineering texts in conjunction with Laplace transforms. We also assume that the readers are familiar f(z)

f(z)

1. u-2 (z-2a)

a2

1

a

o~-------~-------z a

FIG.

E6-14

with the concept of a singularity function, such as is exemplified by the unit step Based on this notation,

U-1(t) and unit ramp U-2(t). f1(X) f2(Y)

a1 U_1(X) - a1 U-l(X - a) density of X 1 1 == aU-l(Y) - aU-l(Y - a) density of Y 1 "':I:(s) = £ aU-l(X) - £ aU_l(X - a)

==

"':I:(s)

= !as - !as e- a ,

"'1/S) == !as - .! eas

a8

Since the two variables are independent, Eq. (6-69) gives

The desired density function is the inverse Laplace transform of "'is); thus, f(z) = £-1 22 (1 - 2e- aa

a

8

+ e- 2a a)

1

= '2 [U_2(Z) - 2U_2(Z - a)

a

+ U_2(Z

- 2a)]

The graphical representation of f(z) follows directly from the definition of the step and unit ramp. This is illustrated in Fig. E6-14. (b)

",,(s)

1

== - a 28 2 (1 - e- a s) (1 - e 1

G

== - a28 2 (2' - e- a , f(z) == -

~ [2U-2(Z)

a

-

' )

eGa)

- U_2(Z - a) - U-2(Z

+ a)]

246

CON'rINUUM WITHOUT MEMORY

It should be noted that for the convenience of calculation we have interchanged . ., defining terms of the Laplace transform with its inverse. Also note that function is similar to that of the previous case but displaced by a distance - a along the axis of the random variable. J

The simplicity of computation here lies in the symbolism used for describing the rectangular-shape function in a closed form through the use of singularity functions. Such methods are commonly practiced in electrical engineering problems. The reader, however, is warned against applying such techniques to density functions that are defined for -- 00 to + 00 • The proper extension of the above technique to such a problem requires the use of two-sided Laplace (or Fourier) transforms. PROBLEMS

6...1. X and Yare random variables uniformly distributed in the interval 0 to 1. Find (a) PIX < Y < + I} (b) (c) (d) (e)

P {X

s

Y

P{IX - YI

< -

IA

0

11

A 12 ~

A 22

A 18 A 21 A 22 A 23 A 3! A 32 A 33 Determinant of [A]

>0

1

>

0

>0

In order that Eq. (7-14) represent a normal distribution, we must have

J-"'", J-"'", . · · J-"'", !(Xl,X2,

en

· · • ,x n ) dXl dX2 · · · dx; = 1

(7-21)

J_"'", J_"'", ···("'", exp [- ~ Q(Yl,Y2, · · · ,Yn)] dYI dY2 · · . dYn = 1

(7-22)

The value of C« can be determined from Eq. (7-22) for any given matrix [A] and set of real numbers aI, a2, . . . ,an. The detailed calculation of C; requires space which is not presently available. The interested reader is referred to Cramer, Laning and Battin, or Wilks. It can be shown that the constant has the value

en

Cn

VIAl

----;;[2

(21r)

Joo- Joo 00

-

00

. · ·

Joo -

exp

v-IAI

(7-23)

= (21r )n/2

[-

%

00

(Xi - ao)(Xj - aj)] dXl · · • dXn

=

1

(7-24)


The average value and covariance of

The variance are _ ak)2] = cofactor of

Covariance of

and

In the particular case, when all pertinent random variables , X n are mutually independent, we have Covariance of Xi and X, that is, A is a diagonal matrix.

=

0

The variances become k

= 1,2, . . . ,n

The normal multivariate density of n variables is given by

1

=

(27r)n/20"10'2

.&..&..&.lULVlUU.JlIJI..&._,

(7-28)

random

1 • Un

n

(7-29)

That is, in the case of mutually independent normal variables the joint density distribution is the product of n one-dimensional normal distributions. Distributed In(lelllen,aellt The object of this section is to exhibit a most useful property of gaussian distributions. It will be shown that any random variable consisting of the linear combination of several normally distributed independent random variables has itself a normal distribution. Let Xl and X 2 be independent random variables with normally distributed density functions with zero means and standard deviations 0'1 and 0"2: 1'\1 1"\11"114\"'110 U",y

NORMAL DISTRIBUTIONS AND LIMIT THEOREMS

and

fl(xl) =

1

e-

X 12 /

20' 12

e-

X 2 2/

20' 2 'J.

0'1

!2(X2) =

1 0"2

If

problem is to find the density function for the random variable (7-32)

iles Xl, fere al and a2 are real numbers. According to the rules of transforma""on of variables (Sec. 5-11), the density functions for random variables 1 = a 1X 1 and Y2 = a 2X 2 are, respectively,

(7-2 nd

1 1

(7-2~

~he

nt randol

(7-33) (7-34)

density function for the random variable (7-35)

.an be obtained by convolution, as described in Chap. 6, since Y 1 and Lre independent variables. Thus,

] (7-29) bIes the normal

t useful random

lly disbution. lly dis"iations

(7-36)

254


desired

l'iO'1l'''C1Ii"''IrT

function

is

That is, Y also has a normal distribution with zero mean and standard deviation (J'=

The above procedure can be applied repeatedly in order to obtain the density function of the linear combinations of several distributed independent random variables. Thus the theorem can be established by induction. Theorem. Let X k be distributed random variables with zero means and variance Uk')" that f(Xk)

1= --

k = 1,2, . . . ,n

e-Xk2/2uk2

Uk

The random variable Y obtained

p(y)

(J',

distributed random variable with zero mean and standard that

Variance of Y = 1

exp [ - 2(a1 2 where

+ ... +

+

y= is a deviation

a linear combination of

OI.k

u

2

=

a1

+ a22 ~2

=

akUk

2(J'12

+

••

a22u22

+ ··.+

+ ar,2)]

k = 1, 2,

an

2u 2

n

1 =

,

n

In the sections we have .discussed how the linear combination of n random variables with normal distribution leads to a normal distribution in n-dimensional space. The engineering implication of this theorem is of great significance, In many applications we may be able to assume linearity or superposition of the effects of several independent causes. In such lems, if each cause has a normal distribution, the of the above theorem is justified. We may be concerned with the addition of signals in an adder, or the series or parallel combination of a number of

NORMAL DISTRIBUTIONS AND LIMIT THEOREMS

In each case some kind of law may and this law may be to obtain of the over-all sum of the effects.. In the stated theorem of Sec. 7-3 we have assumed a normal distribution for each variable. In this section we remove this restriction and show that under reasonably circumstances the density distribution of the sum of n random variables approaches normal distribution as n is greatly increased. To be more FIG. 7-1. (a) A number of random quanexact, the following very significant tities summed up in an adder; (b) a theorem, called the central-limit series combination of a number of physical elements. theorem, holds. Theorem. Let X k (k = 1,2, . . . ,n) be mutually independent random variables with identical density distribution functions with a finite average m, and standard deviation

(12

k} ~ nk 2

But, for a given pair of (1 and k, the ratio of (12/nk 2 tends to zero as n ---7 00 Therefore (7-72) P{ISn - ml > k} ~ 0 n--'OO This inequality can also be written in the alternative form P{18 n

-

ml

S; k} ~ n--.OO

Strong Lono of Large Numbers. The following theorem is given without a proof. Theorem. Let X 1, X 2, • • • , X n, . . . be an infinite sequence of independent random variables such that i = 1,2, . i = 1,2, .

Let w be the set of points of the sample space for which

. Xl(W) + X + .. + Xn(w) = m -----11m - - - - -2(w) n

n--' OO

Then W has probability 1. (is said to be almost certain). The strong law implies that the limit of the average approaches the common expected value of the afore-mentioned independent variables. For proof see Loeve and Fortet. It should be kept in mind that the laws of large numbers and the central-limit theorem can be stated under more general circumstances than those assumed in this section.

NORMAL DISTRIBU'l'IONS AND LIMIT THEOREMS

PROBLEMS 7-1. An honest coin is tossed 1,000 times. (a) Find the probability of a head occurring in less than 500 times. (b) Find the probability of a head occurring less than 500 but more than 450 times. Use the normal approximation. (c) Same question as (b) but use Chebyshev's inequality. 7-2. X is a random variable with binomial distribution (n = 300, p = ~'3). Approximate P{x S 100}. 7-3. For a random voltage V assume the following values with their respective pro babilities : [-2, -1, 0, 1, 2 ] [0.05, 0.25, 0.15, 0.45, 0.10] (a) Find V. (b) Find the standard deviation of V. (c) If the voltage Vis applied across a 20-ohm resistor, determine the average power W dissipated in the resistor in unit time.. (d) What is the probability of the power W being less than 0.10, 0.20, 0.50 watt, respectively? (e) The voltage V is applied to the same resistor but only for a period of ~2 microsecond. N ow suppose that this experiment is repeated 10 6 times. Find the probability that the total dissipated power "will remain between WI and W2 watts. (I) Calculate part (e) for WI WI

WI

= 0.01 = 0.05 = 0.10

W2

= 0.02

W2

= 0.10

W2

= 0.15

7-4. The independent noise voltages VI and V 2 are added to a d-c signal of 10 volts after going through amplifiers of parameters k: and k 2, respectively. Find the density of the output V = kIV I +k 2V2 + 10 (a)

VI and V 2 are normal with parameters (3,4) and (-2,3).

(b) V I and V 2 are uniformly distributed between 0 and 1 volt.

7-6. Let U and V be the output of two adders. The input to each adder is obtained through a number of linear networks not necessarily independent, that is, n

U

L

=

n

V

akXk

=

k=l

I

bkxk

n=l

(a) Determine the standard deviation of U and V.

(b) Show that the covariance of the output of the adders is n

n

LL

i=li=l

where

Pij

is the covariance of

Xi

and X j.

aib;l'ii

CONTINUUM WITHOUT

Let S« be the average n identically distributed binomial distribution (p,q). Prove that

7-1" Let X be a random signal with (m,cr). (a)

PIIX -

(b) pt\X -

m\ ~ a} ~ 95 per cent

m\:::;

(c) P {fluctuation

(d) P

{I

X

~

I

99 per cent

== X

I

~m s

~ 95 per cent

~ m I ~ a} ~ 99 per cent

for a for a

~ ~

for a

~ 4.5 I~I

for

4.5u lOu

a~ 10 I~I

(e) When the signal is approximately normally distributed, the numbers 4.5 and 10 in the above inequalities could be respectively reduced to 1.96 and 2.58 [see Parzen (Chap. 8)]. '1-8" Show that the characteristic function of an n-dimensional normal distribution is

t/>(tth, . . . ,t..)

where

Olit

is the second joint moment of =

0

~

== exp ( -

for k

and Xi, and

=

1, 2, . . . , n

Consider a normally distributed variable with the probability

where, as usual,

Q=

OI.ii

(X i

-

Find the probability density of Q by using the concept of moment-generating function and transformation of variables.

CHAPTER

8

8-1. Definition of Different Entropies. In a preceding chapter we studied the transmission of information by discrete symbols. In many practical applications the information is transmitted by continuous signals, such as continuous.elec~r~c'Y~Y~s.. That is, the transmittedsIgn-ir---~mi8-1)i~--ac~~t~u~u-s--fli~cti~'~ 'of-time during a finite time interval. During that interval the amplitude of the signal assumes a continuum of values with a specified probability density function. The main object of this chapter is to outline some results for continuous channels similar to those discussed for discrete systems, principally the entropy associated with a random variable assuming a continuum of values. The extension of mathematical results obtained for finite, discrete systems to infinite systems or systems with continuous parameters is quite frequent in problems of mathematics and physics. Such extensions require a certain amount of care if mathematical difficulties and inaccuracies are to be avoided. For example, matrix algebra is quite familiar to most scientists in so far as finite matrices are concerned. The same concept can be extended to cover infinite matrices and Hilbert spaces, the extension being subject to special mathematical disciplines that require time and preparation to master. Similarly, in network theory, one is familiar with the properties of the -.l----.....-T---+----Time axis rational driving-point impedance functions associated with lumped FIG. 8-1. A continuous signal. linear networks; but when dealing with impedances associated with transmission lines, one is far less knowledgeable as to the class of pertinent transcendental functions Ut;i::;lil1U.l.ll~ the impedances. (In fact, as yet there is a very limited body of work available on the extension of existing methods of network synthesis from lumped-parameter to distributed-parameter systems.) This pattern of increased complexity of analysis, requiring special mathematical consideration for passage from the finite to the infinite from the discrete to the continuous, also prevails in the field of information theory. 267

CONTINU1JM WITHOUT MEl\ifORY

One method of is to extend the definitions from discrete to continuous cases in a way similar to the nresentation the of discrete and continuous random variables. used such a for the .n..n·t-n-t- .. we have a continuous random variable. has merit of and convenience, but at the expense of not being well defined from a strictly mathematical point of view. the significance of the entropy of a continuous random variable becomes somewhat obscure, as is shown later. The mathematically inclined reader may find it more tenable to start this discussion with the definition of the mutual information between two random objects each assuming a continuum of values. 'I'he procedure has been outlined in Shannon's original paper as well as in a fundamental paper of A. N. Kolmogorov (see also the reference cited on page 289). The definitions of the different entropies in the discrete case were based on the concept of different expectations encountered in the case of two-dimensional discrete distributions. In a similar way, we may introduce different in the case of one-dimensional or multidimensional random variables with continuous distributions. For a one-dimensional random .n. ....,...,.....

r.."!n

....... lIJt.............. 'v ....... v

H(X) =

f(X)] = -

+00 00

f(x) log f(x) dx

(8-1)

The different entropies associated with a two-dimensional random variable possessing a f(x,y) and densities flex) and f2(Y) are Y) = E[ - log f(X,Y»)

+00 00 f(x,y) log f(x,y) dx dy H(X) =

fleX) log fleX) dx

H(Y) = E[ - log f2(Y)] = H(XI Y) =

(8-2) (8-3)

(8-4)

log fll(X! Y)] [(x,y) dx f2(Y) - log fx(YIX)]

= - f+OO -00

f(x y) log f(x,y) dx dy , fleX)

(8-6)

Finally, as will be out in Chap. 9, for an n-dimensional random variable possessing a probability density function f(Xl,X2, . . . ,x n ) , the entropy is defined as

CONTINUOUS CHANNEL WITHOUT MEMORY

+00

f(x1, · · · log f(xl' . . . ,xn ) 00

All definitions here are upon the existence of the corresnonding integrals. The Nature Mathematical .uttficulties Bn-':lrnnrlQ!r\l tion we describe the mathematical difficulties encountered in extendmz the concept of self-information from discrete to continuous models. There are at least three basic points to be discussed: 1. The entropy of a random variable with continuous distribution may be negative. 2. The entropy of a random variable with continuous distribution may become infinitely large. Furthermore, if the probability scheme under consideration is "approximated" by a discrete scheme, it can be shown that the entropy of the discrete scheme will always tend to infinity as the quantization is made finer and finer. 3. In contrast to the discrete case, the entropy of a continuous does not remain necessarily invariant under the transformation of the coordinate systems. Of these three difficulties, perhaps the first one is the most apparent. The second and third require more explanation. For this reason we treat item 1 in this section but defer discussion of topics 2 and 3 "a later section. Negative Entropies. In the discrete case all the involved are positive quantities because the probability of the occurrence of an event in the discrete case is a positive number less than or equal to 1. In the continuous case, co 00

f(x) dx = 1

. . f-.. . f(x,y) dx dy = 1

(8-8)

Evidently, the density functions need not be less than 1 for all values of the random variable; this fact may lead to a negative entropy. A situation leading to a negative entropy is illustrated in Example where the entropy associated with the density function depends on the value of a parameter. This is a reason why the concept of self-information no longer can be associated with H (X) as in the discrete case. We call H (X) the entropy function, but H(X) .no longer indicates the average self-information of the source. Similar remarks are valid for conditional entropies. Thus it follows that the individual entropies may assume negative values. How-


ever, it will be shown that the mutual information is Example 8...1. A random variable has the density function shown in Fig. Find the corresponding entropy. {(x)

FIG.

E8-1

Solution 2h

f(x)

==

b _ a (x - a)

f(z)

=

b~ a (b

H(X)

== _

- z)

(a+b) /2

s x s -a+b 2a b for ~ x s b fora

2h b-a

2h ( b - a x

a) In - - - (x - a) dx

-

b

~(a+b). /2 -b - .a (b 2h

The above integrals can be evaluated by parts.

z ln Axdx ==

x

2

2

2h z) In - - (b - x) dx .b - a

so doing, note that

x2 1n AX -"4

Thus, H(X) == ~ [(x - a)21n ~- (z - a) _ (x - a)2J(a+b)/2 b-a b-a 2 a + _h_ [(b _ x)21n~ (b _ x) ~ (b - X)2Jb b- a b- a 2 (a +b) /2 H(X) == ~ {[(b - a)t ln h __ (b - a)SJ + In h _ (b - a)2J} b-a 4 8 8

= h(b ;and since thus

a) ( _ In h

+

h(b - a)

2

H(X)

=

J

co -co

f(x) dx == 1

== -In h

+ ~~

I

The entropy depends on the parameter h, but a translation of the probabinty curve along the x axis does not change its value. Note also that H(X) > 0 H(X) == 0 H(X) < 0

for h < for h = for h >

ve ve ve

8...3. Infiniteness of Let X be a one-dimensional random variable with a well-defined range [a,b] and a probability density

CONTINUOUS CHANNEL·· WITHOUT MEMORY

function

that

s d}

<X

P{e

f(x) dx

= F(d)

-

We propose to examine the entropy associated with this random a familiar mathematical routine. That is, we divide the "tTQ'rl0hla

I

f(,,)

I

I

I ~~a2,\ I

P1 I::J. a1 I I I

I

I I I

I

FIG.

I I I I

I

I

I

I

I 1

I I

~~aan

I I

8-2. A quantization of a continuous signal for computing entropy.

interval of interest between a and b into nonoverlappmg subintervals (see 8-2): (a,at]; (at,a2]; . . . ; (an,b] a

al - a

=

P(a

peal

< at < a2 < · · · < an < an+l =

. . .

< X ::::; <X ~ <X

b

(8-11)

,ale - ak-l =

al)

= faa.

a2) =

~ b)

=

. ,b - an = k = 1, 2, . . . ,n 1 (8-13) f(x) dx = F(al) - F(a) ~ Pl Llal

fat f(x) dx at

b

=F(a2) - F(al) = P2 L1a2

(8-14)

f(x) dx = F(b)

Now we may define another random variable X d, assuming discrete set of values

the

with respective probabilities (8-17)


Accoroma to (8-9) the events under consideration form a plete probability scheme, as

com-

n+l

2:

Pk aak

= F(b)

- F(a)

(8-18)

L Pk aak log Pk aak

(8-19)

k=l

n+l

Thus

H(Xd )

= -

k=l

Now, let the length of each interval in Eq. (8-19) become infinitely small by infinitely increasing n. It is reasonable to anticipate that in the limit, when every interval becomes vanishingly small, the entropy of this discrete scheme should approach that of the continuous model. The process can be made more evident by adopting an arbitrary level of quantization, say (8-20) k = 1, 2, . . . ,n + 1 and evaluating the above entropy, In+l

H(Xd ) = -

n+l

L Pk ax log Pk - L Pk ax log ax

k=l

But

H(X)

k=l

= lim H(Xd )

(8-22)

.6x-+o

Therefore when ax is made smaller and smaller, while Pk aak, the area under the curve between ak-l and ak, tends to zero, the ratio of the area to aak remains finite for a continuous distribution. In the limit , lim H(X d ) = .6x-+O

fb f(x) logf(x) dx a

n+l

lim

L Pk ~x log ~x

(8-23)

.6x-+O k = 1

assuming that the first integral exists. As the subintervals are made smaller "by making n larger, the Pk ax become smaller but the entropy H(Xd ) increases. Thus, in the limit when an infinite number of infinitesimal subintervals are considered, the entropy becomes infinitely large. The interpretation is that the continuous distribution can potentially convey infinitely large amounts of information. We have used the word "potentially" since the informa.. . tion must be received by a receiver or an observer. The observer can receive information with a bounded accuracy. Thus H(X) should preferably be written as HE(X), indicating the bounded level of accuracy of the observer. If the observer had an infinitely great level of accuracy, he could detect an infinitely large amount of information from a random signal assuming a continuum of values. In a manner similar to the definition of entropy in the discrete case,


we may define the [a,b] as

A"\-n+1r>A'1n,"'I:Y

H(X)

of a complete continuous scheme b

=

f(x) log f(x) dx

It is important to note that the integral of entropy of a continuous random variable is not necessarily infinite. However, the above limiting process, which introduces the tflA1Y"~tfl.nll~1"' discrete analog model with an infinitely large number of states, leads to an infinite entropy. 8-4. Variability of the Entropy in the Continuous Case with coorcmate Systems. Consider a one-dimensional continuous random variable X with a density function f(x). Let the variable X be transformed into a new variable Y by a continuous one-to-one transformation. The density function p(y) is p(y)

= f(x)

I~; I

If it is assumed that the transformation Y = g(X) is monotone and valued, the entropy associated with Y is H(Y) = -

f-:"

p(y) log p(y) dy

H(Y) = -

f-+

[f(X)

Y)

=-

f-

H(Y) = H(X)

f(x)

I~; I]

log [f(X)

I~; I]

f(x) dx -

+ f-+.. . f(x)

log

f(x)

I~~ Idx

dy I dx

Idx (8-28)

The entropy of the new system depends on the associated function log Idx/dyl. This is in contrast with the discrete case. In the discrete case, the values associated with the random variable do not enter the computation. For instance, the entropy associated with the throwing of an ordinary die is

[X] = [1,2,3,4,5,6] = [%,%,%,%,~'6,%]

= 6( -% log %) = log 6 A change of variable, say Y = X2, does not produce any change in the H(X)

entropy since [P] remains unchanged: [X2] = [1,4,9,16,25,36] [P] = [~'6,7'6,%,%,%,7'6] H(X2) = log 6 = H(X)

274

CONTINUUM WITHOUT

When the transformation of the continuous random is, Y == AX + B (8-28) 00

co

'IlTO't"1I0ihlo

IAI

f(x)

+ log IAI

= H(X)

in bits Equation (8-31) suggests that the entropy of a continuous random variable subjected to a linear transformation of the axis remains invariant within a constant log IA I. Example 8...2. Find the entropy of a continuous random variable with the density function as illustrated in Fig. E8-2. O~x~a

f(x) = bx 2

=0 Determine the entropy H(X 1) when the transformation X2 = 2x.

Xl

elsewhere

= X + d, d > O.

Answer same

for

f(x)

2a

a FIG.

E8-2

The value of b which makes the above f(x) a permissible nrobabtnty density function is given by Solution.

aJ(x) dx H(X) Note that Thus

xJ

== b 3"

= - JoG bx2ln

x 2 In "Xx dx = H(X)

=

-2b

H(X)

=

-2

]a0 = 3ba x3

=

I~ -

-

-2b

x3

"9

~ (In

a

= ~3 + In ~3

-~a A

= 1

bx2.dx

3" In "Xx

[~ In

3

The entropy may be positive, negative, or zero, dependmg on the parameter a. H(X) H(X) H(X)

>0

a

>

3e-%

0 0

d

The probabihtv density curve will be simply d units shifted to the left. becomes

----,-...-In

Vb (Xl

d) - ...;..----~

-

2

a

= 3 + In 3 = This result could have been predicted from

H(X)

(8-31), as

log A = log 1 = 0 For the transformation

X2

=

2x, one finds

p(X2) 2a

This result is

b

b

8 x22 10 g 8 X2 2 dX2

agreement with H(X 2 )

=

= 8b X2 2

=

H(X)

(8-31), that is,

+

a

bx 2

2 dx

=

H(X)

+

2

J.V.1E~aS1Ure of Information in Continuous Case. The materiar precedmz two sections lead one to think that the concept of loses its usefulness for continuous systems. On the concept of entropy is as in the continuous case as in the discrete case. To put this concept into focus, one has slightly to reorient one's thoughts by the emphasis on the transinformation rather than on the individual entropies. The different entropies associated with a continuous that H(YIX), and no direct as as the information 1l"'\1I"'/.,..n.nCH::~.n,N, the channel is concerned. it will be shown that the transits significance. to we use the concept of the transinformation I(X;Y) of the . . variables X and Y as the point in defining the entropy of continuous systems. ""'JI...lI.'-f. ....JI..IU!o...

f(x,y) f( ) log fl(X)f2(Y) dx dy x,y

(8-32)

276


Note

as before, we have

I(X;Y) = H(X) - H(X!Y) = H(Y) - H(YIX) = lI(X)

+

N ow we should be able to demonstrate how this measure of transmtormation does not face the three mentioned difficulties encountered in with individual entropies. Transinformation Is Nonnegative. A proof of the validity of this property can be obtained by using the basic inequality for convexity of a logarithmic function. r1I£H:lll'YlI(ll"

.

- /00 !(x,Y) -00 /00 -00 f(x,y) log !l(x)f2(y) dx dy 00 flex) = / -00 -00 !(x,y) log f(xly) dx dy 00 [fl(X) ] ~ _", _", f(x,y) f(x\y) - 1 log e dx dy /

T(X,Y) -

/00 /00

= -

f-"'", f-"'",

f2(y)h(x) log e dx dy

+ f-"'", = 1 · 1 · log Hence

e - log e

!(x,y) log e dx

=0

(8-34)

I(X;Y) ~ 0

Transinformation Is Generally Finite.

Consider the expression

P {Xi < X < Xi + ~X, Yi < Y < Yi + ~y} I og P{Xi < X < Xi + ~X}PtYi < Y < Yi + ~y}

(8-35)

which is a direct extension of the definition of mutual information in the discrete case. As ~X and ~Y are made smaller, each of these probability terms tends to zero. This was the reason for the individual continuous entropies to tend to infinity in our passage from the discrete to the continuous models. While each of the individual terms tends to zero, the above ratio remains finite for all cases of interest. In fact, in the limit, the expression becomes

It is certainly reasonable to exclude the degenerate cases corresponding to densities which are not absolutely continuous. (See references given the footnotes on pages 277 and 295.) To sum up, in passing from the discrete to the continuous model, each one of the entropies H(Y), H(X), and H(X,Y) leads to the calculation the logarithm 'of some infinitesimal probabilities, thus leading to infinite

CONTINUOUS

'\J .... JL.Lll..J.'1J. •• .&.:.IJ!..il

nnw~'vp.r" the of and will lead to a finite measure of

WITHOUT MEMOIty

.on+'rA-n~",(T

I(X;Y).* Invariance

.iramemiormouon under Linear '1''Y'I'1.n.(,~trvr-m.atnrvn. we should like to show in contrast with our ure of mutual mrormation, (8-32), remains invariant under all scale transrormattons and the of the channel. ~V""JLV'" '''''JI. eouations of transformation

= aX+b YI = cY d Then H(X 1)

= -

H(Y l)

=

Y 1)

=

1_

00 00

co

Pl(Xl) log Pl(Xl) dXl

P2(YI) co co

P2(Yl) dYl

p(Xl,Yl) log p(XI,Yl) dXl dYl

where PI (Xl) and P2(YI) are the functions associated and and p(XI,YI) is their with the variables density function. But according to Sec. 5-12, Pl(Xl)- =

P2 (Yl)

=

-rar

fleX)

f2(Y)

ICI

-I g 1-

f(x,y) _ f(x,y) _ f(x,y) X ) ( P 1,Yl - IJ(Xl,Yl/X,y)! ~ lacr H(X I) = H(X) log lal H(Y I) = H(Y) log lei H(XI,Y I) = H(X,Y) log [ec] H(X l) H(Y I) -- H(Xl,Y l) H(X) H(Y) -Y) log

+

Thus

I(XI;Y I)

= =

+ +

+

+ lal + lel-

While the individual entropies may change under linear transtormauons. the transinformation remains intact. I (X; Y) is finite. when all densities are absolutely continuous-the transmtormarion may become infinite in some extrinsic circumstances (see, for instance, I BE Trans. on Inform. Theory, December, 1956, pp. 102-108, or theorem 1.1 of the Gel'fand and Iaglom reference cited on page 295.-

WITHOUT MEMORY

Thus we have removed all three ObJlec1jlOI1S transinformation for the basis of our discussions. * The following elementary properties of transinformation self-evident: I(X;Y) I(X;Y) ~ 0 I (X; Y) = 0 if X and Y are mnenennent variables Fl(X) ~ H(XI Y) H(Y) ~ H(Y\X) lI(X) H(Y) ~ H(X, Y)

+

(8-45) (8-46) (8-47)

8-6. Maximization of the of a Continuous Random The maximum entropy of a complete discrete scheme occurs when all the events are equiprobable. This statement is not meaningful in the case of a random variable assuming a continuum of values. In this case, it is quite possible to have entropies which may not be finite. This might be interpreted as a pitfall for the definition of the channel capacity. However, the situation can be improved by assuming some plausible constraint on the nature of the density distributions. For if the random variable has a finite range, then one may ask what of density distribution leads to the greatest value of entropy. Such questions can be answered by using mathematical maximization techniques from the calculus of variations, such as the method of Lagrange" multisections for the pliers. We shall employ this method in following three basic constraints. Case 1. What type of probability density distribution gives maximum entropy when the random variable is bounded by a, finite interval, say a ~ X ~ b?· Case 2. Let X assume only nonnegative values, and let the first moment of X be a prespecified number a (a > 0). What probability density distribution leads to the maximum-entropy? Case 3. Given a random variable with a specified second central moment (or a specified standard deviation 0), determine the probability density distribution that has the maximum entropy. "'-II'Il"1I'-II...,,16

* This is indeed in accordance with our fundamental frame of reference. measure something, one must have a basis of comparison. The "arithmetical ratio" of this' comparison gives an indication of the relative measure of the thing that is being measured 'with respect to some adopted unit. -, Similarly, in information theory, it is the difference of some a priori and a posteriori expectation of the system that provides us with a measure of the average gain or loss of knowledge or uncertainty about a system. In problems of information theory, the above "arithmetical ratio" in turn is translated into -the difference of two entropies. This is of course due to the use of the logarithmic scale.

Case 3. "".. . the sneemed class the gaussian distribution ,.I"'\

.lA.JL.... ....., .....

Let A be a constant mumpner ; the

uU•..II.J!.'L.JLll'LJ' '11'11

x

The use of the

is


the expected value, that is, the first moment the continuous X ~ 0, is specified as E(X) = a, the unknown function f(x) is subject to the following constraints: JI.·'-II.LJLUVJLA..L

WOLalJLJl.lL.,..' . . ."

H(X) = -

10'" f(x} In f(x) dx 10'" f(x) dx = 1 10'" xf(x) dx = a

(8-52)

a

>0

Using the method of Lagrangian multipliers, we find

a

df (-f In f) +

a

p. df

(f)

a + A af (fx)

= - (1 + In f) +

f(x) =

p. +

AX = 0

el&-l+Ax

(8-53) (8-54)

The desired density distribution is of an exponential type. values of A and p. can be determined by direct substitution of f(x) in the constraint relations:

10'" eX., dx = 1 el'-l 10 xex" dx = a el'-l

ee

(8-55)

Note that A must not be positive; otherwise the probability constraint cannot be satisfied. Based on this remark, the above equations yield

or

(8-56)

Finally

(8-57)

The extremal entropy has a value of H(X)

~ ~

oo 1 1 - e- x / a In - e- x / a dx o a a oo 1 x a 1 foo X x a = In a - e- / dx + - e- / dx o a a 0 a =Ina+l

= -

(8-58)


Thus the maximum for all continuous random variables with prespecified first moment in [0,00] is H(X) = In ae

ogarithm is here computed to the natural base. f(x) be a one-dimensional probability density and let o and zero random variable X have a preassigned standard ? mean. Which function f(x) gives the maximum of the entropy Following the outlined procedure of the calculus of variations, one would maximize a linear combination of the constraints through evaluation of the constant multipliers of these constraints. To be specific, co co

f(x) dx

=1

x 2f(x) dx =

H(X)

co co

q2

f(x) In f(x) dx

According to the previously mentioned technique, -

~ (f In f) + :f (p.f) + :f (Xx2f) = 0 - (1 + In f) + JL + Xx2 = 0 f(x) = eP,- le"x 2

But

co

. (8-61) (8-62)

eP,- le"x 2 dx = 1

00

(8-64)

The latter equations yield (8-65)

ep,-l

= (J

Finally, Among all one-dimensional density distributions with prespecified secondorder moment (average power), the gaussian (normal) distribution provides the largest entropy. The maximum value of the entropy can be found directly.

natural Iozantnnuc

The above three maximization fashion to the case of muttnnmensionat constraints. For ~ sional all second-order moments the different variables are lIl(H~pe:ndlent. will to an n-dimensional zaussian As nrececunz consider a continuous cnannei have a -r\"rI'"\hlan-lO

"7IOiI"\'1nNATVlI

"...""".£..

...... I",

"I""V"lI"I,1"'Il"lO I I U

.I..ll..i.6AJ\JIV.i.J1.UJ.i.,

margmat uensrnes can

The

aplPll(~at]lOn of

the aetLnlILe: e()UatlollS for "'.. .

-t-"lI"l....... v .. "'iId"\1f'I

+

dx

283


The double "lI",+,niN'~n I moments, that is, 2(1

~

p

the last eouation is the sum of three . .' '-' ..... ~ . ,,''~_~ .

2) [-\ E(X2) (1'x

(fx(fy

E(XY)

,,!T... 'II

..

~lI'.

+

where P,l1 = N ow, recall that p

= correlation coefficient = .!!.!.!

(J'x(J'y

Thus Y)

= In 21r(f = In 21i"(J'

x(J' y

(8-76)

x(J' ye

The mutual information in this channel is

+

or

I(X;Y) = H(X) H(Y) Y) = In y21re (J'x + In (J'y In (1 - p2)

This equation indicates a measure of transinformation for the zaussian channel. The transinformation depends solely on the correlation coefficient between the transmitted and the received signals. the noise is such that the received signal is independent of the transmitted signal we have and p=o =0 When the correlation coefficient is increased, the mutual information will increase. 8-9. Transmission of Information in Presence of Additive Noise. The rate of transmission of information in a channel may be defined as the mean or the expected value of the function: !(X,Y) I

= log !l(X)!2(Y)

(8-78)

That is, E(l)

= I(X;Y)

=

00

!(x,y) !(x,y) log fl(X)j2(Y) dx dy

The rate of transmission in bits provides a measure for the average information processed in the channel in the sense described previously. The maximum of I(X;Y) with respect to all possible input probability densities, but under some additional constraints, leads to the concept of channel capacity in continuous channels. In other words, in contrast

284


discrete case, the channel in the continuous case an absolute on the constraint. The evaluation the channel capacity is a difficult and no ~'V method can be to cover all circumstances. in cases we are able to study the rate of transmission and evaluate its maximum. One such case which is also of much nracucar ~lIi1'll'·n1h is the with additive noise.

...... 'V ... '"",A

....

n"ll"\n.r.

.... x

------+----&-..---~

7° I

I

Probability density of y'

ep(z)

I I

I

FIG. 8-3. An illustration of the performance of a continuous channel in the presence of additive noise.

Let X be the random variable the transmitted signal and Y the received signal. We assume that the noise in the channel is of X: additive and

Y=X+z + xix) = c/>(z)

(8-80)

fx(z

where Z is a random variable, with a probability density function c/>(z). JiiQluatlon (8-80) suggests some simplification in the relations among ferent density functions associated with X and Y. In fact, reference to 8-3 that

0

elsewhere

(a) Determine the source entropy.

(b) Determine the equivocation entropy.. (c) Determine the entropy at the receiver. (d) Determine the transinformation. 8...4.. Let X and Y, the cartesian coordinates of a point M, be independent random variables uniformly distributed in [0,1]. (a) Study the marginal and the joint distribution of

and

Y

X


(b) different for the communication model (c) Evaluate the transinformation.

A continuous channel has the following characteristics: f(ylx) ==

The

to the channel is a random voltage 2a

X and Y may assume any values from - 00 to + 00. (a) The entropy of the source (b) The conditional entropy of the channel (c) The transinformation 8..6. Answer the same questions as in Probe 8-5 for the sources and the channels described below: (a)

flex) == x(4 - 3x) = 6y(2 4 fl(x) = e:» f(ylx) = xe- Xll

-=-;;

f(ylx) (b) (c)

f(x,y) =

1

~

0 ~ x ~

0

y)

0 0

x

Iwo\. But the transmission characteristics of all "physical systems" for all "practical' purposes " vanish for "very large frequencies." it is reasonable to confine ourselves to the. class of all signals {j(t)} such that their Fourier integrals have no some range (-wo,+wo). More specifically, frequency content OJl,.·.CJlOJIH.,,,r

1h."'II.rlI,,,,,£'II'

.n.v\4''lP.VI>."A"IVO

IlJJl.Jl.JI.,Jl.Jl..llJl>.JLJl.Jl.F\

J1.Jl.V.ll.Jl.'-L'""Jl..L,-,Q..J.Jl.Jl.'VJl.UllIJJl._,

'-''-'Jl,Jl..LIJ'Jl.V''''''

'Ir\ .... '-"hl,o'1l"'ll"\

TRANSMISSION OF BAND-LIMITED SIGNALS

a

is said to be band-limited when

F(jw) = ff[f(t)] =

co co

!Ct)e-iwt dt

=0

for Iwl

>Iwol

o

The frequencies of the human voice are between a few and cycles per second. The frequency range of the human eye and ear is also limited. The bandwidths of telephone, and television are other examples of band-limited communication equipment. While it takes a continuum of values to identify an arbitrary continuous signal in the real interval 00 ,+ 00], we shall showthat the restriction of Eq. (9-16) will reduce the identification problem to that of a real function specified by a denumerable set of values. The sampling theorem below states that, if a signal is band-limited, it can be completely specified by its values at a sequence of discrete points. This theorem serves as a basis for the transition from a problem of continuum to a of discrete domain. Theorem. Let J(t) be a function of a real variable, possessing a bandlimited Fourier integral transform F(jw) such that*

F(jw) = 0

for

Iwl > Iwol

Thenj(t) is completely determined by knowledge of its value at a sequence of points with abscissas equal to 1rn/Wo,

...]

[n] = [... ,

Furthermore, jet) can be expressed in the following form: j

jet) =

(1rn) Wo

n=

sin wo(t - 1rn/wo) wo(t - 1rn/ wo)

Proof. Consider the pair of Fourier integrals: F(jw) = f(t)

=

j(t)e-i6Jt dt

1

f~:. F(jw)ei

°

The corresponding time function can be obtained from the Fourier integral tables (or, in such a simple example, directly):

jet) = wok sin wot (9-26) 7rWot Note that, because of the band-limited character of F(jw), jet) should be completely determined by the following domain arid range, _ z., 0, s; 27r, [ . . .>: 27r, Wo Wo Wo Wo

...

[ . . . ,0,0, w;k, 0,0, . . . ]

This result is indeed in agreement with the expression for jet) suggested by the sampling theorem.. co

J(t) =

~ J (7rn) sin wo(t - 7mjwo) = wok sin wot

'-'

n=

-co

Wo

wo(t - 7rn/wo)

7rWot

(9-27)


interpretation of the theorem can be pose that a continuous band-limited voltage signal vet) quantize vet) at times

'In'll''r\'nIAet.nrll

[

••• , _

~7f',

_

Wo

~,o,.!!:-., 27f', .

Wo

Wo Wo

The quantized voltages are successively applied to an ideal filter as impulses of appropriate magnitude at the specified times. The response of an ideal low-pass filter with cutoff at Wo to a unit impulse uo(t) is found to be k(sin wot) /1rt, where k is the constant of the filter. Thus

v(~) Input

Output

FIG.

9-4. Physical interpretation of the sampling theorem.

the total output of the filter, vo(t), will represent the original time function vet) within some scale factor. Vo(t) = k Wo 1r

v

sin wo(t - 1rn/wo) wo(t - 1rn/wo)

(9-28)

_oJ

n=

If the constant of the filter is 7f'/wo, then vo(t) and vet) become The concept the theorem has been p.minl()Vfl~(j communication such as the extensive work done at the Telephone Laboratories on speech transmission and also other inves... tigators prior to Shannon, such as Nyquist, Kupfrntiler, and Gabor. The mathematical statement of the sampling theorem has been made E. T. Whittaker, J. M. Whittaker, and several other mathematicians. Subsequent to Shannon's use of the sampling a number of mterestmz articles on this subject have appeared in the literature of as well as in mathematical journals (see N-2 of the Appendix). ..lVl.'VJ...lU..ll..,W..l.

306


Example We illustrate the sampling theorem with the followinz ~..I~U.uJ.l.Jl'!d. Consider a band-limited signal with a triangular frequency distribution, as shown Fig. E9-1. From a table of Fourier transforms (or by performing the we the time-domain function describing such a (j W ) signal is given by

I

IF

f(t)

K

=

(a)

Now, the sampling theorem states that f (1rn) sin wo(t - 1rn/wo) (b) wo wo(t - 1rn/ wo)

=

f(t)

-00

where f(1rn/wo) is the value of f(t) at the respective sampling points. We illustrate the theorem by showing that the summation of (b) does, in fact, yield the function in Eq. (a) when the appropriate values are inserted in Eq. (b). First, evaluatef(1rn/wo) at the sampling points by setting tin Eq. (a) equal to 1rn/wo. FIG.

E9-1

=

f(O)

(c)

Kwo

f where m == n/2, n

Kwo 211"

=0

=~

(d)

= ±2, ±4, ±6, ....

f (n1r) Wo

= Kwo 21r

where n is odd.

[Sin(2n/4)1r (~n/4)1rJ2

= Kwo

4

211"

=

ni

'(;(t

Kwo 4 ~ n 21r 2

(e)

n

Next, evaluate sin /')0) for each sampling point by letting take on integral Wo - n1r Wo values from - 00 to 00. By a trigonometric identity, sin wo(t - n1r/wo) = sin wot cos n1r - cos wot sin n«. Therefore sin wo(t - n1r/wo) wo(t - n1r/ wo)

=

. SIn

(-1)n wot wot - n1r

Equation (b) can now be written 00

fCt)

= sin

wot

\'

~

n=-oo

Since f(n1r/wo) = f( -n1r/wo) [from same Inl and take the summation from written 1rn ) [ (-l)n f ( Wo wot - 1m

(-l)n ]

+ woe + 1rn

Then, substituting Eq. (g) into I(t)

= sin

Cdot

f (n1r) ( -1)n Wo woe - 1rn

(/)

(e)], we can add pairs of terms

n=

=

to

f

n=

00.

The

the sums can be

for n

>0

(g)

(/),

[Cd~t 1(0) + 2Cdot

f

(h)

TRANSMISSION

BAND-LIMITED SIGNALS

But from (c) to (e) we know the values of f(n1r/wo) to substitute Furthermore, we know that f(n1r/wo) =: 0 (n even) from (d), and so we can stitute a new index, m, into the summation, where n

=:

2m - 1

(i)

thus summing only the odd values of Performing these substitutions in (h) we obtain f(t)

sin wot

(j)

We substitute the following trigonometric identity for sin wot: (k) f(t)

=:

2 sin wot cos wot

2

1 - 2wot

2

(l)

Performing some algebraic manipulations, we get f(t)

=:

(m)

Then by bringing (wOt)2/1r 2 inside the summation and dividing both numerator and denominator by Eq. (d), f(t)

=:

sin (wot/2) cos (wot/2) wot/2 (n)

Rewriting the second term in the summation, using the following identity, a2

= 2- - 1 2

a

x

-

1 1r 2 (m - >2)2 1r2(m [ 1r2(m ~elpaI'at]mg

f(t)

the parts of

~2)2

(wot/2)2 -

(0)

summation,

sin (wotj2) cos (wot/2) wot/2

+2 But

~2)2

tan z

1

1r 2

=:

(m - ~2)2 -

2z

8 (wot/2)2 - ;2

1

(p)

(q)

308

CONTINUUM WITHOUr,r MEMORY

(see, for instance, E. C. Titohmarsh, "Theory of Functions," p. 113, sity Press, New York, 1950), which can be rewritten with the substitution m

Univer-

=n +

00

tan z

'\'

k

-z- = 2

m=l

1 7I"2(m - ~~)2 -

(r)

Z2

Widder* shows 00

~

1

~ (2m - 1)2 = 1

1

1

+ 32 + 52 + . · .

71"2

(8)

= 8"

m=l

Substituting z = CJJ ot/2 in Eq. (r) and substituting Eqs. (r)and (8) in Eq. (p), we get J(t) = KCJJo sin (CJJ ot/ 2) cos (CJJ ot/ 2) 271" CJJotj2

[1 + tanwot/2 (CJJ ot/ 2) -

~~] 8

(t)

71"2

and finally J(t) = KCJJo 271"

[SinCJJot/2 (CJJ ot/2)]2

(u)

This solution was derived by several students, in particular, S. Rubin, during a course on information theory.

9-8. The Concept of a Vector The use of n-dimensional real or complex space is quite common in engineering problems. For instance, in the study of signals in communication theories one frequently employs the concept of vector space, although this may not appear in .its strict mathematical frame of reference. The concept of power content of signals in engineering texts is an example of tacitly exploiting a basic of vector-space theory. It is safe to assume that not all the readers are familiar with these concepts. For the benefit of such readers, we include this section as a digression from the main stream of thought. The section is designed to provide a brief glimpse into vector-space theory. Meanwhile the use of basic axioms required by a vector space and the properties specified for a norm are presented in a way parallel to the treatment of some of the material presented in Chap. 2. Spaces. A set or collection of elements, generally called points, is said to form a . space. This space is not what is generally understood by geometrical points. This may be a collection of points, or vectors, or functions, etc. For example, the set of all functions y

=

x cos a

where x is a given real number and Ol a real variable, forms a space S. For each value of the parameter x we have a point in this space. The

* u Advanced Calculus," p. 340, Example B, Prentice-Hall, Inc., Englewood Cliffs, N.J.


Yl = COS Ot, Y2 = siderationa This is .... e-t .. ·. ..n

.........

3 cos a are elements of the set written in the

Yl E S

Y2

S

Vector Space. A vector space is defined in the way. be the set of real or complex numbers. V is said to be a vector space for every pair of points x and y, x E y E and a number a E the operations of addition and multiplication by a number are defined so that x+y E V (9-29) ax E V 'v, the following properties of the space are 1. There exists an element "zero" in V such that, for each x lfOorHll1rorl

V..L..L'-,..L..L....,.. ' - ' ....

°

= x. 2. For every x E V there corresponds a unique point - x E V such that -x x = 0. x

+

3. Addition is associative, for example,

+ y) + z = x + (y + z) = x + y + z x + y = y + x (commutative law) for all x, y E V. a(x + y) = aX + aYe (a + (3)x = ax + (3x. (x

4. 5. 6.

7. (a(3)x = a((3x). 8. 1 · x = z. 9. 0· X = 0. It is worthwhile to note here that the above properties are those of ordinary vectors as encountered in undergraduate courses in sciences, and for this reason any set of elements that has these is called a vector space. In mathematical terminology, V is said to form an abstract space over F. EXAMPLE. The most familiar example of vector space is given by an ordinary two-dimensional rectangular coordinate, that is, the set of all ordered pairs (al,a2) of real numbers, addition being defined as follows: Given y= x= WRJI-.1.l.l'-,U

T\1"'£'T\01"'f".lA!:llQ

then x

+y

= (al

bl, a2

Vector 0 = (0,0)

b2) and ax = (aal,aa2). x = (al,a2)

-x = (-al,-a2)

All the properties listed above being satisfied, the set forms a vector space. One has to be rather careful here in interpreting the pair (al,a2). It is familiar from analytical geometry that such a pair represents a "point" in the plane, and so addition of pairs as defined above will be


meanmgtess in But there is no harm pair (al,a2), say, represent a vector issuing from the origin in the point (al,a2); then any misunderstanding can be obviated. Linear and Linear N ormed Spaces. A space S is said to be (3 E F) and any x, y of for every of numbers (a aX + (3y E S. In the application of the theory of linear vector spaces to physical problems, a most significant type of vector space is encountered. This is called normed linear space. Normed linear spaces exhibit a natural generalization of the familiar euclidean space. We first define what is meant by norm. The norm of an element x of V, denoted by IIxll, is a nonnegative real number satisfying the following properties. (I) Ilxll = 0 if x = 0 (II) Ilxll ¢ 0 if x ¢ 0 (9-30) (III) Ilx + yll ~ Ilxll + lIyll (IV) lIaxll = lal IIxll A vector space with a norm is called normed linear space. Inner Product. As is known from vector analysis, the "inner product" of two vectors x = (al,a2) and y = (b1,b 2) in the real two-dimensional rectangular system is defined as (x,y) = alb! + a 2b 2• If we adopt the V(x,x) of a vector as its norm, we have (x,x) = + a22 = IIxl1 2 If the two-dimensional normed linear space is a space, then a natural extension of the above definition for the inner product leads to the following. Let x = (~1"YJ1) Y

where ~1, 111, x and y is

t2,

112

=

(~2;172)

are complex numbers. (x,y) = ~1~2

+ 111r;2

where the bar stands for a complex conjugate. (x,x) =

IIxl1 2

Then the inner product of

The norm of x is defined as

=

Note that when the space is real this more definition of the inner product is obviously valid. A vector space for which the inner product is defined is sometimes called an inner-product space. The following properties for the inner product are given: (I) (II)

(III)

(x,y) = (y,x) (al x l + a2X 2, y) = al(xl,y) a2(x2,y) (x,x) ~ 0 (x,x) = 0 if, and only if, x = 0

(9-31)

TR./~Nf'MI:SSl:ON

OF

Nvdimeneionol

defined spaces with an like to reiterate and establish the notation real spaces.

the space to the absolute of two x and y is xy. As consider the set of all numbers. The space is the n-dimensional sum of two elements x E Rn and y E IU'..LV''-AI.'UI.'LJl!

n'D1'rh,'n'''ll'D1'''rl''

x

'VU~lIU.\,,i.'V(;4I.!..lI.

y=

is defined as

+ one sees

. . . ,an

+

a . . . . """",J ............,... y

is

Hilbert where R~ denotes a nenumeranrv as

the

space.

312


of elements x and y is defined as usual:

This definition has a meaning because

Iaibi I :s; Llaibil :s;

(ai2

QO

Thus

ai2

+b +

2 i )

bi 2 )
it

(Pei

+ P ei) (9-67)

This reasoning can be further extended in order to show that the above decoding scheme (also called maxiFIG. 9-10. Minimum-distance decoder: mum-likelihood detector) is an optimal All received words closest to U; are decoding procedure [see Gilbert (I)]. decoded as Us: An example of this partitioning is suggested in Fig. 9-10. If we had only three signal points U 1, U 2, and U 3 in the signal space, the three would determine the regions V 1, V 2, and V 3. Note that, if one makes the further assumption that all the sig-

TRANSMISSION OF

.JjJ)"QI,..L'lAJ---Jl..lJl.J.VJl.J1.JI..J!.:..lJlJ

SIGNALS

have power, will consist of n-dimensional nnll~Tlharil'I"O with apexes at the origin of the coordinates. Each is bounded J.L [en ~ Mdimensional space). An Encoding Problem. We have now established a model of a continuous channel. For instance, if it is desired to transmit waveforms each word selected from a finite set of band-limited Wk may be chosen as the vector representing the totality of values of the kth signal. The central problem for this communication model is to devise suitable codes, that is, a decision scheme that minino additional conmizes E subject to certain plausible constraints. straints were imposed, then the best solution would consist of some sort of equidistant placement of M points in an n-dimensional space. Most of the practically useful and theoretically interesting constraints stem from the fact that the original signals must have a limited power content. The following set of three hypotheses was considered Shannon. 1. All signals (words) have the same power content 2. The power content of signals (words) is smaller than or to a specified power P. 3. The average power content of all signals (words) is smaller than or equal to a specified P. Thus the corresponding is to minimize E nracement of M in the n-dimensional space with the constraints: 1. M points on the sphere centered at the origin with radius respectively 2. M points within or on the centered at the origin with radius V nP, respectively 3. M points such that their average squared distance to the origin is nP A Lower Bound for of Error. We focus our attention, along with Shannon, on the communication of continuous signals in the presence of additive when the code words lie on a Let N be the variance of radius (case 1) centered at the noise at each sampling point, n the number of sampling points, and M the number of code words. Familiarity with the material of Chap. 8 makes it clear that, for an optimum encoding, the average error probability will Because of the additivity of noise, be a function of M, n, and only the ratio of the variances of P and N will enter the picture. Shannon's procedure for obtaining some bounds of E can be described in terms of a geometric model, The details of his derivations require much more mtLm'~l1~~el]lnOIOa

rlr.,ntr\.£iI'1I"Y\.nP

.... ,JSO='... , J ....... I-.I

326


space than that available here. We shall quote the method of attack and some of the results obtained. For details of the proof the reader is referred to the original reference [Shannon (III)]. The suggested method for obtaining a lower bound for the error probability is rather interesting. It is based on the following two direct steps: 1. Consider the error associated with this minimum-distance decoding scheme, that is, when the signal space is divided into appropriate polyhedra. Then evaluate the corresponding probability of error. 2. The exact evaluation of the error probability for the above geometric model seems to be a complex problem. Therefore, one may approximate the probability of error for this scheme by comparing it with the error in any similar model which may be subjected to G simpler computation. The probability of error associated with a decoding scheme, for a specific word Wi, is equal to the product of P {Wi} and P {received word not in the ith polyhedron Wi}. In our adopted model, P {w;j = 1/M, and the other probability is rather difficult to compute. The following basis for comparison may be established. FIG. 9-11. Comparison of Consider a two-dimensional monotonically probability within a circle decreasing probability density function f(r) and a polygon of equal with f( 00) = O. Compare the probability of area for a monotonically the random variable R assuming values in a decreasing probability density. circle C centered at the origin with the probability of the random variable assuming values in a polygon G of the same area (see Fig. 9-11). Let J stand for the common area between C and G; then P{R E C} = P{R E J}

+ P{R E

C (\ R ~ G}

(9-68)

Now, if we compare the probability associated with any two elements of equal area, one in the polygon but not in the circle and the other in the circle but not in the polygon, we shall conclude that a smaller probability is associated with the former element. Thus P{(R E G) r, (R ~ C)} ~ P{(R E C) r. (R ~ G)}

(9-69)

This reasoning can be extended to an analogous situation in the n-dimensional space. We may compare the probability of a signal point being in the n-dimensional polyhedron (or pyramid) with the similar probability for a right-angle n-dimensional cone, both with solid angle flj. (fl j is the area cut by the pyramid or the cone on the unit n-dimensional sphere.

327


Both the pyramid and the cone have their apexes at the origin.) The signal point W; lies on the axis of the cone at a distance.ynp. Thus, we arrive at a practical method for obtaining a lower bound for the probability of error, as the computation of the latter probability can be directly accomplished. E 2::

if 2: P (signal being moved outside ith cone} = i1 2: M

M

;=1

;=1

F(n;)

The probability function F(n;) is a monotonically decreasing function of distance and also convex in the n-dimensional space. Thus M

if 2:

F(n;) 2:: F

;=1

(~)

(9-70)

M

no =

where

Ln;

;=1

Shannon finds it easier to compute the probability of a signal point being displaced to outside a cone of half angle 8 than to designate

FIG. 9-12. A schematic diagram of the minimum-distance decoding pyramid in n-space.

the cone by its solid angle. He refers to this probability function by Q(8). Having this new variable 8 in mind, we find E

2:: F

(~)

= Q(81)

(9-71)

where 81 corresponds to the solid angle (l/M)n(ll'), that is, 1

n(8 1) = M n(lI')

The actual computation of Q(8) turns out to be quite long and more complex than can be presented in a few pages. In Sec. 9-17, we shall quote Shannon's results without taking the space for their derivations. 9-16. An Upper Bound for the Probability of Error. We have pointed out more than once that the proper selection of the words in the code book is a most important factor in reducing the probability of error. In this section it will be shown that, even if the code-points are selected at

328


average of error of the code cannot Q'lI'lI1"'na~::!Q limiting value. The evaluation of this bound is the subject of the present section. Vi, Consider a circular cone with half angle 8 about a 'received corresponding to a transmitted word Wi. If the cone does not surround any signal point, the word will be unambiguously detected as ui; However, if there are other signal points in this cone, they may be incorrectly decoded as the original message. The probability 'of a code-point being in this cone is the ratio of the solid angle of the cone to that of the total space surrounding Vi, that is, 0(8) P {one code-point inside cone} = Q(1r)

(9-72)

Assuming the code-points are independently (at random) distributed on the sphere of radius vi nP, we find

]M-l O(O)]M-l = [1 - 0(1r)

0 (0) P{M - 1 code-points in cone} = [ n(1r)

P{none of M - 1 code-points in cone}

Actually, in order to compute the average error probability for the decoding scheme, we must first compute the probability of the transmitted signal being displaced in the region between a cone of half angle 0 and one with half angle 0 dO. This latter probability of the original signal being displaced noise has already been designated by the notation -dQ(O). Thus, the average probability of error for a random code is given exactly by

+

E =

t: _ {I _[1 _0(1r) O(8)]M-l} dQ(O)

}o=o

(9-74)

Some simplification will occur if this exact formula is somewhat weakened by using the inequality (1 - X)2 2:: 1 - nx, which suggests

{1 - [1 - ~~nM-l} ~ (M - 1) [~~n ~ M [~~n By dividing the range of integration into two parts [0,0'] and [0' finds E

~

-

(O' M }o

t:

[0(02] dQ«(J) dQ«(J) 0(1r) } 0'

(9-75)

one (9-76)

Finally, we obtain an upper bound for the average probability of error by assigning to the arbitrary angle 0' the value of the cone half angle that corresponds to finding only one signal point in the cone, on an aver-


age.

That is,

(J'

=

(}1,

where 81 is defined

Thus, the average probability of error satisfies the inequalities Q((h)

~E~

Q(81) -

~81 ~g:) dQ(8)

In the following section, we discuss the asymptotic behavior of this error probability when n is increased indefinitely. 9...17. Fundamental Theorem of Continuous Memoryless Channels in Presence of Additive Noise. Using the inequalities of Eq. (9-77), we wish to study the asymptotic behavior of the error probability when the rate of transmission of information approaches the channel capacity. Without going into Shannon's elaborate derivation of the probability function Q(8 1) , we merely quote his result when n asymptotically approaches infinity. Shannon shows that if n ~ 00 and if the signaling rate approaches the channel capacity (per degree of freedom), that is,

R

=

1

-log M n

~

1

-log 2

+

=c

then the upper and the lower bound of Eq. (9-77) will coincide. case he has shown that the probability of error approaches the standard normal distribution: ep [

~

+ N) vn- '\j/2P(P N(P + 2N) (R -

C)

]

In this of a

(9-78)

Expression (9-78) exhibits Shannon's fundamental theorem for continuous channels in the presence of additive gaussian noise. For any arbitrarily small but positive E = C - R, the probability of error for the' codes approaches 0 as n ~ 00. From expression (9-78) it can also be noted that, with a fixed positive E, it is not possible to encode messages such that one may transmit at a rate greater than the channel capacity with arbitrarily small error probability. Note. An early conception of this idea appears in Shannon's article in the 1949 Proceedings of the IRE. The treatment in the recent (1959) Bell System Technical Journal article is quite elaborate. In the preceding pages we have tried to present Shannon's basic method of proof without recourse to the more complex techniques initiated in that paper. The omission was felt necessary in order to remain within the bounds of an elementary presentation. The reader who wishes to appreciate fully Shannon's techniques should consult the original paper.

CONTINUUM

suppiementarv articles for this ~I-"''MI and Thomasian. first article makes use of mstance and power for the transmission of a class of limited in the space. The second is the closest one to Shannon contains several mterestmc For it is shown that the of the transmission transin the presence of mission rate is close to the channel +n.II""'IM''''''II'''l,",,

nih ..........

nn",,\nnll-w--.r.,.

~

+

Shannon's equivalent result suggests a

olh$"Jl~ru:h~

estimate

1"-.I.11.Jl.IUUJL.Il..ll.JL'-'..Il..Il.

The third afore-mentioned paper derives a UUi:::loUJ..UlJV of error. So the results have been obtained in Shannon A. Thomasian has a of channels in the presence of additive noise case power contents of the signal and noise are limited A summary of his results is below. Let be a nossi DIe

word.

We assume n

As before, we designate the corresnondmz

word

If the average noise power per coordmate, N > 0, is snecmed muenendent g:aUE;SI3/ll noise mean zero and variance the channel is constrained n

P{vlu}

exp [ -

= j=l

The tollO'Wlna munication

".,......,.


Subject to the above assumnuons, there exist words [Ul, U2, 1

n

k = 1, . . . ,M such

LP{D~luk} M

~

3

=:; exp { -

i [~1 + 0.3(0 _R],2(P + N) - I]}

k=l

where C = channel capacity in bits per signal coordinate R = suitable rate of signaling; i.e., 0 < R < C = log (1 + D k = output word decoded when input word is Uk and optimal decoding procedure is employed; D~ is set complement to This theorem gives an upper bound for the average error probability in the described communication model. The decoding criterion is based on minimum distance. For the sake of simplicity we have the case where several words can be associated with a single transmitted word. Although Thomasian's formulation of the problem is similar to Shannon's approach, his derivation and proofs are quite different. Thomasian's proof is based on some basic lemmas and inequalities similar to those employed by Feinstein and Wolfowitz (see A full derivation of this interesting theorem requires more time and space than is available at present. The interested reader is referred to the original article. * PROBLEMS 9-1. A microphone does not let through sounds above 4,000 cycles per second. Determine the sampling interval, allowing reconstruction of the output waveform from sampled values. 9..2. Prove the sampling theorem for a time function f(t) which is identically zero outside the interval a < t < b. 9-3. (a) Show that the volume of an n-dimensional hypersphere of radius Ris

where

J(R)

=: - - " - - - - - - - -

n (b) For large n, plot the integrand in the denominator as a function of R.

* A. J. Thomasian, Error Bounds for Continuous Channels, Proc. Fourth International Symposium on I nJormation Theory, to be published in 1961. See also Black... well, Breiman, and "I'homasian.

332


the ratio of volumes of two hyperspheres of the same radius dimensions nand n - 2, resnectrvelv Let X = [X 1,X 2, • • ~ be a random vector with gaussian distribution: '-'VJLlJl.IJ'¢::lIJl.O

Mean of Xk = 0 Moment XkXi = 0 Standard deviation of X k

k = 1,·2,·.

=

., n j

'¢

k

(1

X may be by a X in an n-dimensionel space. (a) Given a point X of this space with a distance R from the discuss the relation between R and the power content of the signal. (b) What is the probability of having points representing signals of this ensemble within a sphere of radius R about the origin? (c) Same question as in part (b) for the point being between two spheres of respective radius Rand R + dR. 9...6. Verify if a bandpass function restricted to the frequency interval "1\w, ("1\ + l)w can be expanded as

n

f

J(t) =

+ 1) (t

sin 21f'w("1\

n/2w) - sin 21f'w"1\(t - n/2w) 21f'w(t - n/2w)

Verify the validity of the following identity due to O. R. Cahn. If J(t) is a periodic function of period T and if all the Fourier coefficients vanish above the nth harmonic, then 2N

f(t)

=

I

sin {(2N

f

(2N

k=O

+

(S. Goldman, "Information N.J., 1953.) Show that the three vectors

83,

[~~,

l)(n.;~r)[t -

+ 1) sin

kT /(2N {(1f'/T)[t - kT /(2N

r rerrraee-ma.u.

-%, -%]

[%,

form an orthonormal basis for R3. Show the validity of

IIX + YII

s IIXII +IIYII

in R3 by a simple study of an associated triangle. 9...10. Prove that, if

IIX + YII = IIXII

+ IIYI\

then X and are independent. Let J(t) be a function of 21f' such that J{t) = 1 J(t) = -1 J(t) = 0

OYS,nev

Quoted by Khinchin in Uspekhi Mat. Nouk, vol. 8, no. 3, p. 3, 1953.

CHAPTER

The mathematical theory of probability has grown during the past three decades. Today probability encompasses several such as game professional fields of mathematical chains. decision theory, time series, and Since 1940 several mathematicians have made fundamental contribucomtions to the establishment of the new science of statistical munications. Perhaps two of the most significant landmarks of comon this are the munication theory which have immediate of and and Shannon's In both cases, stochastic processes occupy important place. In communication theories, messages that are transmitted in time intervals are generally dealt with; that is, the raw material consists of time series. This immediately gives rise to on the statistical nature of these time series and their accurate rlD':!~lt'1Int""1{"\n at the entry and at the exit of The theory of and has been concerned with problems of determining optimum linear systems in the sense the least-square criterion, for extraction of signals from mixtures in communication of signals and noise. This is a it seems that this with frequent practical application. topic has been somewhat overemphasized. At present a broader outlook, the general study of linear systems under stochastic regime, appears desirable. Such a study seems to be the most natural extension of the ordinary network theory which is concerned with the study of linear systems under time functions. this wpll-(i p."\irp.I()np~li area of deterministic linear an . . . . ,., ·. . . ""+ POSItIon technological development of our sciences. knowledge of stochastic processes is essential to physical scientists of deterministic networks to pr()D2~DIJllS interested in extending the tic systems. Information theory deals messages and their ensemble transformations. There again a fundamental of the nlt'4""hhB£l~'inI"liQ requires a knowledge of stochastic processes. r.. .....i-n ..

337

SCHEMES n-rc.c!o 1". chanter constitutes an Introcluctor'v of stochastic processes" theme of the statistical of communications.. is the This is aimed at the communications enzmeer treatment of the subject without too mathematical Those interested this find a number of recent and articles available for a more OIJ~::i\.iJ.(~.U4:JCiU coverage of Middleton and U2~vp,nnort Root). 10.. .1. Stochastic In a first to some ofprobability theory, the subject must be confined to what may be called the static part of probability theory. This will an expose of the dynamic, or, more theory. It will study probabilistic which on time or any other real parameter. In the mathematical literature this referred to as the study of time series, or, more a is to stochastic processes. The immediate of time series. This will reader with a method of introduction of some new terms and a the more eiementarv time concepts of to there will be a study of averages and we relate study of the averages to the well-established theory of Fourier 1I1Llljg.r;t:;t1l':'5. in much the same line of as the of moments characteristic functions Fourier '1n+nnl"1I"n lei The intuitive definition of a stochastic process can be A stochastic process is a random process whose rancomes are infinite sequences or functions, in contrast to a dom variable whose outcomes are numbers or vectors. In other a stochastic process {X(t)} is a process whose outcomes are functions of t. The values of the process at times ti, t2, ". , that X(t 1) , X(t 2) , • • • , X(t n ) , form a sequence of random variables. At each instant of time ti there is a random variable X(t£) with a a specified probability distribution. The stochastic process can discrete or continuous time parameter. A discrete process consists of a finite or an infinite of random variables each defined at crete time. Without loss assume the time sequence 1, 1, 2, 3, . .. Then the process integers .. , 1''1

1l"1'"\4n."txrIOriIO"Q

1I.AIIIJIIJJL'-'IIJJL.LIl.NV'V.Il.

X(t)

= {.

.

}

For a continuous stochastic process, the outcome at any desired See J"

Am.. Math. Monthly, vol.

no. lO,pp. 649-653,1942.

JII,,&,,\\,""''''_....''''''

PROCESSES

E t, is a variable. of a stochastic discrete and continuous are sketched in order to a random process, the distribution of any number of random variables of the process must be n and any set of real known. To be for any is, t-, . . . , t; to the time interval of the process there must be a set of random variables . , with a known distribution * "in'1'11"nr\ h,n"?t10

.. , As an biased

of a discrete process, consider the which are head or tail, with respective

VA(:tI.L.l.lIJ.lD

FIG. 10-1. (a) l!ix:ample of a discrete-time stochastic process; (b) example of a continuous-time stocnastic process.

- p. Call the result of the kth throw a random variable which assumes one of the two numerical values, say corresponding to a head and 0 to a tail. Here the of the random variables X consists of

X(t)

= ... , X- 2,

x, Xl,

x

3,

•••

continuation of the throws both numerical directions is a matter of mathematical convenience rather than anything else. In this example. are of each other. note that any two members of the these variables not be random variable of the has a well-defined and all the joint probabilities for two or more members of the ..I...I...I.~''-G. by the binomial law. '-G.V.JL.

It is to be noted that, while for the continuous processes the selection of the time sequence . . . ,i-I, to, t-, t«; la, . . . is arbitrary, for a discrete process the time sequence must be selected appropriately from the doubly infinite sequence of the discrete times of the process. In the latter case, the number of variables is, at most, denumerable.

340

SCHEMES

MEMORY

For a simple illustration of a continuous process, consider a ber of radio receivers registering a stochastic noise voltage, as saetcneu in Fig. 10-2. The voltage registered by each receiver is an outcome of the stochastic process, or a member of the ensemble function. Select an infinite number of time-sampling points on the taxis some time reference t = 0, such as t- 2, t-I, to, t-, t2, .•. ' At a OWJLJutJJLJLJLJL~ firstreceiver

Second receiver

'FIG.

10-2. An illustration of different outcomes of a process.

time ti the values of the registered noise for different receivers will be designated by i=

-1,0,1,2,

Assuming that a great many receivers are available, it should be possible, at least theoretically, to estimate the probability distribution of the noise amplitude A (ti) = X(ii): (10~1)

the probability that at time ti the registered noise voltage lies between x and x + dx volts. This is called the first probability density distribution

~VAA~4~'~AV

of the process. tion is

To be more

PROCESSES

the first dx

for all sampling times. distribution is defined as

'.JI,JVVVUIJ ..

(I)

E[Y(t)]2

,QO

This random variable is somewhat indicative of the average values of all the occurring random variables of the family. For stationary processes it is generally true that this limit exists. Furthermore, assume that, for a large value of r, the above random variable does not further indicate any randomness but approaches a constant number. For an process these requirements must be met. For every outcome of the ergodic process the time average X, should exist and should equal the expected value of any specific sampled random variable of the sequence: for every k According to the nnl{;;! stated that the first-order .. w·n·,,,~,,. cn:Y1Yl

'-V'UL'VIJ\JV'UL

=

'Yyrn.n,,01''T'tr

(X(t)

for every k

Similarly, the ergodicity of the second order implies = (X(t) .

+ T)

From the mathematical of the above "definition" for ergodicity remains somewhat incomplete. To be more exact, one should qualify the equality of the ensemble average and time average for the stationary process with the reservation "almost everywhere." The latter terminology has a specific mathematical meaning which will not be discussed here. The reader interested in the practical application of the theory may rest assured that he is not faced with such unusual circumstances in the study of ordinary physical systems. Another mathematical point to be brought out here is the fact that, while the most important implication of has been there is a tacit omission of the delicate mathematical definition for such a process. The readers interested in a more precise definition may find the following presentation more satisfactory. Let {X} = (... ,X- 1,XO,X1,X 2,

•••)

* The material of this section was communicated in essence by Dr. L. Cote. The section may be omitted in a first reading. All equations referring to ergodicity imply "almost everywhere."

MEMORY 'lI"O'l"ilrfn'lI"II'\ t::UlIl"'\I"II'ilO'Vlld'lQ

of a discrete . .

-t·n-l-·D.n."It'1ln ...'r.,. 'In.1ronn.niC''lt't

of r variables. sampled randorn variables of the sequence ~lnVI~-VHI.IHf~n function

Now define a runenon

N

k=O

Let

lim

s-.»

The function G(X) in the limit mayor may not exist. If it exists and has a constant value G independent of X, and moreover if this constant value of G is equal to E[g(X 1,X2 , • • • ,Xr ) ] for all choices of sequence X k+ 1 , • • • , X k +r , then the sequence is called ergodic. It is easy to see that the above-mentioned two cases are encompassed this more definition. In fact, letting g(Yl,Y2, · .. ,Yn) = g(y) = y

n

=

there results N

n=O

lim G(N,X) = E(X) N-+co

The process lim used here differs from the usual limit since G is a runcnon of a random variable. a discussion of the see Sec. the second-order can be illustrated

- E(X)][Xi +K

-

E(X)]

Since the process is stationary to begin with, it follows that lim G(N,X) = p(K) N-+co

does not but on the time interval The employs the concept of a discrete process, for convenience. The same conceptual pattern applies for defin... ing continuous ergodic processes. During the past three decades a vast amount of mathematical work has been produced on the subject of ergodicity. The famous Birkhoff's ergodic theorem is a classical landmark in this specialized field. The interested reader is referred to specialized articles on the subject. O'll"l'Il'.r1""''1"Id'l

STOCHASTIC PROCESSES

The of 10-3 is a reminder that the as defined here are a subclass of the stationary processes which are in turn included in the more general class of stochastic processes. This introductory presentation is not concerned with nonstationary processes.

FIG.

10-3. A classification of different processes as discussed in the text.

In fact, the subsequent study of linear systems will be confined to ergodic processes. Example 10-3.

tJ.L.L.L.LV.LtJUI.L.L.1

Determine whether or not the stochastic process {X(t)}

=A

sin t

+ B cos t

is ergodic. A and B are normally distributed independent random variables with zero means and equal standard deviation. Solution. The second-order central moment of the process was computed in Example 10-2: E[X(t l)X(t2)] = q2 cos (t l - t2) In order to obtain the time average, consider two specific members of the ensemble: and

A 1 sin t A 2 sin t

+ B cos t + B'), cos t 1

where A 1, A 2 and B 1, B 2 are some specific permissible values that the variables A and B may assume. N ext compute the second-order time average. Time average = «AI sin t

+B

1

cos t)[A 2 sin (t

+ 7) + B

2

cos (t

+ 7)])

It can be shown that this time average depends on the selection of the member in the ensemble. In fact, let 7 = 0, and inspection will show that Time average = (A 1A 2 sin" t)

+ (B

lB 2

cos" t) =

~'2(AIA2

+B

1B 2)

It has been found that there is at least one second-order time average that depends on the selected members of the ensemble, rather than being a constant. Thus the process is not ergodic although it is stationary.

10...6. Correlation Coefficients and Correlation Functions. This section begins with the introduction of the correlation coefficient p which is commonly used in the study of the interdependence of two random variables X and Y. Correlation coefficient = p =

V' E[(X

!)·

(Y - Y)] _ E[(X - X)2] · E[(Y - Y)2]

(10-23)

SCHEMES

bles, If the two variables are mdependent, the correlation coefficient is zero, the variables are uncorreiateu. necessarily of course does pendence.) In the same trend of it is natural to useful to two random variables of a process. stochastic process, covariance of } be a two sampled random variables and is defined as Covariance {X(ti),Y(t k )

}

=

E{[X(ti) - X(ii)J[Y(t k )

-

When the two variables are selected from a process, the covariance coefficient is more called the autocovariance. The autocovariance indicates a measure of mteruepenuence or coherence between the two random variables of the process ,X(ti) and X(t k ) , that is,

An obvious simpnncation occurs if a convenient choice is made·

=0 =0

=0 =0

This can be considered as an effect of a the variables. Such an has no effect on our studies except the simpnncation of results. In such a case, ""Cl"ll1l"Y\'I',I"Il"t"llno+1I1I"1."nl

Covariance of {XCii), Y(t k ) } = Autocovariance of }= For a stationary process of second order with a the autocovariance depends only on the time lag ti - tk and gives a measure of the effect of the of the process on its and future states: n""\W,,\C111"''YT 1"111I1I"'\4"l .... ,11"1.'1"\

}= The autocovariance of a process is sometimes autocorrelation function in enzmeennz literature. This is not generally in with the definition of correlation coefficient:

Under the be menucar with the autocovariance within a constant. order to avoid the normalized autocovariance function of a stationarv process, that RXX(T)/Rxx(O) , will be called the autocorrelation of that process. This will be in the engmeermg literature, I t will be assumed that the first-order averages zero and a normalization is done to make

With this reservation in function is

the autocorrelation of a stationarv process

a cross-correlation

a

function may be defined as

+ To sum up, the autocorrelation cross-correlation IUllCtlOIlS ,~ for processes of the second order or provided that the means of the variables are taken to be zero and = 1, ensembles in linear since later studies the one for erQ~OQ]lC systems, there is one more on with zero first-order averages: 'If

,

v

IOlI'"lfll"lrl,,,rll"Hn

C:!O"II'V\ h i A:U:::l

the

of the correlation may be from either probabilistic considerations (joint of samores) or from the deterministic of that is, the time averof the two specific time functions under consideration. equivalence of the two results is assumed ergodic hypothesis. with two or more sampled random variables of a or nOC::ln'YI,C'ff'

SCHEMES WITH MEMORY

stocnasuc process (T units of time mterdependenoe of the drtterentsamntes

lO1rn'nn'ln VJu.~'-'A.A..IL'iJ.lLVI~. the correlation functions based on the concept of time averaging; that is, form the product of one -.. /""". member of the ensemble by another member shifted by a time T. The product should be averaged as indicated previously. Care be exercised, however, to determine that the process is to and this is not a simple task. Process. Consider the stoa ehastic process nA... ,..,.,.·..... '11'111hn.

{X(t) }

{A cos wt

+

sin

distributed random variables with zero means, vananccs, and zero correlation · then nn"1'"'lI"V\OII'rT

= A cos wti B sin wti = = A cos wtk + B sin wt k =

akA

+ {3kB

The distribution for the random variable is as variable is the linear sum of two norma] variables. The same is true for the distribution of the random normal with zero mean Ui = V ai 2 + ~ = 1 standard deviation normal with zero mean Uk = yak'}, + 13k'}, = 1 standard deviation distribution of and is a two-dimensional normal also with (0,0) mean and a covariance matrix which can be de1jerlnlIledinthe way: .....n""llCi' ...'V'l"

'""'....."-Jv ..... JL.r..J'U!..v .....v' ........

P12

PI2

is the covariance coerncient between -

Xi)(Xk

-

cos wti cos wtk

+

and

Xk ) ] =

+ AB cos wti sin wtk sin wti cos wtk + B2 sin wti sin,wtk]

Since A and Bare uncorrelated random variables, E(AB)

= E(B4) ==

E(A)E(I!) == 0


Thus PI2 PI2

= =

cos Wt'i cos wtk cos W(t& - tk)

cos 1

the and a direct of in the form associated two-dimensional normal variable. This distribution function is normal with (0,0) mean and covarianees which depend solely on the time difference ti - tk; that is, choice of the origin of time is irrelevant. Thus the process is stationary of the second order. Meanwhile it has been shown in the above that the normalized correlation function PI2 depends on the time difference ti - tk, as it should be in a second-order stationary process. A generalization of this normal process can also serve disas an example of a strictly stationary process; that is, the tribution In for any number n of sampled variables remains intact under any translation of the time scale. The is left for the reader. lM+lI~I-"lI'1I+''IlA'1n

FIG.

10-4. A Poisson-type pulse process, described by Eqs, (10-42).

Consider the following stochastic process, which has many engineering applications. A pulse generator transmits pulses with random durations but with heights of or - E. The occurrence of a pulse is supposed to follow the Poisson law, with k/2 the average number of pulses in the unit time. In order to compute the autocorrelation function of this process, the familiar probabilistic definition of the ensemble averages is employed: PI1(r)

=

E[X(t l ) • X(t l

+ r)]

=

~'l;XIX2/(XI,X2)

The variables are of a discrete each one assuming values of plus or minus E. It is necessary to have the probability function for the two variables. For this, the probabilistic nature of the n .... should first be clarified. The mean number of zero crossings in a time interval T is kT. probability of getting a specified number n of crossings in a time interval T assumes the Poisson distribution with an average of A = kT. probability of n crossings in time ']1 is [(kT)n/n!]e- k T • Thus the proba... 4'"\hl,01l'Y'll

at t and

at t

at

1'}

=~[ at t and -- E at t

1'}

at

n=O,2,4, •••

=%

Note that the joint probability function The factor % appears as a consequence of the fact the probability of having or -E is assumed to be d"fa'r,ant"lt:::l

on an average, Thus

(kr)n

nl-

(kr) 2

2T this

""i"Il'''iln1"''IIn.",

FIG.

it is

(kr)3

+

3t

assumed that

is a

nn'Ql1"11"tTO

number.

10-5. Autocorrelation function of the Poisson process of Fig. 10-4.

same of course two cases, write

The process is indeed a stationary process of the second and the normalized autocorrelation function + 1')] remains invariant under any translation of the time axis. A sketch of the autocorrelation function is given in Fig. 10-5. As a second example, consider a noise which


distinct desired to 1. First-order 2. First-order 3. Second-order

'If'''II'lI'''i!"\lnOInBO

outcomes, as illustrated

nnl'Y\'1l"'\11111"a

rhC::!1"1"'lh11l1"lt'hn

1I-'.I.'-JI'JILAII'J.Il.Jl..Il.V_Y

1I-'.Il.'-JI'JILAII'J.Il.,A..Il.V_Y

t2

=

4. 5. Autocorrelation coefficient :OO

the dennmz equation of power snectrum and the relation dw we

nn'l:!l1".ll'fTO

e} = 0

n~oo

Convergence in the Mean-square Sense (Abbrev1:ated Form m.s.). convergence is defined as follows:

l.i.m.

=X

This

(10-73)

n~oo

if

- XI2 = 0

lim

where l.i.m. stands for the limit in the mean. Almost Certain Convergence (Abbreviated For a.c. convergence, the set of realized sequences of converges to X with when n The a.c. convergence, sometimes called the strong convergence, implies convergence not conversely). There are also other modes of convergence defined in the literature (for necessary and sufficient conditions for each mode convergence, see, for M. S. Introduction to Stochastic sera vol. 2, Processes," or J. E. 1949). The above concept of convergence applied to a discrete sequence can be extended to the case of stochastic functions. For example, for a process {X(i)}, define X(r) as the stochastic limit for XCi), abbreviated as l.i.m. X(t) = Xo(r) t~1"

when it satisfies the condition lim E[X(t) - X o(r )]2 = 0

(10-74)

t~1"

This is the definition for stochastic m.s. convergence of the process. are also a.c. convergence and convergence in stochastic functions, but a discussion will not be undertaken Example 10...5. Consider the tossing of an honest coin. Let X (n) be the number heads in n throwings divided by n; that is, X(n) is a random variable obeying the binomial distribution: E[X(n)]

=

~~

1 u [X (n )] = 4n 2

average standard deviation


It can be shown that the sequence X(I), X(2), . . . , X(n), ..•

converges, in the sense of 1, 2, and 3 below, to ~'2. 1. By Chebyshev's inequality (Feller, p. 219, or Loeve, p. 14): PtlX(n) - ~'21

2:: t} =::;

1/4n

~-

In the limit: PtlX(n) - ~'21 n-4CO

2:: e}

= 0

1

E[X(n) - ~'2]2 == q2[X(n)] == 4n

2. For n -+ 00:

l.i.m. X(n) =

~'2

as lim E[X(n) n-4CO

~'2]2

=0

3. Using the strong law, one can also prove that [X(n)] converges to

~~

in the a.c,

n-4CO

sense.

The proof is somewhat complex and will be omitted here (see Loeve, p. 19).

Stochastic Differentiation and The ment of modes of convergence, continuity, differentiability, and nrteurability for stochastic processes is beyond the scope of this text. For the immediate purpose, it seems sufficient to give rudimentary definitions similar to the concepts acquired in ordinary courses in analysis. . A real process XCi) is said to be continuous at a time t in m.s. if '1tOl f'lI'A'1f8f"\l'1 0

lim E\X(t h-40

+ h)

-- X(t)\2 = 0

The m.s. derivative of a continuous process shown by X(t) is defined as X(t · 1.i.m, h-40

+ h)h -- XCi) =

X'(t)

(10-76)

As in the case of the integration of deterministic functions, different types of stochastic integrals can be defined. The Riemann stochastic integral as well as the Lebesgue and the Stieltjes stochastic integrals is the technical literature. At present, this discussion is frequently used the definition of stochastic integrals in the Riemann confined to sense. Divide the real interval of integration (a,b) into n arbitrary subintervals di; The Riemann integral of a stochastic process {X(t)} can be approached in the following way:

L n

b

X(t) dt

= l.i.m. n-+co

X(ti) d4

1

=

R(a,b)

(10-77)

SCHEMES

is contingent upon the convergence the sum in the ID.S. sense" It can be shown that the double of the autocovariance function the . . . . . . . exists over the square (a,b)] in the J. Statist. Soc., sere vol. no . 2, It is not intended to delve further into this interest to note once the of differentiation and is In1jr0(1Uce

00

n --?>

00

That is, the basic relations among different entropies hold under a stationary regime. H(X,Y) H(X,Y)

= H(X) = H(Y)

+ H(YIX) + H(X\Y)

The important mathematical conclusion to be pointed out is that, for stationary regimes: 1. The existence of different entropies is a fact. 2. The significance of the entropies and their interrelationships is just the same as in the case of a regime under an independent source and memoryless channel. 3. Finally, the rate of transinformation exists and is the same as before. 'lAv........JLJI. ..... ''Uiio

I(X;Y)

=

+ H(Y)

- H(X,Y)

4. The stationary capacity of the channel 0 is the max I(X;Y) over all stationary sources. Without including more complex mathematical machinery, a few suggestions for further reading are offered. Shannon's original contributions were primarily concerned with Markovian sources. In the light of 8

COMMUNICATION UNDER STOCHAS':rIC REGIMES

recent contributions it has been established that the first and fundamental theorems hold for stationary (not necessarily lVlarkovian or ergodic) sources and channels with or without memory. These subsequent results are primarily due to McMillan, Feinstein, and """ According to Khinehin, up to 1953 no adequate proofs of these theorems existed. Some of the more recent developments are due to Kolmogorov, Blackwell, Feinstein, Rozenblatt-Rot, Perez, and Driml. The transactions of the first and the second Prague Conference (Czechoslovak Academy of Sciences, 1958) contain a considerable amount of research related to the entropy of the stochastic signals. These developments are of a mathematical nature, outside the scope of the present undertaking. v

.

PROBLEMS 11...1. Determine which of the following transition probability matrices correspond to regular chains. Find the T matrix. (a) [~ (e)

[>J

(c) [~

(b)

1

%

1

~]

(f)

[>~ ~]

(g)

0 0

%] ~~

0 1

%

%]>~

(d)

0

(h)

[>~

>~] 1 0

%

11-2. Which of the following Markov transition probability matrices are regular? x means any positive nonzero number. The rows are assumed to add to 1.

(a)

(d)

U ~] z

0 0

[~

z 0

(b)

[Xx

x

0]

x 0 o 0 x

(c)

[0o

0

0]

x 0 x o x 0 x 0 o x o x

° ° 0]

o 0 x x o x x x x x x 0 o x 0

11.. . 3. Study the Markov chain of Fig. PII-3. states. Is the chain ergodic?

FIG.

Pll..3

Determine transient and ergodic

SCHEMES WITH MEMORY

is

Show that the chain illustrated in

FIG.

Pll-4

11-6" Consider the chain represented by the transition pronamntv matrix 1 (a) Is the chain regular? (b) Is the chain ergodic? (c) Find the T matrix.

11.. .6" Draw a Markov diagram for a game of tennis. Assume that a player has a fixed probability P for winning each point. Determine: (a) Probability of winning a game when = 0.60. (b) Probability of winning a set when = 0.60. (c) Parts (a) and (b) for P = 0.51. 1.1. Snell, Finite Markov Chains and Their Applications, Am. llfath. Monthly, vol. 66, no. 2, pp. 99-104, February, (a) Find the T matrix for the transition matrix below.

o (b) Assuming an initial probability matrix of [~~,~~,~i], compute the entropies H'», H('}.), and H(!) and compare H(3) with H(l) H('},). (0) Discuss and verify the situation when H(l) H('},) = H(3).

+

+

For Probe determine the entropy for each regular Markov chain. Show that, for a regular Markov chain, there is a probability being in the state Sj independent of the starting state; Oli is also the fraction that the process can be expected in state Sf (see and

PART

4

L'importance d'un fait se mesure done Ason rendement, c'est-a-dire, ala quantite de pensee qu'elle permet d'economisero Si un resultat nouveau a du prix, c'est quand, en reliant des elements connus depuis longtemps, mais [usque-Is epars et paraissant etrangers les uns aux autres, il introduit subitement I'ordre IA OU regnait I'apparence du desordre. II nous permet alors de voir d'un coup d'ceil chacun de ces elements et la place qu'il occupe dans I'ensemble. Ce fait nouveau non-seulement est precieux par lui-meme, mais lui seul donne leur valeur a tous les faits anciens qu'il relie, Henri Poincare Atti IV conqr. intern. vol. 1, p. 169

CHAPTER

Frequent reference has been made to Shannon's significant statement that by proper encoding it is possible to send information at arate arbi ... trarily close to C through the channel, with as small a probability of error or equivocation as desired. In the context of our work, we have put in focus the accurate meaning of this statement for discrete as well as con... tinuous channels. The major objective of this chapter is to give a detailed statement and proof of the fundamental theorem for discrete noisy memoryless channels. Unless otherwise specified, in this VJLJL~'lIIlJV'VJL we are concerned only with discrete noisy memoryless channels. A glance at the above statement reveals that there are several to be clarified before the full implication of the statement is realized. These points are as follows: 1. A quantitative definition of the words "error" and "frequency" or "probability of error" in a communication system 2. The relation between error and the of the channel The definition of error (1) requires the description of a detection or a decoding scheme. Once we have a method for decoding the received signals, then we are in a position to discuss the probability of error associ... ated with that method. Thus, our immediate plan for the first part of this chapter consists of the following: 1. The definition of a decision (or detection) scheme (Sec. 12-1) 2. The definition of the error associated with a detection scheme (Sec. 12-2) 3. A discussion of the relation between error and equivocation (Sec. 12-3) 4. A study of the transmission of information in the extended channel (Sec. 12-4) Subsequent to the presentation of these preliminaries, we shall turn our attention to the fundamental theorem of discrete memoryless channels. From an organizational point of view, the chapter is divided in the following four parts: Preliminaries, Feinstein's Proof, Shannon's Proof, and Wolfowitz's Proof. 397

398

SOME RECENT

Wolfowits's most the the first uses a certain mzemous cedure. Shannon's a deal of the and contains new ideas t{)II{\,,{]~T1nO' nresentation, we have made an effort to use the tion of the contributors as far as will be found convenient those readers articles. nt\'tTQl1.f'1Q1

".Accn Inlo

.".,.....,.11-01-·1£'\'11"'\

A-rl'0'1I1n0

I

Decision A decision scheme is a method employs for after a word has been message word was Let us review m~ttnLenlaljlC:l1 model of our discrete communication Each transmitted and received word is an n-letter sequence

"rOt'l•.oll"tTOrl

'lI"'£'l>A':b£'l>"I"'I!'Tfl-r

Uk Vk

where

Xkl, Xk2,

• • • ,Xkn

Ykl, Yk2, •

Ykn

are letters selected a the transmission and we wish to devise a detection scheme. when a word Vk is received we must associate or one of transmitted words with »». This calls for a decision that may be substantiated some statistical decision is not excluded. At letters at our out of are selected for transmission. N sets [Ul,U2, • ,UN]. Whenever a word Vn E that the word Uj was transmitted. This is to as a decision scheme, we have not stated the statistical criterion on which the partruonmz has been based. A decision and a criterion nronaouitv distribution space. '!'he maximum-likelihood prmcrpte decision .... "'-">"...J

....

Xki

.,,'u .....

and

= =

l ' " . , , , , ,0 •• _

Yki

Ir'1l"'\ •..,......... Y·,,'Y\nt'

"r!pi"'.ol,rTl"n,n'

v'V'A....... s:....JL'V'A...

VA ... v'V .... , , _ .

we wish to define what are the error associated with a decision scheme. Error 'in uetecuon word is not received Uk is transmitted. this error is

E

THE FUNDAMENTAL THEOREM OF INFORMATION

a decision scheme can eon vement.tv as a random variable values ek, as stated in with a probability P {v E A k } . Thus the average value of the error is P{v

[1 -

The of the above notation should made clear. We find then the index k is changed to cover all the error for a possible tion regions. It is instructive to note that the average error probability can be equivalently derived by considering an alternative method of bookkeeping. The error occurs when a word Uk is transmitted but the corresponding received word is not in A k • This error probability can also be considered as a value taken by a random variable E. A~lu

P{v

= Uk} = 1 -

The average value of E is

The equivalence of the expressions in Eqs. and (12-4) can be exhibited by appropriate manipulation of the terms. Equation can be written as

1-

= 1 [1 -

=

ukl v E

To sum up, observe that we have arrived at a clear understandmz average error for any decision scheme in ~VJ.JLVJ.(')IIJ.. as anticipated earlier, is a function of the input distribution, the and the decision In the decision scheme upon of the Vi, chooses that transmitted symbol Uj whose conditional probability P{UjiVil is the greatest is generally to as an ideal observer. that this tJ.&'I.,PJIl.4JPJJL....JLv.l

RECENT DEVELOPMENTS

definition is message probabilities.

upon

of the

{uklvi} = max cvex i

Fein· stein the useful of uniform error bounds. decision scheme is said to be uniform error bounding with bound A, if there exists a number 0 ~ A < 1 such that

k

=

(12-6)

1,2, . . . ,N

Obviously, if a decision scheme has a uniform error bound, then (12-2) and (12-4) yield

That is, the average detection error probability will not exceed the uniform error bound of the channel, that the latter exists. For a discrete memoryless the following theorem holds: Theorem. Let E be the error probability of a decision scheme with N detection regions. Then H(UIV) ~

E-

(1-

(1 -

+E

1) .

Proof. A of this theorem follows from the 'i"''·u· of the entropy function. [See Feinstein (I, pp. 35-36).] The proof based on the law of additivity [Eq. (3-34)] is also of interest. Consider a received word Vk and the entropy H(Ulvk). The original word associated with Vk will be denoted by Uik. Of course, several received words may correspond to the same original word iu. Using Eq. (3-34) and the convexity of x log z, tnI"'1'''''TTl1' .......

1YY'rA'Y"II01"t'.'U'

1"AllnUT1TU'V'

II(Ulvk) = -P{Uiklvk} log P{Uiklvk} - (1 - P {u« IVk }) log (1 - P {Uik Iv, }) N

(1 - P{Uiklvd) ( i=l

i¢ik

~

However,

-P{uiklvk} log P{Uiklvk} - (1 - P{uiklvk}) log (1 - P{Uiklvk}) + (1 P{ uiklvk}) log (N H(UIV)

=

1)

'rHEFUNDAMENTAL THEOREM OF INFORMATION ........... .Il~"-'.AIII.Il.

where M is the number of all

'Y'\tr\C!C!lI1""\U£:l>

words in the

1"OnOl"ulnrr

space.

V) ~

+

- 1)

The summation over M possible received signals can be done convementtv by first summing over a region A k and then proceeding over all such regions. H(UIV) ~ -

P{v E (1 - P{uklv

)

+

(1 - P{uklv E } )

1)

Using the values of ek and E as defined in Eqs. the convexity argument, one finds

and

V) ~ [J{v

or H(UIV) ~

(1 -

E)

E Ak}ek log ek

+E

- E log E

(1 -

-

1)

+E

1)

When the channel is not lossless, at least for some distribution, the equivocation will be positive. That is, for a uniform error bound scheme the equivocation entropy will be bounded

o < H(UIV)

~ -

X log X - (1 - X)

(1 - X)

+ X log (N

0
1 02 introduce an set whose elements u;

rfl!'.\C''''''''llhl!'.\ril

a Thus

P{Vk find

RECENT

Note that the is less

are

o1> _ remains to see how note

and

are mterreiateo.

In addition,

n

1n"D"V'

Ensemble Codes. Subsequent to Feinstein's the fundamental theorem of ·encoding for discrete enannels ence of noise, Shannon provided an alternative does employ Feinstein's inequalities. Shannon's proof is based on the concept of ensemble codes. He considers a class of distinct codes selected at random and evaluates the average error probability not for any member of this ensemble but for their totality. This average bility of error may be bounded between some limiting values are not too difficult to derive. Thus, since there is at least one member of the ensemble of codes which has a probability of error less than the averis less age error for the ensemble, there is a code whose error than the upper bound. In other words, if we were given an encoding machine, which selects anyone of, say, k specific encoding n1"('~A,nn,rooc:!. then, on an average, the machine cannot do better or worse than certain limiting behaviors. The limiting average error values for the ensemble, of course, depends on the over-all nature of the liUlHll:~--u.p.n()(lscheme. Shannon's method has the merit of exhibiting the behavior of the probability of error as a function of word as will be discussed later. In order to appreciate fully Shannon's technique, we shall devote this section to the formulation of the problem. Consider a set of M messages which are to be encoded and transmitted via a discrete messages channel with a finite alphabet in the presence of noise. and the encoding alphabet are ""IlJl'l,'ld...LIlU

[m] = [ml,m2, ... ,mM] [a] = [al,a2, . . . ,aD]

The channel is defined as usual by its stochastic matrix:

k, j

=

1,2, . . . , D

As before, in order to combat noise, we use code words n symbols rate of the selected from alphabet [a]. We know that the transmission of information is closely related to the number of messages, and n, the word length. For the moment, assume that we have a transinformation rate in mind that is not beyond the realm of possibility. Then the word lengths are selected according to the equation 1

R=-lnM n M = en R

code vocabularies are in the manner: We at a word Uk to to the message m-, assumed to have of we select a of to corresnonn still at random and the message m2, and so on. This code be referred to as code Of course, we do this all over it is to code that is not identical with Indeed there are BM such some which may and others rather InstaIJlCe~ a code book in which messages ml, m2, . . . , mM have been the same word u; is a and useless code. next task is to devise a scheme everyone of these distinct codes and to make an estimate of the average This is not for any code error in ensemble of such codes. Consider a Ul, U2,. ., with {UB}], fed into the channel as described the noise characteristics of the we matrix the detection of received messages, this comnosrte O'J.I..I.~J.I..I..JI.lu, the maximum-likelihood detection criterion as before. into a number of The will be aSS:UZ:Ilea to the transmitted words in a one-to-one manner, on the conditional ,tJ'''''... to a transmitted word Ui for any received Vk we 'Y\'r4"'\r11111.nO

IJ.I. \./UCAJU.U...I.lJ_Y

....v ...

J1.'-Ao

j

1,

,M' j

i

cecomnz procedure will be modified to fit each of the . . , that a word Vo has been received. The receiver refers to the conditional the code

THE

V.'1.,....A. .... J../lA.&:."tTlI/"\l"11Cl

O'V'VU... VJL....

It is convenient to think of this ouantrtv as a random variable (Uk,Vj) is a value of the v ~J.J.~lJJ.\j. value of each such value has a associated with it. CDF of this variable will be rtI01::l1N'"n01rOrtl

The our encoomz

TI"\IUI"\'t1l'TlI'Y1lnl'

on a discrete

channel

() > 0, there exists a block code ecumrobabte code words such that the error ~

+ 0) + «»'

We divide up the (u,v) in two sets and T'. If for any chosen word the. mutual information per letter is larger than the chosen reference value of R + 0, we include that pair in the set T, otherwise in T'

E

>R

E

~

R+O

The total of the random variable Z associated with codepoint pairs in T', definition of F(z), is thus equal to + 8). Now we the result of the section. That sider the ensemble code and its average error probability. error for the code ensemble can be rewritten by dividing the summation

THE FUNDAMENTAL THEOREM OF INFORMATION THEORY

operation described in ments in. T and then in T '.

= LP{u,v}[l

-- (1 --

into two {U})M-l]

T'

But

u: {u}

first over all the

+ LP{u,v}[l

-- (1 --

T

::;; 1; therefore ::;; F(.R + 8)

r,

+ LP{u,v}[l -- (1 T ~ F(R + 8) + M LP{u,v}Qv{u}

{U})M-l]

T

[See the analogous derivation of Eq. (9-75).] The final step required is an estimation of Qv{u} for (u,v) E T. We note that P{u,v} log P{u}P{v} > nCR + 8) for (u,v) E T n(R+8)P{v} or P{vlu} > e (12-60) {u}, observe that P{vlu'} ~ P{vlu} > P{v}e n(R+9) for u' E Sv(u) P{u'lv} > P{u'}e n(R+8) 1 2::: P{u'lv} > en(R+8)Qv{u}

In order to relate this inequality to

L

(12-62)

S11(U)

Finally we find

e- n(R+8) > ::;; F(R + 8) + M Pa

::;;

r,

~

+ +

{u} P{u,v}e- n(R+8)

L T

R(R 8) enRe-n(R+tJ) F(R + 8) + e- n8

(12-64)

Since the average error for the ensemble code is smaller than the quantity + 8) + er", at least there must exist a code with an error probability less than or equal to this same quantity. Thus, the theorem is proved. (12-65) r, ~ F(R + 8) + e- n8

F(R

As n is increased, for a fixed 8 (8 can be selected arbitrarily small), the term e- n8 and also R become smaller and smaller. Similarly, for a fixed M, one may select n large enough to keep Rand F(R + 8) arbitrarily small. Note that the probability of error decreases approximately exponentially with the increase of word length. The exponential feature of this interrelationship is an interesting result which will be further elaborated upon in the next section (see also theorems 2 and 3 of Shannon [4]).

RECENT

is useful for a more extensive appncation

of functions.

. .. , Then

the CDF of S, will

the mecuautv

Chernov's is somewhat proves stronger results than the one version of Chernov's be

oo<x"'4""\hla'nt''':~ has become a field. to detection problems, such as radar detection embrace another proressionai area. The areas of application to some aspects of computers, optics, psychology, and others seem to be the same pattern of specialization. The mathematical foundation of information constitutes another SP(~CU11U~ea The aim of this book has been an ''In -r-,..r.rl'''IIn. ... r.'i''''\tT presentation essentials of information We have o++n."nI"\.,.,.+£\n fundamentals that are for an of the ~UIJI~I~II prior to its application. The areas of specialized applications are outside the scope of the present work. Fortunately, an of literature dealing with the application of information to many specialized fields is and more books on the are forthcoming. It is that the reader will be stimulated to a inclusive study of the subject. The following diversified notes may be of additional interest. The notes are not necessarily complete or self-sustaining. are inserted as a few examples of the multitude of topics available for further For each note, adequate reference for further pursuit is provided. The bibliography at the end of the book also may prove helpful for this purpose. N-l. The Gambler with a Private Wire. J. L. Kelly, * has sugan model which the of the rate transmission of information in a different way. Consider the case of a with a wire who places bets on the outcomes of a game of chance. We assume that the side information which he receives has a probability p of being true and of 1 p of false. Let the original capital of the gambler be and capital after the Kth betting. Since the is not certain that the ll...

J1.JI..ll..Ls;;.,.u.JUJUJI.VIJ.

+ .....··, ...v u·u+'1If",,"Il-.C'l

VJl.VVUJl.VJI..LJLVIJ,

'ta1"1"lTd")'1""tT

J1.VlL.'UUoJl.JLJI.s;;.,..

J. L. Kelly, Jr., A New Interpretation of Information Rate, Bell System Tech. J., vol. 35, pp. 917-926, 1956.

450

mrormation is .on1!".1l1"'t~R~r on each bet. of the successive

"II"nl"ll' Pcrit a is in order. note that in this A1I'lIr'II"lI"",n,+.nn

'In.'I'Or.,..nnr'il1l1l1I''O

= C(r)

pcrit, the bounds exponenuauj Further Remarks The Bose-Chaudhuri t-error Codes. of and work has been derived K. have derived the necessary C. Bose and and sufficient conditions for a group code to be a correcting code. Their work contains material of theoretical as well as practical interest. We shall some of their main results with little the reader (if any) indication of their method of derivation. For the article. is referred to the The necessary sufficient condition for existence of a a-error grou p code is the existence of an n X r matrix [A] with r = n - k such any 20 row vectors of A are Theorem If n = 2m - 1, there exists a a-error V'-'~L"''VV'I.I'''·JL'''_ group code (n,k) with 'I.IJ ..

v....,.... ,

riI''Tl"il''''i'''il1rn

s-.VJLJl.V .... VIJJLJlLHAJVJL'JJLJL

JLJLJL\A.V\.-lLVUJL ..L.)

k~n-am

The proof of 1 is based in on which states: The necessary and sufficient condition for a binary o-error corthe null have a recting group code is that each word 1. 2 is a statement than the result norm of 2a obtained earlier Varsamov. t The latter author has shown that

+

See Elias, cited above, and R. M.Fano, The Statistical Theory of rntormation, N uouo cimenio, ser. X, vol. 13, suppl. 2, pp. 353-372. t R. C. Bose and D. K. R. Chaudhuri, On a Class of Error Correcting Binary Group Codes, Inform. and Control, vol. 3, no. 1, pp. 68-79, 1960. t R. R. Varsamov, The Evaluation of Signals in Codes with Correction of Errors, Doklady Akad. Nauk S.S.S.R. new series, vol. 117, pp. 739-741, 1957.

ADDITIONAL NOTES AND TABLES

if k satisfies the

8;0-1

+

u.C~J..J.1J

J..J..J..'CI'Lt

8;0-2

+ ... +

where then a a-error correcting binary group code (n,k) exists. The merit of the proof suggested Bose and Chaudhuri lies larly in the fact that it provides a constructive method for the codes. Also, the implementation of these codes does not seem to be too complex (see, for instance, Peterson*). Peterson points out that these codes have a cyclic property and hence can be implemented with what is called a shift-register generator, as demonstrated earlier Prange. t (The cyclic property implies that, if a word u = aI, a2, . . . , an E 8, the words obtained by cyclically shifting digits of u in some fashion are also members of 8, for example, Ul = an, aI, a2, . . . ,an-l E 8. An early study of shift-register generators can be found in a report by N. Dependent Error Correction. In all error correcting schemes thus far presented, the occurrence of error was assumed to be a statisticanv pendent phenomenon. In many data-processing systerns the occurrence of an error in a particular binary digit is conditional on the occurrence of error in the preceding digits. Several interesting procedures for the detection and correction of interdependent errors have in the literature for special error patterns, although a solution has not yet been devised. An interesting class of these codes has been investigated by N. M. Abramson, P. and D. W. The latter author devises codes for correction and detection of a "burst" of errors (for example, when lightning may knock out several adjacent telegraph pulses). A brief discussion of Abramson's approach is presented below without reference to the practically important problem of instrumentation. Abramson has suggested a code which corrects single or double adjacent error (SEC-DAEC). Let m be the number of information digits and k the number of parity checks; the number of distinct single and adjacent errors in a word with n = m + k digits is n + n = 2n. • arIIJV-(~Jlt~(~:K number k then must satisfy 2 k ~ 2(m

m

* w.

~ 2k -

1

+ k) + 1 -

k - ~~

W. Peterson, "Error Correcting Codes," John Wiley & Sons, Inc., New York.

t E. Prange, "Some Cyclic Error-correcting Codes with Simple Decoding

rithms," Air Force Cambridge Research Center AFCRC-TN-58-156, April, 1958. tN. Zierler, Several Binary-sequence Generators, Mass. Insi. 'I'echnol., Lincoln Lab. Tech. Rept. 95, September, 1955.

k

mo

To devise a one has to set up a set of 2n equations whose solution determines

+

instrumentation tecnmoue the above cited reference. uonuouuuni Codes. Most

the ·.

n

'~T'("'lI",n nl'n

hencetheeq1~lPlrnE~nt

The encodmz-decodmz nroceuure tion of error prooarnutv

enannon, HANote on a Partial for Communication " vol, pp. 390-397, Academic Press, New

V1U:ar.u.u'C.lO,

The cascacmz of these channels is .on1111"'!'l'olllniwhose stochastic matrix is the of the two and In the for simpncrtv stochastic of such results can be ~'V.ll.,JL'VA.'iOlJA..Il(11)

Z

Jl.l'JTJ!JtiJrQ.,A.1..1

= 4>(:&)

Z

4>(:&)

Z

fP(z)

1.30 1.31 1.32 1.33 1.34 1.35 1.36 1.37 1.38 1.39 1.40 1.41 1.42 1.43 1.44 1.45 1.46 1.41 1.48 1.49 1.50 1.51 1. 52 1.53 1.54 1.55 1.56 1. 57 1.58 1.59 1.60 1.61 1.62 1.63 1.64 1.65 1.66 1.67 1.68 1.69 1. 70 1. 71 1. 72 1. 73 1. 74 1. 75 1. 76 1. 77 1. 78 1. 79 1.80 1.81 1.82 1.83 1.84 1.85 1.86 1.87 1.88 1.89 1. 90 1. 91 1.92 1.93 1.94

0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706 0.4713 0.4719 0.4726 0.4732 0.4738

1.95 1. 96 1. 97 1. 98 1. 99 2.00 2.02 2.04 2.06 2.08 2.10 2.12 2.14 2.16 2.18 2.20 2.22 2.24 2.26 2.28 2.30 2.32 2.34 2.36 2.38 2.40 2.42 2.44 2.46 2.48 2.50 2.52 2.54 2.56 2.58 2.60 2.62 2.64 2.66 2.68 2.70 2.72 2.74 2.76 2.78 2.80 2.82 2.84 2.86 2.88 2.90 2.92 2.94 2.96 2.98 3.00 3.20 3.40 3.60 3.80 4.00 4.50 5.00

0.4744 0.4750 0.4756 0.4761 0.4767 0.4772 0.4783 0.4793 0.4803 0.4812 0.4821 0.4830 0.4838 0.4846 0.4854 0.4861 0.4868 0.4875 0.4881 0.4887 0.4893 0.4898 0.4904 0.4909 0.4913 0.4918 0.4922 0.4927 0.4931 0.4934 0.4938 0.4941 0.4945 0.4948 0.4951 0.4953 0.4956 0.4959 0.4961 0.4963 0.4965 0.4967 0.4969 0.4971 0.4973 0.4974 0.4976 0.4977 0.4979 0.4980 0.4981 0.4982 0.4984 0.4985 0.4986 0.49865 0.49931 0.49966 0.499841 0.499928 0.499968 0.499997 0.499997

---0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.60 0.61 0.62 0.63 0.64

0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224 0.2257 0.2291 0.2324 0.2357 0.2389

0.65 0.66 0.67 0.68 0.69 0.70 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.80 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.90 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.20 1. 21 1.22 1.23 1.24 1.25 1.26 1.27 1.28 1.29

0.2422 0.2454 0.2486 0.2517 0.2549 0.2580 0.2611 0.2642 0.2673 0.2703 0.2734 0.2764 0.2794 0.2823 0.2852 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015

466

ADDITIONAL NOTES AND NORMAL DISTRIBUTIONS

P{O

-
z}

z}

z

0.34134 0.47500 0.47725 0.49500 0.49865 0.49950

0.68268 0.95000 0.95450 0.99000 0.99730 0.99900

0.31732 0.05000 0.04550 0.01000 0.00270 0.00100

0.84134 0.97500 0.97725 0.99500 0.99865 0.99950

0.15866 0.02500 0.02275 0.00500 0.00135 0.00050

SOME COMMON

Mean

Variance 0'2

Characteristic E(e itX)

[E(X)]2

E(X 2 )

(peit

np

+. q)n

e-AAk eA(eit - 1)

k! k = 0, 1,2, . • . ,n e-}-2[x-ml0

<X

An Introduction to Information Theory

An Introduction to Information Theory

An Introduction to Information Theory