Basic Probability Theory

-.,.p BASIC PROBABILITY THEORY Robert B. Ash Department of Mathematics University of Illinois DOVER PUBLICATIONS, IN...

Author: Robert B. Ash

405 downloads 2451 Views 8MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

-.,.p

BASIC PROBABILITY THEORY

Robert B. Ash Department of Mathematics University of Illinois

DOVER PUBLICATIONS, INC. Mineola, New York

Copyright

Copy right © 1970, 2008 by Robert B. Ash

All rights reserved.

Bibliographical Note This Dover edition, first published in 2008, is an unabridged republication of

the work originally published in 1970 by John Wiley & Sons, Inc., New York. Readers of this book who would like to receive the solutions to the exercises not

published in the book 1nay request the1n fro1n the publisher at the following e-1nail address: [email protected]

Library of Congress Cataloging-in-Publication Data Ash, Robert B.

Basic propability theory I Robert B. Ash. - Dover ed. p. c1n.

Includes index. Originally published: New York : Wiley, [1970] ISBN-13: 978-0-486-46628-6 ISBN-10: 0-486-46628-0 1. Probabilities. I. Title. QA273.A77 2008 519.2-dc22 200800473

Manufactured in the United States of A1nerica Dover Publications, Inc., 31 East 2nd Street, Mineola, N.Y 11501

Preface

This book has been written for a first course in probability and was developed from lectures given at the University of Illinois during the last five years. Most of the students have been juniors, seniors, and beginning graduates, from the fields of mathematics, engineering and physics. The only formal prerequisite is calculus, but an additional degree of n1athematical maturity may be helpful. In talking about nondiscrete probability spaces, it is difficult to avoid measure-theoretic concepts. However, to develop extensive formal machinery from measure theory before going into probability (as is done in most graduate programs in mathematics) would be inappropriate for the particular audience to whom the book is addressed. Thus I have tried to suggest, when possible, the underlying measure-theoretic ideas, while emphasizing the probabilistic way of thinking, which is likely to be quite novel to anyone studying this subject for the first time. The major field of application considered in the book is statistics (Chapter 8). In addition, some of the problems suggest connections with the physical sciences. Chapters 1 to 5, and Chapter 8 will serve as the basis for a one semester or a two-quarter course covering both probability and statistics. If probability alone is to be considered, Chapter 8 may be replaced by Chapter 6 and Chapter 7, as time permits. An asterisk before a section or a problem indicates material that I have normally omitted (without loss of continuity) , either because it involves subject matter that many of the students have not been exposed to (for example, complex variables) or because it represents too concentrated a dosage of abstraction. A word to the instructor about notation. In the most popular terminology , P{X < x} is written for the probability that the random variable X assumes a value less than or equal to the number x. I tried this once in my class, and I found that as the semester progressed, the capital X tended to become smaller in the students ' written work, and the small x larger. The following semester, I switched to the letter R for random variable, and this notation is used throughout the book. v

vi

PREFACE ..

Fairly detailed solutions to some of the problems (and numerical answers to others) are given at the end of the book. I hope that the book will provide an introduction to more advanced courses in probability and real analysis and that it makes the abstract ideas to be encountered later more meaningful. I also hope that nonmathematics majors who come in contact with probability theory in their own areas find the book useful. A brief list of references, suitable for future study, is given at the end of the book. I am grateful to the many students and colleagues who have influenced my own understanding of probability theory and thus contributed to this book. I also thank Mrs. Dee Keel for her superb typing, and the staff of Wiley for its continuing interest and assistance. Urbana, Illinois, 1969

Robert B. Ash

Co nte nts

1. B A SIC C O N C E P T S

1.1 1 .2 1 .3 1 .4 1 .5 1 .6 1 .7 1 .8 2.

Introduction Algebra of Events (Boolean Algebra) Probability Combinatorial Problems Independence Conditional Probability Some Fallacies in Combinatorial Problems Appendix: Stirling's Formula

1 3 10 15 25 33 39 43

R AN D O M V A RI A B L E S

Introduction Definition of a Random Variable Classification of Random Variables Functions of a Random Variable Properties of Distribution Functions Joint Density Functions Relationship Between Joint and Individual Densities ; Independence of Random Variables 2.8 Functions of More Than One Random Variable 2.9 Some Discrete Examples

2.1 2.2 2.3 2.4 2.5 2.6 2. 7

46 48 51 58 66 70 76 85 95 vii

CONTENTS

·viii 3.

EXP ECTATION

3.1 3.2 3.3 3.4 3.5 3.6 3.7

Introduction Terminology and Examples Properties of Expectation Correlation The Method of Indicators Some Properties of the Normal Distribution Chebyshev' s Inequality at?- d the Weak Law of Large Numbers

100 107 1 14 1 19 122 124 126

4. CO NDITI O NAL P R O BABILITY AND EXP ECTATI O N

4. 1 4.2 4.3 4.4 4.5 S.

Introduction Examples Conditional Density Functions Conditional Expectation Appendix : The General Concept of Conditional Expectation

130 1 33 1 35 140 1 52

C HAR ACT E R I S TIC F UNCT I O N S

5.1 5.2 5.3 5.4

Introduction Examples Properties of Characteristic Functions The Central Limit Theorem

1 54 1 58 1 66 169

6. INFINI TE S E Q UE N C E S O F RANDO M VARIAB L E S

6.1 Introduction 6.2 The Gambler ' s Ruin Problem 6.3 Combinatorial Approach to the Random Walk; the Reflection Principle 6.4 Generating Functions 6.5 The Poisson Random Process 6.6 The Strong Law of Large Numbers

1 78 1 82 1 86 191 196 203

CONTENTS 7.

MARKOV CHAINS

7.1 7.2 7. 3 7.4 7.5 8.

ix

IntrQduction Stopping Times and the Strong Markov Property Classification of States Limiting Probabilities Stationary and Steady-State Distributions

21 1 21 7 220 230 236

INTRODUCTION TO STATISTICS

8.1 8.2 8. 3 8.4 8.5 8.6 8.7

Statistical Decisions Hypothesis Testing Estimation Sufficient Statistics Unbiased Estimates Based on a Complete Sufficient Statistic Sampling from a Normal Population The Multidimensional Gaussian Distribution

241 243 258 264 268 274 279

Tables

286 289 290 333

I

A

Brief Bibliography

Solutions to Problems Index

BASIC PROBABILITY THEORY

Basic Co n ce p ts

1. 1 I N T R O D U C T I O N

The origin of probability theory lies in physical observations associated with games of chance. It was found that if an "unbiased" coin is tossed independ ently n times , where n is very large, the relative frequency of heads , that is, the ratio of the number of heads to the total number of tosses , is very likely to be very close to 1/2. Similarly, if a card is drawn from-a perfectly shuffled deck and then is replaced, the deck is reshuffled, and the process is repeated over and over again , there is (in some sense) convergence of the relative frequency of spades to 1/4. In the card experiment there are 52 possible outcomes when a single card is drawn. There is no reason to favor one outcome over another (the principle of "insufficient reason" or of "least astonishment") , and so the early workers in probability took as the probability of obtaining a spade the number of favorable outcomes divided by the total number of outcomes, that is , 1 3/52 or 1/4. This so-called "classical definition" of probability, (the probability of an event is the number of outcomes favorable to the event, divided by the total number of outcomes, where all outcomes are equally likely) is first of all restrictive (it considers only experiments with a finite number of outcomes) and , more seriously, circular (no matter how you look at it, "equally likely"

2

BASIC CONCEPTS

essentially means "equally probable," and thus we are using the concept of probability to define probability itself). Thus we cannot use this idea as the basis of a mathematical theory of probability; however, the early proba bilists were not prevented from deriving many valid and useful results. Similarly, an attempt at a frequency definition of probability will cause trouble. If Sn is the number of occurrences of an event in n independent performances of an experiment, we expect physically that the relative fre quency Snfn should coverge to a limit ; however, we cannot assert that the limit exists in a mathematical sense. In the case of the tossing of an unbiased coin , we expect that Snfn � 1/2, but a conceivable outcome of the process is that the coin will keep coming up heads forever. In other words it is possible that Snfn � 1 , or that Snfn � any number between 0 and 1 , or that Snfn has no limit at all. In this chapter we introduce the concepts that are to be used in the con struction of a mathematical theory of probability. The first ingredient we need is a set Q, called the sample space, representing the collection of possible outcomes of a random experiment. For example, if a coin is tossed once we may take Q = {H, T}, where H corresponds to a head and T to a tail. If the coin is tossed twice, this is a different experiment and we need a different Q, say {HH, HT, TH, TT}; in this case one performance of the experiment corresponds to two tosses of the coin. If a single die is tossed, we may take Q to consist of six points, say Q = {1, 2, . . . , 6}. However, another possible sample space consists of two points, corresponding to the outcomes "N is even" and "N is odd," where N is the result of the toss. Thus different sample spaces can be associated with the same experiment. The nature of the particular problem under considera tion will dictate which sample space is to be used. If we are interested, for example, in whether or not N > 3 in a given performance of the experiment, the second sample space, corresponding to "N even" and "N odd," will not be useful to us. In general, the only physical requirement on Q is that a given performance of the experiment musl produce a result corresponding to exactly one of the p oints ofO. We have as yet no mathematical requirements on 0; it is simply a set of points. Next we come to the notion of event. An "event" associated with a random experiment corresponds to a question about the experiment that has a yes or no answer, and this in turn is associated with a subset of the sample space. For example , if a coin is tossed twice and Q = {HH, HT, TH, TT}, "the number of heads is < 1 " will be a condition that either occurs or does not occur in a given performance of the experiment. That is , after the experiment is performed , the question "Is the number of heads < 1 ?" can be answered yes or no. The subset of Q corresponding to a "yes" answer is A = {HT, TH, TT}; that is , if the outcome of the experiment is HT, TH, or TT, the answer

1.2

B

=

ALGEBRA OF EVENTS (BOOLEAN ALGEBRA)

{first toss = second toss} A=

3

{number of heads 3}

{N is even or N > 3} = {2, 3, 4, 5 , 6} {N is even and N > 3} = {4, 6} {N is not even} = {1 , 3 , 5} {N is not > 3} = {N < 3} = { 1 , 2} -..

=

Schematic representations (called Venn diagrams) of unions, intersections, and complements are shown in Figure 1 .2. 1 . Define the union of n events A 1 , A 2 , , A n (notation : A 1U UA n , or U; 1 A i) as the set consisting of those points which belong to at least one of the events A 1 , A 2 , , A n . Similarly define the union of an infinite se quence of events A 1 , A 2 , as the set of points belonging to at least one of the events A 1 , A 2 , (notation : A 1u A 2u , or U t) 1 A i). Define the intersection of n events A 1 , , A n as the set of points belonging to all of the events A 1 , n A n , or n� 1 Ai). , A n (notation : A 1 n A 2 n Similarly define the intersection of an infinite sequence of events as the set of •

•

•

•

•

•

•

•

·

•

•

·

•

•

•

•

•

•

•

·

·

•

·

·

·

·

·

1.2

5


points belonging to all the events in the sequence (notation : A1 n A2 n , or n� 1 Ai). In the above example , with A = {N is even} = {2, 4, 6}, B = {N > 3} = {3 , 4, 5, 6}, C = {N = 1 or N = 5} = { 1 , 5}, we have ·

·

·

Au Bu C = Q, A nB nC= 0 Au Beu C = {2, 4, 6} u {1 , 2 }u {1 , 5} = {1 , 2, 4, 5, 6} (AU C) n [(A n B)c] = { 1 , 2, 4, 5, 6 } n {4, 6 }c = {1 , 2, 5}

Two events in a sample space are said to be mutually exclusive or disjoint if A and B have no points in common , that is , if it is impossible that both A and B occur during the same performance of the experiment. In symbols, A and B are mutually exclusive if A n B = 0. In general the events A1, A2, , An are said to be mutually exclusive if no two of the events have a point in common ; that is , no more than one of the events can occur during •

•

•

FIGURE 1 .2.2 A

(")

(B u C)

=

(A

(")

B) u (A (l C).

the same performance of the experiment. Symbolically, this condition may be written for i ¥- j are said to be mutually exclusive if Similarly, infinitely many events A1, A2, Ai n A; = 0 for i ¥- j. In some ways the algebra of events is similar to the algebra of real numbers , with union corresponding to addition and intersection to multiplication. For example, the commutative and associative properties hold. •

•

•

Au B

=

Bu A,

Au (Bu C)

=

(Au B)u C

A nB

=

B n A,

A n (B n C)

=

(A n B) n C

(1.2.1)

'

Furthermore, we can prove that for events A, B, and C in the same sample space we have A n (Bu C) = (A n B) u (A n C) (1.2.2) There are several ways to establish this ; for example, we may verify that the sets of both the left and right sides of the equality above are represented by the area in the Venn diagram of Figure 1 .2.2.

6

BASIC CONCEPTS

Another approach is to use the definitions of union and intersection to show that the sets in question have precisely the same members ; that is, we show that any point which belongs to the set on the left necessarily belongs to the set on the right, and conversely. To do this , we proceed as follows. x

E A n (Bu C)=> x E A =>X E A

X E Bu c (x E B or x E C)

and and

( The symbol=> means "implies , '' and -¢=>means "implies and is implied by.") CASE 1.

x

E B. Then

x

E A and

x

E B, so

x

E A n B, so

x

E (A n B)u

2.

x

E C. Then

x

E A and

x

E C, so

x

E A n C, so

x

E (A n B)u

(A n C).

CAsE

(A n C).

Thus x E A n (B u C) => x E (A n B)u (A n C); that is , A n (Bu C) c ( A n B)u (A n C). (The symbol c is read "is a subset of"; we say that A 1 c A 2 provided that x E A 1 => x E A 2 ; see Figure 1 .2.3. Notice that, according to this definition , a set A is a subset of itself: A c A.) Conversely: Let

x

E (A n B)u (A n C). Then x E A n B or

x

E A n C.

E A n B. Then x E B, so x E Bu C, so x E A n (Bu C).

CAsE

1.

CAsE

2. x E A n C. Then x E C, so x E Bu C, so x E A n (Bu C).

x

c

Thus (A n B)u (A n C)

A n (Bu C) ; hence

A n (Bu C) = (A n B)u (A n C)

As another example we show that (A 1 u A 2u

·

·

·

u A ) c - A 1c n A 2c n n

FIGURE 1 .2.3 Al

·

c

·

·

A 2.

nA c n

(1 .2.3)

1.2


7

The steps are as follows. X

E (AlU···UAn)c-¢:> X ¢ A1U···UAn

it is not the case that x belongs to at least one of the Ai x E none of the Ai X E A/

for all i

x E A1cn···nAn c

An identical argument shows that (1.2.4) and similarly (1.2.5) Also (1 .2.6) The identities (1 .2.3)-(1 .2.6) are called the DeMorgan laws. In many ways the algebra of events differs from the algebra of real numbers, as some of the identities below indicate. AUA = A

Au Ac = Q

AnA = A

AnAc = 0

AnO = A

AU0 = A

AUO = Q

Arl0 = 0

(1 .2. 7)

Another method of verifying relations among events involves algebraic manipulation, using the identities already derived. Four examples are given below ; in working out the identities, it may be helpful to write Au B as A + B and AnB as AB. 1 . Au (AnB) = A

(1 .2.8)

PROOF.

A + AB = AQ + AB = A(Q + B) = AO. = A

2. (Au B)n (Au C) = Au (Bn C)

(1.2.9)

8

BASIC CONCEPTS

PROOF.

(A + B) (A + C) = = = = =

(A + B)A + (A + B)C AA + AB + A C + BC A (Q + B + C) + BC AO. + BC A + BC

(note AB = BA)

(1 .2. 1 0) PROOF.

4. (A nBe)u (A nB)u (AcnB) = A u B

(1.2.11)

PROOF.

ABC + AB + ACB = ABC + AB + AB + ACB

[see (1.2 .7)]

= A (Bc + B) + (A + Ac)B = AQ + QB =A+B

(see Figure 1 .2.4). As another example, let Q be the set of nonnegative real numbers. Let n = 1, 2, ...

(This will be anoth�r common way of describing an event. It is to be read: "A n is the set consisting of those points x in Q such that 0 < x < 1 - 1/n." If there is no confusion about what space Q we are considering, we shall simply write A n = {x: 0 < x < 1 - 1/n}.) Then 00

U A n = [0, 1) = {X: 0

n=l

0}. List the points that belong to the following events. =

=

=

(A u B) rl cc, (A rlB) rl [(A u C)c] A rlB rl C, A u (B rl cc), 2. Let A, B, and C be arbitrary events in the same sample space. Let D1 be the event that at least two of the events A, B, C occur; that is, D1 is the set of points common to at least two of the sets A, B, C. Let D2 {exactly two of the events A, B, C occur} D3 {at least one of the events A, B, C occur} {exactly one of the events A, B, C occur} D4 D5 {not more than two of the events A, B, C occur} Each of the events D1 through D5 can be expressed in terms of A , B, and C by using unions, intersections, and complements. For example, D3 A u B u C. Find suitable expressions for D1, D2, D4, and D5• 3. A public opinion poll (circa 1 850) consisted of the following three questions: (a) Are you a registered Whig? (b) Do you approve of President Fillmore's performance in office? (c) Do you favor the Electoral College system? A group of 1 000 people is pol led. Assume that the answer to each question must be either "yes" or "no. " It is found that: 550 people answer "yes" to the third question and 450 answer "no." 325 people answer "yes" exactly twice; that is, their responses contain two "yeses" and one ''no. " =

=

= =

=

I0

4.

5.

BASIC CONCEPTS

100 people answer "yes" to all three questions. 125 registered Whigs approve of Fillmore's performance. How many of those who favor the Electoral College system do not approve of Fillmore's performance, and in addition are not registered Whigs ? HINT: Draw a Venn diagram. If A and B are events in a sample space, define A -B as the set of points which belong to A but not to B; that is, A -B = A rl B e. Establish the following. (a) A rl (B - C) = (A rl B) -(A rl C) (b) A - (B u C) = (A -B)- C Is is true that (A -B) u C = (A u C) B? Let Q be the reals. Establish the fo �lowing.

-

. b) = lJ (a, b - �] - lJ [a+�, b ) n=1 n=1 [a, b] = nn=1 [a, b+�) - nn=1 (a-�' b]

(a,

8.

9.

n

n

n

A and B are disjoint events, are Ae and Be disjoint? Are A rl C and B rl C disjoint? What about A C and B C? If An An-1 A1 , show that nr 1 Ai = An , Uf 1 Ai = A1. Suppose that A1 , A2, is a sequence of subsets of n, and we know that for each nf 1 Ai is not empty. Is it true that nf 1 Ai is not empty ? (A related questi on about real numbers: if, for each we have :Li 1 ai < b, is it true that :Lt) 1 ai < b?) If A, B 1 , B2, are arbitrary events, show that A 0 for every A E :F (1 .3.4) P(Q) = 1 (1 .3. 5) are disjoint sets in /F, then If A1, A2, •

•

•

P(A1u A2u

·

·

·

)

=

P(A1) + P(A2) +

·

·

·

(1 .3.6) We may now give the underlying mathematical framework for probability theory. ·

A probability space is a triple (0., ff, P), where 0. is a set, :F a sigma field of subsets of n, and P a probability measure on :F.

DEFINITION.

We shall not, at this point, embark on a general study of probability measures. However, we shall establish four facts from the definition. (All sets in the arguments to follow are assumed to belong to :F.) 1 . P(0)

=

0

(1 .3. 7)

Au 0 = A; hence P(Au 0) = P(A) . But A and 0 are disjoint and so P(Au 0) = P(A) + P(0). Thus P(A) =P(A) + P(0); conse quently P(0) = 0. PROOF.

2. P(Au B)

=

P(A) + P(B) -P(AnB)

(1 .3.8)

A = (AnB)u (An Be), and these sets are disjoint (see Figure 1 .2.4). Thus P(A) = P(AnB) + P(AnBe). Similarly P(B) = P(AnB) + P(AenB). Thus P(A) + P(B) - P(A n B) = P(A n B) + P(AnBe) + P(Ae n B) = P(Au B). Intuitively, if we add the outcomes in A to those in B, we have counted those in A nB twice ; subtracting the outcomes in A n B yields the outcomes in A u B. PROOF.

3. If B

c

A, then P(B) < P(A); in fact, P(A - B)

=

P(A) - P(B)

(1 .3.9)

where A -B is the set of points that belong to A but not to B. P(B) + P(A -B), since B c A (see Figure 1 .3. 1), and the result follows because P(A -B) > 0. Intuitively, if the occurrence of B PROOF. P(A)

=

1 .3 PROBABILITY

13

A-B

FIGURE 1 .3 . 1

0

always implies the occurrence of A , A mus t occur at least as often as B in any sequence of performances of the experiment. (1 .3. 10) That is, the probability that at least one of a finite or countably infinite collection of events will occur is less than or equal to the sum of the prob abilities ; note that, for the case of two events, this follows from P(Au B) = P(A) + P(B) - P(A nB) < P(A) + P(B). We make use of the fact that any union may be written as a disjoint union, as follows. PROOF.

A1u A2u · · · = A1u (A1enA2) u (A1e nA2enA3)u · · · u (A1enA2en · · · nA �_1nAn) u · · ·

( 1. 3. 1 1)

To see this, observe that if x belongs to the set on the right then x E A1en · · · nA�_1nA n for some n ; hence x E A n . Thus x belongs to the set on the left. Conversely, if x belongs to the set on the left, then x E An for some n. Let n0 be the smallest such n. Then x E A1en · · · nA�0_1nA n 0 , and so x belongs to the set on the right. Thus 00

00

n=l

n=l

P(Alu A2u . . · ) = I P(Ale n ...nA�-1 nAn) < I P(An)

using (1 .3.9) ; notice that

The basic difficulty with the classical and frequency definitions of probability is that their approach is to try somehow to prove mathematically that, for example, the probability of picking a heart from a perfectly shuflled deck is 1/4, or that the probability of an unbiased coin coming up heads is 1/2. This cannot be done. All we can say is that if a card is picked at random and then replaced, and the

REMARKS.

14

BASIC CONCEPTS

process is repeated over and over again , the result that the ratio of hearts to total number of drawings will be close to 1/4 is in accord with our intuition and our physical experience. For this reason we should assign a probability 1/4 to the event of obtaining a heart, and similarly we should assign a probability 1/52 to each possible outcome of the experiment. The only reason for doing this is that the con sequences agree with our experience. If you decide that some mysterious factor caused the ace of spades to be more likely than any other card, you could incorporate this factor by assigning a higher probability to the ace of spades. The mathematical development of the theory would not be affected ; however, .the conclusions you might draw from this assumption would be at variance with experimental results. One can never really use mathematics to prove a specific physical fact. For example, we cannot prove mathematically that there is a physical quantity called "force. " What we can do is postulate a mathematical entity called "force" that satisfies a certain differential equation. We can build up a collection of mathematical results that, when interpreted properly, provide a reasonable description of certain physical phenomena (reasonable until another mathematical theory is constructed that provides a better description). Similarly, in probability theory we are faced with situations in which our intuition or some physical experiments we have carried out suggest certain results. Intuition and experience lead us to an assignment of probabilities to events. As far as the mathematics is concerned, any assignment of probabilities will do , subject to the rules of mathen1atical con sistency. However, our hope is to develop mathematical results that, when interpreted and related to physical experience, will help to make precise such notions as "the ratio of the number of heads to the total number of observations in a very large number of independent tosses of an unbiased coin is very likely to be very close to 1/2. " We emphasize that the insights gained by the early workers in prob ability are not to be discarded, but instead cast in a more precise form.

PRO B L E MS

1. 2.

Write down some examples of sigma fields other than the collection of all subsets of a given set n. Give an example to show that P(A - B) need not equal P(A) - P(B) if B is not a subset of A.

1 .4

COMBINATORIAL PROBLEMS

IS

1 . 4 C O M BI N A T O RI A L P R O B L E M S "

We consider a class of problems in which the assignment of probabilities can be made in a natural way. Let Q be a finite or countably infinite set, and let :F consist of all subsets of

n.

For each point wi E Q, i = 1 , 2, . . . , assign a nonnegative number pi , with "!iPi = 1 . If A is any subset of 0., let P(A) = "!wiEA Pi· Then it may be verified that P is a probability measure ; P {wi} = pi, and the probability of any event A is found by adding the probabilities of the points of A. An (Q , :F , P) of this type is called a discrete probability space. Let E1

=

Hence

1.

Throw a (biased) coin twice (see Figure 1 .4. 1). {at least one head}. Then E1 = A1 u A 2u A3

.. Example

Let E 2

=

=

P(A1) + P(A 2) + P(A3) = .36 + .24 + .24 = .84 {tail on first toss} ; then E2 = A3u A 4 P(E1)

=

{HH}, A 3 = {TH },

A1

Assign P(A1) P(A2) P(A4)

=

=

=

.36 P(A3)

=

.24

.16

FIGURE 1.4.1 Coin-Tossing Problem.

16

BASIC CONCEPTS

and In the special case when n = {w1, ..., wn} and Pi = 1/n , i = 1 , 2, . . . , n, we have number of points of A favorable outcomes P(A) = total number of points in n total outcomes corresponding to the classical definition of probability. Thus, in this case, finding P(A) · simply involves counting the number of outcomes favorable to A. When n is large, counting by hand may not be feasible ; combinatorial analysis is simply a method of counting that can often be used to avoid writing down the entire list of favorable outcomes. There is only one basic idea in combinatorial analysis, and that is the following. Suppose that a symbol is selected from the set {a1, ..., an}; if ai is chosen, a symbol is selected from the set {bil, ..., him}.Each pair of selections (ai, bii) is assumed to determine a "result" f(i, j). If all results are distinct, the number of possible results is nm, since there is a one-to-one correspondence between results and pairs of integers (i, j) , i = 1 , . . . , n, j = 1 , . . . , m. If, after the symbol bii is chosen, a symbol is selected from the set {cii1, ci12, , ciiv}, and each triple (ai, bii, ciik) determines a distinct result f(i, j, k) , the number of possible results is nmp. Analogous statements may be made for any finite sequence of selections. Certain standard selections occur frequently, and it is convenient to classify them. Let a1, ..., an be distinct symbols. •

•

•

Ordered samples of size

r,

with replacement

The number of ordered sequences (ai1, , ai), where the aik belong to {a1, ..., an}, is n x n x · · · X n ( r times) , or (1 .4. 1 ) •

•

•

(The term "with replacement" refers to the fact that if the symbol aik is selected at step k it may be selected again at any future time.) For example, the number of possible outcomes if three dice are thrown is 6 X 6 X 6 = 216. Ordered Samples of Size

r,

without Replacement

The number of ordered sequences (ai1, , ai), where the aik belong to {a1, ..., an}, but repetition is not allowed (i.e. , no ai can appear more than •

•

•

1 .4

once in the sequence) , is

n(n- 1) · · · (n - r + 1)

COMBINATORIAL PROBLEMS

n'· , (n- r) !

=

=

r

17

1, 2, . . . , n (1. 4 .2)

(The first symbol may be chosen in n ways , and the second in n - 1 ways, since the first symbol may not be used again, and so on.) The above number is sometimes called the number of permutations of r objects out of n, written (n) r ·

For example, the number of 3-digit numbers that can be formed from 1 , 2, , 9, if no digit can be repeated, is 9(8) (7) = 504. .

.

.·

Unordered Samples of Size r, without Replacement

The number of unordered sets {a i1 , , air} , where the aik' k = 1 , . . . , r, are distinct elements of {a1 , . . . , an } (i.e. , the number of ways of selecting r distinct objects out of n) , if order does not count, is •

•

•

G) - r ! (:� r) !

(1 .4.3)

To see this, consider the following process. (a) Select r distinct objects out of n without regard to order ; this can be done in c:) ways, where (�) is to be determined. (b) For each set selected in (a) , say {ai1 , , a ir }, select an ordering of , air· This can be done in (r) r = r ! ways (see Figure 1 .4.2 for n = 3, ai 1 , •

•

r

=

•

•

2). •

•

The result of performing (a) and (b) is a permutation of r objects out of n; hence

or

( n) r ! r

(�)

=

(n)r

=

n'·

(n - r) !

n r ! (n r) ! '

�

r

=

1, 2,

. .

.

,n

We define (�) to be n ! /0 ! n ! = 1 , to make the formula for C:) valid for r = 0, 1 , . . . , n. Notice that ( �) = ( nk). c:) is sometimes called the number of combinations ofr objects out ofn. 11

12

(a) (b) 12

/ ""'

FIGURE

13

/ ""'

21 13

23

/ ""'

31 23

1.4.2 Determination of (;)

.

32

18

BASIC CONCEPTS

Unordered Samples

of

Size r, with Replacement

We wish to find the number of unordered sets {ai1 , . . . , a,;}, where the "r aik belong to {a1, . . . , an } and repetition is allowed. As an example, le t n = 3 and r = 3. Let the symbols be 1 , 2, and 3. List all arrangements in a column so that a precedes b if and only if a, read as an ordinary 3-digit number, is 0]. If A and B are independent, then P(A () B) = P (A)P(B), so that P(B I A) = P(B) and P(A I B) ( = P(A () B)/P (B)) =
0. Thus P(A) is a weighted average of the conditional probabilities P(A I Bi). PROOF.

P(A)

=

P(A n Q)

=

I P(A n Bi) i

=

( (Y Bi ))

P A n =

=

P

I P(Bi)P(A I Bi) i

( Y (A n Bi))

Notice that under the above assll:mptions we have

P(Bk)P(A I Bk) ( 1 .6.6) = = I P(Bi)P(A I Bi) i This formula is sometimes referred to as Bayes' theorem ; P(Bk I A) is some ' times called an a posteriori probability. The reason for this terminology may be seen in the example below . P(Bk I A)

P(A n Bk) P(A)

..- Example 3.

Two coins are available, one unbiased and the other two headed. Choose a coin at random and toss it once ; assume that the unbiased coin is chosen with probability 3 /4. Given that the result is heads, find the probability that the two-headed coin was chosen. The "tree diagram" shown in Figure 1 .6.2 represents the experiment. We may take Q to consist of the four possible paths through the tree , with each path assigned a probability equal to the product of the probabilities assigned to each branch. Notice that we are given the probabilities of the H

unbiased T H

two-headed T FIGUR E 1 . 6.2

Tree Diagram.

1 .6

CONDITIONAL PROBABILITY

37

events B1 = {unbiased coin chosen} and B2 = {two-headed coin chosen}, as well as the conditional probabilities P (A I Bi) where A = {coin comes up heads}. This is sufficient to determine the probabilities of all events. Now we can compute P(B2 1 A) using Bayes ' theorem ; this is facilitated if, instead of trying to identify the individual terms in (1.6.6) , we simply look at the tree and write ,

P(B2 I A)

= -

P(B2nA) P(A) P{ two-headed coin chosen and coin comes up heads } P{ coin comes up heads} (1/4)(1) = � ... (3/4)(1/2) � (1/4)(1) 5

There are many situations in which an experiment consists of a sequence of steps , and the conqitional probabilities of events happening at step n � 1 , given outcomes at step n , are specified. In such cases a description by means of a tree diagram may be very convenient (see Problems) . .,._

Example 4.

A loaded die is tossed once ; if N is the result of the toss, then P {N = i} = pi , i = 1 , 2, 3 , 4, 5 , 6. If N = i, an unbiased coin is tossed independently i times. F ind the conditional probability that N will be odd, given that at least one head is obtained (see Figure 1 . 6.3). at least one head

FIGURE 1 .6.3

38

BASIC CONCEPTS

Let A = {at least one head obtained}, B P (AnB)/P(A) . Now

P(A () B)

=

.! P{ N

i=l,3,5

=

i

=

{N odd}. Then P(B I A)

=

and at least one head obtained}

- l p + 1.p + 3 1 P s a -2 1 32 s \

since when an unbiased coin is tossed independently i times , the probability of at least one head is 1 - ( l/2) i. Similarly,

P(A)

=

=

Thus

6

_! P{N

i=l 6

_! Pi( l

i=l

=

i

-

2-i)

and at least one head obtained}

PRO B LE M S

l.

2. 3.

4.

5.

In 10 Bernoulli trials find the conditional probability that all successes will occur consecutively (i.e. , no two successes will be separated by one or more failures), given that the number of successes is between four and six. If X is the number of successes in n Bernoulli trials, find the probability that X > 3 given that X > 1 . An unbiased die is tossed once. If the face is odd, an unbiased coin is tossed repeatedly; if the face is even, a biased coin with probability of heads p :#- 1/2 is tossed repeatedly. (Successive tosses of the coin are independent in each case.) If the first n throws result in heads, what is the probability that the unbiased coin is being used ? A positive integer I is selected, with P{I = n} = (1/2)n , n = 1 , 2, . . . . If I takes the value n , a coin with probability of heads e-n is tossed once. Find the probability that the resulting toss will be a head. A bridge player and his partner are known to have six spades between them. Find the probability that the spades will be split (a) 3-3 (b) 4-2 or 2-4 (c) 5-1 or 1-5 (d) 6-0 or 0-6.

1.7

39

SOME FALLACIES IN COMBINATORIAL PROBLEMS

6. An

7.

urn contains 30 white and 1 5 black balls. If 10 balls are drawn with (respec tively without) replacement, find the probability that the first two balls will be white , given that the sample contains exactly six white balls. Let C1 be an unbiased coin, and C2 a biased coin with probability of heads 3/4. At time t 0, C1 is tossed. If the result is heads, then C1 is tossed at time t 1 ; if the result is tails, C2 is tossed at t 1 . The process is repeated at t 2, 3 , . . . . In general, if heads appears at t n, then C1 is tossed at t n + 1 ; if tails appears at t n , then c2 is tossed at t n + 1. Find Yn the probability that the toss at t n will be a head (set up a difference equation). In the switching network of Figure P. 1 .6_. 8, the switches operate independently. =

=

=

=

=

=

=

=

8.

=

=

In D

c FIGURE P . 1 .6.8

Each switch closes with probability p, and remains open with probability 1 p. (a) Find the probability that a signal at the input will be received at the output. (b) Find the conditional probability that switch E is open given that a signal is received. In a certain village 20 % of the population has disease D. A test is administered which has the property that if a person has D , the test will be positive 90 % of the time , and if he does not have D , the test will still be positive 30 % of the time. All those whose test is positive are given a drug which invariably cures the disease , but produces a characteristic rash 25 % of the time. Given that a person picked at random has the rash , what is the probability that he actually had D to begin with ? -

9.

1.7

S O M E F A L L A C I E S I N C O M BI N A T O R I A L P R O B L E M S

In this section we illustrate some common traps occurring in combinatorial problems. In the first three examples there will be a multiple count . ..- Example 1.

Three cards are selected from an ordinary deck, without replacement. Find the probability of not obtaining a heart.

40

BASIC CONCEPTS

The total number of selections is (532). To find the number of favorable outcomes, notice that the first card cannot be a heart; thus we have 39 choices at step 1. Having removed one card, there are 38 nonhearts left at step 2 (and then 37 at step 3). The desired probability is c39) (3 8) (3 7) 1(532). FALLACY. In computing the number of favorable outcomes, a particular selection might be: 9 of diamonds, 8 of clubs, 7 of diamonds. Another selection is: 8 of clubs, 9 of diamonds, 7 of diamonds. In fact the 3! = 6 possible orderings of these three cards are counted separately in the numera tor (but not in the denominator). rhus the proposed answer is too9 high by a factor of 3!; the actual probability is (39)(38)(37)/3! (532) = (� )/(5i) (see example 2, Section 1 .6) . ...._ PROPOSED SoLUTION.

.,._

Example 2.

Find the probability that a five-card poker hand will result in three of a kind (three cards of the same face value x, plus two cards of face values and with x , and distinct). ( PROPOSED SoLUTION. Pick the face value to appear three times 13 possibilities). Pick three suits out of four for the "three of a kind" ((:) choices). Now one face value is excluded, so that 48 cards are left in the deck. Pick one of them as the fourth card; the fifth card can be chosen in 44 ways, since the fourth card excludes another face value. Thus the desired prob ability is (13)(:)(48)(44)/(552). FALLACY. Say the first three cards are aces. The fourth and fifth cards might be the jack of clubs and the 6 of diamonds, or equally well the 6 of diamonds and the jack of clubs. These possibilities are counted separately in the numerator but not in the denominator, so that the proposed answer is too high by a factor of 2. The actual probability is 13(:)(48)(44)/2(552) = 13(:)(\2)16/(552) [see Problem 2, Section 1.4; the factor (\2)1 6 corresponds to the selection of two distinct face values out of the remaining 12, then one card from each of these face values]. REMARK. A more complicated approach to this problem is as follows. Pick the face value x to appear three times, then pick three suits out of four, as before. Forty-nine cards remain in the deck,9 and the total number of ways of selecting two remaining cards is (� ). However, if the two face values are the same, we obtain a full house; there are 12(�) selections in which this happens (select one face value out of 12, then two suits out of four). Also, if one of the two cards has face value we obtain four of a kind; since there is only one remaining card with face value x and 48 cards remain after this one is chosen, there y

x,

z,

y,

z

1.7

SOME FALLACIES IN COMBINATORIAL PROBLEMS

41

are . 48 possibilities. Thus the probability of obtaining three of a kind

IS

(This agrees with the previous answer.) ......- Example Ten cards are drawn without replacement from an ordinary deck. Find the probability that at least nine will be of the same suit. PROPOSED SoLUTION. Pick the suit in any one of four ways, then choose nine of 1 3 face values. Forty-three cards now remain in the deck, so that the desired probability is 4(193)43/(��). FALLACY. Consider two possible selections. 1 . Spades are chosen, then face values A K Q J 10 9 8 7 6 . The last card is the 5 of spades. 2. Spades are chosen, then face values A K Q J 10 9 8 7 5. The last card is the 6 of spades (see Figure 1 .7. 1). Both selections yield the same 10 cards, but are counted separately in the computation. To find the number of duplications, notice that we can select 10 cards out of 1 3 to be involved in the t duplication; each choice of one card (ou of 1 0) for the last card yields a distinct path in Figure 1 . 7 . 1 . Of the 10 possible paths corresponding to a given selection of cards, nine are redundant. Thus the actual probability is 4 [C:) 43 - G�) 9J 3.

G�)

6 of spades

5 of spades

FIGURE

1 .7.1

Multiple Count.

42

BASIC CONCEPTS

Now

� ) c:)39 G so that the probability is [ c:)39 G�) J (i�) =

4

+

+

4

,

_....

as obtained in a straightforward manner in Problem Section 1 .4. An urn contains 10 balls bb . . . , b1 0. Five balls are drawn ..- Example without replacement. Find the probability that b8 and b9 will be included in the sample. PROPOSED SoLUTION. We are drawing half the balls, so that the proba bility that a particular ball will be included is 1/2. Thus the probability of including both b8 and b9 is (1/2)(1 /2) = 1/4. FALLACY. Let A = {b 8 is included}, B = {b 9 is included}. The 1difficulty is simply that A and B are not independent. For P(A () B) = (�)/( �) = 2/9 (after b8 and b9 are chosen, three balls are to be selected from the remaining eight). Also P(A) = P(B) = (:)/(�) = 1/2, so that P(AnB) =/= P(A)P(B) . ..- Example Two cards are drawn independently, with replacement, from an ordinary deck; at each selection all 52 cards are equally likely. Find the probability that the king of spades and the king of hearts will be chosen (in some order). PROPOSED SoLUTION. The2 number of unordered samples of size 2 out of 1 52, with replacement, is (5 +22- ) = (5;) [see (1 .4.4)]. The kings of spades and hearts constitute one such sample, so that the desired probability is 1/ (5:). FALLACY. It is not legitimate to assign equal probability to all unordered samples with replacement. If we do this we are saying, for example, that the outcomes "ace of spades, ace of spades" and "king of spades, king of hearts" have the same probability. However, this cannot be the case if independent 4.

_....

5.

1 .8 APPENDIX: STIRLING'S FORMULA

43

sampling is assumed. For the probability that the ace of spades is chosen twice is ( 1 /52)2 , while the probability that the spade and heart kings will be chosen (in some order) is P{first card is the king of spades, second card is the king of hearts} + P{first2 card is the king of hearts, second card is the king of spades} = 2 (1/52) , which is the desired probability. The main point is that we must use ordered samples with replacement in order to capture the idea of independence. _....

1 .8

A P P E N D I X : S TI R LI N G ' S F O R M U L A

An estimate of n! that is of importance both in numerical calculations and theoretical analysis is Stirling'sformula in the sense that Define (2n) ! ! (read 2n semifactorial) as 2n(2n - 2)(2n - 4) · · · 6(4)(2) , and (2n + 1) ! ! as (2n + 1) (2n - 1) · · · (5) (3) (1). We first show that '1T (2n - 1) ! ! (2n - 2) ! ! (2n) ! ! (a) < -----'-__;, -- < __� (2n + 1) ! ! 2 (2n) ! ! (2n - 1) ! ! Let Ik = J�12 (cos x)k dx, k 20, 1 , 2, . k. .1. Then 10 = 7T/2,2 11 = 1 . Integrating by parts, we obtain Ik = J�' (cos x) - d(sin x) = J�' (k - 1)(cos x)k-2 sin2 x dx. Since sin2 x = 1 - cos2 x, we have Ik = (k - 1)7TIk_2 - (k - 1)Ik or Ik = [(k - 1 )fk]lk_2• By iteration, we obtain l2n =k ( /2) [ (2n - 1) ! !/ (2n) ! ! ] and 12n+1 = [ (2n) ! ! / (2n + 1 ) ! ! ] . Since (cos x) decreases with k, so does lk, and hence 12n+1 < 12n < 12n_1 , and (a) is proved. (b) Let Qn = (2:)/22n . Then lim QnJn'1T = 1 n � oo To prove this, write PROOF.

=

-

(2n) ! (2n - 1) ! ! (2 n) ! ! ((2n)(2n - 2 ) · · · (4) (2) )2

44

BASIC CONCEPTS

Thus, by (a),

'1T 2) ! ! (2n) ! ! (2n - _;__ < Qn < (2n - 1) ! ! (2n + 1) ! ! 2

--..._ .;.. _

Multiply this inequality by to obtain

(2n - 1) ! ! (2n - 2) ! !

_

-

...;._ ...._

(2n - 1) ! ! (2n) ! ! = Qn(2n) (2n) ! ! (2n - 2) ! !

2n < n'1TQ n2 < 1 2 n + 1n'1TQ n2 � 1 , ---

If we let n � we obtain proving (b). (c) Proof of Stirling's formula. Let en = n !fnne-n�. We must show that en � 1 as n � Consider (n + 1) !/n ! = n + 1 . We have oo ,

oo .

n n en+ 1 (n + 1) + le-< + t >.J2'1T(n + 1)

(n + 1) ! n!

Thus

(

)

(

)

n - < n+ l / 2 ) .J -;, en+ l n .! = (e) 1 + = (n + 1)(e) 312 en n + 1 (n + 1) n n (1 + 1 /n) n+l/ 2 > e n. e n+ 1/e n < 1 en � c.

for sufficiently large (take logarithms and expand Now for large enough Since every mono in a power series); hence tone bounded sequence converges, a limit We must show e = 1 . By (b), 2n ) n ( .J n '1T 2-2 = 1 lim n � oo n But

(2nn ) .Jn TT T2n

(2n) ! & c 2n(2nfer.J2TT(2n) & c 2 n n = 2 n = n 2 2 2 en 2 n! n! 2 ( en(nfe) .J2'1Tn ) en 2 � e 2 , e2n fe n 2 � 1 . e2n � e cfc 2 = 1 , e = 1. =

and However, and consequently so that The theorem is proved. REMARK. The last step requires that e be >O. To see this, write Therefore

1.8

APPENDIX: STIRLING'S FORMULA

45

where c0 is defined as 1. To show that en � a nonzero limit, it suffices to show that the limit of In cn+l is finite, and for this it is sufficient to show that In In (cn+1/cn) converges to a finite limit. Now n + 1 In c;: = In [e( l + !f l/2)] = (n + D in ( 1 + !) =

1

1 -

_

(n +

�) ( n1 2n1 2 + On(n3 )) _

_

where O (n) is bounded by a constant independent of n. This is the order of 1/n2 ; hence In In (cn+1/cn) converges, and the result follows.

Ra n d o m Va riab l es

2 . 1 I N TR O D U C T I O N

In Chapter 1 we mentioned that there are situations in which not all subsets of the sample space Q can belong to the event class �, and that difficulties of this type generally arise when Q is uncountable. Such spaces may arise physically as approximations to discrete spaces with a very large number of points. For example, if a person is picked at random in the United States and his age recorded, a complete description of this experiment would involve a probability space with approximately 200 million points (if the data are recorded accurately enough, no two people have the same age). A more convenient way to describe the experiment is to group the data, for example, into 10-year intervals. We may define a function q(x), x = 5 , 15, 25, . . . , so that q(x) is the number of people, say in millions, between x - 5 and x + 5 years (see Figure 2.1.1). For example, if q(15) = 40, there are 40 million people between the ages of 10 and 20 or, on the average, 4 million per year over that 10-year span. Now if we want the probability that a person picked at random will be between 14 and 16, we can get a reasonable figure by taking the average number of people per year [4 = q(15)/10)] and multiplying by the number of years (2) to obtain (roughly) 8 million people, then dividing by the total population to obtain a probability of 8/200 = .04. 46

2. 1

q (x) 40

? .....

30

v

� �v.

��

30\

� �I

20 10 0

v�

30/ �

0

� ��

47

35 million people between 20 and 30, etc.

I�

40

INTROD UCTION

"

20-.....

18

��

10'

10 14 16 20

30

40

FIGURE 2. 1 . 1

50

60

70

�

80

90

....

X

Age Statistics.

If we connect the values ofJq(x) by a smooth curve, essentially what we are doing is evaluating (1/200) �: [q(x)/10] dx to find the probability that a person picked at random will be between 14 and 16 years old. In general, we estimate the number of people between ages a and b by J� [q(x)/10] dx so that q(x)/10 is the age density, that is, the number of people per unit age. We estimate the probability of obtaining an age between a and b by f� [q(x)/2000] dx; thus q(x)/2000 is the probability density, or probability per unit age. Thus we are led to the idea of assigning probabilities by means of an integral. We are taking Q as (a subset of) the reals, and assigning P(B) = JBf(x) dx, where f is a real-valued function defined on the reals. There are several immediate questions, namely, what Jsigma field we are using, what functions fare allowed, what we mean by Bf(x) dx, and how we know that the resulting P is a probability. For the moment suppose that we restrict ourselves to continuous or J piecewise continuous f Then we can certainly talk about Bf(x) dx, at least when B is an interval, and the integral is in the Riemann sense. Thus the appropriate sigma field should contain the intervals, and hence must be at least as big as the smallest sigma field f!4 containing the intervals ( f!4 exists; it can be described as the intersection of all sigma fields containing the intervals). 1The sigma field f!4 = f!4(E1) is called the class of Borel sets of the reals £ • Intuitively we may think of f!4 being generated by .starting with the intervals and repeatedly forming new sets by taking countable unions (and countable intersections) and1 complements in all possible ways (it turns out that there are subsets of £ that are not Borel sets) . Thus our problem will be to construct probability measures on the class of Borel sets of £1. 1The reason for considering only the Borel sets rather than S all subsets of £ is this. Suppose that we require that P(B) = BJ (x) dx .:7

48

RANDOM VARIABLES

for all intervals1 B, where f is a particular nonnegative continuous function defined on E , and J oo oo1 f(x) dx 1 . There is no probability measure on the class of all subsets of E satisfying this requirement, but there is such a meas ure on the Borel sets. Before elaborating on these ideas, it is convenient to introduce the concept of a random variable; we do this in the next section. =

2.2

D E FI N I T I O N O F A R A N D O M V A R IA B LE

Intuitively, a random variable is a- quantity that is measured in connection with a random experiment. If Q is a sample space, and the outcome of the experiment is a measuring process is carried out to obtain a number R( ). Thus a random variable is a real-valued function on a sample space. (The formal definition, which is postponed until later in the section, is somewhat more restrictive. ) ..- Example Throw a coin 10 times, and let R be the number of heads.10 We take Q = all sequences of length 10 with components H and T; 2 points altogether. A typical sample point is HHTHTTHHTH. For this point R( ) = 6. Another random variable, R1 , is the number of times a head is followed immediately by a tail. For the point above, R1 ( ) = 3 . ...._ ..- Example 2. Pick a person at random from a certain population and measure2 his height and weight. We may take the sample space to be the plane E , that is, the set of all pairs (x, y) of real numbers, with the first coordinate x representing the height and the second coordinate y the weight (we can take care of the requirement that height and weight be nonnegative by assigning probability 0 to the complement of the first quadrant). Let R1 be the height of the person selected, and let R 2 be the weight. Then R1(x, y) = x, R 2 (x, y) = y . As another example, let R3 be twice the height plus the cube root of the weight; that is, R3 = 2R1 + flR2• Then R3(x, y) = 2R1(x, y) + flR2 (x, y) = 2x + fly . ...._ ..- Example Throw two dice. We may take the sample space to be the set of all pairs of integers (x, y), x, y = 1 , 2, . . . , 6 (36 points in all) . Let R1 = the result of the first toss. Then R1(x, y) = x. Let R2 = the sum of the two faces. Then R2 (x, y) = x + y. Let R3 = 1 if at least one face is an even number; R3 0 otherwise. Then R3(6, 5) = 1 ; R3(3 , 6) = 1 ; R ( 1 , 3) = 0, and so on. ...._ w,

w

1.

w =

w

w

w

3.

3

=

..- Example 4.

2.2

DEFINITION OF A RANDOM VARIABLE

49

Imagine that we can observe the times at which electrons are emitted from the cathode of a vacuum tube, starting at time t = 0. As a sample space, we may take all infinite sequences of positive real numbers, with the components representing the emission times. Assume that the emission process never stops. Typical sample points might be w1 = (.2, 1 . 5 , 6.3 , . . . ) , w 2 = (.01 , .5, .9, 1 .7, . . . ). If R1. is the number of electrons emitted before t = 1 , then R1(w1) = 1 , R1(w 2) = 3. If R 2 is the time at which the first electron is emitted, then R 2 (w1) = .2, R 2 (w 2) = .01 . ...._

If we are interested in a random variable R defined on a given sample space , we generally want to know the probability of events involving R. Physical measurements of a quantity R generally lead to statements of the form a < R < b, and it is natural to ask for the probability that R will lie between a and b in a given performance of the experiment. Thus we are looking for P{w : a < R(w) < b } (or, equally well, P{w : a < R(w) < b }, and so on). For example, if a coin is tossed independently n times, with probability p of coming up heads on a given toss , and if R is the number of heads, we have seen in Chapter 1 that P{w : a < R (w) < b} = NoTATION.

_ p)n-k k ( ±=a (n)p l k

k

{w : a < R(w) < b } will often be abbreviated to {a < R < b}.

As another example, if two unbiased dice are tossed independently, and R 2 is the sum of the faces (Example 3 above) , then P{R 2 = 6 } = P{(5, 1), ( 1 ' 5) ' (4 ' 2) ' (2' 4 ) ' (3 ' 3)} = 5/3 6. In general an "event involving R" corresponds to a statement that the value of R lies in a set B ; that is, the event is of the form { w : R( w) E B}. Intuitively, if P{w : R(w) E I} is known for all intervals I, then P{w : R(w) E B} is deter mined for any "well-behaved" set B, the reason being that any such set can be built up from intervals. For example, P{O < R < 2 or R > 3} ( = P{R E [0, 2) u (3 , oo)}) = P{O < R < 2} + P{R > 3}. Thus it appears that in order to describe the nature of R completely, it is sufficient to know P{R E I} for each interval I. We consider in more detail the problem of characterizing a random variable in the next section ; in the remainder of this section we give the formal definition of a random variable. * For the concept of random variable to fit in with our established model for a probability space, the sets {a < R < b } must be events ; that is, they must belong to the sigma field /F. Thus a first restriction on R is that for all real a, b, the sets {w : a < R(w) < b } are in /F. Thus we can talk intelligently about the event that R lies between a and b. A question now comes up : Suppose that the sets {a < R < b } are in :F

50

RANDOM VARIABLES

for all a, b. Can we talk about the event that R belongs to a set B of reals, for B more general than a closed interval ? For example, let B = [a, b) be an interval closed on the left, open on the right. Then a < R (w) < b iff a < R(w) < b - .! n

'Thus

{w : a < R(w) < b } =

for at least one n = 1, 2, . . .

U { w : a < R(w) < b - .!)

n =l

n

and this set is a countable union of sets in !F , hence belongs to !F . In a similar fashion we can handle all types of intervals. Thus {w : R(w) E B} E !F for all intervals B. In fact {w : R(w) E B} belongs to :F for all Borel sets B. The sequence of steps by which this is proved is outlined in Problem 1 . We are now ready for the formal definition. A random variable on the probability space (Q, !F , P) is a real valued function R defined on Q, such that for every Borel subset B of the reals, { w : R( w) E B} belongs to !F .

DEFINITION.

Notice that the probability P is not in:volved in the definition at all ; if R is a random variable on (Q, !F , P) and the probability measure is changed, R is still a random variable. Notice also that, by the above discussion, to check whether a given function R is a random variable it is sufficient to know that { w : a < R( w) < b} E !F for all real a, b. In fact (Problem 2) it is suffi cient that {w : R(w) < b} E !F for all real b (or, equally well, {w : R(w) < b} E !F for all real b ; or {w : R(w) > a} E !F for all real a ; or {w : R(w) > a} E :F for all real a ; the argument is essentially the same in all cases) . Notice that if :F consists of all subsets of Q, {w : R(w) E B} automatically belongs to !F , so that in this case any real-valued function on the sample space is a random variable. Examples 1 and 3 fall into this category. Now let us consider Example 2. We take Q = the plane £ 2 , !F = the class of Borel subsets of £ 2 , that is, the smallest sigma field containing all rec tangles (we shall use "rectangle" in a very broad sense, allowing open, closed, or semiclosed rectangles, as well as infinite rectangular strips). To check that R1 is a random variable, we have {(x, y) : a < R1(x, y) < b} = {(x, y) : a < x < b} which is a rectangular strip and hence a set in /F. Similarly, R2 is a random variable. For R3, see Problem 3. Example 2 generalizes as follows. Take Q = En = all n-tuples of real

2.3

CLASSIFICATION OF RANDOM VARIABLES

51

numbers, :F the smallest sigma field containing the n-dimensional "inter vals." [If a = (a1, , bn), the interval (a, b) is defined as , an) , b (b1, {x E En : ai < xi < bi, i = 1 , . . . , n} ; closed and semiclosed intervals are defined similarly.] The coordinate functions, given by R1 (x1 , , x n) = , Rn(x1, , xn) = xn, are random variables. x1 , R 2 (x1, • • • , xn) = x2 , Example 4 involves some serious complications, since the sample points are infinite sequences of real numbers. We postpone the discussion of situations of this type until much later (Chapter 6). •

•

=

.

•

.

•

.

•

•

•

•

•

.

•

•

P RO B LE M S

* 1 . Let R be a real-valued function on a sample space n , and let CC be the collection of all subsets B of E1 such that { w : R( w) E B} E �(a) Show that CC is a sigma field. (b) If all intervals belong to CC , that is, if {ro : R ( w) E B} E � when B is an interval, show that all Borel sets belong to CC. Conclude that R is a random variable. *2. Let R be a real-valued function on a sample space n , and assume { w : R ( w) < b} E � for all real b. Show that R is a random variable. *3. In Example 2 , show that R3 is a random variable. Do this by showing that if R1 and R2 are random variables, so is R1 + R2 ; if R is a random variable, so is aR for any real a ; if R is a random variable, so is {I R. 2.3

C L A S S I F I C A T I O N OF R A N D O M VARI A B L E S

If R is a random variable on the probability space (0, /F, P) , we are gener ally interested in calculating probabilities of events involving R, that is, P{w : R(w) E B} for various (Borel) sets B. The way in which these proba bilities are calculated will depend on the particular nature of R ; in this section we examine some standard classes of random variables. The random variable R is said to be discrete iff the set of possible value s are the v·alues of of R is finite or countably infinite. In this case, if x1 , x 2 , R that belong to B, then •

•

•

P{R E B } = P{R = x1 or R = x2 or · · · } = P{R = x1} + P{R = x2 } + · · · = I PR(x)

where PR(x) , x real, is the probability function of R, defined by PR(x) = P{R = x}. Thus the probability of an event involving R is found by summing

52

RANDOM VARIABLES

the probability function over the set of points favorable to the event. In particular, the probability function determines the probability of all events involving R . .,._ Example 1.

Let R be the number of heads in two independent tosses of a coin, with the probability of heads being .6 on a given toss. Take Q = {HH, HT, TH, TT} with probabilities .36, .24 , .24 , . 1 6 assigned to the four points of Q ; take ff = all subsets. Then R has three possible values , namely , 0, 1 , and 2, and P{R = 0} = . 1 6, P{R = 1 } = . 48, P{R = 2} = .36, by inspection or by using the binomial formula P{R

=

k}

=

G)Ji 0 and "''n Pn = 1 , then FR has a jump of magnitude Pn at x = xn ; FR is constant between jumps. In the discrete case, if we are given the probability function, we can con struct the distribution function, and, conversely , given FR, we can construct 1

& (x) .48

�

.16 0

1

2

X

pR(x) .48 .36

.16 0

FIGURE 2.3. 1

1

2

Distribution and Probability Functions of a Discrete Random Variable.

2.3

PR ·

CLASSIFICATION OF RANDOM VARIABLES

53

Knowledge of either function is sufficient to determine the probability of all events involving R. We now consider the case introduced in Section 2. 1 , where probabilities are assigned by means of an integral. Let f be a nonnegative Riemann integrablet function defined on E1 with S 00oo f(x) dx = 1 . Take n = E1, :F = Borel sets. We would like to write, for each B E /F,

J

P(B) = Bf (x) d x but this makes sense only if B is an interval. However, the following result is applicable. ·

Theorem 1. Let f be a nonnegative real-valued function on E l, with

j 00 f(x ) dx = 1 . There is a unique probability measure P defined on the Borel 00 subsets of E1 , such that P(B) = JBf(x ) dx for all intervals B = (a, b] .

The theorem belongs to the domain of measure and integration theory, and will not be proved here. The theorem allows us to talk about the integral of f over an arbitrary Borel set B. We simply define JBf(x) dx as P(B), where P is the probability measure given by the theorem. The uniqueness part of the theorem may �pen be phrased as follows. If Q is a probability measure on the Borel subsets of E1 and Q(B) = SB f(x) dx for all intervals B = (a, b] , then Q(B) = JBf(x) dx for all Borel sets B. If R is defined on Q by R( w) = w (so that the outcome of the experiment is identified with the value of R) , then

P{ w : R( w) E B} = P(B) =

J;(x) dx

In particular, the distribution function of R is given by FR ( x)

= P{ w : R( w) < x} = P(- oo , x] =

so that FR is represented as an integral.

fjct) dt

The random variable R is said to be absolutely continuous iff there is a nonnegative function / = fR defined on E1 such that

DEFINITION.

for al l real x

(2.3.1)

fR is called the density function of R. We shall see in Section 2.5 that FR (x) must approach 1 as x � oo ; hence J00oo fR (x) dx = 1 .

t "Integrable" will from now on mean "Riemann integrable."

54

RANDOM VARIABLES

fR,(x) I\

1 b-a

;:: X

b

a

1

b

a

FIGURE 2. 3 . 2 Variable .

Distri bution and Density Functions of a Uniformly Distributed Random

..- Example 2.

A number R is chosen at random between a and b ; R is assumed to be uniformly distributed; that is , the probability that R will fall into an interval of length c depends only on c , not on the position of the interval within [a, b ]. , 1 We take Q = E , ff = Borel sets, R(w) = w, f(x) = fR (x) = 1/(b a) , a < x b or x < a. Define P(B) = JBf(x) dx . In particu lar, if B is a subinterval of [a, b] , then P(B) = (length of B)/(b - a). The density and distribution function of R are shown in Figure 2. 3.2 . ...._ -

NoTE .

The values of FR are probabilities, but the values of fR are not ; probabilities are found by integrating fR· FR(x)

=

P {R < x }

=

fjR

(t) dt

If R is absolutely continuous, then

P { a < R < b}

=

J:

!R(x) dx,

a O

(it constant)

x 2

Find and sketch the distribution function of R2 ; is R2 absolutely continuous ? (a) Let R1 have distribution function F1 (x)

Define R2

6.

for R1 < 2

Find the density of R2 • Let R1 be as in Problem 3 , and define R2

S.

if R1 > 1

Show that R2 is absolutely continuous and find its density. An absolutely continuous random variable R1 is uniformly distributed between - 1 and + 1 . Find and sketch either the density or the distribution function of the random variable R2, where R2 = e-R�. Let R1 have density f1(x) = 1/x2 , x > 1 ; f1(x) = 0, x < 1 . Define R2

4.

if R1 < 1

= 1 - e-x ,

x >O

= 0'

x 0

= 0'

R1

1

f8xy dx = 4y( 1 - y2)

Sketches of/1 and /2 are given in Figure 2.7. 5 .

(Figure 2. 7 .4b) ...._ y

y

y

0

1

X

0

(a)

(b) FIGURE 2.7.4

79

80

RANDOM VARIABLES

4y (l - y 2)

FIGURE 2.7.5

Problem 2

The second problem posed at the beginning of this section has a negative answer ; that is, if R1 and R 2 are each absolutely continuous then (R1 , R 2) is not necessarily absolutely continuous. Furthermore , even if (R1, R 2) is absolutely continuous , ft(x) and /2 (y) do not determine ft 2 (x, y). We give examples later in the section. However, there is an affirmative answer when the random variables are independent. We have considered the notion of independence of events , and this can be used to define independence of random variables. Intuitively , the random variables R1, , R n are independent if knowledge about some of the Ri does not change the odds about the other R/s. In other words , if Ai is an event involving Ri alone , that is, if A i = { R i E Bi}, then the events A1, , A n should be independent. Formally, we define independence as follows. •

•

•

•

•

•

Let Rb . . . ' Rn be random variables on en, F, P). Rl, . . . ' R, are said to be independent iff for all Borel subsets B1, , Bn of E1 we have

DEFINITION .

•

REMARK.

For

If R1,

.

•

•

, R n are independent, so are R1,

•

•

•

•

•

, Rlc for

k < n.

P{R l E Bl , . . . ' R k E Bk } = P{R l E Bl , . . . ' R k E Bk , - oo < R�c+ t < oo , . . . , - oo < R n < oo } = P{R l E Bl} . . . P{Rk E Bk} since P{ - oo < R i < oo } = 1 . If (R i , i E the index set I) , is an arbitrary family of random variables on the space (0 , :F, P), the Ri are said to be independent iff for each finite set of distinct indices i1, . . . , ik E I, Ri 1 , , Rirc are independent. •

•

•

2.7

RELATIONSHIP BE TWEEN JOINT AND INDIVIDUAL DENSITIES

81

We may now give the solution to Problem 2 under the hypothesis of independence. Let Rb R2 , , R n be independent random variables on a_ given probability space. If each R i is absolutely continuous with density h, then (R1, R 2 , , xn , , Rn) is absolutely continuous; also, for all x1, Theorem 1.

•

•

•

•

•

•

•

•

•

h 2 ·'· · n (x1, X2 , ··· , xn) = ft(x1)/2 (x 2) ···fn (xn) Thus in this sense the joint density is the product of the individual densities. PROOF. The joint distribution function of R1, . . . , R n is given by F12 . .. n(x1 , . . . , xn) = P{ R1 < x1, . . . , R n < xn } < X1 } < Xn } = P{R1 P{R n •

=

•

f'� f�'ft(ut) · · ·

by independence

•

· · · fn(u n)

du 1

· · ·

du n

It follows fr � m the definition of absolute continuity [see (2.6.2)] that (R1, . . . , Rn) is absolutely continuous and that the joint density isft 2 ... n (x1, . . . , x n) = /1 (X1 ) ···fn (X n) · Note that we have the following intuitive interpretation (when n = 2). From the independence of R1 and R 2 we obtain P{x < R1 < x + dx, y < R 2 < y + dy } = P{x < R 1 < x + dx}P{y < R 2 < y + dy} If there is a joint density, we have (roughly) ft 2 (x, y) dx dy = ft (x) dx f2 (y) dy, so that.h 2 (x, y) = ft(x)f2 (y) . As a consequence of this result, the statement "Let R1 , . . . , Rn be inde pendent random variables, with R i having density h,' ' is unambiguous in the sense that it completely determi-n es all probabilities of events involving the random vector (R1, . . . , R n) ; if B is an n-dimensional Borel set,

P{(R t. . . . , R n) E B} =

r Jft(x1) ·

B

· · · fn( xn)

d x1 ·

· ·

d xn

We now show that Problem 2 has a negative answer when the hypothesis of independence is dropped. We have seen that if (R1, . . . , R n) is absolutely continuous then each R i is absolutely continuous, but the converse is false

82

RANDOM VARIABLES

in general if the Ri are not independent ; that is, each of the random vari , Rn can have a density without there being a density for the ables R 1 , �-tuple (R1 , . . . , Rn) . .

•

•

Let R 1 be an absolutely continuous random variable with density f, and take R2 = R 1 ; that is, R2(w) = R 1 (w) , w E 0. Then R2 is absolutely continuous, but (R1 , R 2) is not. For suppose that (R1 , R 2) has a density g. Necessarily (R1 , R 2) E L, where L is the line y = x, but

..- Example 2.

P{(R� > R 2) E L} =

Ifg(x, y) dx dy

.

L

Since L has area 0, the integral on the right is 0. But the probability on the left is 1 , a contradiction. _.... We can also give an example to show that if R1 and R 2 are each absolutely continuous (but not necessarily independent) , then even if (R1 , R2) is ab solutely continuous, the joint density is not determined by the individual densities . ...,. Example

3.

Let

-1 < X < 1

f12( x , Y) = 1 (1 + xy), =

Since

0

elsewhere

J-1x dx J-1y dy J!12(x, y) dy 1

f1( x) =

-1 < y < 1

=

1

= �.

=

f2 ( Y) = � , =0

0

=

0, -1 < x < 1

elsewhere

-1 < y < 1

elsewhere

But if

-1 < X < 1 -1 < y < 1

0 we get the same individual densities. =

_....

elsewhere

2.7

RELATIONSHIP BETWEEN JOINT AND INDIVIDUAL DENSITIES

83

FIGURE 2.7.6

Now intuitively, if R1 and R 2 are independent, then, say, eR1 and sin R 2 should be independent, since information about eR1 should not change the odds concerning R 2 and hence should no t affect sin R 2 either. We shall prove a theorem of this type, but first we need some additional terminology. If g is a function that maps points in the set D into points in the set E,t and T c E, we define the preimage of T under g as g-1 (T) . {x E D : g (x) E T}

For example, let D = {x 1 , x 2 , x3, x4} , E = {a , b , c} , g(x1) = g(x2) = g (x3) = a, g(x ) = c (see Figure 2.7.6 ) . We then have 4 g-1 {a} g 1{a , b} g-1 {a, c} g-1 {b} -

= {x 1 , = {x 1 ,

x 2 , x a} x 2 , x3} = {x1 , x 2 , x3, x4}

= 0

Note that, by definition of preimage, x E g-1 (T) iff g(x) E T. Now let R 1 , • • • , Rn be random variables on a given probability space, and let g1 , • • • , gn be functions of one variable, that is, functions from the reals to the reals. Let R� = g1 (R1) , • . . , R� = gn (Rn) ; that is, R; (w ) = gi (Ri (w) ) , w E n. We assume that the R; are also random variables ; this will be the case if the gi are continuous or piecewise continuous. Specifically, we have the following result, which we shall use without proof. If g is a real-valued function defined on the reals, and g is piecewise con tinuous, then for each Borel set B c E1 , g-1 (B) is also a Borel subset of E 1 . (A function with this property is said to be Borel measurable.) Now we show that if gi is piecewise continuous or, more generally, Borel measurable, R; is a random variable. Let B; be a Borel subset of E1 • Then R�-1 (B�) = { w : R�(w) E B�} = { w : gi(Rlw) ) E B�} 1 = { w : Ri(w) E gi (B�)} E !F

since g£ 1 (B� ) is a Borel set.

t A common notation for such a function is g : and belongs to E for each x in D .

D

� E. It means simply that g(x) is defined

84

RANDOM VARIABLES

Similarly , if g is a continuous real-valued function defined on En , then , for 1 1 each Borel set B c £ , g- (B) is a Borel subset of En . It follows that if R1, . . . , Rn are random variables , so is g(R1, . . . , Rn ) .

If R1, . . . , Rn are independent, then R�, . . . , R� are also

Theorem 2.

independent. (For short, "functions of independent random variables are

independent.") PROOF.

If B� , . . . , B � are Borel subsets of E1, then

P { R� E B� , . . . , R � E B �}

= = = =

P{ g1(�1) E B� . . . , gn(R n) E B �} P{R1 E g!1(B�), . . . , R n E gn 1(B�)} n

II P{Ri E g£1(B �)} by independence of the Ri i=1 n

II P{gi(Ri) E B � } i =1

=

n

II P{R� E B� } i=1

PRO B L E M S 1.

Let (R1 , R2) have the following density function. f12(x, y) = 4xy = 6x2 =0

if O < x < 1 , 0 < y < 1 , x > y if O < x < 1 , 0 < y < 1 , x < y elsewhere

(a) Find the individual density functions /1 and f2. (b) If A = {R1 < � } , B = { R2 < � } , find P(A u B). 2. If (R1 , R2) is absolutely continuous with /12(x, y) = 2e-<x+y) ' =0 3.

4.

elsewhere

find /1 (x) and f2(y) . Let (R1, R2) be uniformly distributed over the parallelogram with vertices ( - 1 , 0), (1 , 0) , (2 , 1 ) , and (0, 1 ) . (a) Find and sketch the density functions of R1 and R2 • (b) A new random variable R3 is defined by R3 = R1 + R2• Show that R3 is absolutely continuous, and find and sketch its density. If R1, R2 , , Rn are independent, show that the joint distribution function is the product of the individual distribution functions ; that is, •

•

•

F1 2. . . n(x1, x2, . . . , xn) = F1 (x1 )F2 (x2)

· · ·

Fn (xn)

for all real x1, . . . , xn

2.8

FUNCTIONS OF MORE THAN ONE RANDOM VARIABLE

85

[Conversely, it can be shown that if F12 ... n(x1, , xn) = F1 (x1) • Fn (xn) for , xn , then R1, . . . , Rn are independent.] all real x1, S. Show that a random variable R is independent of itself-in other words, R and R are independent-if and only if R is degenerate, that is, essentially constant (P{R c} = 1 for some c) . 6. Under what conditions will R and sin R be independent ? (Use Problem 5 and the result that functions of independent random variables are independent.) 7. If (R1, , Rn) is absolutely continuous andf1 2 ... n (x1, , xn) = /1(x1) · · · fn (xn) for all x1, . . . , xn , show that R1, . . . , Rn are independent. 8. Let (R1, R 2) be absolutely continuous with density f1 2 (x, y) = (x + y)/8, 0 < x < 2, 0 < y < 2 ; f1 2(x, y) = 0 elsewhere. (a) Find the probability that R12 + R2 < 1 . (b) Find the conditional probability that exactly one of the random variables R1, R2 is < 1 , given that at least one of the random variables is < 1 . (c) Determine whether or not R1 and R2 are independent. •

.

.

•

•

•

•

•

=

I

•

2.8

•

•

•

•

•

F U N C T I O N S OF M O R E T H A N ONE RANDOM VARIABLE

We are now equipped to consider a wide variety of problems of the following sort. If R 1 , , Rn are random variables with a given joint density, and we define R = g (R 1 , , Rn) , we ask for the distribution or density function of R. We shall use a distribution function approach to these problems ; that is, we shall find the distribution function of R directly. There is also a density function method, but it is u s u ally not as convenient ; the density function approa

•

•

•

, R n) < z}

!1 2

x} n P {R1 > x, . . . , R n > x} = 1 - II P {Ri > x} P { T1 > x} = 1

-

i= l

F p 1(x) =

n

1

-

II ( 1

i= l

-

Fi(x))

90

RANDOM VARIABLES

REMARK.

We may also calculate Fp1 (x) as follows.

P{T1 < x} = P{at least one Ri is < x} where A i = {Ri < x} = P(A 1UA 2U · · · UA n) = P(A 1) + P(A 1c nA 2) + · · · by (1.3. 1 1) But P(A 1 cn · · · nAf_1nA i) = P{R1 > x, . . . , Ri-l > x, Ri < x} = (1 - F1 (x)) · · · (1 - Fi_1 ( x)) Fi ( x) Thus F p1 ( x) = F1 (x) + (1 - F1 (x))F 2( x) + (1 - Fl(x))(1 - F2(x))Fa (x) + · · · + (1 - F1( x)) · · · (1 - F n-1( x)) F n( x) Hence 1 - F p 1 = (1 - F1)[1 - F2 - (1 - F2) Fa - · · · - (1 - F2) · · · (1 - F n-l) Fnl = (1 - F1)(1 - F2)[1 - Fa - (1 - Fa ) F4 - · · - (1 - Fa ) · · · (1 - Fn-l) Fn ] n = II (1 - Fi) i =l as above. We now make the simplifying assumption that the Ri are absolutely con tinuous (as well as independent) , each with the same density f [Note that ·

P{Ri = R ;} = Hence

IIf(xi)f(x;) dxi dx; = 0

( if i � j)

P{Ri = R1 for at l east one i � j} < I P{Ri = R 1 } = 0 i -::f::. j

Thus ties occur with probability zero and can be ignored.] We shall show that the Tk are absolutely continuous, and find the density explicitly. We do this intuitively first. We have ·

P{x < Tk < x + dx} = P{x < Tk < x + dx, Tk = R1} + P{x < Tk < x + dx, Tk = Rn} + P{x < Tk < x + dx , Tk = R 2} + ·

·

·

by the theorem of total probability. (The events {Tk = Ri } , i = 1, . . . , n, are mutually exclusive and exhaustive.) Thus P{x < Tk < x + dx} = nP{x < Tk < x + dx, Tk = R 1} = nP{Tk = R1, x < R1 < x + dx}

by symmetry

2.8

91


Now for R 1 to be the kth smallest and fall between x and x + dx, exactly k - 1 of the random variables R2, . . . , R n must be < Rr, and the remaining n - k must be > R 1 [and R 1 must lie in (x, x + dx)]. Since there are (�-i) ways of selecting k - 1 distinct objects out of n - 1 , we have n - 1 P{x < Tk < x + dx } = n P{x < R 1 < x + dx, k- 1

(

)

R 2 < R1 , . . . , Rk < R1 , Rk+1 > R1 , . . . , R n > R1 } But if R1 falls in (x, x + dx) , Ri < R 1 is essentially the same thing as Ri < x, so that n - 1 1 k P{x < Tk < x + dx} = n J (x) dx(P{ Ri < x}) k- (P{ R i > x}) nk- 1 n - 1 1 = n J(x)(F(x))k- (1 - F(x)) n-k dx k- 1

( (

Since P{x < Tk < x to exist) , we have .

+

) )

dx} = iJc(x) dx, wherefk is the density of Tk (assumed

(

)

n - 1 k 1 J (x)(F(x)) - (1 - F(x) ) n-k k- 1 [When k = n we get nf(x) (F(x)) n-1 = (dfdx)F(x) n , and when k 1 we get nf(x) (1 - F(x)) n-1 = (dfdx) (1 - (1 - F(x)) n) , in agreement with the previous results if all Ri have distribution function F and the density f can be obtained by differentiating F. ] To obtain the result formally , we reason as follows. fk (x) = n

=

n

P {Tk < x} = L P{ Tk < x, Tk = Ri} = nP{ Tk < x, Tk = R 1} i= 1 = nP{ R 1 < x, exactly k - 1 of the variables R 2 , . . . , R n are < R1 , and the remaining n - k variables are > R 1} n - 1 P{ R1 < x, R 2 < R t> . . . , Rk < Rt> Rk+ 1 > R1 , = n k by symmetry R n > R1 }

( _ l)

•

•

•

,

RANDOM VARIABLES

92

The integrand is the density of Tk , in agreement with the intuitive approach. T1, . . . , Tn are called the order statistics of R1, . . . , R,. REMARK.

All events

{ Ri1 < x , R,;" 2 < Ri 1 ,. . .. . , R,;"k < R,; , R ,;"k+ 1 > R,;"1 , . . . , R.",.n > R .:"1 } "1

-

have the same probability, namely ,

This justifies the appeal to symmetry in the above argument . ...._

P RO B L E M S

1.

2.

Let R1 and R2 be independent and uniformly distributed between 0 and 1 . Find and sketch the distribution or density function of the random variable

Ra = R2/R12· If R1 and R2 are independent random variables, each with the density function

f(x) = e-x, x > O ; f(x) = 0, x < 0, find and sketch the distribution or density function of the random variable R3, where (a) R3 = R1 + R2 (b) R3 = R2/R1 3. Let R1 and R2 be independent, absolutely continuous random variables, each normally distributed with parameters a = 0 and b 1 ; that is, =

J1(x) = [2 (x)

=

1

v2 7T

e-x2 / 2

Find and sketch the density or distribution function of the random variable

R3

4.

R2/R1• Let R1 and R2 be independent, absolutely continuous random variables, =

each uniformly distributed between 0 and 1 . Find and sketch the distribution or density function of the random variable R3, where

Ra

=

max (R1, R2) min (R1, R2)

The example in which R3 = max (R1, R2) + min (R1, R2) may occur to the reader. However, this yields nothing new, since

REMARK.

2.8


93

max (Rr, RJ + min (Rr , RJ = R1 + R 2 (the sum of two numbers is the larger plus the small e r). 2 2 5. A point-size worm is inside an apple in the form of the sphere x + y 2 + z = 4a 2 • (Its position is uniformly distributed.) If the apple is eaten down to a core determined by the intersection of the sphere and the cylinder x2 + y2 = a2 , find the probability that the worm will be eaten. 3 6. A point (R1, R 2 , RJ is uniformly distributed over the region in E described by x2 + y 2 < 4, 0 � z < 3x. Find the probability that R3 < 2R1 • 7. Solve Problem 6 under the assumption that (R1, R2 , R3) has density f(x , y, z) = kz2 over the given region and f (x, y, z) = 0 outside the region. 8. Let T1, . . . , Tn be the order statistics of R1, . . . , Rn , where R1, . . . , Rn are independent, each with density f Show that the joint density of T1, • • • , Tn is given by x1 < x2 < · · · < Xn g (x1, • • • , Xn) = n ! f(x1) • • • f(xn), elsewhere =0 HINT : Find P{T1 O = 0, X 2R2 > 3R3. 10. A man and a woman agree to meet at a certain place some time between 1 1 and 12 o'clock. They agree that the one arriving first will wait z hours, 0 < z < 1 , for the other to arrive. Assuming that the arrival times are independent and uniformly distributed, find the probability that they will meet. 11. If n points R1, • • • , R n are picked independently and with uniform density on a straight line of length L, find the probability that no two points will be less than distance d apart ; that is, find P{min IRi - Ril > d} i =l-j HINT : First find P{mini =l- i !Ri - Ri l > d, R1 < R2 < · · · < Rn} ; show that the region of integration defined by this event is Xn_1 + d < Xn < L Xn- 2 + d < Xn-1 < L - d Xn_3 + d < Xn-2 < L - 2d x1 + d < x2 < L (n - 2)d 0 < x1 < L - (n - 1)d -

94 12.

RANDOM VARIABLES

(The density function method for functions of more than one random variable.) Let . . . , xn). be absolutely continuous with density Define random variables , Wn by Wi n , i 1 , 2, . . . , n ; thus . . . , Rn). Assume that g is one-to-one, con , Wn) tinuously differentiable with a nonzero Jacobian Ju (hence g has a continuously differentiable inverse h) . Show that ( , Wn) is absolutely continuous with density fi n ( (y l , llh (Y) I , Y

(R1, . . . , Rn) (W1, • • •

W1, • • • g(Rr,

=

=

f12... n (x1, gi(R1, • • • , R ) =

W1, • • •

2 ·· y) /12 ···n (h (y)) /12·· · n (h(y)) =

·

· · · , Yn)

=

l lu (x) lx=h(y)

n and [The result is the same if g is defined only on some open subset D of E . . . , Rn) E D} 1 .] Let and be independent random variables, each normally distributed with a 0 and the same b. Define random variables and by cos (taking > O) sin Show that and are independent, and find their density functions. Let and be independent, absolutely continuous, positive random variables and let R3 Show that the density function of R3 is given by

P{(R1, 13. R1

=

=

R2

R1 R0 00 R2 R0 00

R0

=

14.

R1

R0 R2 =

00 R1R2•

fa (z)

R0

00

=

=

L"' ��(:) f2(w) dw,

z>0

z O ; f1(x) = 0 , x < 0. Let R 2 = R12 • We may compute E(R 2) in two ways. .,._

1 . E (R 2) = E (R 1 ) =

2

f-oooo x"i1(x) dx Jo( oo x2e-11J dx =

= r( 3) = 2

by (3.2.4)

2. We may find the density of R 2 by the technique of Section 2.4 (see Figure 3 .2.3) . We have

d - e-V"Y Y >o f2 (y) = J� (JY) .J y = dy 2.JY = 0, y O ; f(x) = 0, x < 0. Let R3 = max (R1, R 2 ). We compute E(R3).

E(R3) = E[ g(R 1, R 2) ] = =

f" f"

s:s: g(x, y)f12(x, y) dx dy

max (x, y)e- "'e-u dx dy

Now max (x, y) = x if x > y ; max (x, y) = y if x < y ( see Figure 3.2.4). Thus

E(R3) =

JJxe-"'e-u dx dy JJye-"'e-u dx dy xe-"' e-u dy dx fl() ye-u lY e-"' dx dy 0 0 flO0 i"' 0 +

B

A

=

+

y

on A, max (x, y) on B, max (x, y)

=

=

x y FIGURE 3 .2.4

1 12

EXPECTATION

The two integrals are equal, since one may be obtained from the other by interchanging x and y . Thus

The moments and central moments of a random variable R, especially the mean and variance, give some information about the behavior of R. In many situations it may be difficult to compute the distribution function of R explicitly, but the calculation of some of the moments may be easier. We shall examine some problems of this type in Section 3.5. Another parameter that gives some information about a random variable R is the median of R, defined when FR is continuous as a number ft (not necessarily unique) such that FR(ft) = 1 /2 (see Figure 3.2.5a and b) . tn general the median of a random variable R is a number ft such that

FR(ft) FR(ft-)

P{R < ft} > � = P{R < p,} < �

=

(see Figure 3.2. 5c) . Loosely speaking, ft is the halfway point of the distribution function of R.

(a)

J1.

a

b

(b)

J1.

(c)

FIGURE 3 .2.5 (a) Jl is the Uniq ue Median. (b) Any Number Between a and b is a Median. (c) fl is the Unique Median.

3.2

TERM INOLOG Y AND EXAMPLES

1 13

PRO B L E M S 1.

Let R be normally distributed with mean 0 and variance E(Rn) = 0, n odd = (n

-

1 )( - 3) · · · (5)(3)(1), n

2. Let R1 have the exponential density .f1 (x)

1. n

Show that even

x > 0 ; f1 (x) = 0, x < 0. Let R1 < 1 , R2 = 0; if 1 < R1 < 2,

= e-x,

R2 = g (R1) be the largest integer < R1 (if 0 < R2 = 1 , and so on). (a) Find. E(R2) by computing f 0000 g (x)/1(x) dx . (b) Find E(R2) by evaluating the probability function of R2 and then computing

"

I'Y YPR2 (y).

3.

4.

5.

Let R1 and R2 be independent random variables, each with the exponential density f(x) = e-x, x > 0 ; f(x) = 0, x < 0. Find the expectation of (a) R1R2 (b) R1 - R2 (c) IR 1 - R2 l Let R1 and R2 be independent, each uniformly distributed between - 1 and + 1 . Find E[max (Rb R2)]. Suppose that the density function for the length R of a telephone call is x >0

f(x) = xe-x , =

The cost of a call is

C(R) = 2, = 2

X
0 and E(R) = 0, then R is essentially 0 ; that is, P{R = 0} = 1 . This we can actually prove, from the previous properties. Define Rn = 0 if 0 < R < 1 /n ; Rn = 1 /n if R > 1 /n. Then 0 < Rn < R, so that, by property 3 , E(Rn ) = 0. But Rn has only two possible values, 0 and 1 /n, and so

Thus for all n But Hence

P{R

=

0}

=

1

Notice that if R is discrete, the argument is much faster : if I x>o xp R (x) = 0, then xpR (x ) = 0 for all x > 0 ; hence PR (x) = 0 for x > 0, and therefore PR (O) = 1 . CoROLLARY. PROOF.

If m

If Var R =

=

0,

then R is essentially constant.

E(R), then E[(R - m) 2]

=

0, hence P{R

=

m}

=

1.

t Since E(R) is allowed to be infinite , expressions of the form 0 · oo will occur. The most convenient way to handle this is simply to define 0 · oo = 0 ; no inconsistency will result.

1 16

EXPECTATION

5. Let

R1 , . . . , Rn be independent random variables. (a) If all the Ri are nonnegative , then R rJ = E(R1)E(R 2) E(R1R 2 E(Rn) (b) If E(Ri) is finite for all i (whether or not the Ri > 0) , then E(R 1R 2 Rn ) is finite and '

·

·

·

·

·

·

·

·

·

\

We can prove this when all the Ri are discrete , if we accept certain facts about infinite series. For Xl ,

Xl '

.

.

•

, Xn

• . . '

Xn

Under hypothesis (a) we may restrict the x/s to be > 0. Under hypothesis (b) the above series is absolutely convergent. Since a nonnegative or ab solutely convergent series can be summed in any order , we have

If (R1 , . . . , Rn) is absolutely continuo us, the argument is similar, with sums replaced by integrals.

E(R1R 2 •

• •

s:· · -s:x1 · · · xnf12 . . . n{x1, . . . , xn) dx1 · · · dxn = J:· · · s: xl · · x11 U x1) · · · fnC xn) d x1 · · · d xn = s :xJl(x1) d xl · · · s: x nfn(xn) dxn

R n) =

·

= E(R1) · · · E(R n)

6. Let R be a random variable with finite mean 1n and variance

infinite). If a and b are real numbers , then Var (aR PROOF.

we have

+

a2 (possibly

b) = a 2 a2

Since E(aR + b) = am + b by properties 1 and 2 [and (3 . 1 .2)], Var (aR + b ) = E((aR + b - (am + b)) 2 ]

= E[a2 (R - m) 2] = a2E[(R - m) 2]

by property 2

3.3 PROPERTIES OF EXPECTATION

1 17

R1, . . . , Rn be independent random variables, each with finite mean. Then Var (R1 + + Var Rn + Rn) = Var R1 + PROOF. Let mi = E(R i) . Then 2 2 Var (R I + + Rn) = E[ Ci1 Ri - ii mi) ] = E [ ( l-1 (Ri - mi)) ] i 7. Let

· · ·

· · ·

· · ·

If this is expanded, the "cross terms" are 0, since, if i � j,

E [(Ri - mi)(Ri - mi) ] = E(RiRi - miRi - miRi + mimi) = E(Ri)E(Ri ) - miE(Ri) - miE(R i) + mimi by properties 5, 1 , and 2 since E(Ri) mi , E(Ri ) = mi = 0 Thus n n + Rn) = 2:,1 E(Ri - mi) 2 = 2:,1 Var Ri Var (R 1 + i= i= CoROLLARY. If Rr , . . . , R n are independent, each with finite mean, and a1, . . . , an, b are real numbers , then Var (a1R1 + + an2 Var Rn + anRn + b) = a12 Var R1 + PROOF. This follows from properties 6 and 7. (Notice that a1Rr, . . . , anR n =

· · ·

· · ·

· · ·

are still independent ; see Problem 1.)

{31, . . . , f3n ( > 2) can be obtained from the moments cx1, . . . , cxn, provided that cx1, . . . , cxn_1 are finite and cxn exists. To see this, expand (R - m ) n by the binomial theorem. 8. The central moments

n

Thus

E[(R - m)n] = k=Oi ( n ) (-m)n-kctk Notice that since cx1, , cxn_1 are finite, no terms of the form + f3 n

= .

k

.

•

oo

oo

can appear in the summation, and thus we may take the expectation term by term, by property 1 . This result is applied most often when n = If has finite mean hence always exists since > 0] , then + =

2. R E(R2) [ R2 (R - m)2 R2 - 2mR m2; Var R = E(R2) - 2mE(R) + m 2

EXPECTATION

1 18

That is,

a2 = E (R2)

-

[E (R)] 2

(3 .3. 1 )

which is the "mean of the square" minus the "square of the mean." 9. If E(R7�) is finite and 0 < j < k, then E (R3 ) is also finite. PROOF

I R(w) l 3 1 if I R(w) l < 1

I R(w) l k

for all w

and the result follows. Notice that the expectation of a random variable is finite if and only if the expectation of its absolute value is finite ; see (3. 1 .7). Thus in property 8, if cxn-l is finite, automatically cx1, , cxn_ 2 are finite as well. •

•

•

Properties and 7 fail without the hypothesis of independence. For example, let R1 = R 2 = R, where R has finite mean. Then E (R1R 2 ) � E(R1)E(R 2 ) since E(R2 ) - [E(R)] 2 = Var R, which is > 0 unless R is essentially constant, by the corollary to property 4. Also, Var (R1 + R 2 ) = Var (2R) = 4 Var R, which is not the same as Var R1 + Var R 2 = 2 Var R unless R is essentially constant .

5

REMARK.

PRO B LE M S

1. 2. 3.

4. 5.

, If R1, , Rn are independent random variables, show that a1R1 + b1, an Rn + bn are independent for all possible choices of the constants ai and h i . If R is normally distributed with mean m and variance a2 , evaluate the central moments of R (see Problem 1 , Section 3.2). Let () be uniformly distributed between 0 and 27T. Define R1 = cos 0, R2 = sin 0. Show that E(R1R2) = E(R1)E(R2), and also Var (R1 + R2) = Var R1 + Var R2, but R1 and R2 are not independent. Thus, in properties 5 and 7, the converse assertion is false. If E(R) exists, show that IE(R)I < E(!RI). Let R be a random variable with finite mean. Indicate how and under what conditions the moments of R can be obtained from the central moments. In •

.

.

•

•

•

3.4

particular show that E(R2) < oo if and only if Var R < an is finite if and only if Pn is finite. 3.4

CORRELATION oo .

1 19

More generally,

C O RRELA TION

If R1 and R 2 are random variables on a given probability space, we may define joint moments associated with R 1 and R 2

j, k > 0 and joint central moments

{33 k = E[(R1 - m1 ) 3 (R 2 - m2) k ],

m1 = E (R1), m 2 = E(R 2)

We shall study {311 = E[ (R1 - m1) (R 2 - m2) ] = E(R1R 2) - E (R1)E(R 2) , which is called the covariance of R1 and R 2 , written Cov {R1, R 2 ). In this section we assume that E(R1) and E(R 2 ) are finite, and E(R1R 2) exists ; then the covariance of R1 and R 2 is well defined. Theorem 1.

conversely.

If R1 and R 2 are independent, then Cov (R1, R 2) = 0, but not

PROOF. By property 5 of Section 3. 3, independence of R1 and R 2 implies that E(R1R 2 ) = E(R1)E(R 2) ; hence Cov (R1 , R2 ) = 0. An example in which Cov (R1 , R 2 ) = 0 but R1 and R 2 are not independent is given in Problem 3 of Section 3.3. We shall try to find out what the knowledge of the covariance of R1 and R 2 tells us about the random variables themselves. We first establish a very useful inequality. 2

Assume that E(R12) and E(R 22 ) are finite (R1 and R 2 then automatically have finite mean, by property 9 of Section 3.3, and finite variance, by property 8) . Then E(R1R2 ) is finite, and IE(R1R 2) 1 2 0. For any real number x let

R1

Since h(x) is the expectation of a nonnegative random variable, it must be > O for all x. The quadratic equation h(x) = 0 has either no real roots or, at

120

EXPECTATION h(x)

/

FIGURE 3 .4. 1

/

/

/ //Not possible /

Proof of the Schwarz Inequality.

worst, one real repeated root ( see Figure 3.4. 1 ). Thus the discriminant must be 0. Define the correlation coefficient of R1 and R 2 as

By Theorem 1 , if R1 and R 2 are independent, they are uncorrelated ; that is, p(R1, R 2 ) = 0, but not conversely. Apply the Schwarz inequality to R1 - E(R1) and R 2 - E(R2). IE[(R1 - ER1) (R 2 - ER 2 ) ]1 2 < E[(R1 - ER1) 2 ]E[ (R 2 - ER 2) 2 ]

PROOF.

We shall show that p is a measure of linear dependence between R1 and R 2 [more precisely , between R1 - E(R1) and R 2 - E(R2) ], in the following sense. Let us try to estimate R 2 - ER2 by a linear combination c(R1 - ER1 ) + d, that is , find the c and d that minimize E{ [ (R2 - ER 2 ) - (c(R1 - ER1) + d) ] 2} = a 22 - 2c Cov (R1, R 2) + c2 a12 + d2 = a22 - 2c pa1a 2 + c2 a12 + d2

3.4

CORRELATION

121

Clearly we can do no better than to take d = 0. Now the minimum of Ax2 + 2Bx + D occurs for x = -B/A ; hence a12c2 - 2pa1a2c + a22 is minimized when Thus the mtntmum expectatiofi is a22 - 2p 2 a22 + p2 a22 = a22 (1 - p2). For a given a22 , the closer I P I is to 1 , the better R2 is approximated (i n the mean square sense) by a linear combination aR1 + b . In particular, if I PI = 1 , then so that

with probability 1 . Thus , if I PI = 1 , then R1 - E(R1) and R2 - E(R2) are linearly de pendent. (The random variables R1, , Rn are said to be linearly dependent i ff there are real numbers a1, . . . , an, n ot all 0 , such that P{a1R1 + · · · + anRn = 0} . = 1 . ) Conversely, if R1 - E(R1) and R2 - E(R2) are linearly dependent, that is , if a(R1 - ER1) + b(R2 - ER2) = 0 with probability 1 for some constants a and b, not both 0 , then I PI = 1 (Problem 1 ) . •

PRO B LE M S

1.

2.

3. 4.

•

•

If R1 - E(R1) and R2 - E(R2) are linearly dependent, show that I p(R1, R2)1 = 1 . If aR1 + b R2 = c for some constants a, b, c, where a and b are not both 0, show that R1 - E(R1) and R2 - E(R2) are linearly dependent. Thus I p(R1, R2)1 = 1 if and only if there is a line L in the plane such that (R1 ( w) , R2( w )) lies on L for "almost" all w, that is, for all w outside a set of probability 0. Show that equality occurs in the Schwarz inequality, IE(R1R2) 1 2 = E(R12)E(R22), if and only if R1 and R2 are linearly dependent. Prove the following results. (a) Schwarz inequality for sums: For any real numbers a1 , . . . , an , b1 , . . . , bn ,

127

b}

This is the desired result. The general proof is based on the same idea. Let Ab = {R > b} ; then R > RIA b · For if w ¢= Ab , this says simply that R(w) > 0 ; if w E Ab, it says that R(w) > R(w). Thus E(R) > E(RIAb ). But RIA b > b lA b since w E Ab implies that R(w) > b. Thus '

E(R) > E(RIAb ) > E(b /Ab ) = bE(IAb ) Consequently P (A b ) < E(R) /b , as desired:

_:_

bP (A b)

(b) Let R be an arbitrary random variable, c any real number, and 8 and nz positive real numbers. Then

P{ IR -

e l > e}

PROOF.

P { I R - c l > 8}

=

P {I R -

em } < E[IR : e l m ]

by (a)

8

(c) If R has finite mean m and finite variance a2 > 0, and k is a positive real number, then

P{ IR -

ml > ka}
1 . 9 6a} = . 05 . In this case Chebyshev' s inequality says only that

P {. IR -

ml >- 1 .96a} -

f. f (x, y) d x dy JxEA yEB = r ft(x) [ l h(y I x) dy] d x = l Px (B)fl( x) d x A J � yEB

P{ R 1 E A, R2 E B} = f which is (4. 1 . 1).

x

4.3

CONDITIONAL DENSITY FUNCTIONS

1 37

y)

We have seen that if (R1, R 2 ) has density f(x, and R1 has density f1(x) we have a conditional density I x) = f(x, y) /ft (x) for R2, given R1 = x. Let us reverse this process. Suppose that we observe a random variable R 1 with density ft(x) ; if R1 = x, we observe a random variable R 2 with density I x) . If we accept the continuous version of the theorem of total prob ability, we may calculate the joint distribution function of R1 and R 2 using (4. 1 . 1 ) .

h(y

h (y

f':P{R2 < Yo I R1 = x} fl(x) dx = f:[f: h(y I x) dy] !l(x) dx f: f':fi(x)h(y I x) dy dx

F(xo , Yo ) = P{R1 < Xo , R 2 < Yo } =

=

y)

Thus (R1, R 2 ) has a density given by f(x, = ft(x)h(y I x) , in agreement with (4.3. 1) . To summarize: We may look at the formula f(x, = ft(x)h(y I x) in two ways. we have a natural notion of conditional 1 . If (R1, R 2) has density f(x, probability.

y),

y)

2. If R1 has density ft(x) , and whenever R1 = x we select R 2 with density

h(y I x) , then in the natural formulation of this problem (R1 , R2) has density f(x, y) = ft(x) h( y I x).

In both cases "natural" indicates that (4. 1 . 1) , the continuous version of the theorem of total probability, is required to hold. We may extend these results to higher dimensions. For example, if (R1, R 2 , R3 , R4) has density f(x1, x 2 , x3 , x4) , we define (say) the conditional density of (R3 , R4) given (R1, R 2) , as ( xb X2 , Xa, x4) f a, h( X x4 I xb x2) = f1 2 (xb X2 ) where

f1 2 (x1, x2) =

1: i:f(xb X2 , Xa, xJ dxa dx4

The conditional probability that (R3 , R4) belongs to the two-dimensional Borel set B, given that R1 = x1, R 2 = x 2 , is defined by

P x l[c2(B) = P {(R 3 , R4) E B I R1 = xb R 2 = x2} =

ff B

h( x3 , x4 I xb x2) d x3 dx4

1 38

CONDITIONAL PROBABILITY AND EXPECTATION

The appropriate version of the theorem of total probability is

P {(Rt. R 2) E A , (R 3 , R4) E B} =

JJP.,,.,.(B)f12(xt. x2) dx1 dx2 A

If (R1, R 2 ) has density ft 2 (x1, x 2) , and having observed R1 = x1, R 2 = x 2 , we select (R3 , R4) \Vith density h(x3 , x4 1 x1, x 2), then (R1, R 2 , R3 , R4) must have density f(x1, x 2 , x3 , x4) = ft 2 (x1, x 2)h(x3 , x4 1 x1, x 2). Let us do some examples . 1.

We arrive at a bus stop at time t = 0. Two buses A and B are in operation. The arrival time R1 of bus A is uniformly distributed be tween 0 and tA minutes , and the arrival time R 2 of bus B is uniformly distrib uted between 0 and tB minutes, with tA < tB . The arrival times are independent. Find the probability that bus A will arrive first. We are looking for the probability that R1 < R 2• Since R1 and R 2 are independent (and have a joint density) , the conditional density of R 2 given R1 is f (x, y) = ( y) = f2 _!_ ' tB ft(x) ..- Example

If bus A arrives at x, 0 < x < tA, it will be first provided that bus B arrives between x and tB . This happens with probability (tB - x)ftB . Thus

By (4. 1 .2),

=

L

0

tA (

)

X 1 t 1 - - - d x = 1 - _4.. tB tA

2 tB

[Formally, taking the sample space as £2 , we have C = {R1 < R 2} = {(x, y) : x < y}, C x = {y : x < y}, Px ( Cx) = P{R1 < R 2 I R1 = x} = 1 xftB , 0 < X < tA.] Alternatively, we may simply use the joint density :

P{R 1 < R2} =

as before .

.,...

JJ f(x, y) dx dy

x 0. If R0 = it, we take n independent observations R1, R2 , , Rn , each Ri having the exponential density /;.(y) = ite-;.11 , y � 0 ( = 0 for y < 0) . Find the conditional density of R0 given (R1, R2, , Rn ) · Here we have specifiedfo(it), the density of R0, and the conditional density of (R1, R 2 , , Rn) given R0, namely, Example •

•

•

•

•

•

•

•

•

by the independence assumption The joint density of R0, R1 ,

.

.

.

, Rn is therefore

J(it, xl , . . . ' xn) = fo (A)h(xb . . . ' xn I it) The joint density of R1 , . , R n is given by .

g( xt , · . . , x n)

=

=

Ane-l ( l+x>

.

oo oo J-oo f(A, xt, . . . , x,.) dA Jfo ;.ne-- 0, and lf(x) l < A 2eK 2x for x < 0, then L1(s) is finite for K1 < Re s < K2 • This follows, since

and

L'"

i f (x) e-•"' i d x

Region of Convergence

L1(s)

[(x) u(x) e-axu(x) xne-axu(x) ' xa.e-axu(x) ,

f

0, -

1, . . . 1

1/s 1/(s + a) n!f(s + a)n+1 r(a + 1)/(s + a)a.+ 1

Re s Re s Re s Re s

>0 > -a > -a > -a

x > 0 ; u(x) = 0, x < 0. If we verify the last entry in the table the others will follow. Now ae-y y x x dy xae-a e-s d x = [with y = (s + a) x ] 1 + a o s � ( + a) o r(oc + 1 ) t = 1 + (s + a) a

oo i

(oo

Strictly speaking, these manipulations are only valid for s real and > -a. However, one can show that under the hypothesis of property 1 L1 is analytic for K1 < Re s < K2 . In the present case K1 = - a and K2 can be taken arbitrarily large, so that L1 is analytic for Re s > - a . Now L1(s) = r(a + 1)/(s + a)a+1 for s real and > -a, and therefore, by the identity theorem for analytic functions, the formula holds for all s with Re s > -a. This technique, which allows one to treat certain complex integrals as if the integrands were real-valued, will be used several times without further comment.

t

1 58

CHARACTERISTIC . FUNCTIONS

u(x) and - u( x) have the same Laplace transform 1/s, but the regions of convergence are disjoint :

REMARK.

-

100

dx

=

- u( - x) e-sx d x

=

- oo

and

100

- oo

u ( x) e

-n

1000 0 1

-sx e

dx

-s x -e

- oo

=

dx

1 S

-

Re s > 0

,

=

1

-

S

Re s < 0

,

This indicates that any statement about Laplace transforms should be accompanied by some information about the region of convergence. We need the following result in doing examples ; the proof is measure theoretic and will be omitted. 5. Let R be an absolutely continuous random variable. If h is a nonnega tive (piecewise continuous) function and Lh(s) is finite and coincides with the generalized characteristic function NR (s) for all s on the line Re s = a , then h is the density of R.

5.2

E X A MPLE S

We are going to examine some typical problems involving sums of independ ent random variables. We shall use the result, to be justified in Example 6, that if R1 , R 2 , , Rn are independent , each absolutely continuous, then R1 + + Rn is also absolutely continuous. In all examples Ni(s) will denote the generalized characteristic function of the random variable Ri . •

·

·

•

•

·

Let R1 and R 2 be independent random variables, with R1 uniformly. distributed between - 1 and + 1 , and R2 having the exponential density e-vu(y). Find the density of R0 = R1 + R 2 We have

..- Example 1.

•

all s Re s > - 1

Thus, by Theorem 1 of Section 5. 1 , N0(s)

=

N1 (s)N2 (s)

=

2s (s

1

+ 1)

(e8 - e-8)

5.2

i [e-(x- 1)

_

e

EXAMPLES

1 59

- (x + l)J

----.---�----�--� X

-1

1

0

FIGURE 5.2. 1

at least for Re s > - 1 . To find a function with this Laplace transform, we use partial fraction expansion of the rational function part of N0(s) : 2 s(s

1

1-

1)

=

1

-

2s

-

2(s

1

1)

�

Now from Table 5. 1 . 1 , u(x) l1as transform 1 /s ( Re s > 0) and e- xu(x) has transform 1/(s � 1) (Re s > - 1) . Thus (1/2) (1 - e-x) u(x) has transform 1 f2s(s � 1) (Re s > 0) . By property 2 of Laplace transforms (Section 5. 1) , (1 /2) (1 - e- < x-t1> )u(x � 1) has transform e8f2s(s � 1) and (1 /2)(1 - e- < x-l > )u(x - 1) has transform e-sj2s(s � 1) (Re s > 0). Thus a function h whose transform is N0(s) for Re s > 0 is h(x) = � (1 - e- < x+1 > ) u (x

�

1) - � (1 - e- < x-1> ) u (x - 1)

By property 5 of Laplace transforms, h is the density of R0 ; for a sketch, see Figure 5.2. 1 . ...._ Let R0 = R1 � R 2 � R3, where Rb R 2, and R3 are independ ent with densities /1(x) = /2 (x) = exu(-x), f3(x) = e- < x-1 > u (x - 1) . Find the density of R0 • We have

�

Example 2.

1

N1(s) = N2(s) = {o e "'e-•"' d x = and Thus

f1

J- oo

oo e- ( x-1 ) e-sx d ) x N3 ( s N0(s) = N1(s)N2(s)N3(s) =

-

1 -

e-s

s

e-s

�

(s - 1 ) 2(s

�

1

S

Re s < 1 Re s > - 1

, 1)

,

,

- 1 < Re s < 1

160

CHARACTERISTIC FUNCTIONS

We expand the rational function in partial fractions.

A + B + C + (s - 1 ) 2 s - 1 s 1 The coefficients may be found as follows. A = [(s - 1 ) 2G(s) ]s =1 = � = -l B = (( s - 1) 2G(s)) =1 ds s + C = [(s 1)G(s)] s = -1 = l + From Table 5.1 . 1 , the transform of xexu(x) is 1/(s 1) 2 , Re s > - 1 . By Laplace transform property 3, the transform of -xexu(-x) is 1/(1 - s) 2 , + Re s < 1 . The transform of e-xu(x) is 1/(s 1), Re s > - 1 , so that, again by property 3 , the transform of exu(-x) is 1/(1 - s) , Re s < 1 . Thus the transform of + + - !xexu(-x) l exu(-x) l e-xu(x) . 1

[!!_

J

IS

1 /4 1/2 ( s - 1) 2 s - 1 By property 2, the transform of G(s) =

.

_

+

1 /4 s

+

1

,

h(x) = [l - � (x - 1)]ex-1u( - (x - 1))

IS

- 1 < Re s < 1 +

1 le- < x- > u(x - 1)

- 1 < Re s < 1

By property 5, h is the density of R0 (see Figure 5.2.2) .

...._

h(x)

0

1

FIGURE 5.2.2

+ h(x) = (l � (1 - x))ex-1 , = } e- < x-1 > ' x>1

x< 1

.,._

Example

3.

5.2

EXAMPLES

161

Let R have the Cauchy density; that is,

fR(x) = -rr( l +1 x2) ' The characteristic function of R is

MR(u) = J:e -iuxfR(x) dx

R(

[In this case NR (s) is finite only for s on the imaginary axis.] M u) turns out to be e l l . This may be verified by complex variable methods (see Problem 9) , but instead we give a rough· sketch of another attack. If the characteristic function of a random variable R is integrable, that is,

-u

R oo 1 x ei u du (u) M fR(x) = R f 27T oo

it turns out that R has a density and in fact f is given by the inverse Fourier transform. In the present case

0 00 f- e -l u l du = f eu du + r 00 e-u du = 2
du + -27T1 i000 e-u ( l-z.x ) du J_[ 1 + 1 = 1 2 = 27T 1 + ix 1 ixJ 7T(1 + x ) .

.

-

Thus the Cauchy density in fact corresponds to the characteristic function e- l u l . This argument has a serious gap. We started with the assumption that e- l u l was the characteristic function of some random variable, and deduced We must from this that the random variable must have density 1/Tr(l + establish that e-l u l is in fact a characteristic function (see Problem 8) . Now let R0 = R1 + + Rn , where the Ri are independent, each with the Cauchy density. Let us find the density of R0 • We have

x2).

·

·

·

162


If instead we consider R0fn, we obtain

=

MRo / n (u )

E [e-i uRo f n ]

= Mo

( :) =

e- l u l

Thus R0/n has the Cauchy density. Now if R 2 = nR1 ·, then /2 (y) (1/n)ft(yfn) (see Section 2.4) , and so the density of R0 is n 1 n7r(1 + y 2fn 2) Tr(Y 2 + n 2)

fo(Y) =

REMARKS. 1 . The arithmetic average R0fn of a sequence of independent Cauchy distributed random variables has the same density as each of the components. There is no convergence of the arithmetic average to a constant, as we might expect physically. The trouble is that E(R) does not exist. 2.

If R has the Cauchy density and R1 = c1R, R 2 = c 2R, c1, c 2 constant and > 0, then I

M1(u) = E( e-iu R1) = E(e-i u c 1R) = MR (c1u) = e-c1 l u l and similarly M2 (u ) = e-c2 l u l Thus, if R0 = R1 + R 2 = (c1 + c 2) R, Mo (u) = e- < c t+C 2 ) I u l which happens to be M1(u)M2 (u). This shows that if the char acteristic function of the sum of two random variables is the prod uct of the characteristic functions, the random variables need not be independent. 3. If R has the Cauchy density and R1 = OR, fJ > 0, then by the calculation performed before Remark 1, R1 has density /1(y) = 0/Tr (y2 + 0 2) and (as in Remark 2) characteristic function M1(u) = e-8l u l : A random variabl� with this density is said to be of the Cauchy type with parameter fJ or to have the Cauchy density with parameter fJ. The formula for M1(u) shows immediately that if R1 , . . . , Rn are independent and Ri is of the Cauchy type with parameter fJ i , i = 1 , . . . , n, then R1 + · · · + Rn is of the Cauchy type with parameter f) l + . . . + f) n · ..... ..- Example 4.

If R1 , R 2 , , Rn are independent and normally distributed, then R0 = R1 + · · · + Rn is also normally distributed. •

•

•

5.2

EXAMPLES

1 63

We first show that if R is normally distributed with mean m and variance a2 , then (all s) (5.2.2) Now

NR (

[00 s) = e-sx R .. - oo

f (x) dx =

l

oo

- oo

1 e-sx e-(x-m ) f 2a dx .J 2Tr a 2

2

Let y = (x - m)/.J 2 a and cop1plete the square to obtain

by (2.8.2)

No(s) = Nt(s)N2 (s) . . . Nn(s)

=

e-s < ml+...+ mn > e s < al +···+an > 12 2

2

2

But this is the characteristic function of a normally distributed random + mn, a0 2 = variable, and the result follows. Note that m0 = m1 + a12 + · + an 2 , as we should expect from the results of Section 3.3. _.... ·

·

·

·

·

.,._ Example 5.

Let R have the Poisson distribution. PR(k)

=

e-;.Ak k!

'

k

=

0, 1, . . .

We first show that the generalized characteristic function of R is (all s)

We have

(5 .2.3)

as asserted. We now show that if R1 , . . . , Rn are independent random variables, each with the Poisson distribution, then R0 = R1 + · · · + Rn also has the poisson distribution. If Ri has the Poisson distribution with parameter Ai, then

No(s)

=

N1(s) N2 (s) · · Nn(s) = exp [(A1 + · · + An) (e-s - 1 ) ] ·

·

This is the characteristic function of a Poisson random variable, and the result follows. Note that if R has the Poisson distribution with parameter

1 64


A, then E(R) = Var R = A (see Problem 8, Section 3.2). Thus the result that the parameter of R0 is A1 + · · · + An is consistent with the fact that E(R0) = E(R1) + · · · + E(Rn) and Var R0 = Var R1 + · · · + Var Rn . ...._

.,._

Example 6.

In certain situations (especially when the Laplace transforms cannot be expressed in closed form) it may be convenient to use a convolu tion procedure rather than the transform technique to find the density of a sum of independent random variables. The method is based on the following result.

Let R1 and R 2 be independent random variables, having densities /1 and /2 , respectively. Let R0 = R1 + R 2• Then R0 has a density given by Convolution Theorem.

fo (z) =

i:Mz - x)f1(x) dx = i:.II(z - y)f2(y) dy

(5.2.4)

(Intuitively, the probability that R1 lies in (x, x + dx] is f1(x) dx ; given that R1 = x, the probability that R0 lies in (z, z + dz] is the probability that R 2 lies in (z - x, z - x + dz] , namely , /2 (z - x) dz. Integrating with respect to x, we obtain the result that the probability that R0 lies in (z, z + dz] is dz

i!2(z - x)f1(x) dx

Since the probability is f0(z) dz, (5.2.4) follows.) PROOF.

To prove the convolution theorem, observe that F0(z) = P{R1 + R 2 < z} =

=

II fi(x)f2(y) dx dy

J: [ f00" f2(y) dyJ !1(x) dx

Let y = u - x to obtain

J: [fJ2(u - x) du] f1(x) dx = foo [ J:fi(x)f2(u - x) dx] du

This proves the first relation of (5.2.4) ; the other follows by a symmetrical argument. We consider a numerical example. Let .f1(x) = 1/x2 , x > 1 ; .f1(x) = 0 , x < 1 . Let f2 (y) = 1 , 0 < y < 1 ; /2 (y) = 0 elsewhere. If z < 1 , /0(z) = 0 ; if 1 < z < 2, oo 1 fo(z) = j1(x)f2 (z - x) d x = -2 d x = 1 - 1 1X Z - oo

J

iz

-

5.2

-----L--'---'-+- X z

_._____ __.. �X

-1

_

ft(x)

ft (x)

1

1

{0 (z)

_1_

.__ .! z

---'----'---�

1

0

FIGURE 5.2. 3

If z

2

z

Application of the Convolution Theorem.

> 2, fo (z) =

(see Figure 5.2.3) .

1 65

z- 1 z

z- 1

0

EXAMPLES

l

z

1

1 2 dx =

z- 1 X

Z -

- 1

-

1

Z

The successive application of the convolution theorem shows that if R 1 , . . . , Rn are independent, each absolutely continuous , then R 1 + · · · + Rn is absolutely continuous. _....

REMARK.

P RO B L E M S

Let R1 , R2 , and R3 be independent random variables, each uniformly distrib uted between - 1 and + 1 . Find and sketch the density function of the random variable Ro = R1 + R2 + R3. 2. Two independent random variables R1 and R2 each have the density function [ (x) = 1/3 , - 1 < x < 0 ; [ (x) 2/3 , 0 < x < 1 ; f(x) = 0 elsewhere. Find and sketch the density function of R1 + R2• 2 2 3. Let R = R1 + , Rn are independent , and each Ri · + Rn , where R1 , is normal with mean 0 and variance 1 . Show that the density of R is

1.

=

·

·

•

•

•

1 n f (x) = nl2r(n/2) x< /2 )-le-x/ 2 ' 2

(R is said to have the "chi-square" distribution with n "degrees of freedom.")

1 66


4. A random variable R is said tq have the "gamma distribution" if its density is, for some a , fJ > 0, [(x)

=

xa.-Ie- a:l P

r( a){Ja. , X > 0 ; [(x)

=

0,

X

< 0

Show that if R1 and R2 are independent random variables, each having the gamma distribution with the same {J, then R1 + R2 also has the gamma distribution. 5. If R1, . . . , Rn are independent nonnegative random variables, each with density .Ae-lxu(x), find the density of R0 = R1 + · · · + Rn. 6. Let () be uniformly distributed between - 'T/'/2 and 'T/'/2. Show that tan () has the _ Cauchy density. 7. Let R have density f(x) = 1 - lxl , !xl < 1 ; f(x) = 0 , lxl > 1 . Show that MR (u) = 2(1 - cos u)fu2 • *8. (a) Suppose that f is the density of a random variable and the associated characteristic function M is real-valued, nonnegative, and integrable. Show that kf(u), oo < u < oo , is the characteristic function of a random variable with density kM(x)/271', where k is chosen so that kf(O) = 1 , that is, -:-

J:

(kM(x)/2,.] dx

=

1

(b) Use part (a) to show that the following are characteristic functions of random variables : (i) e- l ul , (ii) M(u) = 1 - l ui , l ui < 1 ; M(u) = 0 , lui > 1 . *9. Use the calculus of residues to evaluate the characteristic function of the Cauchy density. 10. Calculate the characteristic function of the normal (0, 1) random variable as follows. Differentiate e x2/2 (cos ux) .y- dx M(u) =

foo

- 00

-

271'

under the integral sign ; then integrate by parts to obtain M' (u) = - uM(u). Solve the resulting differential equation to obtain M(u) = e-u 2 12• From this, find the characteristic function of a random variable that is normal with mean 2 m and variance a •

5.3

PROPERTIES OF CHARA CTERISTIC FUNCTIONS

Let R be a random variable with characteristic function M and generalized characteristic function N. We shall establish several properties of M and N.

5.3 PROPERTIES OF CHARACTERISTIC FUNCTIONS

M(O) I M(u)l

1. = N(O) = 1 . =N This follows, since < 1 for all 2. If R has a density f, we have

M(O) u.

1 67

(O) = E(e ) . 0

I M(u) l = J:e-iure f(x) dx < J:l e-iu'i(x)l dx = Jj(x) dx = 1 The general case can be handled by replacing f(x) dx by dF(x), where

F

is the distribution function of R. This involves Riemann-Stieltjes integration, which we shall not enter into here. 3. If R has a density f, and f is even., that is , = for all then is real-valued for all For

M(u)

f( - x) f(x)

u.

x,

M(u) = Jj(x) cos ux dx - Jj(x) sin ux dx Since f(x) is an even function of x and sin ux is an odd function of x, f (x) sin ux is odd ; hence the second integral is 0. It turns out that the assertion that M(u) is real for all u is equivalent to the statement that R has a symmetric distribution, that is, P{R E B} = P{R E - B} for every Borel set B. (-B = { - x x E B}.) 4. If R is a discrete random variable taking on only integer values , then M(u + 27T) = M(u) for all u. To see this , write M(u) = E(e-iuR) = n=! ooPne-iun (5.3. 1) i

:

00

where Pn = P{R = n}. Since e- i u n = e- i < u+ 2 1T> n , the result fol lo ws . Note that the Pn are the coefficients of the Fourier series of on the interval [0, 27T] . If we multiply (5.3. 1) by and integrate, we obtain

eiuk Pk = _!_27T l M(u) eiuk du 2 7I

0

M

(5. 3.2)

We come now to the important moment-generating property. Suppose that N(s) can be expanded in a power series about s = 0.

N(s) = k!=Oaksk 00

where the series converges in some neighborhood of the origin. This is just the Taylor expansion of N; hence the coefficients must be given by

k d ak = l.k ! dsN(s)k J S=O

1 68


=

But if R has density f and we can di�erentiate N(s) J�oo e-sx f(x) dx under the integral sign, we obtain N'(s) J oooo - xe-sxf(x) dx ; if we can differentiate k times, we find that

=

Thus

(5.3.3) and hence

The precise statement is as follows. 5. If NR (s) is analytic at s 0 (i.e. , expandable in a power series in a neighborhood of s 0) , then all moments of R are finite, and

=

=

(5.3.4) within the radius of convergence of the series. In particular, (5.3.3) holds for all k. We shall not give a proof of (5.3.4). The above remarks make it at least plausible ; further evidence is presented by the following argument. If R has density f, then N(s)

= J:e-""'f (x) dx oo ( S 2X 2 S 3X3 = J-oo 1 - sx + -2 ! - --3 ! +

· · ·

(- l ) ks kxk + + k!

· · ·

)

f(x) d x

If we are allowed to integrate term by term, we obtain (5.3.4). Let us verify (5.3.4) for a numerical example. Let f(x) e-xu(x) , so that N(s) 1 / (s + 1), Re s > 1 We have a power series expansion for N(s) about s 0.

=

=

1 1+s

-

=

.

= 1 - s + s 2 - s3 +

· · ·

-

+ ( 1) ks k +

· ·

·

ls i < 1

Equation 5.3.4 indicates that we should have ( - 1)kE(Rk)fk ! E(Rk) k ! To check this, notice that

=

= (- 1) k , or

5.4

THE CENTRAL LIMIT THEOREM

1 69

Let R be a discrete random variable taking on only nonnegative integer values. In the generalized characteristic function

REMARK.

N (s)

= _2 pke-sk, 00

Pk = P{R = k}

k=O

make the substitution z = e-s . We obtain A(z)

=

N(s) ]z = e-s

=

E(zR)

= _2 p�k 00

k=O

A is called the generating function of R ; it is finite at least for lzl < 1 ,

since .2� o P k = 1 .

We consider generating functions in detail in connection with the random walk problem in Chapter 6.

PRO B L E M S

1.

Could [2/(s + 1)] - (1/s)(1 - e-8) (Re s > 0) be the generalized characteristic function of an (absolutely continuous) random variable ? Explain. 2. If the density of a random variable R is zero for x ¢= the finite interval [a , b ] , show that NR (s) is finite for all s. 3. We have stated that if MR (u) is integrable, R has a density [see (5.2. 1)]. Is the converse true ? 4. Let R have a lattice distribution ; that is, R is discrete and takes on the values a + nd, where a and d are fixed real numbers and n ranges over the integers. What can be said about the characteristic function of R ? 5. If R has the Poisson distribution with parameter .A, calculate the mean and variance of R by differentiating NR (s) . 5 . 4 THE CENTRAL LIMIT THEOREM

The weak law of large numbers states that if, for each n, R1, R 2 , , Rn are independent random variables with finite expectations and uniformly bounded variances , then , for every s > 0, •

as n �

oo

•

•

1 70


In particular, if the R i are independent observations of a random variable R (with finite mean m and finite variance a2) , then � oo

as n

The central limit theorem gives further information ; it says roughly that for large n , the sum R1 + · · · + Rn of n independent random variables is approximately normally distributed, under wide conditions on the individual

Ri .

To make the idea of "approximately normal" more precise, we need the notion of convergence in distribution. Let R1, R2, be random variables with distribution functions F1 , F2, , and let R be a random variable with distribution function F. We say that the sequence R1 , R 2, converges in •

•

•

•

•

•

•

•

•

distribution to R (notation : R n d > R) iff Fn (x) � F(x) at all points x at which F is continuous. T o se e the reason for the restriction to continuity points of F, consider the following example .

..,.. Example 1 .

Let Rn be uniformly distributed between 0 and 1 /n (see Figure 5.4. 1) . Intuitively, as n � oo , Rn approximates more and more closely a random variable R that is identically 0. But Fn(x) � F(x) when x � 0 , but not at x = 0, since Fn(O) = 0 for all n, and F(O) = 1 . Since x = 0 is not a continuity point of F, we have Rn d > R . _.... Fn(x)

F(x) 1

1 0 ��

0

----�----� x

0

1

n

f,(x) n

1

::: X

n

FIGUR.E 5.4. 1

Convergence in Distribution.

5.4


171

REMARK.

The type of convergence involved in the weak law of large numbers is called convergence in probability. The sequence R1 , R2 , is said to converge in probability to R (notation : Rn P > R) iff for every s > 0, P{IRn - Rl > s} � 0 as n � oo . Intuitively , for large n , R n is very likely to be very close to R. Thus the weak law of large numbers states that ( 1 /n ) '!,f 1 (Ri - E(Ri)) P ) 0 ; in the case in which E(Ri) = m for all i, we have •

•

•

n - !, Ri 1

n i= l

P>

m

The relation between convergence in probability and convergence in distribution is outlined in Problem 1 . The basic result about convergence in distribution is the following. Theorem 1. The sequence R1, R 2 , converges only if Mn(u) � M(u) for all u, where Mn is the Rn, and M is the characteristic function of R. •

•

.

in distribution to R if and characteristic function of

The proof is measure-theoretic , and will be omitted. Thus, in order to show that a sequence converges in distribution to a normal random variable, it suffices to show that the corresponding sequence of characteristic functions converges to a normal characteristic function. This is the technique that will be used to prove the main theorem, which we now state.

For each n, let R1, R 2 , , Rn be independent random variables on a given probability space. Assume that the Ri all have the same density function f (and characteristic function M) with finite mean m andfinite variance a2 > 0 , and finite third moment as well. Let Theorem 2. (Central Limit Theorem).

n. !, R3 - nm Tn = .Jfi a

•

•

•

i=I

(= [Sn - E(Sn)]/a(Sn) , where Sn = R1 + · · · + Rn and a(Sn) is the standard deviation of Sn) so that Tn has mean 0 and variance 1 . Then T1, converge in distribution to a random variable that is normal with mean T2 , •

0

•

•

and variance

1.

*Before giving the proof, we need some preliminaries.

172


Let f be a complex-valued function on E1 with n continuous derivatives on the interval V = (-b, b). Then, on V, n k k 1 -1 ) ( 1 O )u (1 - t) n-1 < n > ( J n f(u) = k=OL k ! + u o (n - 1) ! f (tu) dt Thus, ifiJ < n > l < M o n V, wh ere 1 0 1 < M (0 depends on u) Theorem

3.

PROOF.

Using integration by parts, we obtain

n-1 f ( n-1)(t) u + ruf ( n-1 ) (t) (u - t)n-2 dt t) (u o Jo (n - 1) ! [ (n - 1) ! (n - 2) ! J n n n ' 0 1) 1 2 ( u ) ( ) u f (u t n 1) t = + J ( (t) d (n - 1) ! Jo (n - 2) ! n n n n u 0 -1) 0 -2) 2 ( 1 ( ) ) ( ( u J J = (n - 1) ! (n - 2) ! rJouf ( n-2) (t) (u(n-- t)3)n-! 3 dt n k k 1 ) ( O u ) ( u J = - L + J J '(t) dt by iteration k=1 k ! n n k 1 1 ) ( O u t) uk ) ( ( u J n f(u) = kL=O k ! + Jo J < > (t) (n - 1) ! dt

n 1 ) (u t n c j ) f (t) dt =

Jo

_

-

+

0

Thus

The change of variables t = ut ' in the above integral yields the desired ex pression for f(u). Now if (1 I = Jo 1) ! then t (1 ( =M <M Jo 1) ! !

JIJ

Let () = In !/ l ui

nu ( 1 - t)n-1 j< nl(tu) dt (n n n 1 t) lul n J u J (n - dt n

n ; then 1 0 1 < M and the result follows.

5.4

1 73

THE CENTRAL LIM IT THEOREM

Theorem 4.

101 < 1

where

y

is an arbitrary real n u"!ber, 0, 01 depending on

PROOF.

This is immediate from Theorem 3.

(b) If z is a complex number and lzl
O dt � V21T V2 1T X

i

-

__

-

HINT : show that

by differentiating both sides. (b) Show that

in the sense that the ratio of the two sides approaches 1 as x � oo . 7. Consider a container holding n = 106 molecules. In the steady state it is reason able that there be roughly as many molecules on the left side as on the right. Assume that the molecules are dropped independently and at random into the

5.4

1 77

container and that each molecule may fall with equal probability on the left or right side. If R is the number of molecules on the right side of the container, we may invoke the central limit theorem to justify the physical assumption that for the purpose of calculating P{a < R < b} we may regard R as normally distributed with mean np n/2 and variance np (l - p) n/4. Use Problem 6 to bound P{IR n/21 > .005n}, the probability of a fluctuation about the mean of more than ± .5 % of the total number of molecules. Let R be the number of successes in 10,000 Bernoulli trials, with probability of success .8 on a given trial. Use the central limit theorem to estimate P{1940 < R < 8080}. =

=

-

8.


I n fi n i te Seq u e n ces of Ran d om Va r i a b l es

6.1 INTRODUCTION

We have not yet encountered any situation in which it is necessary to con sider an infinite collection of random variables , all defined on the same prob ability space. In the central limit theorem, for example, the basic underlying hypothesis is "For each n , let R1, . . . , R n be independent random variables." As n changes , the underlying probability space may change, but this is of no consequence, since a convergence in distribution statement is a statement about convergence of a sequence of real-valued functions on E1 . If R1 , . . . , Rn are independent, with distribution functions F1 , . . . , Fn , and Tn = (Sn - E(Sn))fa(Sn ) , Sn = R1 + · · · + Rn, the distribution function of Tn is completely determined by the Fi, and the validity of a statement about convergence in distribution of Tn is also determined by the Fi, regardless of the construction of the underlying space. However, there are occasions when it is necessary to consider an infinite number of random variables defined on the same probability space. For example, consider the following random experiment. We start at the origin on the real line, and flip a coin independently over and over again. If the result of the first toss is heads, we take one step to the right (i.e. , from x = 0 1 78

6. 1

1 79

INTRODUCTION

to x = 1), and if the result is tails, we move one step to the left (to x = - 1). We continue the process ; if we are at x = k after n trials , then at trial n + 1 we move to x = k + 1 if the (n + 1)th toss results in heads, or to x = k - 1 if it results in tails. We ask, for example, for the probability of eventually returning to the origin. Now the position after n steps is the sum 1 + + of n independ ent random variables, where = 1} = p = probability of heads, = 0 for some n > 0}. = - 1 } = 1 - p. We are looking for We must decide what probability space we are considering. If we are interested only in the first n trials, there is no problem. We simply have a sequence of n Bernoulli trials , and we have considered the assignment of probabilities in detail. However, the difficulty is that the event = 0 for some n > 0} involves infinitely many trials. We must take Q = E00 = all infinite sequences (x1, x 2 , ) of real numbers. (In this case we may restrict the xi to be ± 1 , but it is convenient to allow arbitrary xi so that the discussion will apply to the general problem of assigning probabilities to events in volving infinitely many random variables.) We have the problem of specifying the sigma field :F and the probability measure The physical description of the problem has determined all E B} that is, we know , probabilities involving finitely many for each positive integer n and n-dimensional Borel set B. What we would like to conclude is that a reasonable specification of probabilities involving finitely many determines the probability of events involving all the For example, consider {all = 1}. This event may be expressed as

Sn

P{Rk

R P{Rk P{Sn

·

·

Rn

·

{Sn

•

•

•

P.

Ri;

P{(R1 ,

•

•

•

Rn)

Ri.

Ri

Ri n { R1 1, . . . , R n 1 } n=l The sets {R 1 1 , . . . , R n 1} form a contracting sequence ; hence P{ all Ri 1 } lim P{ R1 = 1, . . . , Rn 1 } = lim pn 0 if p < 1 oo oo n -+ n -+ As another example, tRn = 1 for infinitely many n} {for every n, there exists k > n such that Rk 1 } n u { Rk = 1 } lk n= =n Thus P{ R n = 1 for infinitely many n } !��p[ l.J { Rk 1}] kn = lim lim P [ 0n {Rk = 1}] n -+ oo m-+ oo k= 00

=

=

=

=

=

=

=

=

=

=

00

00

=

=

=

1 80

INFINITE SEQUENCES OF RANDOM VARIABLES

Thus again the probability is determined once we know the probabilities of all events involving finitely many Ri . We now sketch the general situation. Let Q = E00 • A set of the form where c E , is called a cylinder with E a measurable cylinder if is a Borel subset of qase Suppose that for each n we specify a probability measure Pn on the Borel subsets of En ; is to be interpreted as P{(R1, where E ... = Suppose , for example, that we have specified P5• Then k < 5 , is determined. In particular, in the discrete case we have

Bn n

{(x1 , x2 , . . . ) : (x1 , . . . , xn) Bn}, Bn Bn , Pn (Bn) Ri (xl , x 2 , ) xi . I

( x l , X2 x3 )E B3 - oo < x4 < oo , - oo < xs < oo .

En . . . . , Rn) Bn}, Pk ,

P{R1 xb R 2 x2 , R3 x3, R4 x4, R5 x5} =

=

=

=

=

.

and in the absolutely continuous case we have

P{(Rb R2 , Ra) E Ba} r . J f(xl , x2 , Xa, x4, ) dxl dx2 dx3 dx4 dx5 Xs

=

( xl , X 2 , X3 ) EB3 -oo < x4 < oo . - oo < xs < oo

In general, once P is given, k < n, is determined. But we have specified Pk, k < n, at the beginning ; if our assignment of probabilities is to make sense, the original Pk must agree with that derived from P n > k. If, for all n = 1 , 2, . . . and all k < n, the probability measure P origi nally specified agrees with that derived from P n > k, we say that the prob ability measures are consistent. Under the consistency hypothesis , the Kolmogorov extension theorem states that there is a unique probability measure P on ff = the smallest sigma field of subsets of Q containing the measurable cylinders , such that

Pk ,

n

n'

k

n'

Bn) 1 , 2, . . . and all Borel subsets Bn of E n . P (the measurable cylinder with base

=

Bn)

Pn (

for all n = In other words , a consistent specification of finite dimensional probabilities determines the probabilities of events involving all the Ri . We now consider the case in which the Ri are discrete. Here we determine by prescribing the joint probability probabilities involving function = P{R1 = , , Rn = We may then derive th� joint probability function of R1, . . . , R

(R1 , . . . , Rn) Pt 2 · · n Cxt , xn) ·

·

P{R1 xb . . . , Rk xk} =

=

xt ,

·

=

I

XJc+ l •

•

• • ,

·

·

·

xn}

k: P{R1 xb . . . , R n xn} =

Xu

=

(6. 1 . 1)

6. 1

INTRODUCTION

181

If this coincides with the given p1 2 ... k (for all n and all k < n) we say that the system of joint probability functions is consistent. If we sum (6. 1 . 1) over (x1, . . . , xk) E Bk, we find that consistency of the joint probability functions is equivalent to consistency of the associated probability measures P · Thus in the discrete case the essential point is the consistency of the joint probability functions. In particular , suppose that we require that for each n, R1, . . . , R n be independent , with Ri having a specified probability function Pi · Then ( 6. 1 . 1) becomes �

n

P {R1 = Xb

·

·

·

,

Rk = xk} =

L

Pt ( xl) · · · P n( x n) = Pt ( xl) · · · Pk( xk)

and thus the joint probability functions are consistent. The point we are making here is that there is a unique probability measure on :#' such that the random variables R1 , R 2 , are independent, each with a specified prob ability function. In other words , the statement "Let R1 , R 2 , be independ ent random variables , where Ri is discrete and has probability function pi," is unambiguous. In the absolutely continuous case , probabilities involving R1 , . . . , Rn are determined by the joint density function /1 2 n· The joint density of R1 , . . . , Rk is then given by •

•

•

•

•

•

. . •

g( xt , . . . , xk) =

J:· · · f!l2 ·· · n(x1,

•

•

•

, xn) d xlc+ 1

•

•

•

d xn

(6 . 1 .2)

If this coincides with the given /1 2 ... k (n , k = 1 , 2 , . . . , k < n) , we say that the system of joint densities is consistent. By integrating ( 6. 1 . 2) over Borel k sets Bk c E , we find that consistency of joint density functions is equivalent to consistency of the associated probability measures P · In particular , if we require that for each n, R1 , . . . , R n be independent, with Ri having a specified density function h, then (6. 1 . 2) becomes n

g( xl , . . . ' xk) =

fXlcx: . . J!t(xt) . . . fn(xn) dxlc+l . . . dxn

= ft C xl) · · · fk( xk) Therefore the joint density functions are consistent, and the statement "Let R1 , R 2 , be independent random variables , where Ri is absolutely continuous with density function h," is unambiguous. •

•

•

PROBLEMS 1.

Pn,

By working directly with the probability measures give an argument shorter than the one above to show that the statement "Let R1 , R2 , be independent random variables, where Ri is absolutely continuous with density fi," is unambiguous. •

•

•

1 82 2.


If R1 , R2 , are independent, with P{Ri at the beginning of the section, find P{Rn P{limn--. 00 R1� 1}. (Assume 0 < p < 1 .) •

•

•

= =

P{Ri - 1} 1 p , as 1 for infinitely many n} ; also, find

1}

=

p,

=

=

-

=

6.2

T HE GAMB LER ' S RUIN - P R OBLEM

Suppose that a gambler starts with a capital of x dollars and plays a sequence of games against an opponent with b x dollars. At each trial he wins a dollar with probability p , and loses a dollar with probability q = 1 - p. (The trials are assumed independent, with 0 < p < 1 , 0 < x < b. ) The process continues until the gambler ' s capital reaches 0 (ruin) or b (victory) . We wish to find h(x), the probability of eventual ruin when the initial capital . -

IS

X.

Let A = {eventual ruin} , B1 = {win on trial 1}, B2 = {lose on trial 1}. By the theorem of total probability,

P(A) = P(Bl )P(A I Bl ) + P(B2 )P(A I B2) We are given that P(B1 ) p, P(B2 ) = q ; P(A) is the unknown probability h(x ) . Now if the gambler wins at the first trial , his capital is then x + 1 ; thus P(A I B1 ) is the probability of eventual ruin , starting at x + 1 , that is , h(x + 1). Similarly , P(A I B2 ) h(x - 1). Thus =

=

h(x) = ph(x + 1) + qh(x - 1 ) , x = 1 , 2, . . . , b

-

1

(6.2. 1)

[The intuition behind the argument leading to ( 6.2. 1) is compelling ; how ever, a formal proof involves concepts not treated in this book, and will be omitted.] We have not yet found h(x) , but we know that it satisfies (6.2. 1 ) , a linear homogeneous difference equation with constant coefficients. The boundary conditions are h(O) = 1 , h(b) = 0. To see this, note that if x = 1 , then with probability p the gambler wins on trial 1 ; his probability of eventual ruin is then h(2). With probability q he loses on trial 1 , and then he is already ruined. In other words , if ( 6.2. 1 ) is to be satisfied at x = 1 , we must have h(O) = 1 . Similarly, examination of (6.2. 1 ) at x = b 1 shows that h(b) = 0. The difference equation may be put into the standard form -

p h(x

+ 2)

-

h (x + 1) + qh(x) = 0 , X = 0, 1 , . . , b - 2, h(O) .

=

It is solved in the same way as the analogous differential equation p

d 2y

dy

- - + qy = O 2 dx dx

-

1 , h(b) = 0

6.2

THE GAMBLER'S RUIN PROBLEM

1 83

We assume an exponential solution ; for convenience, we take h(x) = it� ( = ex In ;.) . Then pit �2 - .;tx+l + qit X = .;tx (pit2 - A + q) = 0. Since .;tx is never 0, the only allowable it ' s are the roots of the characteristic equation pit2 - it + q = 0, namely, 1 it = - (1 ± J 1 - 4p q ) 2p

Now (p - q) 2 = p2 - 2pq + q2 = p 2 + 2pq + q2 - 4pq = (p + q) 2 - 4pq = 1 - 4pq (6.2.2) Hence 1 it = (1 ± I P - q l ) 2p

The two roots are

+ it l = 1

p 2p

CASE 1 .

+ it 2 = 1 q -

q = 1'

2p

p � q . Then A1 and it2 are distinct ; hence

h(x) = AA1"' + CA 2"' = A + C h(O) = A + C h (b) = A + c Solving, we obtain Therefore

1

(:r = 0

A = - (q fp) b 1 - ( q fp)b

c=

= q p

(�J

1

1 - (q fp) b

p) b ( ( f f q q p)'" = h(x) 1 - ( q fp) b

(6 .2.3)

= q = 1 /2. Then it1 = it 2 = it = 1 , a repeated root. In such a case (just as in the analogous differential equation) we may construct two linearly independent solutions by taking it x and xit x ; that is ,

CASE 2.

p

=

p

Thus

h(x) = Ait x + Cxitx = A + Cx h(O) = A = 1 so c = - -1 h( b) = A + Cb = 0 b h( x) = 1

-

x

--

=b-x b b

.:.._

(6.2.4)

1 84

INFINITE SEQUENCES OF RANDOM VARIABLES h(x)

FI GURE 6.2. 1

Probability of Eventual Ruin.

so that the probability of eventual ruin is the ratio of the adversary ' s capital to the total capital. A sketch of h(x) in the various cases is shown in Figure 6.2. 1 . Similarly, let g (x) be the probability of eventual victory, starting with a capital of x dollars. We cannot conclude immediately that g(x) = 1 - h(x), since there is the possibility that the game will never end ; that is, the gambler ' s fortune might oscillate forever within the limits x = 1 and x = b - 1 . However, we can show that this event has probability 0, as follows. By the same reasoning as that leading to (6.2. 1) , we obtain

g(x) = pg(x + 1 ) + qg(x - 1) ( 6.2. 5 ) The boundary conditions are now g(O) = 0, g(b) = 1 . But we may verify that g (x) = 1 - h (x) satisfies (6.2. 5) with the given boundary conditions ; since the solution is unique (see Problem 1 ) , we must have g (x) = 1 - h(x) ; that is, the game ends with probability 1 . We should mention, at least in passing, the probability space we are work ing in. We take Q = E00 , :F = the smallest sigma field containing the measur able cylinde rs, Ri(x1, x 2 , ) = xi, i = 1 , 2, . . . , P the probability measure determined by the requirement that R1, R 2 , be independent, with P{R i = 1 } = p , P{Ri = - 1 } = q. Thus Ri is the gambler ' s net gain on trial i, and x + If 1 Ri is his capital after n trials. We are looking for h (x) = P{for some n, x + If 1 R i = 0, 0 < x + I� 1 Ri < b, k = 1 , 2, . . . , n - 1 }. A sequence of random variables of the form x + !Y 1 Ri, n = 1 , 2, . . . , where the R i are independent and have the same distribution function (or, are more generally, R0 + If 1 Ri, n � 1 , 2, . . . , where R0, R1, R 2 , independent and R1, R 2 , have the same distribution function) , is called a random walk, a simp le random walk if Ri ( i > 1 ) takes on only the values ± 1 . The present case may be regarded as a simple random walk with absorbing •

•

•

•

•

•

•

•

•

•

•

•

6.2

THE GAMBLE.R•s RUIN PROBLEM

1 85

barriers at 0 and b, since when the gambler' s fortune reaches either of these figures, the game ends, and we may as well regard his capital as forever frozen. We wish to investigate the effect of removing one or both of the barriers. Let hb (x ) be the probability of eventual ruin starting from x, when the total capital is b. It is reasonable to expect that limb__.oo hb (x) should be the prob ability of eventual ruin when the gambler has the misfortune of playing against an adversary with infinite capital. Let us verify this. Oonsider the simple random walk with only the barrier at x = 0 present ; that is, the adversary has infinite capital. If the gambler starts at x > 0, his probability h*(x) of eventual ruin is =

P(A)

P{for some positive integer b, 0 is reached before b}

Let Ab = {0 is reached before b}. The sets Ab, b ing sequence whose union is A ; hence P(A)

But

=

=

1 , 2, . . . form an expand

lim P(A b)

b -+

00

Consequently h *( x)

=

lim hb(x)

=

1

=

(�)'"

if q > p if q < p, by (6 .2.3) and (6 . 2 . 4) (x

=

1, 2, . . . ) (6.2.6)

Thus , in fact, limb__. oo hb(x) is the probability h* (x) of eventual ruin when the adversary has infinite capital ; 1 - h * (x) is the probability that the origin will never be reached , that is, that the game will never end. If q < p, then h * (x) < 1 , and so there is a positive probability that the game will go on forever. Finally, consider a simple random walk starting at 0, with no barriers. Let r be the probability of eventually returning to 0. Now if R1 = 1 (a win on trial 1), there will be a return to 0 with probability h * (l ) . If R1 = - 1 (a loss on trial 1), the probability of eventually reaching 0 is found by evaluat ing h * (1 ) with q and p interchanged , that is, 1 for q � p, and p/q ifp < q. Thus, if q < p, , If p < q ,

r

= p

r

=

(�) + q(l)

p(l) + q

( :)

=

2q

=

2p

1 86


One expression covers both of these cases , namely, r

= 1 = 1 < 1

-

lp

ql ifp = q = � ifp � q -

(6.2. 7)

PRO B L E M S

1. Show that the difference equation arising from the gambler's ruin problem has a

unique solution subject to given boundary conditions at x = 0 and x = b. 2. In the gambler's ruin problem, let D(x) be the average duration of the game when the initial capital is x. Show that D(x) = p (1 + D(x + 1)) + q (1 + D(x - 1)) , x = 1 , 2 , . . . , b - 1 [the boundary conditions are D(O) = D(b) = 0] . 3. Show that the solution to the difference equation of Problem 2 is X

(b/(q - p )) (1 - (q/p) X) if p "#: q 1 - (q/p)b q -p = x(b - x) ifp = q . 1/2 [D(x) can be shown to be finite, so that the usual method of solution applies ; see Problem 4, Section 7.4.] REMARK. If Db(x) is the average duration of the game when the total capital is b, then lim b � oo Db(x) ( = oo if p > q , = xfq - p if p < q) can be interpreted as the average length of time required to reach 0 when the adversary has infinite capital. 4. In a simple random walk starting at 0 (with no barriers), show that the average length of time required to return to the origin is infinite. (Corollary : A couple decides to have children until the number of boys equals the number of girls. The average number of children is infinite.) 5. Consider the simple random walk starting at 0. If b > 0, find the probability that x b w ill eventually be reached. D(x) =

-

=

6 . 3 C O M B I N A T O RI A L AP P R O A C H T O T H E R A N D O M WAL K ; T H E REFLE C T I O N PRIN CIPLE

In this section we obtain, by combinatorial methods , some explicit results connected with the simple random walk. We assume that the walk starts at 0, with no barriers ; thus the position at time n is Sn = 2: 1 Rk, where are independent random variables with P{Rk = 1 } = p, R1, R2 , . •

•

6.3 COMBINATORIAL APPROACH TO THE RANDOM

WALK

1 87

P{Rk = - 1 } = q = 1 - p. We may regard the Rk as the results of an infinite sequence of Bernoulli trials ; we call an occurrence of Rk = 1 a "success," and that of R k = - 1 a "failure." ( Suppose that among the first n trials there are exactly a successes and b failures (a + b = n) ; say a > b. We ask for the (conditional) probability that the process will always be positive at times 1 , 2, . . . , n, that is, P{Sl > 0, s2 > 0, . . . ' sn > 0 I sn = a - b }

(6.3 . 1)

(Notice that the only way that Sn can equal a - b is for there to be a suc cesses and b failures in the first n trials ; for if x is the number of successes and y the number of failures, then x + y = n = a + b, x - y = a - b ; hence x = a, y = b. Thus {Sn = a - b} = { a successes, b failures in the first n trials}.) , Now (6.3. 1) may be written as

P{ S1 > 0, . . . , Sn > 0, Sn = a - b } (6.3.2) P{ Sn = a - b} A favorable outcome in the numerator corresponds to a path from (0, 0) to (n, a - b) that always lies above the axis,t and a favorable outcome in the denominator to an arbitrary path from (0, 0) to (n, a - b) (see Figure 6.3.1). Thus (6.3.2) becomes paqb[the number of paths from (0, 0) to ( n, a - b) that are always above 0] paqb[the total number of paths from (0, 0) to ( n, a - b)] A path from (0, 0) to (n, a - b) is determined by selecting a positions out of n for the successes to occur ; the total number of paths is (�) = (a!b). To count the number of paths lying above 0, we reason as follows (see Figure 6.3.2). Let A and B be points above the axis. Given any path from A to B that touches or crosses the axis, reflect the segment between A and the first zero point T, as shown. We get a path from A' to B, where A' is the reflection of A. Conversely, given any path from A' to B, the path must reach the axis at some point T. Reflecting the segment from A' to T, we obtain a path from A to B that touches or crosses the axis. The correspondence thus established is one-to-one ; hence

the number ofpaths from A to B that touch or cross the axis = the total number ofpaths from A' to

B

t Terminology : For the purpose of determining whether or not a_ path lies above the axis (or touches it, crosses it, etc.), the end points are not included in the path.

1 88


FIGURE 6.3. 1

A Path in the Random Walk . n = 6

a = 4,

b =2

P{R1 = R2 = 1 , R3 = - 1 , R4 = R5 = 1, R6 = -1} = p4q2 ; this is one contribution to

P{S1 > o, . . . , S6 > o, S6 = 2 } This is called the reflection principle. Now the number of paths from (0, 0) to (n, a - b) lying entirely above the axis = the number from (1 , 1) to (n, a - b) that neither touch nor cross the axis (since R1 must be + 1 in this case) = the total number from (1 , 1) to (n, a - b) - the number from (1 , 1) to (n, a - b) that either touch or cross the axis = the total number from (1 , 1) to (n, a - b) - the total number from (1 , - 1) to (n, a - b) (by the reflection principle) [Notice that in a path from (1 , 1) to (n, a - b) there are x successes and y failures, where x + y = n - 1 = a + b - 1 , x - y = a - b 1 , so -

B

FIGURE 6.3 .2

Reflection Principle.

6.3 COMBINATORIAL APPROACH TO THE RANDOM

WALK

1 89

x = a - 1 , y = b . Similarly, a path from {1 , - 1) to (n, a - b) must have a successes and b - 1 failures.]

(

n! a b ( n - 1) ! ( = n - 1) ! = (a - 1) ! b ! a ! (b - 1) ! a ! b ! n n

Thus

if a + b

==

)

n, the number of paths from (0, 0) to (n, a - b) a b lying entirely above th e axis =

( : ) ( :)

Therefore

P{ St > O, . . . , Sn > O I Sn = a - b} = a - b = a - b a+b n --

(6.3.3) {6.3.4)

This problem is equivalent to the ballot p roblem: In an election with two candidates, candidate 1 receives a votes and candidate 2 receives b votes, with a > b, a + b = n. The ballots are shuflled and counted one by one. The probability that candidate 1 will lead throughout the balloting is (a - b)f(a + b). [Each possible sequence of ballots corresponds to a path from {0, 0) to (n, a - b) ; a sequence in which candidate 1 is always ahead corresponds to a path from (0, 0) to (n, a - b) that is always above the axis.]

REMARK.

We now compute h1

= P{the first return to 0 occurs at time j} Since hi must be 0 for j odd , we may as well set j = 2n. Now h 2n = P {S1 � 0, . . . , S2n_1 � 0, S2n = 0} and thus h 2n is the number of paths from (0, 0) to (2n, 0) lying above the axis, times 2 (to take into account paths lying below the axis) , times p nqn , the probability of each path (see Figure 6.3.3). The number of paths from {0, 0) to (2n, 0) lying above the axis is the num ber from {0, 0) to (2n - 1 , 1) lying above the axis, which, by ( 6.3.3) , is {2 :-1) (a - b)f(2n - 1) (where a + b = 2n - 1 , a - b = 1 , hence a = n,

FIGURE 6.3.3

A First Return to 0 at Time 2n.

1 90

INFINITE SEQUENCES OF RANDOM VARIABLES (y + 2k,y)

FIGURE 6.3.4

b=

n

Thus

Comp utation of First Passage Times.

- 1), that is, · (2n - 2) ! 1 2n - 1 ! 2n - 2 = n 2n - 1 n ! (n - 1) ! n n - 1

(

)

h

(

2 2n - 2 = 2n n n - 1

)

(

(p q)

n

) (6. 3.5)

We now compute probabilities of first passage times, that is, P{the first passage through y > 0 takes place at time r}. The only possible values of r are of the form y + 2k, k = 0, 1 , . . . ; hence we are looking for h

��2k = P{the first passage through y > 0 occurs at time y + 2k}

To do the computation in an effortless manner, see Figure 6.3.4. If we look at the path of Figure 6.3.4 backward , it always lies below y and travels a vertical distance y in time y + 2k . Thus the number of favorable paths is the number of paths from tO, 0) to ( y + 2k , y) that lie above the axis ; that is, by (6.3.3) , y + 2k a - b where a + b y + 2k, a - b = y a y + 2k Thus a = y + k, b = k. Consequently y y + 2k pY+kq k h (y) = 2 (6.3.6) Y+ k k y + 2k

(

)

=

(

)

PRO B L E M S

The first five problems refer to the simple random walk starting at 0, with no barriers. 1. Show that P{Sl > 0, s2 > 0, . . . ' S2n-1 > 0, S2n = 0} = U2n/(n + 1), where u2n = P{S2n = 0} is the probability of n successes (and n failures) in 2n Bernoulli trials, that is, (2,:)(pq)n.

6.4

191

GE.NE.RATING FUNCTIONS

Let p = q 1/2. (a) Show that h2n u2n-2/2n, where u2n ( 2:)(1/2)2n . (b) Show that u2n/u2n-2 1 - 1/2n ; hence h2n u2n-2 - u2n · 3. If p q 1/2, show that "#: 0, . . . , S2n "#: 0}, the probability of no return to the origin in the first 2n steps, is u2n (2:)2-2n . Show al so that "#: 0, . . . , =

2.

=

=

=

=

2n

S

1

-

P{S1

=

=

"#:

0}

h 2n + U2n · 1/2, show that 1/2, show that

P{S1

=

=

> 0, . . , 2 > 0} If p q (2nn)2-2n . q S. If p the average length of time required to return to the origin is infinite, by using Stirling's formula to find the asymptotic expression oo . for h2n , and then showing that �� nh2n 6. Two players each toss an unbiased coin independently n times. Show that the probability that each player will have the same number of heads after n tosses is (1/2)2n ��= o (�)2 . 7. By looking at Problem 6 in a slightly different way, show that � �= o (J:) 2 ( 2nn). 8. A spider and a fly are situated at the corners of an n by n grid, as shown in Figure P.6.3.8. The spider walks only north or east, the fly only south or west ; 4.

=

=

=

=

P{S1

.

1

S

n

=

=

=

D ��--��--•

Fly

iL 1

2

Spider

Spider

FIGURE

P.6 . 3 . 8

they take their steps simultaneously, to an adjacent vertex of the grid. (a) Show that if they meet, the point of contact must be on the diagonal D. (b) Show that if the successive steps are independent, and equally likely to go in each of the two possible directions, the probability that they will meet is

( 2nn) (1/2)2n .

6.4

GE NERA T I N G FUN C T I O N S

Let {an, n > 0} be a bounded sequence of real numbers. The generating function of the sequence is defined by A(z)

=

00

n, z � a n =

n O

z complex

1 92


The series converges at least for lzl < 1 . If R is a discrete random variable taking on only nonnegative integer values, and P{R a , n 0, 1, . . . , then A(z) is called the generating function of R. Note that A(z) E(zR), the characteristic function of R with z replacing ! � 0 z P{R e-iu . We have seen that the characteristic function of a sum of independent random variables is the product of the characteristic functions. An analogous result holds for generating functions.

n

= n} = n =

= n} =

=

} and {bn} be bounded sequences of real numbers. Let n {en} be the convolution of {an} and {bn} , defined by en = ki=Oakb n-k ( = 3"i,=0b an-i) Then C(z) = :l: 0 cnzn convergent at least for lzl < 1 , and C(z) = A(z)B(z) PROOF. Suppose first that a n = P{R1 = n}, b n = P{R 2 = n } , where R1 and R 2 are independent nonnegative integer-valued random variables. Then en = P{R1 + R 2 = n} , since {R1 + R2 = n} is the disjoint union of the events {R1 = k, R 2 = n - k}, k = 0, 1 , . . . , n. Thus C(z) = E(zR1+R2 ) = E(zR1zR2 ) = E(zR1)E(zR2 ) = A(z)B(z) Theorem 1. Let {a

_

;

is

In general,

We have seen that under appropriate conditions the moments of a random variable can be obtained from its characteristic function. Similar results hold for generating functio_ns. Let A(z) be the generating function of the random variable R ; restrict z to be real and between 0 and 1 . We show that

= A'(l) A'( l ) = lim A'(z) E(R)

where

(6.4. 1)

If E(R) is finite, then the variance of R is given by Var R

= A" (1) + A ' (1) - [A' (1)]2

To establish (6 . 4 . 1) and (6.4.2) , notice that 00

A(z) = ! anzn , n=O

a n = P{R

= n}

(6.4.2)

6.4

Thus

GENERATING FUNCTIONS

1 93

z n=1na nzn-1 A '( 1 ) = I n a n = E(R) n=1 00

A'( ) = I

z

Let � 1 to obtain proving (6.4. 1). Similarly, 00

z n=1

A"( ) = I n ( n Therefore

00

1)a nzn-2,

2

1

so A"( ) = E (R ) - E(R)

Var R = E(R2) - [E(R)] 2 = A-" (1) + A' (1) - [A' ( 1)] 2 which is ( 6.4. 2). Now consider the simple random walk starting at 0, with no barriers. Let = P { = 0} , = the probability that the first return to 0 will occur at time n = P {S � 0, = 0}. Let � 0,

un

Sn

hn

. . . , Sn_1 Sn U(z) = n=OI unzn, H(z) = nI=Oh nzn (For the remainder of this section, z is restricted to real values.) 1

00

00

If we are at the origin at time n , the first return to 0 must occur at some time k, k = 1 , 2, . . . , n. If the first return to 0 occurs at time k, we must be at the origin after n - k additional steps. Since the events {first return to 0 at time k}, k = 1 , 2, . . . , n , are disjoint, we have

n un = k=I1 hkun-k' = 1, 2, . . . Let us write this as n U n = kI=Ohkun-k' n = 1, 2, . . . This will be valid provided that we define h0 = 0. Now u0 = 1 , since the walk starts at the origin, but h0u0 = 0. Thus we may write n (6.4.3) Vn = kI=Ohku n-k' n = 0, 1, . . . where V n = un, > 1 ; v 0 = 0 = u0 - 1 . Since {v n } is the convolution of {hn } and {un } , Theorem 1 yields V(z) = H(z)U(z) But V(z) = nI=OVnZn = n=I1U nZn = U(z) - 1 n

n

00

00

1 94


Thus

z)( 1 - H(z)) = 1 We may use (6. 4 . 4) to find the h n explicitly. For U(

(6.4.4)

This can be put into a closed form, as follows. We claim that

(6.4.5) where, for any real number ex( ex

ex ,

(�)

denotes

- 1) ·

· · ( ex

To see this, write

( - 1/2) n

=

- n + 1)

n!

( - 1/2)( - 3/2) · · · [-(2n - 1)/2] , n! ·

·

· · ·

n! 1 3 5 (2n - 1) ( - 1) n n! n ! 2n 1 3 5 (2n - 1) (2/2)(4/2)( 6/2) = n =

·

-

·

· · ·

n .f (2n) ! ( - 1) n n! n!

4n

n f. 2

· · ·

(2n/2)

(- 1)

n

(6.4.5). Thus U(z) = },0 ( -! '2) ( -4pqz2)n = (1 - 4pqz2)-112 by the binomial theorem. By (6. 4 . 4) we have (6.4.6) H(z) = 1 - U(z)1 = 1 - (1 - 4pqz2)112 This may be expanded by the binomial theorem to obtain the h n (see Problem 1) ; of course the results will agree with (6.3.5), obtained by the combinatorial approach of the preceding section. Notice that we must have the positive square root in ( 6.4.6) , since H(O) = h0 = 0. Some useful information may be gathered without expanding H(z). Observe that H(1) = !� 0 hn is the probability of eventual return to 0. proving

6.4 GENERATING FUNCTIONS

By (6. 4 .6 ) ,

1 95

H(1) = 1 (1 - 4pq)1 1 2 = 1 - lp - q l by (6.2.2) -

This agrees with the result ( 6.2. 7) obtained previously. Now assume p = q = 1/2, so that there is a return to 0 with probability 1. We show that the average length of time required to return to the origin is infinite. For if T is the time of first return to 0, then 00

00

E (T) =I1 n P{T = n } = I1 nh n = H' ( 1 ) n= n= as in (6. 4 . 1) . By (6.4.6) , H' (z) = - d ( 1 - z2)1 1 2 = z( 1 - z 2)-11 2 � oo as z � 1 dz Thus E(T) = oo , as asserted (see Problem 5, Section 6.3 , for another ap proach). -

PRO B LE M S

Expand (6.4.6) by the binomial theorem to obtain the hn . 2. Solve the difference equation an+ 1 - 3an 4 by taking the generating function of both sides to obtain 4z a0 A(z) (1 - z)( 1 - 3z) + 1 - 3z Expand in partial fractions and use a geometric series expansion to find an . 3. Let A(z) be the generating function of the sequence {a } ; assume that I � 0 l an - an- 1 1 < oo . Show that if lim n an exists, the limit is lim (1 - z)A(z) z--+1 4. If R is a random variable with generating function A(z), find the generating function of R + k and kR, where k is a nonnegative integer. If F(n) P{R < n}, find the generating function of {F(n)}. 5. Let R1 , R2, be independent random variables, with P{Ri 1} p, P{Ri 0} q 1 - p , i 1 , 2, . . . . Thus we have an infinite sequence of Bernoulli trials ; Ri 1 corresponds to a success on trial i, and Ri 0 is a failure. (Assume 0 < p < 1 .) Let R be the number of trials required to obtain the first success. (a) Show that P{R k} qk-Ip, k 1 , 2, . . . . (b) Use generalized characteristic functions to show that E(R) 1/p, Var R (1 - p)/p2 ; check the result by calculating the generating function of R and using (6.4. 1) and (6.4.2). R is said to have the geometric distribution.

1.

=

=

n

-+

'

oo

=

•

=

=

•

=

•

=

=

=

=

=

=

=

=

=

=

196


as in Problem 5, let , With the the rth success (r = 1 , 2, . . . ) . (a) Show that = = 0 ; f(x) = 0, x < 0 (it is a fixed positive constant) . Let An = + + n = 1 , 2, . . . . We may think of An as the arrival time of the nth customer at a serving counter, so that Tn is the waiting time between the arrival of the (n - 1)st customer and the arrival of the nth customer . . :Equally well, An may be regarded as the time at which the nth call is · made at a telephone exchange, the time at which the nth component fails on an assembly line, or the time at which the nth electron arrives at the anode of a vacuum tube. If t > 0, let Rt be the number of customers that have arrived up to and including time t; that is, Rt = n if An < t < An+ l (n = 0, 1 , . . . ; define A0 = 0) . A sketch of (Rt, t > 0) is given in Figure 6.5. 1 . Thus we have a family of random variables Rt, t > 0 (not just a sequence, but instead a random variable defined for each nonnegative real number) . A family of random variables Rt, where t ranges over an arbitrary set I, is called a random process or stochastic process. Note that if I is the set of positive integers, the random process becomes a sequence of random variables ; if I is a finite set, we obtain a finite collection of random variables ; and if I consists of only one element, we obtain a single random variable. Thus the concept of a random process includes all situations studied previously.

Ti

T1 T2 , • • •

T1

·

·

·

Tn,

6.5

•

4 •

3

2

•

1

0

THE POISSON RANDOM PROCESS

1 97

Tn = An - A n - 1

•

Al

0

A2

FIGURE 6.5. 1

Aa

A4

;:.. t

Poisson Process.

If the outcome of the experiment is w, we may regard (Rt (w), t E I) as a real-valued. function defined on I. In Figure 6.5. 1 what is actually sketched is Rt ( w ) versus t, t E I = the nonnegative reals, for a particular w . Thus, roughly speaking, we have a "random function," that is, a function that depends on the outcome of a random experiment . The particular process introduced above is called the Poisson process since, for each t > 0, Rt has the Poisson distribution with parameter At. Let us verify this. If k is a nonnegative integer, P{R t < k} = P{at most k customers have arrived by time t} = P{(k + 1)st customer arrives after time t} = P{T1 + · · · + Tk+l > t} But A k+l = T1 + · · · + Tk+l is the sum of k + 1 independent random variables, each with generalized characteristic function f: Ae-l ze-s � dx = A/(s + A) , Re s > - A ; hence A k+ l has the generalized characteristic function [A/(s + A) ]k+1 , Re s > - A. Thus (see Table 5. 1 . 1) the density of A k+ l is (6.5.1) f (x) = _!_ Ak+lxke-lxu (x) Ak+ l

k!

where u is the unit step function. Thus

P{R t < k}

=

P{T1 +

· · ·

+ Tk+l > t }

1 98


Integrating by parts successively, we obtain

Hence Rt has the Poisson distribution with parameter At. Now the mean of a Poisson random variable is its parameter, so that E(Rt) = A t . Thus A may be interpreted as the average number of customers arriving per second. We should expect that A-1 is the average number of seconds per customer, that is , the average waiting time between customers. This may be verified by computing E(Ti).

E( Ji)

=

rJ ooo Axe-'-"' dx = 1A

We now establish an important feature of the Poisson process. Intuitively, if we arrive at the serving counter at time t and the last customer to arrive came at time t h , the distribution function of the length of time we must wait for the arrival of the next customer does not depend on h , and in fact coincides with the distribution function of T1 . Thus we are essentially starting from scratch at time t ; the process does not remember that h seconds have elapsed between the arrival of the last customer and the present time. If Wt is the waiting time from t to the arrival of the next customer, we wish to show that P{ Wt < z} = P{ T1 < z } , z > 0. We have -

P{ Wt < z} = P{for some n = 1 , 2, . . . , the nth customer arrives in

( t, t + z] and the ( n + 1 )st customer arrives after time t + z} (6.5.2)

(see Figure 6. 5.2). To justify (6.5.2) , notice that if t < A n < t + z < A n+ 1 for some n, then Wt < z. Conversely, if Wt < z, then some customer arrives in (t, t + z] and hence there will be a last customer to arrive in the interval. (If not, then ! : 1 Tn < oo ; but this event has probability 0 ; see Problem 1 .) Now P{t < A n < t + Z < A n+1 } = P{ t < A n < t + z, A n + Tn+1 > t + z} . t

An •

t+z

FIGURE 6.5.2

An+ l •

6.5

Since A n ( = T1 +

· · ·

P{t < A � -< t +

z

THE POISSON RANDOM PROCESS

1 99

Tn) and Tn+l are independent, we obtain, by (6.5. 1), < An+l } JJ (n - 1) ! .;tnxn-le-lx_Ae-ly dx dy tx [el - 1 - (elt - 1) ] 1 - e- ,\ z

Since ! � r n/n ! =

0

z

Wt

=

=

has the same distribution function as T1. Thus Alternatively, we may write

P{Wt < } P [ nU0 {A n < t < An+l < t + }J For if A n < t < A n+ l < t + for some then Wt < Conversely, if Wt < then some customer arrives in (t, t + ] and there must be a first customer to arrive in this interval , say customer n + 1 . Thus A n < t < A n+l < t + for some > 0. An argument very similar to the above shows that (At)n e - l t e - ;. t+ z > ( - < ) P{A n -< t < A n+l -< t + } and therefore P{Wt < } 1 - e - lz as before. In this approach, we do not have the p roblem of showing that P {}: Tn < oo} 0. 1 z

=

z

z

z,

z

n,

z.

z ,

n

z

z

=

n .'

=

=

To justify completely the statement that the process starts from scratch at time t, we may show that if V1, V2 , are the successive waiting times ' then V1 , V2 , are independent, and Vi and starting at t (so V1 = Ti have the same distribution function for all i. To see this , observe that •

.

•

Wt), P { V1 < x1 , . . . , Vk < xk} = P [Po { n < t < A n+l < t + X1 , Tn+ 2 < x2 , •

A

•

•

•

•

•

,

Tn+k < xk} J

200


For it is clear that the set on the right is a subset of the set on the left. Con versely ' if vl < x l , . . . ' vk < x k , then a customer arrives in (t, t + xl], and hence there is a first customer in this interval, say cus tomer n + 1 . Then A n < t < A n+ t < t + x1, and also Vi = Tn+i ' i = 2, . . . , k, as desired. Therefore

P{ Vt < xl , . . . ' vk < xk }

=

=

=

Fix j and let xi --+ Consequently

oo ,

L�t{An 0 , then u nk-t � b. The set S of all such t ' s is closed under addition and has greatest common divisor 1 , since S is generated by the positive integers i for which _h > 0. Thus (Problem 1 b, Section 7.3) S contains all sufficiently large positive integers. Say unk- i � b for i > I. By (7.4.2) , 00

� r i u nk-1- i

i=O

=

1

(with u n

=

0 for

n

< 0)

(7.4. 3)

If �3� 0 r1 < oo , the dominated convergence theorem shows that we may let k ---+ oo and take limits term by term in (7. 4 .3) to obtain b � � 0 r1 = 1 . If � � 0 r1 = oo , Fatou ' s lemma gives 1 > b � ;' 0 ri ; hence b = 0. In either case, then ,

7.4

But

LIMITING PROBABILITIES

233

ro = h + /2 + fa + · · · r1 = /2 + fa + · · · r2 = fa + · · · •

Hence

00

00

'I rn = 'I nfn = fl n =l . n=O u n = 1 /fl · By an entirely

Consequently b = lim sup n lim infn u n = 1 /p , and the result follows.

symmetric argument,

We now apply Theorem 1 to gain complete information about the limiting behavior of the n-step transition probability Pii>. A recurrent state j is said to be positive iff its mean recurrence time fli is < oo , null iff fli = oo. Theorem

2.

(a) If the state j is transient then I � 1 pi;> < oo for all i, hence

P'l� 3� > -+ 0

as n -+ oo

PROOF. This is Theorem 4 of Section 7.3. (b) If j is recurrent and aperiodic, and i belongs to the same equivalence class as j, then Pi;> -+ 1 /p1• Furthermore, fli is finite iff fli is finite. If i belongs to a different class, then Pi;> -+ fulfli · PROOF.

00

k} ' f..< � >p 3� 3�P � � > = £.., 'l3

k=l

'l3

by the first entrance theorem [take p�j> = 0, r < 0]. By the dominated con vergence theorem , we may take limits term by term as n -+ oo ; since p�j-k > -+ 1 /p1 by Theorem 1 , we have p� ; > -+ fi�k) _!_ = fi j

( i: ) k=l

fli

fli

-

If i and j belong to the same recurrent class , fii = 1 . r · Now assume that jvu 3. is finite. If p'l�3� > , p��> 3 'l > 0, then p � �+ + > > p��> 'l3 p 3 , this is bounded away from 0 for large n , since p�j> -+ 1 fp1 > 0. But if fli = oo, then Pit > -+ 0 as n -+ oo , a contradiction. This proves (b). 'l'l

s

J'l

234

MARKOV CHAINS

(c) Let j be recurrent with period d > 1 . Let i be in the same class as j,

with i E the cyclically moving subclass Cr , j E Cr+a · Then pJ jd + a > � d/p,1 • Also , p,1 is fin ite iff ft i is finite, so that the property of being recurrent positive (or recurrent null) is a class property.

PROOF. First assume a = 0. Then j is recurrent and aperiodic relative to the chain with transition matrix IT d. (If A has greatest common divisor d, then the greatest common divisor of {xfd: x E A} is 1 .) By (b) ,

Now, having established the result for write

n++ Pi( i d r l )

-

_

a

= r, assume a = r + 1 and

d d + n � d r ( > --,.- £., pik £., pikp k i k k lt i lt i �

_____,_

_

_

_

as asserted. The argument that p,1 is finite iff /-t i is finite is the same as in (b) , with nd replacing n. (d) Ifj is recurrent with period d > 1 , and i is an arbitrary state, then a = 0, 1 ,

.

.

.,d- 1

The expression in brackets is the probability of reaching j from i in a number of steps that is = a mod d. Thus, ifj is recurrent null, pJt > 0 as n oo for all i. �

�

PROOF.

nd+ a +a ) = � ��d �� p 3 3 - ' 7r=l Since j has period d, pj; d + a - k > = 0 unless k - a is of the form rd (necessarlly r < n) ; hence n + + � ��d a P� 3 > = £., Jl'z.(3!d a >p 3 = [n and p �r> = un for all n.

(The renewal theorem) Let T1 , T2 , be independent random variables, all with the same distribution function, taking values on the positive integers. (Think of the Ti as waiting times for customers to arrive, or as lifetimes of a succession of products such as light bulbs. If T1 + + Tn = x, bulb n has burned out at time x, and the light must be renewed by placing bulb n + 1 in position.) Assume that the greatest common divisor of {x : P{ Tk = x} > 0} is d and let G(n) = 2� 1 P{T1 + + Tk = n} , n = 1 , 2 , . . . If p = E(Ti), show that limn __. G (nd) = d/ p ; interpret the result intuitively. oo , 3. Show that in any Markov chain, (1/n) 2 r: 1 p � � > approaches a limit as n namely, fi i/ P:r (Define p1 = oo if j is transient.) HINT :

2.

•

•

•

· ·

· · ·

·

.

00

-+-

d = period of j 4. Let Vi i be the number of visits to the state j, starting at i. (If i = j, t = 0 counts

as a visit.) (a) Show that E( Vi i) = � 'oon= O p Z� J� > . Thus i is recurrent iff E( Vii) = transient, E( Vi i) < oo for all i. (b) Let C be a transient class, Nu = E( Vu), i, j E C. Show that

N'l· J· = � 'l-J. + £.., ' P'l·kNk3· kEG

( � ii = =

1,

(},

oo ,

and if ]. is

i =j i

#: j)

In matrix form, N = I + QN, Q = II restricted to C. (c) Show that (I - Q)N = N(I - Q) = I so that N = (I - Q)-1 in the case of a finite chain (the inverse of an infinite matrix need not !Je unique). REMARK.

(a) implies that in the gambler' s ruin problem with finite capital, the average duration of the game is finite. For if the initial capital is i and D is the duration of the game, then D = 2�-I vii ' so that E(D) < 00 .

236 7.5

MARKOV CHAINS

S "f A T I O NA R Y AND S T E A D Y - S T A T E DIS T RI B U TI O N S

A stationary distribution for a numbers vi, i E S, such that vi I

Markov chain with state space > 0, Ii ES v i = 1 , and

2 ViPi i iES

= Vi,

S

is a set

of

jES

Thus , if V = (vi, i E S) , then VII = V. By induction, VII n = VII (II n-1) = VII n-1 = · · · = VII = V, so that VII n = V for all n = 0, 1 , . . . . Therefore , if the initial state distribution is V, the state distribution at all future times is still V. Furthermore , since (Problem 4 , Section 7. 1 ) the sequence {R n } is stationary ; that is, the joint probability function of , Rn+k does not depend on n. R n , R n+ 1 , Stationary distributions are closely related to limiting probabilities. The main result is the following. •

•

.

Consider a Markov chain with transition matrix [pul· Assume

Theorem 1.

I I. m

n -+ oo

n

( 'l J

) P · · = q 3.

for all states i, j (where qi does not depend on i). Then (a) Ij ES qi < 1 and Ii ES qip ii = qi , j E S. (b) Either all qi = 0, or else Ij ES qi = 1 . (c) If all qi = 0, there is no stationary distribution. If IjES q1 = 1 , then {qi } is the unique stationary distribution.

PROO F .

I q i = I li m P1;> i i n

< lim

n

inf i P1;> i

by Fatou ' s lemma ; hence Now � £,.,

i

q iPi i

= ·

� £,., ( I"t m

i

n

Pki Pi i < I"I � (

n) )

_

n+ 1 ) f" ( n) " � Pki Pii = 1 Im f £,., = qi In P1c( i n i n ·

In

·

7.5 STATIONAR Y AND STEADY-STATE DISTRIBUTIONS

which is a contradiction. This proves (a). Now if Q = (qi, i E S), then by ( a) , QII QIT n = Q, that is, 'L,i qipip > = q1 • Thus

=

237

Q ; hence, by induction,

by the dominated convergence theorem. Hence qi = C 2i qi)qi, proving (b). Finally , if {vi} is a stationary distribution , then Li vip�;> = vi. Let n -+ oo to obtain 'L,i vtq1 = v1, so that qi = vi. Consequently, if a stationary distri bution exists, it is unique and coincides with {q1}. Therefore no stationary distribution can exist if all qi = 0 ; if Li qi = 1 , then, by (a) , {qi} is station ary and the result is established. The numbers vi, i E S, are said to form a steady- state distribution iff lim n __. oo Pit > = vi for all i, j E S , and 'L,j ES vi = 1 . Thus we require that limiting prob abilities exist (independent of the initial state) and form a probability distribution. In the case of a finite chain , a set of limiting probabilities that are inde pendent of the initial state must form a steady-state distribution , that is , the case in which all q1 = 0 cannot occur in Theorem 1 . For Lj ES pi;> = 1 for all i E S; let n -+ oo to obtain, since S is finite, 'L,1 Es q 1 = 1 . If the chain is infinite, this result is no longer valid. For example , if all states are transient, then pi;> -+ 0 for all i, j. If {qi} is a steady-state distribution, {qi} is the unique stationary distribu tion, by Theorem 1 . However, a chain can have a unique stationary distribu tion without having a steady-state distribution , in fact without having limiting probabilities. We give examples later in the section. We shall establish conditions under which a steady-state distribution exists after we discuss the existence and uniqueness of stationary distributions. Let N be the number of positive recurrent classes. CASE 1 . N = 0. Then all states are transient or recurrent null. Hence p�; > -+ 0 for all i, j by Theorem 2 of Section 7.4, so that, by Theorem 1 of this section , there is no stationary distribution. CASE 2. N = 1 . Let C be the unique positive recurrent class. If C is aperiodic, then , by Theorem 2 of Section 7.4 , pip> -+ 1 /p1 , i, j E C. Ifj ¢ C, then j is transient or recurrent null, so that pi,n> -+ 0 for all i. By Theorem 1 ,

238

MARKOV CHAINS

if we assign vi = 1 /pi , j E C, vi = O, j ¢= C, then {v3} is the unique stationary distribution , and P1r > -+ vi for all i, j. Now assume C periodic , with period d > 1 . Let D be a cyclically moving subclass of C. The states of D are recurrent and aperiodic relative to the transition matrix lid. By Theorem 2 of Section 7.4, P1rd > -+ dffli, i, j E D ; hence {dffli, j E D} is the unique stationary distribution for D relative to lid (in particular, !j E D 1 /fli = 1 /d). It follows that vi = 1 /pi , j E C, vi = 0, j tt C, gives the unique station ary distribution for the original chain (see Problem 1). CASE 3.

N > 2. There is a uniqu� stationary distribution for each positive

recurrent class , hence uncountably many stationary distributions for the original chain. For if V11I = Vr, V2 1I = V2 , then , if a1 , a2 > 0, a1 + a2 = 1 , we have (at vt + a2 V2) II = at vt + a2 V2 In summary , there is a unique stationary distribution if and only if there is exactly one positive recurrent class.

Finally , we have the basic theorem concerning steady-state distributions. Theorem

2.

(a) If there is a steady-state distribution, there is exactly one positive recurrent class C, and this class is aperiodic; also, fii = 1 for all j E C and all i E S. (b) Conversely, if there is exactly one positive recurrent class C, which is aperiodic, and, in addition, fii = 1 for all j E C and all i E S, then a steady-sta te distribution exists.

PROOF. (a) Let {vi} be a steady-state distribution. By Theorem 1 , {vi} is the unique stationary distribution ; hence there must be exactly one positive recurrent class C. Suppose that C has period d > 1 , and let i E a cyclically moving subclass C0 ,j E C1 . Then p �;d + t > -+ dffli by Theorem 2 of Section 7.4, and pJ;d > = 0 for all n. Since dffli > 0, pJr > has no limit as n -+ oo , contradicting the hypothesis. If j E C and i E S, then by Theorem 2(b) of Section 7 .4, P1r > -+ h:)fli, hence vi = hi/fli· Since vi does not depend on i, we have hi = hi = 1 . (b) By Theorem 2(b) of Section 7.4, �� > -+ fi j for all i, if j E C P'l J fl j for all i if j tt C -+ 0

7.5

STATIONARY AND STEAD Y-STATE DISTRIBUTIONS

239

since in this case j is transient or recurrent null. Therefore , if fii = 1 for all i E S and j E C, the limit vi is independent of i . Since C is positive , vi > 0 for j E C ; hence , by Theorem 1 , �i vi = 1 and the result follows. [Note that if a steady state distribution exists , there are no recurrent null classes (or closed transient classes) . For if D is such a class and i E D , then since D is closed , hi = 0 for all j E C, a contradiction. Thus in Theorem 2, the statement "there is exactly one positive recurrent class , which is aperiodic" may be replaced by "there is exactly one recurrent class , which is positive and aperiodic" . ] CoROLLARY. Consider a finite chain. (a) A steady- state distribution exists iff there is exactly one closed equiva lence class C, and C is aperiodic. (b) There is a unique stationary distribution iff there is exactly one closed equivalence class. PROOF. The result follows from Theorem 2, with the aid of Theorem 8c of Section 7. 3 , Theorem 2e of Section 7.4 , and the fact that if B is a finite set of transient states , the probability of remaining forever in B is 0 (see the remark after Theorem 4 of Section 7.3). [It is not difficult to verify that a finite chain has at least one closed equivalence class. Thus a finite chain always has at least one stationary distribution. ] REMARK. Consider a finite chain with exactly one closed equivalence class, which is periodic. Then , by the above corollary , there is a unique stationary distribution but no steady-state distribution, in fact no limiting probabilities (see the argument of Theorem 2a). For example , consider the chain with transition matrix

The unique stationary distribution is rr rr

n n

[� �} = [� �} =

(1 /2, 1 /2) , but n

even

n

odd

and therefore rr n does not approach a limit. Usually the easiest way to find a steady-state distribution

{vi}, if it exists ,

240

MARKO V CHAINS

is to use the fact th�t a steady-state distribution must be the unique stationary

· distribution. Thu� we solve the equations

L ViPii = v jEs iE S under the conditions that all vi > 0 and LiES vi = j'

1.

P RO B L E M S

Show that if there is a single positive recurrent class C, then {1/ p,1, j E C}, with probability 0 assigned to states outside C, gives the unique stationary distribution for the chain. HINT : p�:;d> = Zk EC p �r;d-1 >pki' i E C. Use Fatou's lemma to show that 1/ lti > Lk E0(1/ flk)P ki· Then use the fact that Lj EC 1/ p,1 = 1 . 2. (a) If, for some N, IIN has a column bounded away from 0, that is, if for some j0 and some � > 0 we have p�fo> > � > 0 for all i, show that there is exactly one recurrent class (namely, the class ofj0) ; this class is positive and aperiodic. (b) In· the case of a finite chain, show that a steady-state distribution exists iff II1V has a positive column for some N. 3. Classify the states of the following Markov chains. Discuss the limiting behavior of the transition probabilities and the existence of steady-state and stationary distributions. 1 . Simple random walk with no barriers. 2. Simple random walk with absorbing barrier at 0. 3 . Simple random walk with absorbing barriers at 0 and b. 4 . Simple random walk with reflecting barrier at 0. 5. Simple random walk with reflecting barriers at 0 and I + 1 . 6. The chain of Example 2 , Section 7. 1 . 7. The chain of Problem 4c, Section 7 .3. 8. The chain of Problem 4d, Section 7 .3. 9. A sequence of independent random variables ( Problem 5, Section 7 .3). 1 0. The chain with transition matrix 1.

1

2

3 4 5 6 7 1 -o 0 1 0 0 0 o2 0 0 0 1 0 0 0 0 3 0 0 0 0 4 3 4

II =

1

4 0 0 0 0 0 0

1

5 0 1 0 0 0 0 0 6 7

0 1 0 0 0 0 0 1

� ...

_

1

Q ...

0 0 0 0 0

I nt rod uct i o n to Stat i st i cs

8. 1

S T A T I S T I C A L D E CI S I O N S

Suppose that the number of telephone calls made per day at a given exchange is known to have a Poisson distribution with parameter fJ , but fJ itself is unknown. In order to obtain some information about fJ , we observe the number of calls over a certain period of time , and then try to come to a decision about 0. The nature of the decision will depend on the type of in formation desired. For example , it may be that extra equipment will be needed if fJ > 00, but not if fJ < 00• In this case we make one of two possible decisions : we decide either that fJ > 00 or that fJ < 00• Alternatively , we may want to estimate the actual value of fJ in order to know how much equipment to install. In this case the decision results in a number B, which we hope is as close to fJ as possible. In general , an incorrect decision will result in· a loss , which may be measurable in precise terms , as in the case of the cost of un necessary equipment, but which also may have intangible components. For example , it may be difficult to assign a numerical value to losses due to customer complaints , unfavorable publicity , or government investigations. Decision problems such as the one just discussed may be formulated mathematically by means of a statistical decision model. The ingredients of the model are as follows. 1 . N, the set of states o.,f nature. 24 1

242


2. A random variable (or random vector) R, the observable, whose distribution function F8 depends on the particular fJ E N. We may imagine that "nature" chooses the parameter fJ E N (without revealing the result to us) ; we then observe the value of a random variable R with distribution function F8• In the above example , N is the set of positive real numbers , and F8 is the distribution function of a Poisson random variable with parameter fJ. 3. A , the set of possible actions. In the above example , since we are trying to determine the value of fJ , A = N = (0, oo) . 4. A loss function (or cost function) L(fJ , a), fJ E N , a E A ; L(fJ , a) repre sents our loss when the true state of nature is fJ and we take action a. The process by which we arrive at a decision may be described by means of a decision function, defined as follows. Let E be the range of the observable R (e.g. , E1 if R is a random variable, E n if R is an n-dimensional random vector). A nonrandomized decision function is a function cp from E to A. Thus, if R takes the value x, we take action cp(x). qJ is to be chosen so as to minimize the loss , in some sense. Nonrandomized decision functions are not adequate to describe all aspects of the decision-making process. For example , under certain conditions we may flip a coin or use some other chance device to determine the appropriate action. (If you are a statistician employed by a company , it is best to do this out of sight of the customer.) The general concept of a decision function is that of a mapping assigning to each x E E a probability measure Px on an appropriate sigma field of subsets of A. Thus Px (B) is the probability of taking an action in the set B when R = x is observed. A nonrandomized decision function may be regarded as a decision function with each Px concentrated on a single point ; that is, for each x we have Px{a} = 1 for some a (= cp(x)) in A . We shall concentrate on the two most important special cases of the statistical decision problem, hypothesis testing and estimation. A typical physical situation in which decisions of this type occur is the problem of signal detection. The input to a radar receiver at a particular instant of time may be regarded as a random variable R with density f8 , where fJ is related to the signal strength. In the simplest model, R = fJ + R' , where R' (the noise) is a random variable with a specified density, and (} is a fixed but unknown constant determined by the strength of the signal. We may be interested in distinguishing between two conditions : the absence of a target ((} = 00) versus its presence (0 = 0 1) ; this is an example of a hypothesis testing problem . Alternatively , we may know that a signal is present and wish to estimate its strength. Thus, after observing R, we record a number that we hope is close to the true value of (} ; this is an example of a problem in estimation.

8.2

HYPO THESIS TESTING

243

As another example , suppose that fJ is the (unknown) percentage of defective components produced on an assembly line. We inspect n components (i.e. , we observe R1 , • • • , R n , where Ri = 1 if component i is defective, Ri = 0 if component i is accceptable) and then try to say something about fJ. We may be trying to distinguish between the two conditions fJ < 00 and fJ > 00 (hypothesis testing) , or we may be trying to come as close as possible to the true value of () (estimation). In working the specific examples in the chapter, the table of common density and probability functions and their properties given at the end of the book may be helpful. 8.2

H YP O T H E S I S T E S T I N G

Consider again the statistical decision model of the preceding section. Sup pose that H0 and H1 are disjoint nonempty subsets of N whose union is N, and our objective is to determine whether the true state of nature fJ belongs to H0 or to H1 • (In the example on the telephone exchange , H0 might corre spond to () < 00, and H1 to () > 00.) Thus our ultimate decision must be either " () E H0" o r " () E H1 ,'' so that the action space A contains only two points , labeled 0 and 1 for convenience. The above decision problem is called a hypothesis-testing problem ; H0 is called the null hypothesis, and H1 the alternative. H0 is said to be simple iff it contains only one element ; otherwise H0 is said to be composite, and simi larly for H1 • To take action 1 is to reject the null hypothesis H0 ; to take action 0 is to accept H0• We first consider t� e case of simple hypothesis versus simple alternative. Here H0 and H1 each contain one element, say 00 and 0 1 • For the sake of definiteness , we assume that under H0 , R is absolutely continuous with density f0, and under H1 , R is absolutely continuous with density Jr. (The results of this section will also apply to the discrete case upon replacing integrals by sums.) Thus the problem essentially comes down to deciding, after observing R , whether R has density /0 or Jr. A decision function may be specified by giving a (Borel measurable) func tion cp from E t o [0, 1 ] , with cp(x) interpreted as the ptobability of rejecting H0 when x is observed. Thus, if cp(x) = 1 , we reject H0 ; if cp(x) = 0, we accept H0 ; and if cp(x) = a, 0 < a < 1 , we toss a coin with probability a of heads : if the coin comes up heads , we reject H0 ; if tails , we accept H0• The set {x : cp(x) = 1 } is called the rejection region or the critical region ; the function cp is called a test. The decision we arrive at may be in error in two possible ways. A type 1 error occurs if we reject H0 when it is in fact true, and a type 2 error occurs if H0 is accepted when it is false, that is, when H1 is

244


true. Now if H0 is true and we observe R = x, an error will be n1ade if H0 is rejected ; this happens with pro bability cp(x). Thus the pro bability of a type 1 error is rx =

J:

cp(x) f0(x) dx

(8 . 2 . 1)

Similarly , the probability of a type 2 error i s f3

=

J:

( 1 - cp ( x)) f1(x) dx

(8.2 .2)

Note that ex is the expectation of ·cp(R) under H0, sometimes written E0 cp ; similarly , {3 = 1 - E0 cp. It would be desirable to choose cp so that both ex and {3 wi�l be small, but, as we shall see , a decrease in one of the two error probabilities usually results in an increase in the other. For example , if we ignore the observed data and always accept H0 , then ex = 0 but {3 = 1. There is no unique an swer to the question of what is a good test ; we shall consider s·everal possibilities First, suppo se that there is a nonnegative cost ci associated with a type i error, i = 1 , 2. (For simplicity , assume that the cost of a correct decision is 0.) Suppose also that we know the probability p that the null hypothesis will be true. (p is called the a priori probability of H0• In many situations it will be difficult to estimate ; for example , in a radar reception problem , H0 might correspond to no signal being present.) Let cp be a test with error probabilities ex ( cp) and {3( cp) The over-all average cost associated with cp is 0

( 8 .2. 3 )

B( cp) is called the Bayes risk associated with cp ; a test that minimizes B( cp) is called a Bayes test corresponding to the given p , c1 , c 2 , f0 , and f1 •

The Bayes solution can be computed in a straightforward way. from (8.2. 1-8.2.3) , B ( cp)

have ,

( 1 - p) c 2( 1 - cp(x)) f1(x) ] d x

=

J: [pc 1 cp ( x) f0( x)

=

J: cp ( x) [pctf0( x) - ( 1 - p)c 2 .{1( x)] dx

+

We

+

( 1 - p) c 2 ( 8 . 2 . 4)

Now if we wish to minimize J s cp(x)g(x) dx and g(x) < 0 on S, we can do no better than to take cp(x) = 1 for all x in S ; if g (x) > 0 on S, we should take cp(x) = 0 for all x in S ; if g(x) = 0 on S, cp(x) may be chosen a rbitrarily. In this case g (x) = pc1f0(x) - ( 1 - p)c2[t (x) , and the Bayes solution may

8.2

HYPO THESIS TES TING

therefore be given as follows.

245

Let L(x) = f1 (x ) lf0(x) .

If L(x) > pc1/ ( 1 - p)c2 , take q;(x) = 1 ; that is , reject H0•

If L(x) < pc1/ ( 1 - p)c2 , take q;(x) = 0 ; that is , accept H0• If L(x) = pc1/( 1 - p)c 2 , take q;(x) = anything.

L is called the likelihood ratio , and a test q; such that for some constant it , 0 < A < oo , q;(x) = 1 when L(x) > it and q;(x) = 0 when L(x) < it , is

called a likelihood ratio test, abbreviated LRT. To avoid ambiguity , if .fr(x) > 0 and /Q (x) = 0, we take L(x) = oo. The set on which f1(x) = f0(x) = 0 may be ignored , since it will have probability 0 under both H0 and H1. Also , if we observe an x for which f1 (x) > 0 and f0(x) = 0, it must be associated with H1 , so that we should take q;(x) = 1 . It will be convenient to build this requirement into the definition of a likelihood ratio test : if L(x) = oo };Ve assume that q;(x) = 1 . In fact, likelihood ratio tests are completely adequate to describe the problem of testing a simple hypothesis versus a simple alternative. This assertion will be justified by the sequence of theorems to follow. From now on , the notation P0(B) will indicate the probability that the value of R will belong to the set B when the true state of nature is (} . Theorem 1. For any a , 0 < probability of type 1 error is a .

a

0. Now G (y) = P0 {x : L(x) < y} , - oo < y < oo, is a distribution function [of the random variable L(R) ; notice that L(R) > 0, and L(R) cannot be infinite under H0 ] . Thus either we can find it , 0 < it < oo , such that G( it) = 1 - a , or else G jumps through 1 - a ; that is, for some it we have G(it-) < 1 - a < G(it) (see Figure 8.2 . 1 ) . Define PROOF.

0

q;(x) = 1

= 0

= a

if L(x) > it

if L(x) < it

if L(x) = it

where a = [ G(/l.) - ( 1 - a)] / [ G(it) - G(it-)] if G(it) > G (it-) , a = an arbitrary number in [0 , 1 ] if G(it) = G(it-). Then the probability of a type 1 error Is .

P0 {x : L(x) > 0

as desired.

it } + aP0 {x : L(x) 0

= it } = 1 - G(it)

+ a [G (it)

- G(it-) ] = a

246

INTRODUCTION TO STATISTICS G(y)

1 - a

FIGURE 8.2.1

A test is said to be at level cx0 jf its probability ex of type 1 error is < cx0• oc itself is· called the size of the test, and 1 - {3, the probability of rejecting the null hypothesis when it is false, is called the power of the test. The following result, known as the Neyman-Pearson lemma, is the funda mental theorem of hypothesis testing. 2.

Let cp;. be ex LRT with parameter it and error probabilities ex;. and {3;. . Let cp be an arbitrary test 1-v ith error probabilities ex and {3 ; if ex < oc;. then {3 > {3;.. In other words, the LRT has maximum polver among all tests at level ex;.. Theorem

We give two proofs Consider the Bayes problem with costs c1 = c2 = 1 , and set it = pc1 /(1 - p)c2 = p/(1 - p). Assuming first that it < oo, we have p = it/(1 + it). Thus cp;. is the Bayes solution when the a priori probability is p = it/(1 + it). If {3 < {3;., we compute the Bayes risk [see (8.2.3)] for p = it/(1 + it) , using the test cp. FIRST PROOF.

B(cp) = pet + (1 - p)f3

But ex < ex;. by hypothesis, while {3 < {3;. and p < 1 by assumption. Thus B( cp) < B( cp;.) , contradicting the fact that cp;. is the Bayes solution. It remains to consider the case it = oo. Then we must have cp;.(x) = 1 if L(x) = oo, cp;.(x) = 0 if L(x) < oo. Then ex;. = 0, since L(R) is never infinite under H0 ; consequently ex = 0, so that, by ( 8.2. 1 ) , cp(x)f0(x) = 0 [strictly speaking, cp( x)f0(x) = 0 except possibly on a set of Lebesgue measure 0]. By (8.2.2) , f3 =

rJ {x:L(x)

0 ; hence cp(x) = 0. Thus , in order to minimize {3, we must take cp(x) = 1 when L(x) = oo . But this says that {J > {J;., com pleting the proof. First assume A < oo. We claim that [cp;.(x) - cp(x)] x [{1(x) - Af0(x)] > 0 for all x. For if.ft (x) > Aj0(x), then cp;.(x) = 1 > cp(x), and if _h (x) < Aj0(x), then cp;.(x) = 0 < cp(x). Thus SECOND PROOF.

J: [«p. 0

By (8.2.1) and (8.2.2) ,

1 - {J;. - ( 1 - {J) - Act;. + Act > 0 or The case A

=

oo

is handled just as in the first proof.

If we wish to construct a test that is best at level ex in the sense of maximum power, we find , by Theorem 1 , a LRT of size ex. By Theorem 2 , the test has maximum power among all tests at level ex. We shall illustrate the procedure with examples and problems later in the section. Finally, we show that no matter what criterion the statistician adopts in defining a good test, he can restrict himself to the class of likelihood ratio tests. A test cp with error probabilities ex and {J is said to be inadmissible iff there is a test cp' with error probabilities ex ' and {J' , with ex ' < ex , {J' < {J, and either ' ex < ex or {J' < {J. (In this case we say that cp' is better than cp.) Of course, cp is admissible iff it is not inadmissible. Theorem 3.

Every LRT is admissible.

Let C{J;. be a LRT with parameter A and error probabilities ex;. and {3;., and cp an arbitrary test with error probabilities ex and {J. We have seen that if ex < ex;. , then {J > {J;.. But the Neyman-Pearso n lemma is symmetric in H0 an d H1• In other words , if we relabel H1 as the null hypothesis and H0 as the alternative, Theorem 2 states that if {J < {J ;_ , then ex > ex;. ; the result follows . PROOF.

Thus no test can be better than a LRT. In fact, if cp is any test, then there is a LRT cp;. that is as good as cp ; that is � ex;. < ex and {J;. < {J. F or by Theorem 1 there is a LRT cp;. with ex;. = ex , and by Theorem 2 {J;. < {J. This argument establishes the following result, essentially a converse to Theorem 3.

248


If cp is an admissible test, there is a LRT with exactly the same error probabilities. Theorem 4.

PROOF. As above, we find a LRT cp;. with ex;. = admissible , we must have fJ;. = {J.

LJ..

•

�

.

and fJ ; < fJ ; since cp is

1.

Suppose that under H0 , R is uniformly distributed between 0 and 1 , and under H1 , R has density 3x2 , 0 < x < 1 . For short we write Example

O<x< 1 H0 : f0(x) = 1 , 0<x< 1 H1 : ft (x) 3 x 2 , We are going to find the risk set S, that is, the set of points ( ex ( cp) , fJ( cp)) where cp ranges over l all possible tests. [The individual points (ex( cp) , fJ( cp)) ' are called risk points.] We are also going to find the set SA of admissible risk points, that is, the set of risk points corresponding to admissible tests. By Theorems 3 and 4 , SA is the set of risk points corresponding to LRTs. F irst we notice two general properties of S. 1 . S is convex ; that is, if Q1 and Q 2 belong to S, so do all points on the line segment joining Q1 to Q 2• In other words, (1 - a)Q1 + aQ 2 E S for all a E [0 , 1 ]. For if Q1 = (ex(cp1) , fJ(cp1)) , Q 2 = ( a ( cp2) , {J( cp2)) and 0 < a < 1 , let cp = (1 - a)cp 1 + a cp 2 Then cp is a test, and by (8.2. 1) and (8.2.2), a(cp) = (1 - a) ex ( cp1) + aex ( cp2) , {J( cp) = (1 - a){3 ( + a{J( cp2 ). If Q = (ex( cp) , {3( cp)) , then Q E S, since cp is a test, and Q = (1 - a)Q1 + aQ2• 2. S is symmetric about ( 1/2, 1/2) ; that is, if l s i , l c5 1 < 1 /2 and (1/2 - s, 1 /2 c5 ) E S, then ( 1 /2 + s, 1/2 + �) E S. Equivalently, (ex, {3) E S implies ( 1 - ex, 1 - {J) E S. For if (ex(cp), fJ(cp)) E S, let cp' = 1 cp ; then cp' is a test and ex(cp') = 1 - ex(cp) , fJ(cp') = 1 - fJ(cp). To return to the present example, we have L(x) = 3x2 , 0 < x < 1 . Thus the error probabilities for a LRT with parameter A < 3 are 2 rx. = P 0 { x : L x) > = 1( A } = P0 . x : x > 0 _

•

cp1)

-

-

{ (1)1 12} Gf {J = P01 { x : L(x) < A } = Po , { x : x < Gf 2 } r(,\/ 3 )1/2 2 ( A)3 /2 = = (1 - a) 3 3x d x = Jo

3

(If A > 3 , then ex = 0, fJ = 1 .) Thus SA = { (ex, (1 - ex) 3) : 0 < ex < 1 }. Since no test can be better than a LRT, SA is the lower boundary of the

8.2

{j =

HYPOTHESIS TESTING

· 249

1 - a3

1

FIGURE 8.2.2

set S ; hence, by symmetry, {(1 - ex , 1 - (1 - ex) 3), 0 < ex < 1 } = { ( ex , 1 - cx.3) : 0 < ex. < 1 } is the upper boundary of S. Thus S must be {( ex , p) : 0 < ex < 1 , (1 - ex)3 < p < 1 - ex3 } (see Figure 8.2.2). Various tests may now be computed without difficulty. We give some typical illustrations. (a) Find a most powerful test at level . 1 5 . Set ex. = . 1 5 = 1 - (A/3) 11 2• Since L(x) > A iff x > (A/3)11 2 , the test is given by cp(x) = 1

if X > .85

=0

if X < .85 if X = .85

= anything

We have p = ( 1 - ex) 3 = (.85 ) 3 = .6 1 4. (b) Find a Bayes test corresponding to c1 = 3/2, c2 = 3 , is a LRT with A = pc1/(1 - p) c2 = 3/2 ; that is , cp( x) = 1

=

0

( A)1 / 2 ..j2 if X > = =. 3

2

p

= 3/4. This

0 7 7

..j2 if x < 2

= a nything

..j2 if x = 2

Thus ex = 1 - (A/3)11 2 = .293 , p = (1 - oc) 3 , and the Bayes risk may be computed using (8.2.3).

250

I NTRODUCT/ON TO STATISTICS

a 9

a + :;; 13 = c 3

-�----� a 0 ��--1 a = 1 - -Y2{!

FIGURE 8.2.3

Geometric Interpretation of Bayes Solution.

The Bayes solution may be interpreted geometrically as follows. We are trying to find a test that minimizes the Bayes risk pc1 oc + (1 - p)c2{3 = (9/8)oc + (3/4){3. If we vary c until the line (9/8)oc + (3/4){3 = c intersects S.A. , we find the desired test (see Figure 8 . 2.3). Notice also that to find the Bayes solution we may differentiate (9/8)a + (3/4)( 1 - a) 3 and set the result equal to zero to oLtain a = 1 - .J2/2 , as before. (c) Find a minimax test, that is, a test that minimizes max ( a , {3) . It is immediate from the definition of admissibility that an admissible test with constant risk (i.e. , a = {3) is minimax. Thus we set a = f3 = ( 1 - a)3 , which yields a = .3 1 8 (approximately) . Therefore (A/3) 112 = 1 - a = .682 , and so we reject H0 if x > .682 and accept H0 if x < . 682. -4111 � Example 2.

Let R be a discrete random variable taking on only the values 0 , 1 , 2 , 3. Let the probability function of R under Hi be pi, i = 0 , 1 , where the P i are as follows. X

Po (x) Pt(x)

0

1

2

3

.1

.2

.3

.4

.2

.1

.4

.3

The appropriate likelihood ratio here is L(x) = p1(x) fp0(x). Arrangin g the values of L (x) in increasing order, we have the following table. X

L(x)

1

3

2

2

4

4 3

1.

3

0

2

8.2

HYPOTHESIS TESTING

We may therefore describe the LRT with parameter A as follows. ex Acceptance Region LRT Rejection Region 1 O it is equivalent to In L(x) > In it ; that is ,

n L 2( 0 1 - fJ0)xk + n ( fJ02 - 0 12) > 2a 2 ln it k= 1 This is of the form !;: 1 xk > c . Thus a LRT must be of the form n if L xk > c k= 1 n =0 if L xk < c k= 1

( 8 . 2 .5)

= anything Now R1 + · · · + R n is normal with mean nO and variance na2 , so that the error probabilities are ex

{

i

= P 80 ( xb . . . , x n) : xk > c k= 1

}

= P 80 {R1 + · · · + Rn > c} · · · + Rn - nfJ0 c - n0 0 = pOo R 1 + > ..J n a ..J n a

{

= 1

p

- F* ( c Jn naOo) _

{

= Pe, ( xt ,

. . . ' xn ) :

where F * is the normal (0, 1) distribution function

�lxk < c}

P81 { R1 + · · · + Rn < c} c n t = F* =

( fo : )

}

8.2

HYPOTHESIS TESTING

253

1

1

0

FIGURE 8.2.5

Admissible Risk Points

When R is Normal .

Thus we have parametric equations for a. and fJ with c as parameter, - oo < c < oo . The admissible risk points are sketched in Figure 8.2. 5. Suppose that we want a LRT of size a. . If Na is the number such that 1 - F*(Na) = a. , then (c - n00)/ .Jn a = Na, so that c = n60 + .Jn aNa . We now apply the results to a problem in testing a simple hypothesis versus a composite alternative. Again let R be normal (fJ , a2) , and take H0 : fJ = 00, H1 : fJ > 00 • If we choose any particular 01 > 00 and test fJ = 00 against fJ = 01, the test described above is most powerful at level a.. However, the test is com pletely specified by c, and c does not depend on 01 • Thus, for any 01 > 00, the test has the highest power of any test at level a. of f) = 00 versus f) = 01• Such a test is called a uniformly most powerful (UMP) level a. test of f) = 00 versus fJ > 00 • We expect intuitively that the larger the separation between 00 and 01, the better the performance of the test in distinguishing between the two possibilities. This may be verified by considering the power function Q, defined by Q (fJ)

=

Eocp

=

the probability of rejecting H0 when the true state of nature is f)

Thus Q(O) increases with fJ. Now if H0 : f) = 00, H1 : f)

=

01 , where 01 < 00, the same technique as

254


above shows that a size ex LRT is of the form cp (

xl , .

. .

'

x n) = 1 =

0

n

if 2 xlc = c

= anything

( c;nn:0) {3 = 1 - F * ( c - n fJ ) �n �

where

�

k= l

= F*

l

_

c = nfJ0 + ,Jn aN1 a Again , the test is UMP at level ex for fJ = 00 versus fJ < 00, with power _

function

(

)

c - nO Q ' (fJ) = F * -Jn_ a which increases as fJ decreases (see Figure 8.2.6) . The above discussion suggests that there can be no UMP level ex test of fJ = 00 versus fJ � 00 • For any such test cp must have power function Q(fJ) for fJ > 00, and Q' (fJ) for fJ < 80 . But the power function of cp is given by

Eo (/) =

1: · · · 1: q;(x1, • • • ,

xn) f8( xl > . . .

, xn) d x1 • • • dxn

where .fo is the joint density of n independent normal random variables with mean fJ and variance a2 • It can be shown that this is differentiable for all () (the derivative can be taken under the integral sign). But a function that is Q(fJ) for f) > 00 and Q' (fJ) for fJ < 00 cannot be differentiable at 00•

Q(8)

FIGURE 8 .2.6

Power Functions .

8.2

HYPOTHESIS TESTING

255

In fact, the test cp with power function Q(O) is UMP at level ex for the composite hypothesis H0 : 0 < 00 versus the composite alternative H1 : 0 > 00• Let us explain what this means. cp is said to be at level ex for H0 versus H1 iff E8cp < ex for all 0 < 00 ; cp is UMP at level ex if for any test cp ' at level ex for H0 versus H1 we have E8cp ' < E8cp for all 0 > 00• In the present case E8cp = Q(O) < ex for 0 < 00 by monotonicity of Q(O) , and E8cp ' < E8cp for 0 > 00, since cp is UMP at level ex for 0 = 00 versus 0 > () 0 · The underlying reason for the existence of uniformly most powerful tests is the following. If 0 < 0' , the likelihood_ ratio Lf8,(x)Lf8(x) can be expressed as a nondecreasing function of t(x) [where, in this case, t(x) = x1 + · · · + xn ; see (8.2.5)]. Whenever this happens , the family of densities f8 is said to have the monotone likelihood ratio (MLR) property. Suppose that the f8 have the MLR property. Consider the following test of 0 = 00 versus 0 = 0 1 , 0 1 > 00• cp(x) = 1 =0 =a

if t (x ) > c if t(x) < c if t(x) = c

where P8 0 {x : t(x) > c} + aP80 {x : t(x) = c } = ex (n otice that c does not depend on 0 1). Let A be the value of the likelihood ratio when t(x) ·= c ; then L (x) > A implies t (x) > c ; hence cp(x) = 1 . Also L(x) < A implies t (x) < c , so that cp(x) = 0. Thus cp is a LRT and hence is most powerful at level ex. We may make the following observations. 1 . cp is UMP at level ex for 0 = 00 versus 0 > 00• This is immediate from the Neyman-Pearson lemma and the fact that c does not depend on the particular 0 > 0 0• 2. If 0 1 < 0 2 , cp is the most powerful test at level ex1 = E81 cp for 0 = 0 1 versus 0 = 0 2 • Since cp is a LRT , the Neyrnan-Pearson lemma yields this result immedi ately. 3. If 0 1 < 0 2 , then E81 cp < E8 2 cp ; that is, cp has a monotone non decreasing power function. It follows , as in the earlier discussion, that cp is UMP at level ex for 0 < 00 versus 0 > 00• By property 2 , cp is most powerful at level ex1 = E8 1 cp for 0 = 0 1 versus ' ' () = 0 2 • But the test cp (x) = ex1 is also at level ex1 ; hence E82cp < E8 2 cp, that is , ex1 = E81 cp < E 92 cp. Since the Neyman-Pearson lemma is symmetric in H0 and Hb if 0 1 < 02, then for all tests cp ' with {3( cp ' ) < {3( cp), we have E81 cp < E81 cp ' .

REMARK .

256


We might say that q; is uniformly least powerful for fJ < 00 among all tests whose type 2 error is 00• �

P RO B L E M S

1.

2.

Let H0 :fo(x) = e-x , x > 0 ; H1 :f1(x) = 2e-2x , x > 0. (a) Find the risk set and the admissible risk points. (b) Find a most powerful test at level .05. (c) Find a minimax test. Show that the following families have the MLR property, and thus UMP tests may be constructed as in the discussion of Example 3. (a) p 0 = the joint probability function of n independent random variables, ea ch Poisson with parameter (). (b) p 0 = the joint probability function of n independent random variables Ri , where Ri is Bernoulli with parameter 0 ; that is, P{R i = 1} 0, P{Ri = + Rn has the binomial 0} = 1 0, 0 < () < 1 ; notice that R1 + distribution with parameters n and 0. (c) Suppose that of N objects, () are defective. If n objects are drawn without replacement, the probability that exactly x defective objects will be found in the sample is =

-

Po

·

(O )(lV- 0 )

(x) = x n- x ' (�n)

X

= 0' 1 ' . .

.

'

·

·

() ( () = 0 ' 1 ' . . . ' N)

This is the hypergeometric probability function ; see Problem 7 , Section 1 .5. (d) f0 = the joint density of n independent normally distributed random variables with mean 0 and variance () > 0. 3. It is desired to test the null hypothesis that a die is unbiased versus the alter native that the die is loaded, with faces 1 and 2 having probability 1 /4 and faces 3 , 4, 5 , and 6 having probability 1 /8. (a) Sketch the set of admissible risk points. (b) Find a most powerful test at level . 1 . (c) Find a Bayes solution if the cost of a type 1 error is c1 , the cost of a type 2 error is 2c1 , and the null hypothesis has probability 3/4. 4. It is desired to test the null hypothesis that R is normal with mean 00 and known variance a2 versus the alternative that R is normal with mean 0 1 = 00 + /a and variance a2 , on the basis of n independent observations of R. Find the minimum value of n such that < .05 and f3 < . 03 . 5. Consider the problem of testing the null hypothesis that R is normal (0, 00) versus the alternative that R is normal (0, 01) 01 > 00 (notice that in this case a UMP test of () < 00 versus () > 00 exists ; see Problem 2d) . Describe a most rY..

,

8.2

HYPOTHESIS TESTING

257

a

6.

powerful test at level and indicate how to find the minimum number of independent observations of R necessary to reduce the probability of a type 2 error below a given figure. Let R , . . . , Rn be independent random variables, each uniformly distributed between 0 and (), () > 0. Show that the following test is UMP at level oc for () = versus () '#:

1

H0 :

H1 :

00

= 7. 8.

9.

*11.

12.

xi < 00a11n

or if max l and find the power function of the test. Show that every admissible test is a Bayes test for some choice of costs c1 and c2 and a priori probability p. Conversely , show that every Bayes test with > 0, 0 0, Bayes test with c > 0, > 0. If is most powerful at level and {J( > 0, show that is actually of size Give a counterexample to the assertion if {J( = 0. Let be a most powerful test at level Show that for some constant A we have if > A; < A, except possibly for in a set of = = 0 if Lebesgue measure 0. there is a A class C of tests is said to be essentially complete iff for any test test Show that the following classes are is as good as E C such that essentially complete. (a) The likelihood ratio tests. (b) The admissible tests. (c) The Bayes tests (i.e. , considering all possible and p). such that the statements is as good as Give an example of tests and and are both false. is as good as Let be independent random variables , each with density and let () = () = (a) If is a test based on n observations that minimizes the sum of the error > if probabilities, show that = ITf = = 0 if < Thus

H0;

R3

R1,1. 2 c2

1

10.

00•

q;

1

a .1

c2

0, 1/4 ,

1/4

1

q;)

a0

a0• q; q;(x) 1 x

q;(x)

q;2

q;2

1; H1:

x

q;)

a.

q;

x q;1

q;1•

c1 , c 2 ,

13.

14.

q;1 q;2 "q;1 q;2" "q;2 q;1" R2 , • • • R1, h8, H0 : 00 , H1 : 01• Pn (x) 1 gn (x) h81(xi)/h80(xi)] 1 , 1 [ n P Pn(x) gn(x) 1. <Xn + Pn Po.{x :gn (x) > 1} + Pol {x : gn�x) > 1 } (b) Let t (xi ) [h81(xi)/h80(xi)]112• Show that n n t(R 1) E t(R ) E 8 P80{x:gn(x) > 1} < IT ] i 0 [ 80 i=1 =

=

=

258


(c) Show that < 1 ; hence an 00 and interchanged shows that observations are taken , both error small.

E80t(R1) fJ1

8.3

0 as n oo . A similar argument with f1n 0 as n oo , so that if enough

---+

---+

---+

---+

probabilities can be made arbitrarily

E S T I MA T I O N

C o nsider the statistical decision model of Section 8. 1 . Suppose that y i s a real-valued function on the set N of states of nature , and we wish to estimate y(O) . If we observe R = x we must produce a number 1p(x) that we hope will be close to y ( O ). Thus the action space A is the set of reals and a decision function may be specified by giving a (Borel measurable) function 1p fr om the such a 1p is called an estimate, and the above decision range of R to problem is called a problem of point estimation of a real parameter. Although the estimate 1p appears intrinsically nonrandomized , it is possible to introduce randomization without an essential change in the model. If R1 is the observable, we let R 2 be a random variable independent of R1 and 0 , with an arbitrary distribution function F. Formally , assume 8 {R1 E = E E where R E E is determined by the distribution function F and is unaffected by 0 . If R1 = x and R 2 = y, Wf; estimate y ( O) by a number 1p(x , y) . Thus we introduce randomization by enlarging the observable. There is no unique way of specifying a good estimate ; we shall discu ss several classes of estimates that have desirable properties. We first consider maximum likelihood estimates. Let be the density (or probability) function corresponding to the state o f nature 0 , and assume for simplicity that y ( O) = 0 . If R = x , the maximum likelihood estimate of 0 is given by y(x) = B = the value of 0 that maximizes f (x). Thus (at least in the discrete case) the estimate is the state of nature that makes the particular observation most likely. In many cases the max'imum likelihood estimate is easily computable .

£1 ,

£1 ;

2 B2} P8{R1 B1}P{R2 B2},

P

P{R2 B2}

B1,

f8

8

..,.. Example 1.

Let R have the binomial distribution with parameters n and 0 , 0 < 0 < 1 , so that p 8 (x) = C :) Ox (1 - O ) n , x = 0, 1 , . . . , n. To find B we may set x

-

to obtain

x

-

f)

-

n-x 1 - f)

=

0

or 0

=

x

n Notice that R may be regarded as a sum of independent random variables

8.3 ESTIMATION

R1 ,

259

, Rn, where Ri is 1 with probability fJ and 0 with probability 1 - 0. In terms of the Ri we have B(R) = (R1 + · · · + Rn) fn , which converges in probability to E(Ri) = fJ by the weak law of large numbers. Convergence in probability of the maximum likelihood estimate to the true parameter can be established under rather general conditions. _.... .,._

•

•

•

Example 2.

Let Rb . . . , R n be independent, normally distributed random variables with mean fl and variance a2• Find the maximum likelihood esti mate of fJ = (p , a2). (Here f) is a point in £2 rather than a real number, but the maximum likelihood estimate is defined as before.) We have so that

n

ln f8(x) = - - I n 27T - n I n a - -- L (xi - p) 2 1

n

2a 2 i= 1

2

Thus

n

n a 1 L (xi - p) = 2 C x - p) ln f8(x) = 2 a i= 1 a op

where

n 1 x = - iL xi n =1

and

(

a n 1� 1 � - ln f8( x) = - -n + £. ( xi - p) 2 = - - a 2 + . - £. ( xi - fl ) 2 i aa

a

a3 i= 1

a3

n =1

)

Setting the partial derivatives equal to zero , we obtain where

0

= ( x, s 2)

-)? s 2 = -1 � £- ( xi - x .... n i =1 (A standard calculus argument shows that this is actually a maximum.) In terms of the Ri , we have

where R is the sample mean (R1 + + Rn)fn and V2 is the sample variance ( 1/n) LF 1 (R i - R)2 • If the problem is changed so that fJ = fl (i.e. , a2 is known) , we obtain tJ = R as above. However, if fJ = a2, then we find (} = ( 1 /n) LF 1 (xi - p)2, since the equation a lnfo(x)fop = 0 is no longer present. ..... ·

·

·

260


We now discuss Bayes estimates. For the sake of definiteness we consider the absolutely continuous case. Assume N = £ 1 , and letf8 be the density of R when the state of nature is fJ. Assume that there is an a priori density g for fJ ; that is , the probability that the state of nature will lie in the set B is given by JB g(fJ) dfJ. Finally, assume that we are given a (nonnegative) loss function L(y(fJ) , a) , fJ E N, a E A ; L(y(fJ) , a) is the cost when our estimate of y(fJ) turns out to be a. If 1p is an estimate, the over-all average cost associated with 1p is B(1p)

=

1:1:g(())j8(x)L(y(()), 1p(x)) d() dx

B( 1p) is called the Bayes risk of 1p, and an estimate that minimizes B( 1p) is called a Bayes estimate. If we write B( 1p)

=

1: [ i: g(())j8(x)L(y(()), 1p(x)) d()J dx

( 8.3. 1 )

it follows that in order to minimize B(1p) it is sufficient to minimize the ex pression in brackets for each x. Often this is computationally feasible. In particular , let L(y(fJ) , a) = (y(fJ) - a) 2 • Thus we are trying to minimize

This is of the form A1p2(x) - 2B1p(x) + 1p(x) = BfA ; that is, 1jJ( X)

=

C, which is a mtntmum when

J:g(())j8(x)y(()) 1: g(()) j8(x) d()

d()

( 8 . 3. 2)

------

But the 'conditional density of fJ given R = x is g(O)f8 (x)/J 0000 g(O)f8 (x) dfJ , so that 1p(x) is simply the conditional expectation of y(fJ) given R = x. To summarize : To find a Bayes estimate with quadratic loss function , set 1p(x) = the conditional expectation of the parameter to be estimated , given that the observable takes the value x . ...,. Example 3.

Let R have the binomial distribution with parameters n and (} , 0 < fJ < 1 , and let y(fJ) = 0. Take g as the beta density with parameters r and s ; that is ,

g( () )

=

or- 1( 1

_ fJ)S-1

{J(r, s )

'

0 < fJ < 1 , r,

s

>0

26 1

8.3 ESTIMATION

where fJ(r, s) is the beta function (see Section 2 of Chapter 4). Fi rst we find a Bayes estimate of fJ with quadratic loss function. The discussion leading to (8.3.2) applies , with f8(x) replaced by p 8 (x) = ( �)fJx ( t - fJ) n-x , x = 0, 1 , . . . , n. Thus

f (:) f ( :)

er-1+>* 1(1 - fJ)s-l+n- x d fJ

1jJ(X)

=

--------

er-l+ x( l - fJ)s-l+n-x d fJ

fJ( r + x + 1 , n - x + s) fJ(r + x, n - x + s)

r ( r + X + l)r(n - X + s) r(r + S + n) r ( r + x) r ( n - X + s) r ( r + S + n + 1) -

+x r + s + n r

Now, for a given fJ , the average loss plp ( fJ) , using 1p , may be computed as follows. p lp(O)

=

Eo

[(

)]

r

+R -02 r + s + n 1

--

(r

= V

Since E8 [(R - nfJ) 2 ] p'I' (O)

=

(r

(r

+ s + n )2

1

ar8 R

+ s + n) 1

=

[ n0(1 2

+ s + n )2

l(( r

E8 [( R - nfJ + n fJ ( l - fJ)

- fJ)

r - rfJ - sfJ)2 ]

and E0R

=

nfJ, we have

+ (r - rfJ - sfJ) 2 ]

+ s) 2 - n) fJ 2 + ( n - 2r( r + s))fJ + r2 ]

pip is called the risk function of 1p ; notice that

(8 .3.3) It is possible to choose r and s so that pip will be constant for all fJ . Fo r this to happen , n = (r + s) 2 = 2r(r + s)

262


which is satisfied if r

=

s = ..Jn/2. We then have

x + Jn/2

1/2 ..Jn -x - = - + 1jJ(X) = n + ..Jn Jn n 1 + ..Jn 1+ Pip (fJ) =

n /4

1 = B( 1fJ) 2 4( 1 + ..j n)

(n + ..j n) 2 Thus in this case 1p is a Bayes estimate with constant risk ; we claim that 1p must be minimax, that is , 1p minimizes max8 p'P(fJ) . For if 1p ' had a maximum risk smaller than that of 1p , ( 8 . 3 . 3 ) shows that B(1p') < B(1p) , contradicting the fact that 1p is Bayes. Notice that if B(x) = xfn is the maximum likelihood estimate, then 1p (x) = an B (x) + bn, where an -+ 1 , b n -+ 0 as n -+ oo . _.... We have not yet discussed randomized estimates ; in fact, in a wide variety of situations, including the case of quadratic loss functions, randomization can be ignored. In order to justify this, we first consider a basic theorem concerning convex functions. A functionffrom the reals to the reals is said to be convex ifff [(1 - a) x + ay] < (1 - a)f(x) + af(y) for all real x, y and all a E [0, 1 ]. A sufficient condition for f to be convex is that it have a nonnegative second derivative ("concave upward" is the phrase used in calculus books) . The geometric interpretation is that flies on or above any o f its tangents. Theorem 1 (Jensen 's Inequality).

If R is a random variable, f is a convex function, and E(R) is .finite, then E[f (R)] > f [E(R)]. (For example, E[R2 n ] > [E(R) ] 2n , n = 1 , 2 , . . . . ) PROOF .

Consid�r a tangent to f at the point E( R) (see Figure 8. 3 . 1) ; let the equation of the tangent be y = ax + b. Since fis convex , f(x) > ax + b fo r all x ; hencef(R) > aR + b. Thus E[f(R)] > aE(R) + b = f(E(R)) . f(x) ax + b

-------4--�--� x

E(R)

FIGURE 8. 3 . 1

Proof of Jensen's Inequality.

8.3 ESTIMATION

263

We may now prove the theorem that allows us to ignore randomized estimates . Theorem

2

(Rao-Blackwell). Let R1 be an observable, and let R 2 be

independent of R1 and fJ, as indicated in the discussion of randomized estimates at the beginning of this section. Let 1p = 1p(x, y) be any estimate of y(fJ) based on observation of R1 and R2• Assume that the loss function L(y(fJ) , a) is a convex function of a for each fJ ( this includes the case of quadratic loss). D efine 1p * (x) = E8 [1p( R1, R 2) I R1 = x] = E[1p(x, R2) ] (E81p (R1, R2) is assumed finite.) Let p'P be the risk function of 1p, defined by p'P(fJ) = E8 [L(y ( O) , 1p (R1, R2)) ] = the average loss, using 1p, when the state of nature is fJ. Similarly, let ptp* (fJ) = E8 [L (y ( fJ) , 1p *( R1))]. Then p'P*(fJ) < p'P(fJ) for all fJ ; hence the nonrandomized estimate 1p* is at least as good as the randomized estimate "P · PROOF.

L(y (fJ) , E8 [1p( R1, R2) I R 1 = x]) < E8 [L( y ( O) , 1p( R1, R2)) I R 1 = x] by the argument of Jensen ' s inequality applied to conditional expectations. Therefore L(y (fJ) , 1p * (R1)) < E8 [L(y (fJ) , 1p(R 1, R2)) I R1] Take expectations on both sides to obtain p'P*(fJ) < E8 [L( y (fJ), 1p(R1 , R2)) ]

as desired.

=

p'P ( fJ)

P RO B L E M S

1.

Let R1 , . . . , Rn be independent random variables, all having the same density h8 ; thus f8 (x1 , • • • , xn) = llf 1 h8 (xi) In each case find the maximum likelihood estimate of 0. (a) h8 (x) = OxB-1 , 0 < x < 1, 0 > 0 .

(b) h8 (x)

= 0 e-xlo ,

(c) h8 (x)

=0 ,

1

1

x > 0, 0 > 0

0 < X < 0, 0

>

0

264 2.


Let R have the Cauchy density with parameter 0 ; that is ,

/8 (x) =

0 1r

(x2

+

0 > 0

0 2) '

Find the maximum likelihood estimate of 0. 3. Let R have the negative binomial distri bution ; that is , (see Problem 6, Section 6.4),

p8 (x) = P{R = x} =

4.

S. 6.

7.

x = r, r + 1 , .

(;_}) Or(1 - O) x-r ,

. . ,0 < 0 < 1

Find the maximum likelihood estimate of 0. Find the risk function in Example 3 , using the maximum likelihood estimate 0 = xfn. In Example 3, find the Bayes estimate if 0 is uniformly distributed between 0 and 1 . In Example 3 , change the loss function to L(O, a) = (0 - a)2/0(1 - 0) , and let 0 I?e uniformly distributed between 0 and 1 . Find the Bayes estimate and show that it has constant risk and is therefore minimax. Let R have the Poisson distribution with parameter 0 > 0. Find the Bayes estimate 'lJl of 0 with quadratic loss function if the a priori density is g(O) = e-8 • Compute the risk function and the Bayes risk using 'IJl , and compare with the results using the maximum likelihood estim�te.

8.4

S UFFI C I E N T S T A T I S T I C S

In many situations the statistician is concerned with reduction of data. For example, if a sequence of observations results in numbers x1 , . . . , xn , it is easier to store the single number x1 + + xn than to record the entire set of observations. Under certain conditions no essential information is lost in reducing the data ; let us illustrate this by an example. Let R1 , . . . , R n be independent, Bernoulli random variables with param eter 0 ; that is, P{Ri = 1 } = 0 , P{Ri = 0 } = 1 - 0 , 0 < f) < 1 . Let T = t(R1, , R n) = R1 + + R n , which has the binomial distribution with parameters n and 0 . We claim that P8{R1 = x1 , . . . , R n = xn I T = y} actually does not depend on 0 . We compute, for xi = 0 o r 1 , i = 1 , . . . , n, ·

•

•

P8 { R 1 - X1,

·

•

•

•

·

This is 0 unless y

·

·

. . . ' R n = Xn , P8 { R = , R - xn I T - Y } - ____;;,_ l xb P8 { T = y } n

•

·

_

1+

·

·

·

x

x

=

=

G)

y}

= __,;;..

___________

+ n , in which case we obtain 1 P8 { Rl xl , . . . ' R n = n } - f)Y( l - f) ) n- y y} P8 { T OY( l - f) ) n- y = x

T

8.4

265

SUFFICIENT STATISTICS

The significance of this result is that for the purpose of making a �tatistical decision based on observation of R 1 , . . . , Rn , we may ignore the individual + Rn. To justify this , consider R i and base the decision entirely on R 1 + two statisticians , A and B. Statistician A observes R1 , , Rn and then makes his decision. Statistician B, on the other hand, is only given T = R1 + + Rn. He then constructs random variables R�, . . . , R� as follows. If T = y, let R�, . . . , R � be chosen according to the conditional probability function of R1 , . . . , R n given T = y. Explicitly , ·

·

·

.

.

•

·

P{ R � =

X1 ,

•

•

•

, R � = X n I T = y} = .

·

·

( ;) l

where xi = 0 or 1 , i = 1 , . . . , n , x1 + + xn = y. B then follows A's decision procedure, using R�, . . . , R�. Note that since the conditional prob ability function of R1 , . . . , R n given T = y does not depend on the un known parameter fJ, B's procedur� is sensible. Now if x1 + + xn = y, ·

·

·

·

=

( :)

euo - et-y

x1 , . . . , R�

( :)

= P0{T = y}P0{R � =

·

·

=

xn

I T = y}

= fJ Y( l fJ) n-y = p0 { R 1 = xl , . . . ' R n = X n } _

Thus (R�, . . . , R�) has exactly the same probability function as (R1 , , Rn) , so that the procedures of A and B are equivalent. In other words , anything A can do , B can do at least as well, even though B starts with less informa tion. We now give the formal definitions. F o r simplicity, we restrict ourselves to the discrete case. However, the definition of sufficiency in the absolutely continuous ca se is the same, with pr o bability functions replaced by densities. Also , the basic factorization theorem , to be proved below, holds in the absolutely continuous case (admittedly with a more difficult proof). Let R be a discrete random variable (or random vector) whose probability function under the state of nature fJ is p0• Let T be a statistic for R, that is , a function of R that is also a random variable. T is said to be sufficient for R (or for the family p 0 , fJ E N) iff the conditional probability function of R given T does not depend on fJ . The definition is often unwieldy, and the following criterion for sufficiency is useful. .

.

•

266


Let T = t(R) be a statistic for R. T is sufficient for R if and only if the probability function p 0 can be factored in the form p 0 (x) = g(fJ , t(x))h(x). Theorem 1 (Factorization Theorem).

PROOF.

Assume a factorization of this form. Then

o { R = x, T = y } P y} I x = = = T { po R P0{T = y} This is 0 unless t(x) = y, in which case we obtain P0{ R = x } g( fJ , t(x)) h (x) L g(fJ, t(z)) h (z) P0{ T = y} ·

{z : t ( z ) =y}

g( fJ, y) h( ) L g( fJ, y)h(z) X

{z : t ( z )=y}

h (x) ' z h ( ) L

which is free of 0

{z : t ( z )=·y }

Conversely, if T is sufficient, then

p 0 (x) = P0{R = x} = P0 {R = x, T = t(x)} = P0 {T = t(x)}P0 {R = x I T = t(x)} by definition of sufficiency = g(fJ , t(x))h(x) � Example 1.

Let Rb . . . , Rn be independent, each Bernoulli with param + R n i s sufficient for (R1 , . . . , Rn) . eter fJ. Show that R 1 + We have done this in the introductory discussion , using the definition of sufficiency. Let us check the result using the factorization theorem. If X = X1 + + Xn ' X · = 0 ' 1 ' and t(x) = X1 + + X n ' then ·

·

·

·

·

t

·

·

·

·

n Po (x1 , . . . , xn) = f) t ( x> ( 1 - fJ) - t < x>

which is of the form specified in the factorization theorem [with h(x) = 1 ] . � Example 2.

...._

Let R1 , . . . , Rn be independent, each Poisson with param eter fJ . Again R1 + + R n is sufficient for (R 1 , . . . , Rn) (Notice that + Rn is Poisson with parameter nO. ) R1 + For ·

·

·

·

·

·

·

n

= IT P0{ Ri = xi } i 1 =

e-

n{} f)Xl +

X

1f ·

• • •

• · •

+xn

Xn

•f

8.4 SUFFICIENT STATISTICS

The factorization theorem applies , with g(fJ, t (x) ) 1 / x1 ! n ! , t (X) = x1 + · · · + n . ...._ �

...X

Example

3.

X

=

267

n e- o()t, h (x)

=

Let R1, . . . , R n be independent, each normally distributed and variance a2 • Find a sufficient statistic for (R1, , R n)

with mean fl . assumtng (a) fl and a2 both unknown ; that is , fJ = (p, a2). (b) a2 known ; that is, fJ = fl · (c) fl known ; that is, fJ = a2 • [Of course (R1 , . . . , R n) is always sufficient for itself, but we hope to reduce the data a bi t more.] We compute •

f8 ( x)

Let

=

n (2 7Ta2)- 1 2 e xp

-x = 1 .2n xi , i 1 -

Since xi -

x

=

xi

- fl

s2

(x; - p) 2] i [ -A 2a i= 1

=

•

•

(8.4. 1 )

1 � (xi - x-) 2 i�1

-

n= n= - (x - p) , we have

s2

1 � ( xi

= - �

n i= 1

Thus

2 - fl )

- ( x-

2 - fl)

(8 .4.2) By (8.4.2) , if fl and a2 are unknown, then [take h(x) = 1 ] (R , V2) is sufficient, where R is the sample mean ( 1 /n) _2 ?1 1 Ri and V2 is the sample variance 2 2 ( 1 /n) .2t 1 (Ri R) 2 • If a2 is known, then the term (2?Ta2)- n f2e-n s 1 2a can be taken as h(x) in the factorization theorem ; hence R is sufficient. If !-' i s known , then , by (8.4. 1 ) , .2r 1 ( Ri - p) 2 is sufficient. ...._

-

t

-

PRO B LE M S

1.

2.

Let R1, . . . , Rn be independent, each uniformly distributed on the interval (01, 02]. Find a sufficient statistic for (R1, • • • , Rn), assuming (a) 01, 0 2 both unknown (b) 01 known (c) 0 2 known Repeat Problem 1 if each Ri has the gamma density with parameters 0 1 and 02 , that is,

j((x)

==

x

8I-1e-xlo 2

I, ( 01)02 8 1 '

268 3.


Repeat Problem 1 if each Ri has the beta density with parameters fJ 1 and 02 , that is,

4. Let R 1 and R 2 be independent, with R1 normal (fJ, a2), R2 normal (fJ, T2), where S.

a2 and T2 are known. Show that R1/ a2 + R2/ T2 is sufficient for (R1 , R2). An exponential family of densities is a family of the form f6 (x) = a ( O)b (x) exp

[ i�

ci(O) ti(x)

J,

x real, fJ E N

(a) Verify that the following density (or probability) functions can be put into the above form. x = 0, 1 , . . . , n, 0 < fJ < 1 (i) Binomial (n, fJ) : p8 (x) = (�)fJx(1 - fJ) n -x , e-o ox X = 0, 1, . . . , fJ > 0 (ii) Poisson (fJ) : p8 (x) = , , X. x0 1

-le-xlo

2

(iv) Gamma (fJl , fJ2) : fe (x) = r( fJl)fJ 28 1 x)B 2-1 xO 1 ( 1 , (v) Beta (0 1 , fJ2) : f0(x) = (fJ , 0 ) {J 1 2 I-

_

'

X

>

0, (} = (fJ l , 0 2), f)b f) 2 > 0

0

0 x = r, r + 1 , . . . , (vi) Negative binomial (r, fJ) : p8 (x) = fn < k > ] is a UMVUE of y(O) sample mean Tfn is a UMVUE of 0 . ...._

=

-

"'2;: 0 ak0k ; in particular the

Let Rb . . . , R n be independent, each Poisson with parameter 0. By Example 2, Section 8.4 , T = R1 + · · · + Rn is sufficient for (R1, , Rn) ; T is also complete. For T is Poisson with parameter nO ; hence

..- Example 2.

•

(

•

•

If E8g T) = 0 for all 0 > 0, then

f [ g(k) nk] (jk

k=O

k!

=

0

for all 0 > 0

Since this is a power series in 0, we must have g = 0. If 1p (x1 , , x n) = g(t(x1 , , xn)) is an unbiased estimate of y(O), then •

•

.

•

•

•

Thus y(O) must be expressible as a power series in 0. If y (O) then

n

k i where ck = "'2 . a k- i i=O l ! -

But

=

"'2: 0 ak0 k,

8.5

UNBIASED ESTIMATES BASED ON COMPLETE SUFFICIENT STATISTIC

27 1

hence

=

We conclude that

T -· ! T ' aT-� nT-t i=O i !

For example, if y(O) T '· --

1

r) ! n r

(T -

-

=

=

k'

k ! -· ak-� n k-z i=O i ! 00

y (O)

is a UMVUE of

or,

r< r > nr

-

r

(

=

=

=

! ako k

k=O

1 , 2, . . . , ·the UMVUE is T

n

-

=

the sample mean when r

=

1

)

[In this particular case the above computation could have been avoided, since we know that E8 (T) (nO)" (Problem 8, Section 3.2). Since r fnr is an unbiased estimate of or based on T, it is a UMVUE.] As another example, a UMVUE of 1 /(1 0) !: 0 0 k , 0 < 0 < 1 , is =

f T ! nT-t 1

.

i=O

i!

-

=

0. By Problem 1 , Section 8.4 , T max Ri is sufficient for (R1, . . . , Rn). (a) Show that T is com plete. (b) Find a UMVUE of y(O), assuming that y extends to a function with a continuous derivative on [0, oo ), and on y(O) 0 as () 0. [In part (a), use without proof the fact that if sg h(y) dy 0 for all () > 0, then h(y) 0 except on a set of Lebesgue measure 0. Notice that if it is known that h is continuous, then h 0 by the fundamental theorem of calculus.] , Rn be independent, each normal with mean () and known variance 3. Let R1 , a2 . (a) Show that the sample mean R is a UMVUE of 0 . (b) Show that (R)2 - (a2fn) is a UMVUE of 02 • [Use without proof the fact that ·if J oo h(y)e8Y dy 0 for all () > 0, then h(y) 0 except on a set of Lebesgue measure 0.] =

---+

---+

=

=

=

.

.

•

00

=

=

272 4.

S.

=


=

=

Let R1 , . . . , Rn be independent, with P{Ri k} = 1/N, k = 1 , . . . , N; take 0 N, N 1 , 2 , . . . . (a) Show that max1 � i �n Ri is a complete sufficient statistic. (b) Find a UMVUE of y(N). Let R have the negative binomial distribution :

k = r, r + 1 , . . . , 0

1

< p :::;;:

Take 0 = 1 - p, 0 :::;;: 0 < 1 . Show that y(O) has a UMVUE if and only if it is expressible as a power series in 0 ; find the form of the UMVUE.

6. Let

P{R = k} =

7.

-

Ok

e-8

1

e-8 k !

k = 1 ' 2, . . . ' 0

'

>

0

(This is the conditional probability function of a Poisson random variable R', given that R' > 1 .) R is clearly sufficient for itself, and is complete by an argu ment similar to that of Example 2. (a) Find a UMVUE of e-8• (b) Show that (assuming quadratic loss function) the estimate 'lJl found in part (a) is inadmissible ; that is, there is another estimate '1Jl 1 such that P �p '(O) < p 'P (O) for all 0, and P �p '(O) < p 'P (O) for some 0. This shows that unbiased estimates, while often easy to find, are not necessarily desirable. The following is another method for obtaining a UMVUE. Let R1, . . . , Rn be independent, each Bernoulli with parameter 0, 0 < 0 < 1 , as in Example 1 . If j = 1 , . . . , n, then

E

[ IT �J = P{R1 = = Rj = 1 } = ...

t= 1

Oi

Thus R1R2 R3 is an unbiased estimate of Oi. But then 'IJl (k) = E[R1 R3 1 If 1 Ri = k} is an unbiased estimate of Oi based on the complete sufficient statistic 2f 1 Ri , so that 'lJl is a UMVUE. Compute 'lJl directly and show that the result agrees with Example 1 . 8. Let R1 , . . . , Rn be independent, each Poisson with parameter 0 > 0 . Show, using the analysis in Problem 7, that •

9.

10.

•

·

•

(

n

)

E R1R2 I 2 Ri =k = t= . 1

k (k

-2 1) '

n

•

•

k = 0, 1 ' . . .

Let R1 , . . . , Rn be independent, each uniformly distributed between 0 and 0 ; if T = max Ri, then [ (n + 1 )/n ] T is a UMVUE of 0 (see Problem 2). Compare the risk function E8[((1 + 1/n) T 0)2] using [ (n + 1 )/n] T with the risk function E8[( (2/n) 2f 1 Ri - 0)2 ] using the unbiased estimate (2/n) 2� 1 Ri . Let R1 , . . . , Rn be independent, each Bernoulli with parameter 0 E [0, 1 ] . Show that (assuming quadratic loss function) there is no best estimate of 0 based on R1 , . . . , Rn ; that is, there is no estimate 'lJl such that p 'P (O) :::;;: Ptp ' (O) for all e and all estimates '1Jl 1 of 0. •

-

8.5

11.

12.

UNBIASED ESTIMATES BASED ON COMPLETE SUFFICIENT STATISTIC

273

If E0'1Jl1 (R) = E0'1Jl 2 (R) = y(O) and '1Jl1, '1Jl 2 both minimize p 'P (O) = E0 [('IJl (R) y( 0))2], 0 fixed, show that P0{'lflt (R) = '1Jl2 (R) } = 1 . Consequently, if '1Jl1 and "P 2 are UMVUEs of y(O), then, for each 0, '1Jl1 (R) = '1Jl 2 (R) with probability 1 . Let [0, 0 E N = an open interval of reals, be a family of densities. Assume that of0 (x)fo0 exists and is continuous everywhere, and that J oo 00 f0 (x) dx can be differentiated under the integral sign with respect to 0. (a) If R has density [0 when the state of nature is 0, show that

E0 [

° 0 0 In f0(R)

J

=

0

(b) If E0'1Jl (R) = y(O) and J 0000 '1Jl (x)f0(x) dx can be differentiated under the integral sign with respect to 0, show that

� y(O)

=

{ :o Inf0(R)J

E 1p (R)

(c) Under the assumptions of part (b), show that

= f8 (x1,

if the denominator is >0. In particular, if f8 (x) IIi 1 h 8 (xi), then

E0

o

= =

n Var - In h 8 (R ·) 9

ao a

•

•

,

xn)

=

'l

where R = (R1 , , Rn). The above result is called the Cramer-Rao inequality (an analogous theorem may be proved with densities replaced by probability functions). If 'lJl is an estimate that satisfies the Cramer-Rao lower bound with equality for all 0, then ·lJJ is a UMVUE of y(O). This idea may be used to give an alternativ� proof that the sample mean is a UMVUE of the true mean in the Bernoulli, Poisson, and normal cases (see Examples 1 and 2 and Problem 3 of this section). If R1 , , Rn are independent, each with mean f..l and variance a2 , and V2 is the sample variance, show that V2 is a biased estimate of a2 ; specifically, •

13.

(:o lnf (R)J

° Var11 0 In fo (R) 0

•

•

•

•

•

•

E( V2)

=

(n - 1) 2 a n

274 8.6

INTRODUCTION TO STATISTICS S A MP L I N G F R O M A N O R M A L P O P U L A T I O N I

If R1 , . . . , Rn are independent, each normally distributed with mean fl and variance a2 , we have seen that (R, P) , where R is the sample mean + R n) and V2 = (1 /n) L� 1 (Ri - R) 2 is the sample vari (1 /n) (R1 + ance , is a sufficient statistic for {R1 , , R n). R and V2 have some special properties that are often useful. First, R is a sum of the independent normal random variables Ri/n , each of which has mean pfn and variance a2/n 2 ; hence R is normal with mean fl and variance 0'2/n. We now prove that R and V2 are independent. ·

·

·

•

Theorem 1.

If R1 ,

•

•

•

•

•

, R n are independent, each normal (p, a2), the

associated sample mean and variance are independent random variables. *PROOF. Define random variables Wl , . . . ' wn by

where the c ii are chosen so as to make the transformation orthogonal . [This may be accomplished by extending the vector (1 /)n, . . . , 1 /.Jn) to an orthonormal basis for En . ] The Jacobian J of the transformation is the determinant of the orthogonal matrix A = [cii ] (with cH = 1 f)n , j = 1 , . . . , n) , namely, ± 1 . Thus (see Problem 12, Section 2.8) the density of ( W1 , . . . , Wn) is given by _ f(xl , xn) ! * ( Yt ,

·

·

·

, Y n) -

·

·

IJI

·

,

n = (2 1T a2 )- f 2 exp

where

[- 2a_12 i=Ii ( xi - p)2] Yt

= A-t Yn

8.6 SAMPLING FROM A NORMAL POPULATION

f *( y1 ,

• • •

, Yn ) = (2 7Ta2)-n1 2 exp

=

1 _

;)2 17" a

ex p [ -

[ - � (i 2a

z=l

n

�

]

n C Y 1 - .Jn p) 2 Ti 1 i = 2 .J27T a 2a 2

-2= R)

� (Ri i =l

n

�

_

n

- �

2

= L Ri2 i=l

=

• • • �

� R i - 2 R � Ri + i=l i=l

n

i

i =l

)J 2) ex p ( - --

2 2 Yi - 2p.Jn Y1 + np

It follows that W1 , • • • , Wn are independent, with W2 , (0 , a2) and W1 normal (�n p , a2) . But

n V2 =

275

-

Yi 2 a2

Wn each normal

n(R-) 2

n ( R) 2

l¥; 2 -

n

( i � )2 i=t .J n

= L � 2 - w1 2 i=l n = L �2 i= 2

Since .Jn R = W1 , it follows that R and V2 are independent, completing the proof. The above argument also gives us the distribution of the sample variance. For

( )

n V2 n � 2 = .L a2 z= 2 a

-

where the Wifa are independent, each normal {0, 1). Thus n V2ja 2 has the chi-square distribution with n 1 degrees offreedom ; that is, the density of n V2/a2 is 1 2 < n-l >J 2r ((n - 1) /2)

x< n- 3 ) f 2 e-xf 2'

x>O

(see Problem 3 , Section 5.2). Now since R is normal (p , a2J n) , .Jn (R - p)/a is normal {0 , 1) ; hence

{

P -b

< .J n (R

� #) < b } = F*(b) - F*(- b) =

2F* (b ) -

1

276


where F* is the normal {0, 1) distribution function. If b is chosen so that 2F* (b) - 1 = 1 - oc , that is, F* (b) = 1 - oc/2 (b = Nrz12 in the terminology of Example 3 , Section 8.2), or

P - Na. /2

{ {

P R-

Thus, with probability 1

0 fmn /n) is said to have the F distribution with m and n degrees offreedom, (R1/m)f(R 2 abbreviated F(m , n). Calculate the mean and variance _of the chi-square, t, and F distributions. (a) If has the t distribution with n degrees of freedom, show that has the F(1 , n) distribution. If R has the F(m, n) distribution, show that 1/R has the F(n, m) distribution. (c) If R1 is chi-square (m) and R2 is chi-square (n), show that R1 + R 2 is chi square (m + n). Discuss the problem of obtaining confidence intervals for the variance a2 of a =

2. 3.

T2

T

(b)

4.

normally distributed random variable, assuming that (a) The mean p is known (b) p is unknown be inde (A two-sample problem) Let pendent, with the normal p , and the normal (p1 , p , p 2 and unknown). Thus we are taking independent samples from two different normal populations. Show that if and i = 1 , 2, are the sample mean and variance of the two samples, and

R11 ,aR2 12 , • • • , R1n1 , R21, R22 , • • • , R2n2 a2 R2 i R1 i ( 2 a2) ( 1 ) Ri Vi2 , 1 2 / n n + 2 2 1 [ l k (nl vl2 + n2 v2 ) /2 nln2(nl + n2 - 2)J then [R1 - R2 - kta. 1 2 , n 1+n 2_ 2 , R1 - R2 + kta. 1 2 , n1+n 2_ 2 ] is a confidence interval for p 1 - p 2 with confidence coefficient In Problem 5, assume that the samples have different variances a12 and a22 • Discuss the problem of obtaining confidence intervals for the ratio a12/a22 • 7. (a) Suppose that C(R) is a confidence set for y(O) with confidence coefficient >1 that is, for all E N P9{y(O) E C(R)} > 1 Consider the hypothesis-testing problem H0 : y (O) k H1 : y (O) k and the following test. q;k(x) 1 if k tj: C(x) if k C(x) 0 [Thus C(x) is the accep tance region of q;k.] Show that q;k is a test at level

S.

=

1 - (X.

6.

- (X ;

-

(X

0

=

"#:

=

=

E

(X .

8.7

THE MULTIDIMENSIONAL GAUSSIAN DISTRIBUTION

279

(b) Suppose that for all k in the range of y there is a nonrandomized test q;k (i.e. , q;k (x) = O or 1 for all x] at level (X for H0 : y(fJ) = k versus H1 : y(fJ) '#: k. Let C(x) be the set {k : q;k (x) = 0}. Show that C(R) is a confidence set for y(fJ) with confidence coefficient > 1 - (X. This result allows the confidence interval examples in this section to be translated into the language of hypothesis testing.

*8 . 7

T H E MULTI D I M E N S I O NAL G A U S S I A N D I S T R I BU T I O N

I f R�, . . . , R � are independent, normally 4istributed random variables and we , Rn by Ri = � j 1 aiiR ; + bi , i = 1 , . . . , n, define random variables R1 , the R i have a distribution of considerable importance in many aspects of probability and statistics. In this section we examine the properties of this distribution and make an application to the problem of prediction. Let R = (R 1 , . . . , Rn) be a random vector. The characteristic function of R (or the joint characteristic function of R 1 , . . . , Rn) is defined by •

M(u 1 ,

•

•

, un)

•

=

=

•

•

E [ i( u 1R 1 +

+ u nR n) ] ,

i: 1: exp ( i}: u x ) ·

·

·

·

·

·

1

k

k

u1,

d F ( x1 ,

•

•

•

.

•

.

,

,

u n real

xn)

where F is the distribution function of R. It will be convenient to use a vector n matrix notation. If u = (ur, . . . , un) E E , u will denote the column vector with components u1, . . . , un. Similarly we write x for col (x1, . . . , xn) and A superscript t will indicate the transpose of a R for col (R1 , , R n) matrix Just as in one dimension , it can be shown that the characteri�tic function determines the distribution function uniquely. .

.

.

·

The random vector R = (R 1 , . . . , Rn) is said to be Gaussian (or R1 , . . . , Rn are said to be jointly Gaussian) iff the characteristic function of R is

DEFINITION .

( 8.7. 1)

where b1 , . . . , bn are arbitrary real numbers and K is an arbitrary real symmetric nonnegative definite n by n matrix. (Nonnegative definite means that ��.s=l arKrsas is real and >O for all real numbers ar, . . . , a n. )

280

IN TRODUCTION TO STATISTICS

We must show that there is a random vector with this characteristic function. We shall do this in the proof of the next theorem. Theorem 1. Let R be a random n-vector. R is Gaussian iff R can be n expressed as WR' + b , where b = (b 1 , . , b n) E E , W is an n by n matrix, •

•

and R�, . . . , R� are independent normal random variables with 0 mean. The matrix K of (8. 7. 1) is given by WD Wt , where D = diag (it1 , , An) is a diagonal matrix with entries A.i = Var R;, j = 1 , . . . , n . (To avoid having to treat the case it3 = 0 separately, we agree that normal with expecta tion m and variance 0 will mean degenerate at m.) Furthermore, the matrix W can be taken as orthogonal. •

PROOF . If R

=

WR'

+

E[exp (iutR)]

E [ex p (ivtR'))

= =

Set v

=

Wtu

E

•

then

b,

But

•

=

exp [iutb] E[exp (iut WR')]

[ il exp ( ivkR�)J

fr E [e xp (ivk R�)] = ex p [

k=l

-

!

2

iAkv,.2]

k=l

to obtain E[exp (iut R)]

=

exp [iutb - � u tKu]

where K = WD W t . K is clearly symmetric, and is also nonnegative definite, since u tKu = v t Dv = L�- 1 'Akvk > 0, where v = Wtu. Thus R is Gaussian. (Notice also that if K is symmetric and nonnegative definite, there is an orthogonal matrix W such that W tKW = D , where D is the diagonal matrix of eigenvalues of K. Thus K = WD Wt, so that it is always possible to con struct a Gaussian random vector corresponding to a prescribed K and b.) Conversely, let R have characteristic function exp [iutb - (1/2) (u tKu)] , where K is symmetric and nonnegative definite . Let W be an orthogonal matrix such that WtKW = D = diag (it1 , . . . , An), where the iti are the eigenvalues of K. Let R' = Wt (R - b) . Then

2

where

v

=

Wu

-8 .7

THE M ULTJDIMENSIONAL GAUSSIAN DISTRIBUTION

28 1

It follows that R�, . . . , R� are independent, with R; normal (0, .A1) . Since W is orthogonal , Wt = W-1 ; hence R = WR' + b . The matrix K has probabilistic significance, as follows. In Theorem 1 we have E(R) = h, that is, E(R3) = b;, j = 1 , . . . , n, and K is the covariance matrix of the Ri, that is, Krs = Cov (Rr, R8) , r, s = 1 , . . . , n. Theorem

2.

PROOF. Since the R; have finite second moments, so do the R1 • E(R) = b follows immediately by linearity of the expectation. Now the covariance matrix of the Ri is where E(A) for a matrix A means the matrix [E(Ars)] . Thus the covariance matrix is E[ WR' ( WR')t] = WE(R'R't) Wt = WD Wt = K since D is the covariance matrix of the R ; . The representation of Theorem 1 yields many useful properties of Gaussian vectors. Let R be Gaussian with representation R = WR' + h , W orthogonal, as in Theorem 1 . 1 . If K is nonsingular, then the random variables Rj = Ri - bi are linearly independent; that is, if ! i= 1 aiR;� = 0 with probability 1 , then all a; = 0. In this case R has a density given by Theorem

3.

f(x)

=

(2TT) -

n

f 2 (det K)-1 12 exp

[ - !(x - b)tK-1 (x - b) ]

2. If K is singular, the R: are linearly dependent. If, say, {Ri , . . . , R:} is a maxima/ linearly independent subset of { Ri , . . . , R�}, then ( R1 , , Rr) •

•

.

has a density of the above form , with K replaced by Kr = the first r rows and columns of K. R; +I , . . . , R� can be expressed (with probability 1 ) as linear . . ttons comb tna 0if R* 1 ' . . . ' R*r •

PROOF. 1 . If K is nonsingular, all A.i are f'( y)

=

=

>

0 ; hence R' has density

(2 7T)- nf 2(A.1 . . . An)-1 1 2 exp

[ - 12 ki=1 YkA. ]

1 ( 2 7T)- n f 2( det K )- / 2 exp [ - !Yt v - l y]

2

k

282


The Jacobian of the transformation has density

x

= Wy + b is det W = ± 1 ; hence R

f(x) = f' ( Wt(x - b)) 12 = (27T)- f 2 (d et K) - 1 exp [ - � (x - b)t WD-1 Wt(x - b) ] n

Since K = WD Wt, we have K-1 = WD-1 Wt , which yields the desired ex pression for the density . Now if L i 1 aiRj = 0 with probability 1 , 0=

E[ ±

a1 R f 1 ,

] � =

=

arE(Rr*R.*) a 8 r. 1 n

L arKrs as r,s=1

Since K is nonsingular, it is positive rather than merely nonnegative definite, and thus all ar = 0. 2. If K is singular, then L�, s = 1 a K C!s will be 0 for some a1, . . . , a n , not all 0. (This follows since utKu = L�-1 Akvk2 , where v = Wtu ; if K is singular, then some Ai is 0.) But by the analysis of case 1 , E[I L i 1 a1Rjl 2 ] = 0 ; hence L7 1 aiRj = 0 with probability 1 , proving linear dependence . The re maining statements of 2 follow from 1 . I I

r

rs

REMARK . The result that K is singular iff the R1 are linearly dependent is true for arbitrary random variables with finite second moments, as the above argument shows. �

Example

1.

Let (R1, R 2) be Gaussian. Then K=

where a12 = Var R1 , a 22 = Var R 2 , a1 2 = Cov (R1, R 2). Also , det K = a12a 22 ( 1 - p1 22) , where p1 2 is the correlation coefficient between R1 and Rz. Thus K is singular iff l p1 2 l = 1 . In the nonsingular case we have

where a

=

8.7

exp

283

E(R2) . The characteristic function of (R1 , R2) is [i(au1 + bu2) ] exp [ - � ( 1 2 a (

1

_

a

z

2

...._

If R1 is a Gaussian n-vectqr and Rs by n matrix, then R2 is a Gaussian m-vector. Theorem 4.

=

AR1, where A is an

m

PROOF. Let R1 = WR' + b as in Theorem 1 . Then R2 = A WR' + Ab , and hence R 2 is Gaussian by Theorem 1 . CoROLLARY. (a) If R1 , . . . , R n are jointly Gaussian , so are R1 , . . . , R m , m < n. (b) If R1 , . . . , R n are jointly Gaussian, then a1R1 + · · · + a nR n is a Gaussian random variable . PROOF. For (a) take A = [I 0], where For (b) take A = [a1a2 a n]. •

•

I

is an m by m identity matrix.

•

Thus we see that i f R1 , . . . , R n are jointly Gaussian, then the Ri are (individually) Gaussian. The converse is not true, however. It is possible to find Gaussian random variables R1 , R2 such that (R1 , R2) is not Gaussian, and in addition R1 + R2 is not Gaussian. For example, let R1 be normal (0, 1) and define R2 as follows. Let R3 be independent of R1 , with P{R3 = 0} = P{R3 = 1 } = 1 /2 . If R3 = 0, let R2 = R1 ; if R3 = 1 , let R2 = - R1. Then P{R2 < y} = (1 /2)P{R1 < y} + ( 1 /2) P{ - R1 < y} = P{R1 < y} , so that R2 is normal (0, 1). But if R3 = 0, then R1 + R 2 = 2R1 , and if R3 = 1 , then R1 + R2 = 0 . Therefore P{R1 ·+ R2 = 0} = 1 /2 ; hence R1 + R 2 is not Gaussian. By corollary (b) to Theorem 4, (R1 , R2) is not Gaussian. Notice that if R1 , . . . , R n are independent and each R i is Gaussian, then the R i are jointly Gaussian (with K = the diagonal matrix of variances of the Ri).

if Kii

=

5.

If Rb . . . , Rn are jointly Gaussian and uncorrelated, that is, 0 for i � j, they are independent .

Theorem

284


PROOF. Let ai 2 = Var R i. We may assume all ai 2 > 0 ; if ai 2 = 0, then R1 is constant with probability 1 and may be deleted. Now K = diag (a12 , • • • , a n2) ; hence K-1 = diag ( 1 / a1 2 , • • . , l fan2) , and so, by Theorem 3 , R1, . . . , R n have a joint density given by ! ( X1 ,

•••

'

Xn)

-

_

[

2 1 bi (xi ) 1 � 2 / . . . )- n ( (]1 (Jn) exp - - i£,., 2 =1 a1 2

(2 '7T

]

We now consider the following. predictiOJ;J problem. Let R1 , . . . , R n+ 1 be jointly Gaussian. We observe R1 = x1, . . . , R n = x n and then try to predict the value of R n+1• If the predicted value is 1p(x1 , . . . , xn) and the actual value is xn+ 1 , we assume a quadratic loss (xn+ 1 - 1p (x1 , . . . , xn)) 2 • In other words , we are trying to minimize the mean square difference between the true value and the predicted value of R n+ 1• This is simply a problem of Bayes estimation with quadratic loss function, as considered in Section 8.3 ; in this case R n+ 1 plays the role of the state of nature and (R1 , . . . , R n) the observable. It follows that the best estimate is We now show that in the jointly Gaussian case 1Jl is a linear function of x1, . . . , xn . Thus the optimum predictor assumes a particularly simple form . Say {R1 , . . . , Rr} is a maximal linearly independent subset of {R1, . . . , R n} · If R1 , . . , Rn R n+ 1 are linearly dependent, there is nothing to prove ; if R 1, . . . , Rr , R n+ 1 are linearly independent, we may replace R1 , . . . , R n by R 1 , . . . , Rr in the problem . Thus we may as well assume R1 , . . . , R n+ 1 linearly independent. Then (R1 , . . . , Rn+1) has a density, and the con ditional density of R n+ 1 given R 1 = x1 , . . . , R n = xn is .

where K is the covariance matri x of R1 ,

.

•

•

, R n+1 and Q

=

[qrs l

=

K-1 .

8.7

THE MULTIDIMENSIONAL GAUSSIAN DISTRIBUTION

285

Thus

where

n q n+t .rxr D = D ( xb . . . , x n) = ! r=l Therefore the conditional density can be expressed as

� exp [��] exp [ -c ( xn+l + �JJ

Thus , given R1 = x1 , . . . , R n = x n , R n+l is normal with mean variance 1 /2C = 1 /q n+ t n+t · Hence '

-

D/2C and

Tab l es

Common Density Functions and Their Properties

Uniform on [a, b]

Parameters

Density

Type

1 b -a'

a, b real, a p

Normal

Gamma Beta Exponential ( = gamma with a = 1 , fJ = 1 /.A)

r( a){Ja '

real,

r, s

.Ae-;.x ,

.A > O

x>O

x >O

>0

>0

n = 1 , 2, . . .

t

1 r[ (n + 1 )/2 ] 1 2 2 n 'Tf r (n/2) (1 + X /n) < n+ ) 1

n = 1 , 2, . . .

F

x (m/2) -1 (m/n)m/ 2 x>O {J(m/2 , n/2) (1 + (mfn)x) <m+n) /2 '

nz ,

Cauchy 286

v

8

b

a, fJ > 0

x>O

xr-1 ( 1 - x) s-1 {J(r, s)

Chi-square ( = gamma with a = n/2 , fJ = 2)

a

0

Com�on Density Functions (continued)

Type

Mean

a +b 2

Normal

p

Variance (b - a)2 12 0"2

Gamma

rx{J

rxf12

Uniform on [a, b]

r

Beta

-

n

F

Cauchy

"

A Re s > - A s+A' (2s + 1)-n / 2 , Re s > - 1 /2

2n

not exist if n = 1 n if n > 2 ; n-2 oo if n = 1 or 2 Does not exist

)

(

1

0 if n > 1 ; does

t

a

-).2

;.

Chi-square

1 e-s - e-sb ' all s b -a s all s e-sf.les2a2 /2 , 1 /P Re s > - 1 / {1 , {1 s + 1I

rs (r + s)2 (r +. s + 1 )

r +s 1

Exponential

Generalized Characteristic Function (If Easily Computable)

n if n > 2 ; n -2 oo if n = 2 2n2 (m + n - 2) if m(n - 2)2 (n - 4) n > 4 ; oo if n = 3 or 4 Does not exist

e-Blul ' s = iu ' u real

Common Probability Functions and Their Properties

Type Discrete uniform Bernoulli Binomial

-N1 '

Parameters

Probability p (k)

k = 1 , 2, . . . , N

p (1) p p (O) = q (�)p kqn- k ,

N = 1 , 2, . . .

=

.

O
O Poisson e-;. Akjk ! , k = 0, 1 , . . . Geometric 0
d1. b f2(b L� 1f1 [h b)] l h; b)l, Section 1. l 7/ 1 2 l + l - l C!) l + F ( 1 .5 - F (.5 FR(3) - FR(.5 � - /2 ! F

where

=

) =

)

=

and = and = = = (say) and = ) =

)

where to obtain

n

j= l

and and Thus

. . . .

if E ifj E

if

n

x

)

is interpreted as

) =

=

n

if Differentiate with respect to as desired.

2.5

(a) l (b) (c) (d)

R

)

R

) =

) =

=

is increasing at x1 is decreasing at xi

=

SOLUTIONS TO PROBLEMS

Section 2.6

4. (a) � � ( b) 32 32 (c)

(d) � (e) !

5 + In 4

(f) �

8

(g) 1

In each case, the probability is 1 (shaded area) . x+y=f

x -y =f

(a)

(b)

y = lx

(d)

(e)

(f) PROBLEM 2.6.4

Section 2. 7

1 . (a) .f1 (x) ( b)

/2 (Y)

3/ 8

= =

6x2

-

4x3(0 < x < 1),

2y(O < y < 1)

(g)

297

298


2. /1 (x) = 2e-x - 2e-2x , x > 0 f2 (y) = 2e-2Y, Y > 0 3. f3(z)

1

2

1

1

2

8. (a) P{R 12 + R 2

1 , R 2

4600 n

]

d n

(b) 1 - se-2

1

z

r

< L,

and 0 if (n - 1) d

> 0 ; foo (8) = 217T '

0

< f)

2 7T .

L

300


3 . (a) No. (b). 9

k

4. P{Rl + R2 = k} = L P{Rl = i, R2 = k - i} i =O

[We may select k positions out of + m in (n trn) ways . The number of selections in which exactly i posi ti ons are chosen from the first is ( f) (kmi ) . Sum over i to obtain (a).] Thus P{R1 + R 2 = k} = ( n t m)p Tcqn+m-k , k = 0 , 1 , + m. (Intuitively , R1 + R 2 is the number of successes in n + m Bernoulli trials , with probability of success p on a given trial . ) Now P{R1 = j, R1 + R2 = k } =

n

n

... ,n

P{R1 = j, R 2 = k - j} = (j)p iqn- i (kmi)pk-jqm -k+ i = (j ) (kmi)pkqn+m-k , j = +m 1 , . . . , n , k = j, j + 1 ,

...,n

0,

7�;,�)) ,

the hypergeometric probability Thus P{R1 = j I R1 + R 2 = k} = function (see Problem 7 of Section 1 .5). Intuitively , given that k successes have occurred in + m trials , the positions for the successes may be chosen in ( n t m) ways . The number of such selections in which j successes occur in the first .) . trials is (�) (km 3 -J

n

n

CHAPTER

3

Section 3.2 = =

n+ _ 4 d d e-"' x e-"' 2e-"'d;c i � "' dx + Ia x L + J:2 + 3 + e-1 - e-2 + (e-2 - e-3) + 3 (e-3 - e-4) + +

2

·

·

·

·

·

·

·

·

·


30 1

n+ 1 e-x dx = e-n - e-(n+l) (b) P {R 2 = n} = P {n < R1 < n + 1 } = n E(R2) = 2 nP{R 2 = n} = 2 n [e-n - e-< n+1 > ] n=O n=O co

J

co

e-1 = as above . 1 - e-1

3. (a) 1

4.

(b) 0

(c) 1

1 /3

5. 2 + 30e-3

8. E[R(R - 1)

·

·

·

(R

- r

�

+ 1)] = 2

k=O

k(k - 1 )

·

·

I

1

00

= _Ar 2 k=r (k -

r) !

(k - r k! ·

+ 1)

e-;.;J� �

e-)._Ak-r = _Are-). e). =

xr

= 1 to obtain E(R) = .A ; set r = 2 to obtain E(R 2 - R) = .A2 , hence E(R2) = .A + .A2 • It follows that Var R = .A.

Set

r

Section

3.3

1 . This is immediate from Theorem 2 of Section 2.7.

2. E[(R - n1) n ] = 0, n odd Section

= an (n - 1 ) (n

3.4

- 3)

·

·

·

(5) (3) (1), n even

1 . Let a(R 1 - ER1) + h(R 2 - ER 2) = 0 (with probability 1). If, say, h "#: 0 then we may write R 2 - ER 2 = c(R1 - ER1). Thus a22 = c2 a12 and Cov (R1 , R 2) = ca12 2 ca1 • Therefore p(R1 , R?) = , hence I PI = 1 . I c I a1 2 4 . In (a), let R take the values 1 , 2, . . . , n, each with probability 1 /n, and set R1 = g(R) , R 2 = h(R) , where g(i) = ai , h(i) = hi . Then

� 1 h i2 E(R 22) = £i =1 n -

In (b) let R be uniformly distributed between R 2 = h (R). Then E(R1R 2) =

i

a

and h, and set R1

b g(x)h(x)

a

d x, h -a

In each case the result follows from Theorem 2 of Section 3 . 4 . 5. This follows from the argument of property 7 , Section 3 . 3 .

= g (R),


302

3.5

Section n! (l) i+k (2 )n-j-k k • P R1 - R 2 j ! k ! (n j k) ! 2.9) . 1, j, k 0 , 1 , . . . , n , j + k n 4. (n - 1)p(1 - p). n-1 1 5. A i n - 1 R0 i�= 1 IAi + i + 2 , IA · R02 i = 1IA 2 + 2i""< i I4 .IA ·· IA .IA · 0, IA · E(IAiiA ) P(A i )P(A1) (pq)2 . . n-3 n- 1 E R02) (n - 1)pq + 2 � � (pq) 2 i= 1 i=i+ 2 n-3 (n - 1)pq + 2(pq)2 i�=1 - 2 (n - 1)pq + 2(pq)22 [1 + 2 + + (n - 3)] (n - 1)pq + (pq) (n - 3)(n - 2) . E R02) - [E R0)] 2 (n - 1)pq + (n - 2) (n - 3)(pq)2 R0 n 2). (n - 1)2(pq100)2, q 1 - p 6. 50 (49/50) Section 1. . 532 (b) - 2 . 84 Section 3. 7 1 . P{I R - ml a2 1 . m J.r xe-x dx 1 , E R2) f� x2e-x dx 2, JA-k e-x dx + f� k e-x dx ka P{I R - 11 k . 0 k 1, 1 - 1+k - 1-1 k - d 3

{

-

}

], ·

_

=

J

=

and

J

are

Thus

=

=

(

(n

=

=

- i)

·

·

·

=

Therefore , Var

=

( (assum ing

(

=

>

=

-

3.6

(a)

> = = hence = ( > } If < = } = this is < e < k> + e < > . When k > 1 , it becomes f�r-k e x x = e < + >. Notice that the Chebyshev bound is vacuous when k < 1 , and for k > 1 , e- < 1+ k > approaches zero much more rapidly than 1 /k2• =

=

CHAPTER

Section 1 . T (r)

4.2

Thus =

Now

4

=

f� tr-1e- t d

t

=

(with

t

=

x2)

2 f� x2r-1e-x2 dx.

T(r)T(s) 4 f�f� x2r-1y2s-1e-<x4Y2> dx dy fJ)2 s-1e-P 2 p2 r+2 s-1 dp . 4 f012 J� u p2)� f� ur+s-1e-u du f� p2r+2 s-1e-P 2 d p � T(r + s) T (r)r (s) irr/2 fJ)2r-1 (sin fJ)2s-1 dfJ 2T (r + s) =

dfJ

(in polar coordinates)

Therefore

=

=

0

(set

(cos

=

(cos fJ)2r-1 (sin

=

303


Let z = cos2 () so that 1 - z = sin2 0 , dz = - 2 cos () sin () dO , dz d() = - 1 12 2z (1 - z)1 /2

Thus

3. 5.

T(r)T(s) = 2T(r + s) ---

_

!.

2

lo 1

z r-1 (1 - z) s-1 dz = � {3 (r , s)

3 3 P 1 (e- - e-5) + P 2 (e-4 - e-8) + p3(e- - e-9) + P 4 (1 - e-8) + p 5 (1 - e-5)

0 1

Section 4.3

1 . h(y I x) = ex-y , 0 < x < y, and 0 elsewhere ; P{R 2 < y I R1 = x} = 1 y > x, and 0 elsewhere.

2. J1 (x) = = /2 (y) = =

l l f io

x

-

1

x

-1

-

ero-11 ,

kx dy = kx(x + 1), 0 < x < 1 -kx dy = - kx (x + 1), - 1 < x < 0

kx dx = � k (1 - y2), 0 < y < 1 - kx dx +

1

1

kx dx = �k( l + y2), - 1 < y < 0

Since f�t .f1 (x) dx = f�1 f2 (y) dy = k, we must have k = 1 . The conditional density of R2 given R1 is (x, y) 1 f = , -1 < x < 1, -1 < y < x h2 (y i x) = x + x ) ( 1 t f

The conditional density of R1 given R 2 is ht (x I y) =

J(x, y) 2x = , 0 < y < 1, y < x < 1 2 ) 1 y MY _

2x , - 1 < y < 0, 0 < X < 1 1 + Y2 2x - 1 < y < 0, y < - 1 + y2 '

X

0

>

0, x

>

0,

304 5.


P{g (R1 , RJ < z I R1 = x} = P{g(x , R2) < z I R1 = x} =

Section 4.4

J

{y: g ( x, y) < z}

h (y I x) dy

[ (x, y) 8 xy 2y 0 x + e} + P{Rn < x, R < x + e} e} + P{R < x + e} since Rn < x, R > x + e implies IRn - Rl > e and Rn < x, R < x + e implies R < x + e . Similarly P{R < x - e} = P{R < x - e , Rn > x} + P{R < x - e , Rn � x} < P{Rn - R I > e} + P{Rn < x} Thus F(x - e) - P{IRn - Rl > e} < Fn (x) < P{IRn - Rl > e} + F(x + e) (b) Given c5 > 0, choose e > 0 so small that c5 c5 F(x + e) < F(x) + 2 , F(x - e) > F(x) - "2 . (This is p ossible since F is continuous at x.) For large enough n, P{IRn - Rl > c} < c5/2 since Rn P ) R. By (a), F(x) - c5 < Fn (x) < F(x) + c5 for large enough n. Thus Fn (x) -+ F(x). 5. (a) n > 1 ,690,000 (b) n > 9604 005n IR - .l.nl 7. P{IR - �nl > .005n} = P 1 ! > · 1 1 P{IR*I > .01 v' n} 2 vn 2 -v n oo 2 1 2 = 2P{R* > .01 Vn } = e-t2 1 2 dt < e- . 0001n / 2 V 27T .01 vn V27T . o1 V� 200 .2 e-< 1 1 2 > 1o-4n . For example, if n = 106, this is 1 e-50 v 271' v21rn 8 . . 91

{

_

�

J

}

,_,

_

�


31 1

CHAPTER 6 Section 6.2 1 . f1(x) = .A1 x andf2(x) = .A 2x [orf1(x) = _Ax, f2 (x) = x .Ax in the repeated root case] are linearly independent solutions. For if c1f1 + c 2f2 0 then cl _Alx + c 2_A 2x C 1 _Ax 1 +l + C 2 _Ax2+l If c1 and c 2 are not both 0, then

=

= =

0

0

_Alx _A 2x. =0 _Ax1 +l _Ax2+l

hence

1 1

=

0, a contradiction

.Al .A 2 If f is any solution then for some constants A and C, f(O) /1 (0) c /2 (0) = A + 1) 1) /( ,{2 (1 ) !t (

[ ] [ ] [ ]

since three vectors in a two-dimensional space are linearly dependent. But then /(2)

=

1 d1 f(O) + d2 f( 1), d1 = - , d2 p

d1 [A/1(0) + �{2 (0)] = A/1 (2) + C/2(2) =

+

=

d2 [A/1( 1)

-q p +

C/2( 1)]

Recursively, f(x) = Af1(x) + Cf2(x). Thus all solutions are of the form Af1(x) + Cf2 (x). But we have shown in the text that A and C are uniquely determined by the boundary conditions at 0 and b ; the result follows. 2. By the theorem of total expectation, if R is the duration of the game and A = {win on trial 1} then E(R) = P(A)E(R t A)

Thus D(x)

= p [1 +

D(x

+

1)]

+

q [1

+

+

P (AC)E(R I AC)

D(x - 1)], x

=

1 , 2, . . . , b - 1

since if we win on trial 1 , the game has already lasted for one trial, and the average number of trials remaining after the first is D(x + 1). [Notice that this argument, just as the one leading to (6.2. 1), is intuitive rather than formal.] 3. In standard form, p D (x + 2) - D(x + 1) + qD (x) = -p - q = - 1 , D(O) = D (b) = 0.

312


CASE 1 . p =/= q. The homogeneous equation is the same as (6.2. 1), with solution A + C(q/pY�. To find a particular solution, notice that the "forcing function" - 1 alr�ady satisfies the homogeneous equation, so try D(x) = kx. Then k [p (x + 2) - (x + 1) + qx] = (2p - 1)k = (p - q)k = - 1 . Thus

D(x) = A + C(q/p}c +

X

q -p

.

Set D(O) = D(b) = 0 to solve for A and C. CASE 2. p = q = 1/2. The homogeneous solution is A + Cx. Since polynomials of degree 0 and 1 already satisfy the homogeneous equation, try as a particular solution D(x) = kx2 • Then k [ � (x + 2)2 - (x + 1)2 + � x2] = k = - 1 Thus D(x) = A + Cx - x2 D(O) = A = 0, D(b) = b(C - b) = 0 so that C = b.

Therefore D(x) = x(b - x). If we let b

-+- oo

we obtain if p

oo

D(x) =

> q ; D(x) =

X

if q -p p

0, . . . , S2n- 1 > 0, S2 n 0} is the number of paths from (0, 0) to (2n, 0) lying on or above the axis, times (pq)n . These paths are in one-to-one corre spondence with the paths from ( - 1 , - 1) to (2n, 0) lying above - 1 [connect ( - 1 , - 1) to (0, 0) to establish the correspondence]. Thus the number of paths is the same as the number from (0, 0) to (2n + 1 , 1) lying above 0, namely

=

(2n 1 ) a - b where +

2n + 1

a

a

+ b = 2n + 1 , a - b = 1 , that is, a = n + 1 , b = n

Thus the desired probability is

(2n 1 ) + n +1

(2n) ! u2 n (pq)n (pq)n 2n + 1 n ! (n + 1) ! n+1 1

1

(2n) (pq)11 = (2n) ! (pq)n 5. u2n = n

_

__

n! n!

�

_

_

(4pq)11 1 (2n)2n v'21T2n n ) = = ( pq (nn v'21rn)2 v'n1r v'n1r _

if

-q P -

.1 2


By Problem 2,

1 1 _ ""' h2 n 2n 2n v (n - 1)7T _

U2n- 2

""'

313

1

2 V n3 1 2 ____

7T

Let T be the time required to return to 0. Then P { T = 2n}

=

E( T)

h 2n , n

=

=

00

1 , 2, . . . where 2 h2n ·

00

n=1

2 2nP{ T = 2n}

n =1

=

=

1

00

2 2nh2n

n =1

But 2nh2n ""' K/ vn and � 1/ vn = oo , hence E(T) = oo . 6. The probability that both players will have k heads is [(�)( � )n ] 2 ; sum from k = 0 to n to obtain the desired result. 7. The probability that both players will receive the same number of heads = the probability that the number of heads obtained by player 1 = the number of tails obtained by player 2 (since p = q = 1/2), and this is the probability of being at 0 after 2n steps of a simple random walk, namely (�n)( � )2n . Comparing this expression with the result of Problem 6, we obtain the desired conclusion. (Alternatively, we may use the formula of Section 2.9, Problem 4, with m = k = n.) Section 6.4 1 . (1

_

4pqz2)1 / 2 =

00

2 (�2)( -4pqz2)n

n=O

=

Thus

00

2 ( �2)( -4pq)nz2n

n=O

H(z) =

1

_

(1

_

4pqz2)1 / 2 =

Thus But ( - 1 )n+1

( 1 /2 ) n

=

( - 1 )n+1

00

_

2 (�2)( -4pq)nz2n

n=1

(1/2)( - 1/2)( - 3/2) n!

· · ·

· · ·

[(2n - 3)/2]

(2n - 2) ! (2n - 3) - n 2 n ! 2 4 · · · (2n - 2) 2nn ! 2 2n - 2 1 2n (2n - 2) ! = n = 2 n(n - 1) ! 2n-1 (n - 1) ! n n - 1 2 1 3 5 ·

·

------

( ·

Therefore h211

=

(2/n)(2,;: 12)(pq)n , in agreement with (6.3. 5).

)( )


314

2. If 00

1 4 A(z) = I anzn, then - [A(z) - a0] - 3A(z) = 1 -z z n=O

or

A (z) (z-1 - 3) =

4 1

z

_

Thus

+

a0

�

a0 4z A(z) = + (1 - z)(1 - 3z) 1 - 3z 00

00

2 2 +� = = -2 I zn + (2 + ao) I 3nzn + 1 - z 1 - 3z n=O n=O Thus an = (2 + a0)3n - 2. Notice that (2 + a0)3n is the homogeneous solution, -2 the particular solution. 6. (a)

= k} = P{the first k - 1 trials result in exactly r - 1 successes, and P{N r 1 trial k results in a success} = (� _l)p r- qk-rp , k = r, r + 1 , . . . . Now (�_l) = ('�-i) = (k - 1)(k - 2) (r + 1)rf(k - r) ! = ( -1) k-r ( -r)( -r - 1) ( -r - 2) [ -r - (k - 2 - r)] [ -r - (k - 1 - r)] / (k - r) ! = ( -1)k-r(k �), and the result follows. Note that if j = k - r, this computation shows that ( - 1) i ( -:/) = (i ) r - 1 2 -0 1 (b) We show that T1 and T2 are independent. The argument for T1 , , Tr is similar, but the notation becomes cumbersome. P {T1 = j, T2 = k} = P{R1 = = R1_ 1 = 0, Ri = 1 , Ri+1 = . . . Ri+k-1 = 0, Ri+k = 1 } = p2n1+k-2 k = 1 2 Now P{ T1 = j} = qi-Ip by Problem 5, and P {T = k} = 1�P {T1 = j, T2 = k} = (�/-1) p2qc-1 1 k 2 k p q - 1 - = pq q Hence P {T1 = j, T2 = k} = P { T1 = j}P { T2 = k} and the result follows. · · ·

·

· ·

x

+r - 1 r-1

'

-

'

' · · ·

'

1·

-

'

' · · · ·

•

· · ·

=

'

7

2

1

. ]

'

'

'

.

•

.

•

•


315

(c) E(Nr) = rfp, Var Nr = r [(1 - p)/p2] since Nr = T1 + + Tr and the Ti are independent. The generalized characteristic function of Nr is pe-s r l qe-s l < 1 , 1 - qe Set s = iu to obtain the characteristic function, z = e-s to obtain the gener ating function. 7. P{R = k} = p kq + q1p , k 1 , 2, . . . ; E(R) = pq-1 + qp-1 8 . 1/ v'2

(

_8

·

)

·

·

=

Section 6.5 2. P{an even number of customers arrives in (t, t

+

r] } = I: 0 • • 4• • • • e-;.r ().,-)kfk ! 2 = � I e-;.r ().,-)kfk ! + � I e-;.r ( - A. r)kfk ! k=O k=O = � e-;.r(e;.r + e-;.r) = � (1 + e-2;.r) P{an odd number of customers arrives in (t, t + r] = I: 1 • 3 • 5 • . • • e-;.r ( ).,-)kfk ! = l I e-;.r ().,-)kfk ! - � I e-;. r ( - A. r)kfk ! k=O k=O = � e-;.r (e·h - e-;.r) = � (1 - e-2;.r) ( A. r)k [Alternatively, we may note that I : 0 , 2 , 4 = cosh ). ,- and k! ( A. r) k . oo Ik= 1 . 3 • 5 . . . . Ar. ] h Sin = k! 00

00

00

00

• . . •

3. (a) P{Rt = 1 , Rt+r = 1} = P{Rt = - 1 , Rt+r = - 1} = f(1 + e-2;.r) P{R t = 1 , Rt+ r = - 1 } = P {Rt = - 1 , Rt+r = 1 } f(1 - e-2;.r) (b) K(t, r) = e-2 ;.r =

Section 6.6 4. (a) is immediate from Theorem 1 . (b) If w E A n for infinitely many n, then w E A, which in turn implies that w E A n eventually. Thus lim sup A n A lim inf A n lim sup A n , so all these sets are equal. (c) Let A n = [1 - 1/n, 2 - 1/n] ; lim sup A n = lim inf An = [1 , 2) (Another example. If the A n are disjoint, lim sup A n = lim inf A n = 0 .) (d) n r A k A n U f A k , and Bn = n r A k expands to limn inf An = A , Cn = U r A k contracts to limn sup An A. Thus P(A n) is boxed between P(Bn) and P(Cn), each of which approaches P(A). (e) (limn inf A n)C = c u � 1 n � n A k)C = n � 1 u � nA kC = limn sup AnC by the DeMorgan laws ; (limn sup An)c = limn inf A nc similarly. 6. lim inf A n = {x, y) : x2 + y 2 < 1 } , lim sup An = { (x, y) : x2 + y2 < 1} - {(0, 1 ) , (0, - 1)}. c

n

c

n

c

n

c

c

n

=

3 16 8.

SOLUTIONS TO PROBLEMS =

An)

P(lillln sup

lim n� oo P( U � A1c) by definition of lim sup, hence n

An)

P(limn sup

Now

lim n -+ oo limm -+ oo P( U f n A k)

=

m

-P } -P 1 e since P(A kc) P(A e k k= n m ( k) A P k n = e since I P(An) 0 as n The result follows . If I : 1 P{I Rn - c l } by Theorem 5. Thus for every c R 0, n 0. Then by the second assume that I: 1 P{I Rn - cl } for some Borel-Cantelli lemma, P{ I Rn - c l for infinitely many n} 1. But many·n implies that Rn c, hence P{Rn c} 1, Rn -is. ,cjP for c}infinitely Ithat 0. {Rn 12. Let Sn (R1 Rn)/n; then E(Sn/n)2 (1/n2) Var Sn 1 In Var Rk In -1 n k=1 k n k=1

IT

>

=

e

e

· ·

·

� 2

+

� 3

Hence

>

e

=

>

oo

e

=

e

e

>

=

B

B

M < 2

2

n� - )r1 � n n (notice �k � if k 1 2 (s n E n ) n2 (1 n n) , so that 2 S S n n ) ( h O fl By T eorem Jl E � /;:1

+ . . . +

n

for all n = 1 , 2, . . . . For G(1) f (�> = p(�>, and if G(r) = ptr) for r = 1 , 2, . . . , n, then H

U

=

P{T1

=

1}

=

H

G(n + 1 )

00

=

� P{T1 +

· · ·

n+1 � � P{T1

=

k =1 oo

=

k=1 l=1

+ Tk = n + 1 }

l}P{T2 +

· · ·

+ Tk = n + 1 - /}

n+1 = � � ��> G(n + 1 - 1), if we define G(O) l=1

=

1

n+ 1 = � .f ��>p �£+1 - n by induction hypothesis l=1

= p�£+ 1 >

by the First Entrance Theorem

Now state i is recurrent since �� 1 P{T1 = n} = 1 , and has period d by hy pothesis. Thus by Theorem 2(c), lim n _. G(nd) = d/p. Since a renewal can only take place at times nd, n = 1 , 2, . . . , G(nd) is the probability that a renewal takes place in the interval [nd, (n + 1) d) . If the average length of time between renewals is p , for large n it is reasonable that one renewal should take place every p seconds, hence there should be, on the average, d/ p renewals in a time interval of length d. Thus we expect intuitively that G(nd) � d/ p. 00

4. (a) Let the initial state be i. Then Vu = �: 0 I{Rn = i}' and the result follows. (b) By (a), N = �: 0 Qn so that QN = �: 1 Qn = N - I. (In particular, QN is finite.) (c) By (b), (I - Q)N = I. But N = L: 0 Qn so that QN = NQ, hence N(I - Q) =

I.

Section 7.5

2. (a) j0 is recurrent since pJj > > � for all n > N (see Problem 2, Section 7 .1), hence 0

If i is any recurrent state then since pfj> > � > 0, i leads to j0• By Theorem 0 5 of Section 7.3 , j0 leads to i, so that there can only be one recurrent class . Since p:j� > � > 0 for all n > N, the class is aperiodic, so that liffinp );J0 = 1/ Pio . But then 1/ Pio > � > 0, hence f-lJo. < oo and the class is positive.

320


(Note also that if i is any state and C is the equivalence class of j0, then, for n > N, P{Rn tt C I R0 = i} ::;: 1 - c5, hence P{Rkn tt C I R0 = i} ::;: (1 - c5)k � 0 as k � oo . Thus Ji10 1 , and it follows that a steady state distribution exists.) =

(b) If fiN has a positive column then by (a) there is exactly one recurrent class, which is (positive and) aperiodic, and therefore a steady state distribution exists. Conversely, let {v 1} be a steady state distribution. Pick j0 so that v 10 [ = limn _. p:j�] > 0. Since the chain is finite, p: j: > 0 for all i if n is sufficiently large, say n > N. But then II.L'V has a positive column. 00

3. 1 . If p "#: q, the chain is transient , so p�j> � 0 for all i, j. If p = q the chain is recurrent. We have observed (Problem 5 , Section 6.3) that the mean recurrence time is infinite, hence the chain is recurrent null, and thus p �3� 7! > � 0. In either case there is no stationary distribution, hence no steady state distribution. The period is 2. 2. There is one positive recurrent class, namely {0} ; the remaining states form a transient class. Thus there is a unique stationary distribution, given by v0 = 1 , v1 = 0, j > 1 . Now starting from i > 1 , the probability of eventually reaching 0 is limn -. p � 0 > , since the events { Rn = 0} expand to {0 is reached eventually}. By (6.2.6), if p > q lim P � o > = (q/p)i n __. oo if p < q =1 oo

(Also p��> 1 , p�j > � 0, j > 1). If p > q the limit is not independent of i so there is no steady state distribution. =

3. There are two positive recurrent classes {0} and {b }. {1 , 2, . . . , b - 1} is a transient class. Thus, there are uneountably many stationary distributions, given by v0 = p1, v0 = p 2 , vi = 0, 1 0, p1 + p 2 = 1 . There is no steady state distribution. By (6.2.3) and (6.2.4), (q/p)i - (q/p)b n nr�Pi( o ) - 1 (q/p)b _

=1

l

b

if

if

p

"#:

q

p =q

lim p� � > = 1 - lim P �o > n __. oo n __. oo 1 <j = 0, n __. oo

4. The chain is aperiodic. If p > q then .fi1 = (q/p) i-l < 1 , i > 1 , hence the chain is transient. Therefore p � ; > � 0 as n � oo for all i, j, and there is no stationary


32 1

or steady state distribution. Now if p < q then fi1 = 1 for i > 1 , hence /11 = q + P.fil = 1 , and the chain is recurrent. The equations VII = V become v1q + v 2q v1p + vaq v 2p + v4q

'�

v1 = v2 = va

=

This may be reduced to vi = (p/q)v:i_1 , j = 2, 3 , . . . . If p = q then all vi are equal, hence v :i 0 and there is no stationary or steady state distribution. Thus the chain is recurrent null. If p < q, the condition �; 1 vi = 1 yields the unique solution (q - p) p ;-1 v3 = .i = 1 ' 2 , . . . ' q q =

()

.

Thus there is a unique stationary distribution, so that the chain is recurrent positive; {v i} is also the steady state distribution. 5 . The chain forms a recurrent positive aperiodic class, hence p�j> � vi where the vi form the unique stationary distribution and the steady state distribution. The equations VII V, �:i vi 1 yield (p/q) j-1 V = :j l � (p/q) j-1 i=1 2 6. The chain forms a recurrent positive aperiodic class. Since II has identical rows (p 2 , pq, qp, q2) = V, there is a steady state distribution ( = the unique stationary distribution), namely V. 7. The chain forms a recurrent positive aperiodic class, hence p�j > � v :i where the v :i form the unique stationary distribution and the steady state distribution. The equations VII = V, �1 v1 = 1 yield =

=

V

2 - 31 '

V3 - 24 1 ,

-

V4 - 3 l

-

8 . There is a single positive recurrent class {2, 3 } , which is aperiodic, hence

p�j > � vi where the vi form the unique stationary distribution and the steady state distribution. We find that v 1 = 0, v 2 = 3/ 7 , v3 = 4/7 . 9. We may take Pu = P{ Rn = j} for all i ,j (with initial distribution p 1 = P{Rn = j} also). The chain forms a recurrent class since from any initial state, P{ Rn never = j } = IT: 1 P{Rn = j } = IT: 1 p1 = 0. The class is aperiodic. Clearly v :i = p :i is a stationary distribution, so that the chain is recurrent positive and the stationary distribution is unique and coincides with the steady state distribution.

322 1 0.


The chain forms a positive recurrent class of period 3 (see Sec tion 7.3 , Example 2). Thus there is a unique stationary distribu tion given by v1 -

1 9'

v2 -

2 "'§" ,

-

v3

1 9'

-

v4

2 9'

v5

1 12,

-

v6

1 36,

v7

2 - 9 -

Now the cyclically moving subclasses are C0 = {1 , 2}, c 1 = {3 , 4} , c2 = {5 , 6, 7 }. By Theorem 2(c) of Section 7 . 4 , if i E C r ,J E Cr+a then p�� n+a> � 3vi . Thus

1 2 3 4 5 1 -31 23 0 0 0 1. . _2_

0

3 0 0

1

3

2

nan

rran+1

3

6

7

0

0

0

0

0

0

2

0

0

0

0

0

0

5 0 0 0 0 41

1 12

3

�4

0

3

3

1 2 3

3

0

4

1

2

2

1 12

6

0

0

0

0

7

0

0

0

0

4

1 12

2

1

2

3

4 5

6

7

0

0

o-

0

0

0

3 0 0 0 0 14

1 12

"3

1 2 3 1 2

1

-o

0

3

2

0

0

3

�4

0

5 31

1

3

2

4 T2 32

1

1

0

0

3

0

0

0

0

0

0

0

0

0

0

0

0

2

2 3 0 0 0

1 2 3 4 5 1 -o 0 0 0 1 0

3

0

6 31 32 7 1 2

2

3

0

0

0

4

5 0 0

1 2 3 1 2 3

6

0

0

3

7

0

0

3

1

3

2 3

1 12

4 T2

1

2 3 0 0 0 rran+ 2 � 4 31 32 0 0 0 3 31

6 7

1

23 2 3

0

0

0

0

0

0

0

0

0

0

0

0

0


323

CHAPTER 8 Section 8.2 1.

(a) L(x)

=

= 2 e-x ,

2e-2z/e-z

ex

=

p =

so L(x)

>

A iff x < c

i"e-re dx 1 flO2e-2re dx =

=

�

=

- In A/2. Thus

e-c

e-2c

=

(1 - cx)2

Hence as in Example 1 of the text, SA = {( a , (1 - a)2), 0 < a < 1} and S = { ( a , {J) : 0 < a < 1 , (1 - a)2 < fJ < 1 - a2 }. (b) e-c = 1 - a = .95 , so that c = .05 1 . Thus we reject H0 if x < .05 1 , accept H0 if x > .05 1 . We have fJ = (1 - a)2 = .9025 , which indicates that tests based on a single observation are not very promising here. (c) Set a = fJ = (1 a)2 ; thus a = (3 - V 5)/2 = .38 = 1 - e-c, so that c = .477. 3. (a) X 1 2 3 4 5 6 Po (x) 6l 61 61 6] 61 6l P1 (x) 41 41 � 1.8 81 1.8 L (x) 23 23 ! 43 43 43 -

LRT O A (this never occurs) =

=

0 if L(x) < A

=

1 if L (x)

=

0 if L(x)

=

=

A-and t (x) < fJ0 a1 1n A aqd t(x) > fJ0a1 1n

Since t (x) cannot be > fJ0 in this case, q; ' q;. Again, q; is UMP for fJ fJ0 versus fJ < 00, and the result follows. The power function is (see diagram) 1 1n a < fJ ) 0 fJ0 if q; 1 < (fJ E 8 Q ( Oo rxl /n/ fJ)n a( Oof fJ)n ' Oo rxl /n < f) < f)o 1 - P8 {x : fJ0 a1 1n < t (x) < 00} =

=

=

=

=

=

=

=

=

1 - [( OofO)n - (fJo al ln f o)n ] 1 - (1 - et.)(fJ0/fJ)n , fJ > 00

Q(O) �

a

PROBLEM 8 .2. 6 7. The risk set is { ( rx, (3) : 0 < rx < 1 , (1 - rx)2-n < {3 < 1 - rx2-n}, and the set

of admissible risk points is { ( rx, (1 - rx)2-n) : 0 < a < 1 }. 10. If ( q;) rx < rx0, let q; ' 1 and q;t (1 - t) q; + tq; ' , 0 < t < 1 . Then a.

=

=

rx( q;t)

{3 ( q; t)

=

=

=

(1 - t)rx( q;) + trx( q; ' ) ( 1 - t) {3 ( q;) + t fJ ( q; ' )

326


Since ex ( q;) < ex0 , ex ( q;t) will be < ex0 for some t E (0, 1). But {J ( q;') = 0 and {J( q;) > 0 hence {J ( q;t) < {J ( q;), contradicting the assumption that q; is most powerful at level ex0 • For the counterexample, let R be uniformly distributed between a and b, and let H0 : a = 0, b = 1 , H1 : a = 2, b = 3. Let q;t(x) = 1 if x > t, q;t(x) = 0 otherwise, where 0 < t < 1 . Then {J ( q;t) = 0, ex ( q; t) = 1 - t. For t < 1 , q;t is most powerful at level ex0 = 1 , but is of size < 1 . 1 4. (a) q;n is Bayes with c1 = c2 = 1 , p = 1/2 (hence A = 1), and L(x) = [0 1 (x)/[00 (x) where f0(x) = IJ�t 1 h0(xi) ; the result follows. (b) P9 0{x : gn (x) > 1} = P0 0{gn (R) > 1 } < E00 [gn(R) 1 1 2 ] by Chebyshev ' s inequality n = IJ E0 0 t(Ri) = [E00 t(R1)]n

i=l

_

Section 8.3 (b) 0 = x 1 . (a) 0 = -n / -:£� 1 ln xi 2. 0 = lxl 3. 0 = rjx 4. p ( fJ) = fJ(1 - 0)/n 5. By (8.3.2) with g(fJ) = 1 , 0 < fJ < 1 , we have X ) 'I!'(

=

rJ(ol (xn) x 1 (1 - fJ)n-x dfJ f3(x - {3(x n e (:) 0"' (1 - O) -z dO J o

0

+

,

+ 2, n - x + 1) x + 1 -n +2 + 1 , n - X + 1)

6 . For each �1; , we wish to minimize s� (�)fJX(1 - fJ)n-x [( O - "P(x))2/0(1 - fJ) ] dfJ [see (8.3.1)]. In the same way that we derived (8.3.2), "'·e find that ( 1 0"' (1 - O)n-z-1 dO {3(X + 1 , n - X) X y (x ) = J O -1 {3(x n x) n ( oz-1 (1 - O)n-z 1 dO ' 1

·

The risk function is [(R/n) - fJ ]2 Ptp ( fJ) = E0 fJ ( l fJ)

[

-

oJ

_

]

=

1 n2 fJ( 1

_

_

_

r R fJ) Va 0

=

1 n

=

constan t


327

Section 8.4

Thus in (a), (min Ri , max Ri) is sufficient ; in (b), max Ri is sufficient ; in (c) min Ri is sufficient 2.

hence if fJ1 , fJ 2 are both unknown, err� 1 Ri , L� 1 Ri) is sufficient ; if 01 is known, L� 1 Ri is sufficient ; if fJ 2 is known, rrr 1 Ri is sufficient.

3. [IT� 1 Ri , IT� 1 ( 1 - Ri) ] is sufficient if fJ 1 and fJ2 are unknown ; if fJ1 is known, rrr 1 (1 - Ri) is sufficient, and if fJ2 is known, IT� 1 Ri is sufficient. Section 8.5 1 . (1 - 1/n)T.

2. (a) T has density fp(y) = nyn-1/fJn , 0 < y < fJ (Example 3 , Section 2.8), so E8g(T) = J� g(y)[T (Y) dy = (n/fJn) J� yn-1g(y) dy. If E8g ( T) = 0 for all fJ > 0 then yn-tg(y) = 0, hence g(y) = 0, for all y (except on a set of Lebesgue measure 0). Thus P8 {g(T)

=

0}

=

f

{y: g (y) =O}

[T(y) dy

=

1

(b) If g(T) is an unbiased estimate of y(fJ) then

Eog(T)

=

n on

loayn-lg(y) dy

Assuming g continuous we have fJn y(fJ) 1 !_ n o - -g (fJ) = dfJ n

[

J

or

g(fJ)

=

y( 0)

=

y(fJ) +

f) ' y (fJ) n -

Conversely, if g satisfies this equation then n on-tg(fJ) = d/dfJ [fJny(fJ)] hence n Jg yn-tg(y) dy = fJn y(fJ), assuming on y(fJ) 0 as fJ � 0. Thus a UMVUE of y(fJ) is given by g ( T) = y(T) + (Tfn) y ' (T) . For example, if y(fJ) = fJ then g (T) = T + Tfn = [(n + 1)/n] T; if y(fJ) = 1/fJ then g (T) = 1/ T + (Tfn)( - 1/T2) = (1/T)[1 - (1/n)], assuming that n > 1 . ->-

328


4. We have

P..v. { R1 hence T

=

=

x1 ,

•

•

•

,

Rn

=

xn }

=

1 n IT /{1,2 Nn i=l

. . .

} (xi)Ib,2,

. . .

,N} (max

xi)

max Ri is sufficient. Now PN{ T < k }

=

therefore

(�f

k

1,

=

k

2 , . . . , N;

=

1 , 2, . . . , N

Thus

If E1vg(T) 0 for all N 1 , 2 , . . . , take N 1 to conclude thatg(1) If g(k) 0 for k 1 , . . . , N - 1 , then ENg(T) 0 implies that =

=

=

=

=

=

0.

=

[ Nn - (N - l )n ] g(N) -0 n _

N

henceg(N) 0. By induction, g = 0 and T is complete. To find a UMVUE of y(N) , we must solve the equation =

1\T

I g(k) [kn - (k - l )n ]

k=l

or

=

Nn y(N),

N

=

1 , 2, . . .

Thus N

=

1 , 2, . . .

5. R is clearly sufficient for itself, and Eog(R)

=

1-r g(k) (k -- 11 ) (1 - (j)r ()k-r 00

r

If E8g(R) 0 for all () E [0, 1 ) then I� r g(k)(;�) ()k-r 0, so that g = 0. Thus R is complete. The above expression for E8g(R) shows that for a UMVUE to exist, y( O) must be expandable in a power series. Conversely, let y( O) L i 0 ai () i, 0 < () < 1 . We must find g such that =

=

=

00

( )

k -1 ()k-r g(k) I r - 1 k=r

=

( 1 - O)-r 'y (O)

=

00

I bi () i

i=O

=

00

Ir bi-r() i-r

i=

SOLUTIONS TO PROBLEMS r

Therefore

( - 1) '

i = g( )

For exan1ple, if y( fJ) ( 1 - O)- r y(fJ)

=

fJ k

=

� oo

J =O

i

h1·· - r

=

0 for i

< k'

fJTc then

( ) fJj '

_

( - 1) j

J

and ht·

i = -i g( )

k'

i=O

(

(

j+r - 1

r

_

1

) fJk+i (Problem 6a,

i +r - 1 -k ! r-1 i =k >k (i + rr -11 - k) for i oo

)

Section 6.4)

f)t

.

i = r + k, r + k + 1 ,

. . .

0 otherwise

=

In particular, if k

=

oo

!

=

.

( -1- ) r-1 ( --1)- , r-1 i

. . .

r- 1

=

Thus ht·

i = r, r + 1 ,

329

=

1 then

i = g( )

( 2) r-1 ( - 1) i i

=

z

l

.

-r - 1

'

i >r

+

1

r-1

Thus a UMVUE of fJ = 1 - p is (R - r)/(R - 1) ; a UMVUE ofp = 1 - 0 is 1 - (R - r)/(R - 1) = (r - 1)/(R - 1) (The maximum likelihood estimate of p is rfR, which is biased ; see Problem 3, Section 8.3.)

6. (a) Eo 1fJ(R) Thus

=

e-8

fJk ! k 1 - e-o k l 1fJ ( ) k ! oo

f)k

00

! 1Jl(k) '. k k 1

=

=

e-

1 -

o

e-

8

00

=

! ( - 1)Tc-1 _ k! k=1

The UMVUE is given by 1p (k) = ( - 1) k-1 = - 1 if k is even =

f) k

+ 1 if k is odd

330


(b) Since e-8 is always > 0, the estimate found in part (a) looks rather silly. If 1Jl1 (k) 1 then E8 {[1Jl ' (R) - e-8 ] 2 } < E8 {[1Jl(R) - e-8 ]2 } for all 0, hence 1Jl is inadmissible. =

9.

) 2] ( (n + 1) (n + 1 )2 (n + 1) [ [ E8 Var8 T T - 0 = Var8 T = 2 n n J n fa n y dy E8 T = on - n + 1 () (see Problem 2b), 8 2 n0 + -Eo T2 = -n J yn l dy = o n+ 2 2 n n n 2 [ 2 Varo T = () n + 2 (n + 1)2] - (_n + 1 ):(n + 2) n

and

o

n

Thus

Therefore if n > 1

n

=

0

2 () 2 (n + 1) ] [ E8 T- 0 = n n(n + 2)

() 2 < 3n

1 1 . In the inequality [(a + b)/2] 2 < (a2 + b2) /2, set a = 1p 1 (R) - y(O), b = 1p2 (R) - y(O), to obtain 1Jlt(R) - y( O) 1Jl2 (R) - y (O) 2 .l < 2 [ p'Pt (0) + p'P 2 (0)] = p'Pl (0) + Eo 2 2

[

J

By minimality, we actually have equality. But the left side is

l [ p'P1 (0) + p'P 2 (0) + 2E{ [1Jl1 (R) - y(O)] [tp2 (R) - y(O)]} = � p'P1 (0) + � Cov8 [1Jl1 (R), 1p2 (R)] Thus

Cov8 [1Jl1 (R), 1p2 (R)] = p'P1 (0) = [ p'P1 (0) p'P 2 (0)]1 1 2 = [Var8 VJ1 (R) Var8 1p 2 (R)]1 f 2

We therefore have equality in the Schwarz inequality, and it follows that (with probability 1) one of the two random variables 1p1 (R) - y(O), 1p2 (R) - y(O) is a multiple of the other (Problem 3 , Section 3.4). The multiple is + 1 or - 1 since 1p1 (R) and 1p2 (R) have the same variance. If the multiple is + 1 , we are finished, and if it is - 1 , then ( 1p1 (R) + 1J-' 2 (R))/2 = y(O). The minimum variance is therefore 0, hence 1Jl1:(R) = 1p2 (R) y(O), as desired. =


{ : } Jooo izyo fdx , y) dx dy

Section 8.6 R 1. p R < z =

oo J

33 1

l

1 y ( n/2 ) -le-YI2 zyx<m/2 ) -le-x /2 dx dy = 2 <m+n>f2 r(m/2)r(n/2) 0 0 But s �y x <m l2 >-le-xl2 dx = (with X = uy) s � (uy) <m l2 >-le-uy f2y du. Thus

where

1 x C m/2 ) -1 f00y( m-t-n )J 2-1e-(y/2 ) (l+x ) dy h (x) = < m+n l2 > r(m/2)r(n/2) 2 J0 r [(m + n)/2 ]x <m /2 ) -1 2 <m+n ) /2 - 2 < +n > f2 r(m/2)r(n/2) (1 + x) <m+n > f2 m x <m/2 ) -1 1 X >0 <m+n> l2 x) 2) (1 + {J (m/2, n/

---- ----

---:-

If

'

then = [mn (x) , as desired

2. If R is chi-square with n degrees of freedom then R = R12 + · · · + Rn2 where the R i are independent and normal (0, 1). Thus E(R) = n, and Var R = n Var Ri2 = n [E(Ri4) - [E(Ri2)] 2 ] = n(3 1) = 2n. If R has the t distribution with n degrees of freedom, then E(R) = 0 by symmetry, unless n = 1 , in which case R has a Cauchy density and E(R) does not exist. Now in the integral x2 1 y x = let d 1 + (x2fn) [ 1 + (x2fn) ] 2

If n = 2, the same calculation gives Var R = oo . A similar calculation shows that if R has the F(m, n) distribution then E(R) = nf(n - 2) if n > 2, E(R) = oo if n = 1 or. 2, Var R = [2n2 (m + n - 2)]/ [m(n - 2)2 (n - 4)] if n >' 4 , Var R = oo if n = 3 or 4 .

I ndex

Absolutely continuous random variable, 5 3 Absolutely continuous random vector, 7 2 Actions, set of, 242 Admissible and inadmissible tests, 24 7 Admissible risk points, 248 Alternative, simple and composite, 243 Average value, see Expectation Bayes estimate, 260 with constant risk, 262 with quadratic loss function, 260 Bayes risk, 244, 260 Bayes test, 244 Bayes' theorem, 36, 1 5 0 Bernoulli distribution, see Distribution Bernoulli trials, 28, 3 8, 58, 1 28, 1 5 1 , 1 75 , 177, 1 87, 1 90, 195 , 207, 215 generalized, 29 see also Distribution, binomial Beta distribution, see Distribution Beta function, 1 3 3, 26 1 Binomial distribution, see Distribution Boolean algebra, 3ff Borel-Cantelli lemma, 205 second, 209 Borel measurable function, 8 3 Borel sets, 4 7, 5 0 Bose-Einstein assumption; 2 1 Cauchy distribution, see Distribution Central limit theorem, 1 69ff, 1 7 1 Characteristic function(s) , 15 4ff correspondence, theorem for, 156 properties of, 166ff of a random vector, 2 79 Chebyshev's inequality, 1 26 , 1 27 , 1 29, 206 , 208

Coin tossing, see Bernoulli trials; Distribution, binomial Combinatorial problems, 15ff fallacies in, 39ff multiple counting in, 22 Complement of an event, 4 Conditional density, 1 36, 1 48 Conditional distribution function , 1 3 9 , 1 40, 1 48 Conditional expectation, 1 40ff Conditional probability , 3 3ff, 1 3 0ff Conditional probability function, 98, 142 Confidence coefficient, 27 6 Confidence interval, 276 Confidence set, 278 Continuous random variable, 69 Convergence, almost surely (almost everywhere) , 204-206, 208 , 2 1 0 in distribution, 1 70, 171 , 1 75 , 1 76 in probability, 1 7 1 , 1 75 , 1 76, 205 , 208, 210 Convex function, 262 Convexity of the risk set, 248 Convolution theorem, 1 64 Correlation, 1 1 9ff Correlation coefficient, 120 Covariance, 1 1 9 Covariance function , 20 3 Covariance matrix, 281 Cylinder, 1 80 measurable , 1 80 . Decision function, 24 2, 24 3 nonrandomized, 242 Decision scheme, 1 5 1 DeMorgan laws, 7 , 9 , 1 1 Density function(s) , 5 3 333

INDEX

334

conditional, 1 3 6, 1 48 joint, 70ff, 1 8 1 marginal, 78 Difference equation, 24, 39, 1 82, 1 86, 1 95 characteristic equation of, 1 8 3 Discrete probability space, 1 5 Discrete random variables, 5 1 , 95ff Disjoint events, 5 Distribution, 95 Bernoulli, 25 6, 264, 266 , 269, 272 beta, 260, 268 binomial, 29 , 3 2, 95 , 97-99, 1 1 3 , 1 22, 1 4 1 , 176, 25 6, 25 8, 260, 264, 268 Cauchy, l 6 1 , 1 66 , 264 chi-square, 165 , 27 5 , 276, 278 exponential, 56, 65 , 9 3 , 1 1 0, 1 1 1 , 1 1 3 , 1 29, 1 39, 150- 1 5 2 , 166, 1 68, 196, 200202, 25 6, 26 3, 264 F, 278 gamma, 166, 267 , 268 geometric, 195 , 1 96 hypergeometric, 3 3 , 25 6 multidimensional Gaussian Uoint Gaussian) , 279ff negative binomial, 196, 264, 26 8, 272 normal (Gaussian) , 87, 88, 92, 94, 1 08, 1 1 3, 1 1 8, 1 24- 126, 1 62, 165 , 1 66, 1 7 1 , 1 7 3 - 1 76, 25 2, 256, 267 , 268, 27 1 , 274-276, 278 Poisson, 96-99, 1 1 4, 1 5 2 , 1 6 3 , 169, 1 97 , 1 98, 200, 202, 25 6, 264, 266 , 268, 270, 272 t, 27 7 ' 278 uniform, 54, 7 3 , 76, 84, 92, 9 3 , 1 1 3 , 1 1 8 , 1 4 1 , 1 49, 1 5 0- 1 5 2, 1 65 , 208 , 25 7 , 263, 264, 267 , 27 1 , 272 Distribution function(s) , 5 2 conditional, 1 39, 1 40, 1 48 joint, 72 properties of, 6 6ff Dominated convergence theorem, 23 1 Essentially constant random variable, 85 , 1 15 Estimate, 25 8 Bayes, 260 with constant risk, 262 inadmissible, 272 maximum likelihood, 25 8 minimax, 262 randomized, 25 8, 263

risk function of, 261 unbiased, 268 uniformly minimum variance unbiased (UMVUE) , 269 Estimation, 1 5 2, 242, 243, 25 8ff Event(s) , 2, 1 1 algebra of, 3ff complement of, 4 contracting sequence of, 67 exhaustive, 35 expanding sequence of, 66 impossible, 3, 55 independent, 26, 27 intersection of, 4 mutually exclusive (disjoint) , 5 union of, 4 upper and lower limits of sequence of, 204, 209 sure (certain) , 3 Expectation, lOOff conditional, 1 40ff general def"mition of, 1 0 3 properties of, 1 1 4ff Exponential distribution, see Distribu tion F distribution, see Distribution Factorization theorem, 266 Fatou's lemma, 230 Fermi-Dirac assumption, 20 Fourier series, 1 67 Fourier transform, 15 5

Gambler's ruin problem, 1 82ff, 235 Gamma distribution, see Distribution Gamma function, 1 09 , 1 3 3 Gaussian distribution, see Distribution, normal Generating function, 1 69 , 1 9 1ff moments obtained from, 192 Geometric distribution, see Distribution Hypergeometric distribution, see Distribution Hypothesis, 243ff a priori probability of, 244 composite, 24 3 null, 243 simple, 243 Hypothesis testing, 1 5 1 , 242, 243ff fundamental theorem of, 246 see also Test Independence, 25ff

335

INDEX

Independence of sample mean and variance in normal sampling, 274 lndepdndent events, 26, 27 Independent random variables, 80 Indicators, 1 22ff Intersection of events, 4 Jensen's inequality, 26 2 Joint characteristic function, 279 Joint density function, 70ff, 1 8 1 Joint distribution function, 7 2 Joint probability function, 7 6 , 9 6 , 1 80, 1 8 1 Kolmogorov extension theorem, 1 80 Laplace transform, 1 5 5 properties of; 1 5 6 , 1 5 7 Lattice distribution, 1 69 Law of large numbers, strong, 1 29 , 203, 206 , 207 weak, l 28, 169, 1 7 1 , 207 Lebesgue integral, 1 1 4 Level of a test, 246 Liapounov condition, 1 7 5 Likelihood ratio, 245 test (LRT) , 245 Limit inferior (lower limit) , 204, 209 Limit superior (upper limit) , 204, 209 Linearly dependent random variables, 1 2 1 , 28 1 Loss function (cost function) , 242 quadratic, 260 Marginal densities, 7 8 Markov chain(s) , 21 1ff closed sets of, 224 cyclically moving subclasses of, 227 definition of, 2 1 4 equivalence classes of states of, 2 2 3 first entrance theorem for, 220 initial distribution of, 21 3 limiting probabilities of, 230ff state distribution of, 2 1 4 state space of, 21 3 states of, 220ff aperiodic, periodic, 229 essential, 229 mean recurrence time of, 230 period of, 226- 229 recurrent (persistent) , 221ff recurrent null, 2 3 3

recurrent positive, 2 3 3 transient, 2 2 1 ff stationary distribution for, 236 steady state distribution for, 2 1 5 , 237 stopping time for, 217 strong Markov property of, 219 transition matrix of, 214 n-step, 2 1 4 transition probabilities of, 2 1 4 Maximum likelihood estimate, 25 8 Maxwell-Boltzmann assumption, 20 Median, 1 1 2 Mean, see Expectation Minimax estimate, 262 Minimax test, 25 0 Moment-generating property of character istic functions, 1 67 , 1 68 Moments, 1 07 centr�, 108 joint, 1 1 9 obtained from generating functions, 192 Multinomial probability function, 30, 98 Mutually exclusive events, 5 Negative binomial distribution, see Distribution Negative part of a random variable, 1 04 Neyman-Pearson lemma, 246 Normal distribution, see Distribution Observable, 242 Order statistics, 9 1 Partial fraction expansion, 1 5 9 Poisson distribution, see Distribution Poisson random process, 1 96ff Poker, 1 9 , 2 3 , 40 Positive part of a random variable, 1 04 Power function of a test, 25 3 Power of a test, 246 Probability, !Off a posteriori, 36 classical definition of, 1 , 1 3 , 16 conditional, 3 3ff frequency definition of, 2, 1 3 Probability function, 5 1 conditional, 98, 1 42 joint, 76, 96, 1 80, 1 8 1 Probability measure(s) , 1 2 consistent, 1 80 Probability space, 1 2

INDEX

336

discrete , 1 5

Sample variance , 25 9 , 274 Samples, 1 6ff

Queueing, 2 1 6

ordered, with replacement , 1 6 without replacement , 1 6

Random process, 1 96 Random telegraph signal, 20 3 Random variable(s) , 46ff

unordered, with replacement , 1 8 without replacement , 1 7 Sampling from a normal population, 274

absolutely continuous, 5 3

S chwarz inequality , 1 1 9 , 1 2 1 , 207

central moments of, 1 08

Sigma field, 1 1

characteristic function of, 1 5 4ff

Simple random variable , 1 0 1

classification of, 5 1ff

Size of a test, 246

continuous, 69

Standard deviation , 1 0 8

definition of, 48, 50

States of nature, 24 1

degenerate (essentially constant) , 85 , 1 1 5

Statistic, for a random variable , 265

density function of, 5 3

complete, 269

discrete, 5 1 , 95ff

sufficient , 265

functions of, 5 8ff, 84 , 85ff, 94

Statistical deci�ion model, 24 1

generating function of, 1 9 2ff

Statistics , 24 1ff

independent, 80

Stirling's formula, 4 3 , 1 9 1

infinite sequences of, 1 7 8ff

Stochastic matrix , 2 1 2

linearly dependent , 1 2 1 , 28 1

Stochastic process, 1 96

moments of, 1 07

Stopping times, 2 1 7ff

positive and negative parts of, 1 04

Strong law of large numbers, 1 29 , 20 3 , 206 ,

probability function of, 5 1 simple, 1 0 1

207 Strong Markov property, 2 1 9

Random vector, 7 2 absolutely continuous, 72 Random walk, 1 84 ff

t Distribution ,

see

Distribution

Test, 24 3

combinatorial approach to , 1 86ff

acceptance region of, 278

simple, 1 84

admissible and inadmissible, 24 7

with absorbing barriers, 1 84, 1 85 , 2 1 5 , 228, 240 with no barriers, 1 85 , 1 86 - 1 9 1 , 1 93 1 95 , 2 1 5 , 228, 240 average length of time required to return to 0 in, 1 86 , 1 9 1 , 1 95

Bayes, 244 level of, 246 likelihood ratio (LRT) , 245 minimax , 25 0 power of, 246 power function of, 25 3

distribution of first return to 0 in, 1 8 9

rejection region (critical region) for, 24 3

first passage times in , 1 90

risk set of, 248

probability of eventual return to 0 in,

size of, 246

1 85 with reflecting barriers, 229 , 240 Rao-Blackwell theorem , 26 3 Recurrent (persistent) states of a Markov chain , 221 Reflection principle , 1 8 8 Renewal theorem, 2 3 5

type 1 and type 2 errors of, 24 3 uniformly most powerful (UMP) , 25 3 Total expectation , theorem of, 1 44 , 1 4 9 , 1 5 2 , 153 Total probability , theorem of, 35 , 9 0 , 1 30 , 1 3 2 , 1 34, 1 5 0 , 1 8 2 , 2 1 4 Transient states of a Markov chain , 221

Risk function , 26 1 Risk set, 248

Uniform distribution, see Distribution Uniformly most powerful (UMP) test , 25 3

Sample mean , 2 5 9 , 274

Union of events, 4

Sample space, 2

Unit step function, 1 5 7

337

INDEX

Variance, 108, 1 5 5 - 1 1 8 Venn diagrams, 4

Weak law of large numbers, 1 28 , 1 69 , 1 7 1 ,

207

Basic Probability Theory

Basic probability theory

Basic probability theory with applications