Information-Theoretic Methods for Information-Theoretic Estimating Complicated Probability Distributions
This is volume 207 in MATHEMATICS IN SCIENCE AND ENGINEERING Edited by C.K. Chui, Stanford University A list of recent titles in this series appears at the end of this volume.
Information-Theoretic Methods Information-Theoretic for Estimating Complicated Probability Distributions Zhi Zong DEPARTMENT OF DEPARTMENT OF NAVAL NAVAL ARCHITECTURE ARCHITECTURE DALIAN UNIVERSITY OF TECHNOLOGY CHINA
ELSEVIER Amsterdam –- Boston -– Heidelberg –- London –- New York -– Oxford Paris –- San Diego -– San Francisco –- Singapore -– Sydney -– Tokyo
Elsevier Radarweg 29, PO Box 211, 211, 1000 AE Amsterdam, The Netherlands The Boulevard, Langford Lane, Kidlington, Oxford OX5 0X5 1GB, UK
First edition 2006 Copyright © 2006 Elsevier B.V. All rights reserved
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means electronic, mechanical, photocopying, recording or otherwise without the prior written permission of the publisher Permissions may be sought directly from Elsevier’s Elsevier's Science & Technology Rights Department in Oxford, UK: phone (+44) (0) 1865 843830; fax (+44) (0) 1865 853333; email:
[email protected]. Alternatively you can submit your request online by visiting the Elsevier web site at http://elsevier.com/locate/permissions, http://elsevier.com/locate/permissions, and selecting Obtaining permission to use Elsevier material Obtaining permission to use Elsevier material Notice No responsibility is assumed by the publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. Because of rapid advances in the medical sciences, in particular, independent verification of diagnoses and drug dosages should be made verification Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress
British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library ISBN-13: 978-0-444-52796-7 ISBN-10: 0-444-52796-6 0-444-52796-6 ISSN: 0076-5392 For information on all Elsevier publications visit our website at books.elsevier.com Printed and bound in The Netherlands
10 10 10 9 8 7 6 5 4 3 2 1 06 07 08 09 10
7b my parents To L. Zhou, Xuezhou andXueting
This page intentionally left blank
Preface Mixing up various disciplines frequently produces something that is profound and far-reaching. Cybernetics is such an often-quoted example. Mix of information theory, statistics and computing technology proves to be very useful, which leads to the recent development of information-theory based methods for estimating complicated probability distributions. Estimation of the probability distribution of a random variable is the fundamental task for quite some fields besides statistics, such as reliability, probabilistic risk analysis (PSA), machine learning, pattern reeognization, image processing, neural networks and quality control. Simple distribution forms such as Gaussian, exponential or Weibull distributions are often employed to represent the distributions of the random variables under consideration, as we are taught in universities. In engineering, physical and social science applications, however, the distributions of many random variables or random vectors are so complicated that they do not fit the simple distribution forms at al. Exact estimation of the probability distribution of a random variable is very important. Take stock market prediction for example. Gaussian distribution is often used to model the fluctuations of stock prices. If such fluctuations are not normally distributed, and we use the normal distribution to represent them, how could we expect our prediction of stock market is correct? Another case well exemplifying the necessity of exact estimation of probability distributions is reliability engineering. Failure of exact estimation of the probability distributions under consideration may lead to disastrous designs. There have been constant efforts to find appropriate methods to determine complicated distributions based on random samples, but this topic has not been fully discussed in detail. The present book is intended to fill the gap and documents the latest research in this subject. Determining a complicated distribution is not simply a multiple of the workload we use to determine a simple distribution, but it turns out to be a much harder task. Two important mathematical tools, function approximation and information tiieory, that are beyond traditional mathematical statistics, are often used. Several methods constructed based on the two mathematical tools for distribution estimation are detailed in this book. These methods have been applied by the author for several years to many cases. They are superior in the following senses: (1) No prior information of the distribution form to be determined is necessary. It can be determined automatically from the sample; (2) The sample size may be large or small;
vi
Preface
(3) They are particularly suitable for computers. It is the rapid development of computing technology that makes it possible for fast estimation of complicated distributions. The methods provided herein well demonstrate the significant cross influences between information theory and statistics, and showcase the fallacies of traditional statistics that, however, can be overcome by information theory. The book arose from author's years of research experiences in applying statistical tools to engineering problems. It has been designed in such a manner that it should prove to be useful for several purposes, namely, (1) as a reference book for researchers and other users of statistical methods, whether they are affiliated with government, industry, research institutes or universities, (2) as a reference book for graduate students who are interested in statistics applications in engineering and (3) as a reference book for those who are interested in the interplay of information theory, computing technology and statistics. This text is organized as follows. Chapters 1 to Chapter 3 are introductions to probability and statistics. Chapter 4 briefs approximation theory, emphasizing capabilities and properties of B-spline functions. For those who are familiar with these may skip these four chapters. In Chapter 5, information theory is introduced and entropy estimators are constructed. This chapter is the fundamental theory of the book. From Chapters 6 to 9, four basic methods for estimating distributions are infroduced in detail, aided with numerical examples. The membership functions are key concepts in fuzzy set theory. They are mathematically equivalent to probability density functions. Therefore in Chapter 10, the methods previously introduced are extended to estimation of the membership functions in fuzzy set theory. In Chapter 11, Maximum Entropy Method (MEM) is discussed because MEM cannot be neglected if information theory is mentioned. The influence of MEM in information theory cannot be underestimated. It is nice in this chapter that we will see how B-splines and entropy estimation theory developed in previous chapters have brought new elements into the well-known MEM, making it suited to more complicated problems. In Chapter 12, seven FORTRAN codes for the methods introduced in the preceding chapters are given, and their softcopies are separately attached at the back cover of this book. The organizations and persons the author want to express his gratitude is long. The Natural Science Foundation of China is acknowledged for her financial support under the grant 50579004. Part of the work had been finished during the author's stay in Japan under the sponsorship of Japanese Ministry of Education through collaborating with Professor Y. Fujimoto, Hiroshima University. Thanks go to my colleagues, friends and students for their endurance, helps and supports. Among them are Guo Hai, Dao Lin, Hong Wu, Shuang Hu, L. Sun, Jing Bei, Liu Lei, L. He, Hua Qiang, Zong Yu, Xiao Fang, to name a few. Z. Zong
Acknowledgement 1.
Figure 4.2 is reprinted from Journal of offshore mechanics and arctic engineering, 121, Z. Zong, K. Y. Lam & G. R. Liu, "Probabilistic risk prediction of submarine pipelines subjected to underwater shoek", Copyright (1999) with permission from ASME
2.
Chapter 6 and the figures are reprinted from Structural Safety, 20 (4), Z. Zong & K, Y, Lam, "Estimation of complicated distributions using Bspline functions", Copyright (1998) with permission from Elsevier
3.
Chapter 7 and the figures are reprinted from Structural Safety, 20 (4), Z. Zong & K. Y, Lam, "Estimation of complicated distributions using B-spline functions", Copyright (1998) with permission from Elsevier
4.
Chapter 8 and the figures are reprinted from Structural Safety, 22 (1), Z. Zong & K. Y, Lam, "Bayesian estimation of complicated distributions", Copyright (2000) with permission from Elsevier
5.
Chapter 9 and the figures are reprinted from Structural Safety, 23 (2), Z. Zong & K. Y. Lam, "Bayesian Estimation of 2-dimensional complicated distributions", Copyright (2001) with permission from Elsevier
Contents Preface
v
Acknowledgment
vii
Contents
viii
List of Tables
xiv
List of
figures
xv
1 Randomness and probability
1
1.1 Randomness , 1.1.1 Random phenomena 1.1.2 Sample space and random events 1.2 Probability 1.2.1 Probability defined on events 1.2.2 Conditional probability 1.2.3 Independence 1.3 Random variable 1.3.1 Random variable and distributions 1.3.2 Vector random variables and joint distribution 1.3.3 Conditional distribution , 1.3.4 Expectations 1.3.5 Typical distribution 1.4 Concluding remarks 2 Inference and statistics 2.1 Sampling 2.1.1 Sampling distributions for small samples 2.1.2 Sampling distributions for large samples
vm
.,
2 2 2 4 4 6 8 9 9 12 14 16 19 22 25 26 28 30
Contents
ix
2.1.3 Chebyshev's inequality 2.1.4 The law of large numbers 2.1.5 The central limit theorem 2.2 Estimation 2.2.1 Estimation 2.2.2 Sampling error 2.2.3 Properties of estimators » 2.3 Maximum Likelihood Estimator , 2.3.1 The Maximum Likelihood Method (M-L Mehtod) 2.3.2 The Asymptotic Distribution of the M-L Estimator 2.4 Hypothesis testing 2.4.1 Definitions 2.4.2 Testing procedures... , 2.5 Concluding remarks
31 32 33 34 34 35 37 39 39 40 45 45 47 48
3 Random numbers and their applications..................... 3.1 Simulating random numbers from a uniform distribution 3.2 Quality of random number generators 3.2.1 Randomness test 3.2.2 Uniformity test 3.2.3 Independence test 3.2.4 Visual testing 3.3 Simulating random numbers from specific distributions 3.4 Simulating random numbers for general CDF 3.5 Simulating vector random numbers 3.6 Concluding remarks
49 50 53 54 56 58 58 59 61 64 66
4 Approximation and B-spline functions
67
4.1 Approximation and best approximation 4.2 Polynomial basis 4.3 B-splines 4.3.1 Definitions 4.3.2 B-spline basis sets 4.3.3 Linear independence of B-spline functions 4.3.4 properties of B-splines
,
69 72 77 77 81 82 82
x
Contents 4.4 Two-dimensional B-splines 4.5 Concluding remarks „
,
5 Disorder, entropy and entropy estimation... 5.1 Disorder and entropy 5.1.1 Entropy of finite schemes 5.1.2 Axioms of entropy 5.2 Kullback information and model uncertainty 5.3 Estimation of entropy based on large samples 5.3.1 Asymptotically unbiased estimators of four basic entropies 5.3.2 Asymptotically unbiased estimator of TSE and AIC 5.4 Entropy estimation based on small sample 5.5 Model selection 5.5.1 Model selection based on large samples 5.5.2 Model selection based on small samples 5.6 Concluding remarks
87 87 89 89 92 94 97 105 107 114 118 119 120 126 128
6 Estimation of 1-D complicated distributions based on large samples............................................... 6.1 General problems about pdf approximation 6.2 B-spline approximation of a continuous pdf 6.3 Estimation 6.3.1 Estimation from sample data 6.3.2 Estimation from a histogram 6.4 Model selction 6.5 Numerical examples 6.6 Concluding Remarks Appendix: Non-linear programming problem and the uniqueness of the solution ,
129 130 132 135 135 137 140 144 156 159
7 Estimation of 2-D complicated distributions based on large samples ., 7.1 B-Spline Approximation of &2-Dpdf
163 164
7.2 Estimation 7.2.1 Estimation from sample data
167 167
Contents 7.2.2 Computation acceleration 7.2.3 Estimation from a histogram 7.3 Model selection 7.4 Numerical examples 7.5 Concluding
m
, remarks
169 170 173 174 186
8 Estimation of 1-D complicated distribution based on small samples...
..,..,.,..
.,.,...,.,
8.1 Statistical influence of small sample on estimation 8.2 Construction of smooth Bayesian priors 8.2.1 Analysis of statistical fluctuations 8.2.2 Smooth prior distribution of combination coefficients 8.3 Bayesian estimation of complicated pdf 8.3.1 Bayesian point estimate 8.3.2 Determination of parameter co2 8.3.3 Calculating b and determinant of |FTF| 8.4 Numerical examples 8.5 Application to discrete random distributions 8.6 Concluding remarks 8.6.1 Characterization of the method 8.6.2 Comparison with the methods presented in Chapter 6 8.6.3 Comments on Bayesian approach
, 189 190 192 192 194 198 198 200 203 204 209 210 210 211 211
9 Estimation of 2-D complicated distribution based on small samples 9.1 Statistical influence of small samples on estimation 9.2 Construction of smooth 2-d Bayesian priors 9.2.1 Analysis of statistical fluctuations 9.2.2 Smooth prior distribution of combination coefficients 9.3 Formulation of Bayesian estimation of complicated pdf. 9.3.1. Bayesian point estimate 9.3.2 Determination of parameter a»2 9.4 Householder Transform 9.5 Numerical examples
213 213 216 216 217 , 219 219 221 223 225
xii
Contents 9.6 Application to discrete random distributions 228 9.7 Concluding remarks 229 Appendix: Householder transform , 230 A.I Tridiagonalization of a real symmetric matrix 230 A.2 Finding eigenvalues of a tridiagonal matrix by bisection method 234 A.3 Determing determinant of a matrix by its eigenvalues 235
10 Estimation of the membership function,,...,. 10.1 Introduction.. 10.2 Fuzzy experiment and fuzzy sample 10.2.1 How large is large? 10.2.2 Fuzzy data in physical sciences 10.2.3 B-spline Approximation of the membership functions 10.3 ME analysis 10.4 Numerical Examples 10.5 Concluding Remarks Appendix: Proof of uniqueness of the optimum solution
237 237 242 242 242 244 247 248 253 255
11 Estimation of distribution by use of the maximum entropy method 259 11.1 Maximum entropy , 260 11.2 Formulation of the maximum entropy method 265 11.3 B-spline representation of $ (x)
268
11.4 Optimization solvers
270
11.5 Asymptotically unbiased estimate of At
271
11.6 Model selection 11.7 Numerical Examples 11.8 Concluding Remarks
272 273 279
12 Code specifications 12.1 Plotting B-splines of order 3 12.1.1 Files in directory B-spline 12.1.2 Specification 12.2 Random number generation by ARM
281 281 281 281 282
Contents
12.3
12.4
12.5
12.6
12.7
12.2.1 Files in the directory of random 12.2.2 Specifications Estimating 1-D distribution using B-splines 12.3.1 Files in the directory shhl 12.3.2 Specifications Estimation of 2-D distribution: large sample 12.4.1 Files in the directory shd2 12.4.2 Specifications Estimation ofl-D distribution from a histogram 12.5.1 files in the directory shhl 12.5.2 Specifications Estimation of 2-D distribution from a histogram 12.6.1 Files in the directory shhl 12.6.2 Specifications Estimation of 2-D distribution using RBF , 12.7.1 Files in the directory shr2 12.7.2 Specifications
xiii
..,.,....
282 282 283 283 283 284 284 284 285 285 285 286 286 286 287 287 287
Bibliography
289
Index
295
XIV
List of Tables Table 1.1 The Chi-square distribution Table 2.1 Relationship between Type I and Type II errors Table 3.1 Dependence of multiplier a on sand K
23 45 ,
Table 3.2 Randomness Test for Example 3.1 Table 3.3 Uniformity Testing for Example 3.1 , Table 3.4 Critical values for Chi squared distributions Table 3.5 Comparison of Tables 3.2 and 3.3 Table 3.6 Sampling formulas for some random variables Table 4.1 Six equidistantly spaced points in [-1,1] Table 4.2 Six unequidistantly spaced points in [-1,1] Table 5.1 Observed data pair (JC J 3 J ( )...... Table 5.2 Variance, ME and AIC for the regression polynomials of various degree Table 6.1 The Exponential distribution... Table 6.2 The Normal distribution Table 6.3 The Compound distribution Table 6.4 Estimate by histogram data («,=5Q0) Table 7.1 ME and AIC values Table 7.2 Part of calculated results in the case of RBF used Table 7.3 Wave height and wave period data in the Pacific (winter) Table 7.4 ME and AIC values for wave height-period joint distribution (winter, n s = 19839000) Table Table Table Table
2
8.1 Dependence of MEB on m (ns = 40) 10.1 Results for three fuzzy sets. 10.2 Results for five fuzzy sets..... 10.3 Sample size influences
52 55 57 57 57 61 75 75 122 126 146 148 151 154 176 182 184 185 206 249 251 252
XV
List of figures Figure 1.1 Standard deviation Figure 1.2 Skewness Figure 2.1 Graphical illustration of the acceptance and rejection regions. Figure 3.1 Random number distribution on the plane using single precision calculations.The left figure is generated from 200 random numbers and the right figure generated from 2000 random numbers Figure 3.2 Random number distribution on the plane using double precision calculations.The left figure is generated from 200 random numbers and the right figure generated from 2000 random numbers Figure 3.3 Random numbers generated by LCG viewed in three-dimensional space Figure 3,4 Data distribution projected on X-Z plane (left) and on X-Y plane (right).Note the regular pattern in the right figure Figure 3.5 schematic ARM. Figure 3.6 Flow chart of ARM Figure 3.7 Histograms of the exponential distribution Figure 3.8 Histograms for the normal distribution Figure 3.9 The assumed p.d.f. and generated sample Figure 4.1 An example showing the complexity of pdf...., Figure 4.2 5th order polynomial vs. exact function Figure 4.3 5th order polynomial on Chebyshev points Figure 4.4 The first three order B-splines Figure 4.5 B-splines of order 3 (degree 2) , Figure 4.6 Comparison of various interpolation methods for 10 points. In the figure,_unifarm refers to equally spaced points and nonuniform refers to unequally spaced points Figure 4.7 Comaprison of various interpolation methods for 20 points. The B-spline_interpolating curve is almost coincident
19 19 47
with the Runge function.. Figure 5.1 Arrangement of particles in different states of matters
86 90
52
53 58 59 63 63 63 64 65 68 76 76 77 80
86
xvi
List of figures Figure 5.2 Information missing in estimation process Figure 5.3 Naive estimate of entropy HifJ) bys use of equation (5.58) Figure 5,4 Numerical example demonstrating the aymptotically unbiased estimators_and the distance to their theoretical values Figure 5.5 Numerical comparison of TSE and ME
105 106
Figure 5.6 Observed paired data (x^y,)
122
Figure 6.1 Schematic figure of a histogram... Figure 6,2 Fallacy of Likelihood function as a tool for model selection... Figure 6.3 The exponential distribution Figure 6.4 The normal distribution Figure 6.5 Convergence of linear combination coefficients Figure 6,6 Influence of sample size on estimation accuracy Figure 6.7 Estimate from histogram data. Figure 6.8 Entropy estimation Figure 6.9 Strategy of Fisher's statistical inference Figure 7.1 B-spline approximation of 2-D pdf. Figure 7.2 Influence of parameter h, on the form of radial B-splines
138 140 147 149 150 152 154 155 151 164 166
Figure 7.3 2-D histogram composed of MH *NH cells..
171
Figure 7.4 The given distribution Figure 7.5 The generated random sample («S=5GQ). Figure 7.6 ME-based best estimate and AlC-based estimate ns =200 Figure 7.7 The ME-based best estimated pdft for sample «,=400 Figure 7.8 The ME-based best estimated pdfs for sample w/=500 Figure 7.9 Influence of number of B-splines Figure 7.10 Estimation using RBF for two samples Figure 7,11 Estimated joint distribution of wave-height and wave-period Figure 8.1 Influence of statistical fluctuations on the estimation Figure 8.2 Deviations between true and predicted coefficients Figure 8.3 Errors resulting from derivative approximation
175 175 177 178 179 180 183 186 192 193 195
Figure 8.4 Correlation of© 2 and the behavior of vector b Figure 8.5 Bayesian estimation based on 40 sample points Figure 8.6 Relationship between MEB and ) is a realvalued function defined on X = X(w) satisfying {w | X(m) <x}e fj(O) for any real number x.
(1.23)
where p(Q) denotes the power set of Q . Thus, a random variable implies two things: The value of random variable Jf is dependent on the occurrence of one of
10
/. Randomness and probability
the elementary outcome m. Because the latter is random, so is the value of the random variable. But the probability for a certain value of the random variable is deterministic due to the fact that the event" X{to) =x" is an element in Q, and each event in O is pre-assigned a probability P, In summary, a random variable is characterized by "randomness in value but determinism in probability". It should be noted that it is possible to define two or more random variables on one single random space. In the above example, another random variable Y may be defined as "six spots on the two dice". Random variable is the most fundamental concept in probability theory. And the distribution function of a random variable fully describes its behavior. Distribution and Probability density function (pdf in short) are the basic definitions in this book. Suppose X is a random variable on the sample space Q . For any real numbers, the probability of the event X < x is a function ofx Fix) = Pr(jr <x) = Pr(© | X(m) <x).
(1.24)
Then the function F(x) is defined as the distribution of the random variable. Each random variable has a distribution, and different random variables may as well have the same distribution function. The phrase "random variable X has a distribution F(x) " is equivalent to "X is distributed as F(x) " or "X obeys distribution F(x)", denoted by Jf ~ F(x). A distribution function has the following three properties (1) monotonic function: if x, < x,, then F(x,) < F(x 2 ), (2) F(-oo) = lim Fix) = 0, F{«) = lim F(x) = 1,
(1.25a) (1.25b)
(3) Right continuous: F(x + 0) = lim F(x + Ax) = F(x).
(1.25c)
On the other hand, any real-valued function having the above-defined properties is the distribution function of a random variable. If the derivative of F(x) exists, and the differential form dF(x) = f(x)dx
(1.26)
is given, then f(x) is called probability density function (pdf). Because of property (1.25a), we always have
1.3, Random variable f(x)>0, -ca<x<m.
11 (1.27)
In cases where confusions may arise, a subscript is used to denote pdf in the form of fx (x). If two random variables are considered, for example, symbols like fx (x) and fr (y) would be fine to show their difference. The differential dF(x) represents the event {x < X 0, J
J
1
J
(1.36) *
«
. A
H
Take two-dimensional random variable (X, F) for instance, the corresponding distribution function is x,Y^y)
(1.38)
which has the following properties (1) (2)
(3)
F(x,y) is anon-decreasing function of either x or y, For any x and y,
(l-39a)
H-™,y) ~ Km F(x,y) = 0,
(1.39b)
F(x,-oo) = Mm F(x, y) = 0,
(1.39c)
F(-HM, +oo) =
(1.39d)
lim F(x, j ) = 1,
F(x, y) is a right-continuous function either for x or for y. F(x+0,y) = F(*. j;), F(*,y+0) = F(*,y),
(4)
(1.40)
For any two points (x,, yt) and (x2, y2) on the plane, we have
.yJZQ.
(1.41)
Any 2-dimensional function having the above-mentioned properties can be used as a joint distribution function, If F(x,y) is the joint distribution, then the distribution function of X can be obtained from F(x, y) through Fx (x) = Pr(Jf ,y).
(1.43)
Both Fx{x) and Fr(x) are called marginal distribution functions of the joint distribution function or random variables X and Y. In the case of n-dimensional random variable, various marginal distribution functions ean be obtained from (1)
^ jri (x 1 ) = F(* 1 ,oo | ...,«),
(1.44a)
(2)
F j r i (x i ) = F K * i , . - , o o ) ,
(1.44b)
(3)
FXiX2(xl,x1) = F(xi,x1,m,.-,<xi).
(1.44c)
An «-dimensional joint distribution function has n one-dimensional distribution functions, «/2 two-dimensional marginal distribution functions,...., n f«-/J-dimensional marginal distribution functions. 1.3.3 Conditional distribution If X is a random variable, and A is an event with positive probability, then the conditional probability (1.45)
is called conditional distribution Junction of random variable^ about event A, or conditional joint distribution in short. Conditional joint distribution describes the statistical law of the values a random variable assumes given that event A occurs. Here A might be an event on sample space Q . Different A may correspond to different conditional joint distribution. Thus, conditional distribution function is a function of not only x but also event A. If we take A — {a < X £ b}, then the above conditional distribution function may be rewritten as
0 Ft(x\a<X
(1.58)
if the density function is expressed as xl), Example 1.9 The Exponential distribution Consider the exponential distribution defined by
0 From the definition:
x0
r(")
(1.72) x 0, we have (2.9) Proof. Let's prove Chebyshev's inequality for the case of a continuous random variable. If X has density function,/!*)* then
32
2, Inference and statistics
(2.10)
Dividing both sides of this inequality by e2 gives the desired result.
•
A practical application of Chebyshev's inequality bounds the probability that X is within a few standard deviations of it's expected value p . We have Pr(| X—fi j< k) =1 — Prfl x — ^r |> A), where we may think of k as a small integer. Setting a = k in Chebyshev's inequality yields (2.11) For example (with k - 2), at least 3/4 of the density of X is within two standard deviations of p . 2,1.4 The law of large numbers We now return to a random sample Xt,X2,'--,Xn
. Recall that this is a
sequence of random variables that are independent and are identically distributed with mean E(X)-^i and variance var(Jf) = iT2 .Exploiting the fact that — —a1 E(X) - (A and D(X) = — , we can apply Chebyshev's inequality to the sample n
mean, (2.12) Now think about the sample size ns, getting larger and larger. No matter how small e (as long as e > 0 ) the right-most term above approaches zero as n, -* OD . Thus we have proven the law of large numbers. Theorem 2.2 (The law of large numbers) If X is the average of ns independent random variables with expectation p. and variance a*, then for any s
2.1. Sampling
33 (2.13)
Or, in words, one can force all the probability density of X to be arbitrarily close to {j by choosing a large enough sample size nt . Or, to use a mathematical term, the law of large numbers says the sample mean converges to the population mean as the sample size gets larger and larger. Proof. See Baht (1997).
a
2.1.S The central limit theorem Thus far in this chapter we have established several results about the distribution of the sample mean, X, for a random sample of size «, drawn from the distribution of X, We have seen that the expected value of X is fi, the expected value of X itself. And we have seen that the variance of X is a2 tnt, the variance of X divided by the sample size. We have also seen how to compute the entire distribution of X , although this computation becomes overwhelmingly burdensome for all but the smallest sample sizes. We have also seen that, while computing the entire distribution of X for a moderate or large n, is not feasible, in the special case where X has a normal distribution, however, the distribution of if is (exceedingly) easy to compute; we have X ~ N(/J,(T2 / ns). While it is convenient to be able to simply compute the distribution of the sample mean of X when X is normally distributed, this result is of limited usefulness in real-world applications. Real world phenomena rarely correspond to normally distributed random variables. And on those rare occasions where they do, it is because the variable has been artificially constructed. A leading example is standardized test scores. Such scores are normally distributed by construction, as opposed to naturally. Distributions of raw test scores are not normal. But raw test scores are normalized to ensure the standardized test scores are normally distributed. In contrast, we will now study the central limit theorem, a result which is incredibly useful in real-world applications. Indeed the central limit theorem is the most useful theorem in the field of statistics. For a random variable Jf with any probability distribution, consider drawing a random sample of size n, from this distribution; denote this sample as Xl,Xi,---,XK
. The central limit theorem says that, for largen t , the distribution
of the sample mean, — ^ X ( ,
is approximately normal. The larger the sample
34
2. Inference and statistics
size ns, the better the approximation. We have already seen that E(X) = fi and -
_ » . And has a smaller variance than D(X) , since
( X )
n. +1
\
-\ D(X)(
,
(3.6a)
t^-fX
,
(3.6b)
, M
2
4
After simple calculations, we obtain 1 2
D&-WL.JL 72.
1
£(tn=T
.
.
(3-7a)
,
(3.7b) ( 3 - 7c )
3.2, Quality of random number generators
55
a, 4 D(U2) =
(3.7d)
45M.
(3.7e) D(S22), =
1
(3.70
180«,
From the cenfral limit theorem, the statistics (3.8a) (3.8b) (3.8c) are all asymptotically normally distributed as i¥(0,1) . Their estimates are obtained by substituting equation (3.5) into equations (3.6)-~(3J). Take significance level a = 0,05. If the absolute values of the estimates obtained from equation (3.8) are smaller than 1.96, then the different is significant and U is not a uniform random variable on interval [0,1].
Table 3.2 Randomness test for Example 3.1 Single precision
I/,
Double Precision
U2
U3
£/,
I/j
n, =300
1.1201
1.5161
1.7265
0.6509
0.5260
0.4170
n, = 3000
0.4718
0.4365
0.0812
0.1559
0.0234
0.6975
Example 3.2 Randomness Testfor Example 3.! Return to Example 3.1. The estimates of the three parameters defined in equation (3.8) are given in Table 3.2. Two observations from the table are interesting and worthy of cautions. First, results obtained from double-precision calculations are better than those obtained from single precision calculations. Second, Sample size has a positive influence on these estimates as expected.
56
J. Random numbers and their applications
Surprisingly, both single- and double-precision calculations have passed the testing, meaning that one single testing is not enough for filtering out low quality PRNG 3.2.2 Uniformity test Another name for uniformity testing is frequency testing. Its purpose is to test if the frequency curve (frequency histogram) obtained from random numbers (3.5) is significantly different from theoretical frequency curve (theoretical histogram). Divide the interval [0,1] into K equidistance cells. Rearrange ut in the ascending order from small to large. Suppose w^ random numbers fall into thejth cells, that is, »j random numbers in equation (3.5) satisfy
K
(3.9)
K
Let U be uniform random variable on the interval [0,1] .Then the probability for ut to fall in each cell is pt = — , and the quantity
is asymptotically distributed as ^ z distribution of freedom K - 1 . Generally, it is required that sample size «, > 3 0 , and — > 5 . If permitted, ns should be as large K as possible. Given a string of random numbers in the form of equation (3.5), we may obtain %f value from equation (3.10). This value is compared with the so-called critical values available in a chi-square distribution table given in the appendix to Chapter 1. If %f is smaller than %*m or larger than %f%, the string of random numbers should be rejected for being "under-uniform" if %? < Xm& and for being "over-uniform" if X\ > Xi% • The string of random
numbers
or x!% min
(4,7)
4.2 Polynomial basis Choosing q>t = x', we have polynomials as approximants. Weierstrass theorem guarantees this is at least theoretically feasible Theorem 4.3 (Weierstrass approximation theorem) Let f(x) be a continuous function on the interval [a,b]. Then for any £>0, there exists an integer n and a polynomial pn such that (4-8) In fact, if [a,&]=[0,l], the Bernstein polynomial
converges to f(x) as n -» QO , Weierstrass theorems (and in feet their original proofs) postulate existence of some sequence of polynomials converging to a prescribed continuous function uniformly on a bounded closed intervals. When V = L^a; b] for some interval [a; b], and 1 < p < 1 and the norm is the
4.2, Polynomial basis
73
usual /Miorm, it is known that best approximante are unique. This follows from the fact that these norms are strictly convex, i.e. Jxj < r,||y| < r implies |e+y\\ < 2r unless x = y. A detailed treatment of this matter, including proof that these norms are strictly convex can be extracted from the material in PJ. Davis, Interpolation and Approximation, chapter 7. However, polynomial approximants are not efficient in some sense. Take Lagrange interpolation for instance. If xv,x1,---,xn are « distinct numbers at which the values of the function/are given, men the interpolating polynomial p is found from
)£*«
(4-10)
where Lnk(x) is
~
x
i
The error in the approximation is given by -x,).
where £(x) is in the smallest interval containing Introducing the Lebesque function
(4.12)
x,xl,xn.
and a norm ||/|| = max|/(x)t. Then
WNIkl
(4.14)
74
4. Approximation and B-spline function
This estimate is known to be sharp, that is, there exists a function for which the equality holds. Equally spaced points may have bad consequences because it can be shown that
IKI&Ce"'2.
(4.16)
As n increases, function value gets larger and larger, and entirely fails to approximate the function / This situation ean be removed if we have freedom free to choose the interpolation points for the interval [a,b]. Chebyshev points are known to be a good choice. * t = - \ a + b + (a~b)cos^~^-
* 2L
.
(4.17)
n-l J
The maximum value for the associated Lebesque function is it logK+4. |r'i|< — II
It
(4.18)
ft-
Using Chebyshev points, we therefore obtain the following error bounds for polynomial interpolation
4
X1K denotes the linear space of all polynomials of degree n on [a,b]. We may further show that
Thus by using the best interpolation scheme, we can still only hope to reduce the error from interpolation using Chebyshev points by less than a factor -logn+5.
(4.21)
4.2. Polynomial basis
75
Example 4.3 Runge example This is a well-known numerical example studied by Runge when he interpolated data based on a simple fimction of )
(4.22)
on an interval of [-1,1]. For example, take six equidistantly spaced points in [-1, 1 ] and find y at these points as given in Table 4.1. Now through these six points, we can pass a fifth order polynomial /j(x) = 0.56731-1.7308x'+1.2019x4,
-l:Sx»,)tQ (xM,yM) is
x-x,
(4.27)
The algebraic equation for straight line (xM,yi+l)
-x X
i+2
to {xi+1,yM}
's
x-xi+
(4.28)
X
i+\
Comparing the above two equations we define X- -X, X
Bf(x) =
M
X
-xf -X
M
(4.29)
,-X.j
otherwise
Then the whole broken line from (xa,y0) to (xn,yn) can be defined as (4-30) If equation (4.25) is employed, equation (4.29) can be rewritten in the form of (4.31)
4,3. B-splines
79
This relationship can be generalized to higher order B-splines. Replacing order 2 by order k and order 1 by k-\, we obtain (4.32) This is the well-known recursion relationship for B-splines. We may use it to define a B-spline function of higher orders. So an order 3 B-spline is defined by
~X
(4.33)
Substituting equations (4.25) and (4.31) into above equation yields, after lengthy manipulations,
Bf (x) = (x, - x , ) f (X°+> ~X{''HiX°+> ~X)H(x-x,)
(4.34)
where H(x) is Heaviside function defined by
[
if
(4.35,
The index s-i-3, and the function ws(x) is a product of the form
w.Crt = Suppose the knot sequence is equidistance and there are n+\ equidistance knots, c = x0 < xx < • • • < xx = d, which divide the internal [c,d] into n subintervals. For conveniences in later mathematical expressions, three more knots at each end are defined: x_ 3 , x_2, x_t, xH+l, xn+1 and xn+i, It is customary to set x_3 = x_j =x_l=x0 = c and xH - JCB+, = xatt = xn+3 = d. It is clear that n+l knots define N splines. Therefore, N=n+\. With such definitions, we can plotS/(x) in Figure 4.5. Example 4.4 B-splines of order 3 Let A - {0,1,2,3}. Substituting it into the expression above yields
4. Approximation and B-spline Junction
80
0.5
K
n+1
Figure 4.5 B-splines of order 3 (degree 2)
= ~(3-xf - | { 2 - x ) 2
(4.37a) (4.37b)
2 < x < 3 , Bl(*) = -(3-*)*.
(4.37c)
but note (4J7d)
(4.37e) S,3(x)is continuous at x = \ and x = 2 . Its first derivative is continuous at x = 1 and x = 2 .The second derivative is discontinuous at x = 1 and x = 2. In cases without confiision, we also simply use B, (x) by omitting the symbol for order k. From equation (4.32) we may defme order 4 B-splines. The results turn out to be similar to order 3 B-splines in form. In particular,
4.3. B-splines
81
(X H
: ~*W-».) )
(4.38a)
where the index s=/-4, and the function w/x) is a product of the form -*™).
(4.38b)
In most applications, B-splines of order 3 and 4 are employed. For B-splines of higher orders, we do not give their explicit expressions here. 4,3,2 B-spIine basis sets Can we use B-splines for meaningful calculations? Do B-splines form a basis set for V? The Curry-Sehoenberg theorem says yes! Theorem 4,4 Curry-Schoenberg theorem For a given strictly increasing sequence Z = {iit'"t^K*i}> negative
1=1
integer
sequence
v = {vj,-",vA,}
with
am
all
a
^
S^ven
w !
" "
, v,£k,
set
let A = {JC 1 ,---,X B+4 } be any
non-
i»2
decreasing sequence so that (1)
xlixj£---^xk
* ( x ) = l.
83 (4-44)
This can be seen from recursion relations. (3)
(Nannegativity) Bf(x)>0.
(4.45)
Again follows from recursion relation. (4) (Differentiation) (4.46) X
t
X
i*k
X
M
Proof. See der Hart (2000).
•
Note that the same knot set is used for splines of order k and k - 1 .In the (convenient) case that the first and last knot point appear k times in the knot set ,the first and last spline in the set 5,w , are identically zero. (5) (Integration) "t
k +\
(4.47) X
i
(6) Function bounds if xt: £ x < xM and / = ^ a, 5. ,then in{aw_A>--.,a,} S/(*)Smax{o,. +w ,•••,«,}.
(4.48)
(7) B-splines are a relatively well-conditioned basis set. There exists a constant Dk .which depends on k but not on the knot set A,so that for all i (4.49)
84
4. Approximation and B-spiine function In general, Dk » 2*" 3/s , Cancellation effects in the B-spline summation are limited. (8) (Least-square property) If f(x)eC2 [a,b], then there exists a unique set of real constant flj ( i = -k, •••,«-1) solving the minimization problem
(4.50) These coefficients can be obtained from the linear system of equation CA = b where the matrix C is C=
ffi*(*)B*(x)*
(4.51)
.
(4.52)
The right hand side vector b is
Blt(x)(k\ .
(4.53)
And the unknown vector A is A = {a. 4 ,- s a B .,f.
(4.54)
Example 4.5 Comparison of various approximation methods The effectiveness of the above-mentioned approximation methods can be numerically demonstrated using Runge example in Example 4.3. Consider three interpolation methods: (1) Lagrange interpolation on equally spaced points (denoted as uniform polynomial in Figures 4.6 and 4.7); (2) Lagrange interpolation on non-equally spaced points (denoted as nonuniform polynomial in Figures 4.6 and 4.7); and (3) B-splines. They are obtained as follows.
4.3. B-splines
85
Consider Lagrange interpolation given in equations (4,10) and (4,11) of the form If the fonction values f(xk) at n points % are given, then the function value evaluated by the Lagrange interpolant at any point x is
= £/(**)**(*),
(4.55)
where Lk(x) is
»=n
x-x,
(4.56)
Consider B-spline interpolation. At the given n points xk, the interpolant must satisfy (4.57)
4(*») = /(**>.
where af are unknown coefficients to be determined by solving the system of linear equations using Gauss Elimination Method, to say
5,00
'« N
(4.58)
Figure 4.6 shows the results obtained from the three methods based on 10 points. Uniform polynomial behaves very well around the central portion, even better than B-splines. But it yields bad results around the two ends, validating the conclusions obtained by Runge long ago. Nonuinform polynomial behaves better than uniform polynomial around the two ends, but worst among the three around the central portion. This is due to the feet that nonuniform polynomial uses less points in the central portion. Around the central portion, the performance of Bsplines is almost as same as the uniform polynomial, while around the two ends, it is comparable with that of the nonuniform polynomial. Among the three, it is clear that B-splines are the best interpolants. As the number of points is increases to 20, the difference between B-splines and nonuniform polynomial is not significant, but it remains distinguishable. The uniform polynomial behaves badly around the two ends, as shown in Figure 4.7.
86
4. Approximation and B-splim function
In the figure, the curve of B-splines is not identified due to its closeness to the true Runge function and the difficulty to differentiate them. The effectiveness of B-splines can be well demonsttated by this example.
Runge iunction
B-splines
Nonuniform polynomial
0.5
i Uniform polynoniial -0.5
-1
0.5
0
-0.5
1
Figure 4.6 Comparison of various interpolation methods for 10 points. In the figure, uniform refers to equally spaced points and nonuniform refers to unequally spaced points.
V\ \ V
/
Nonuniform polynomial
Uniform / polynomial/
0.5
-1
-0.5
0
0.5
1
Figure 4.7 Comparison of various interpolation methods for 20 points. The B-spline interpolating curve is almost coincident with the Runge function.
4.4. Two-dimentional B-splines
87
4.4 Two-dimensional B-splines Two-dimensional B-splines may refer to B-splines defined on an arbitrary surface in the 3-D space, or on a plane. It would take a lot of space here if we went deep into the details of 2-D B-splines defined on an arbitrary surface. Considering what will be employed in the subsequent chapters, we focus our discussions on 2-D B-splines defined on a plane, the simplest case in 2-D space. This simplifies the presentation a lot. The reader who is interested in more general theory of 2-D B-splines is referred to de Boor (1978). In the following throughout the book, 2-D B-splines refer solely to those defined on a plane. The simplest 2-D B-splines are obtained by direct product of two 1-D Bsplines in the form of BtJ(x,y) = Bi(x)BJ(y).
(4.59)
It is bell-shaped in a 2-D space. A 2-D B-spline can also be obtained by replacing the argument in a 1 Bspline by the radial distance r = ^{x-xjf+(y-yjf centre of i-th B-spline. In notation,
3
2
-(2-Sf, 6
0,
, where (*,,;>,) is the
1<SS2
(4.60)
S>2
where S - r/A f , a, = IS/ltth2. B-splines defined in this way are called Radial B-spline Function, RBF in short. In the equation, h, defines the radius of the circle centered at (xf,y,) inside which Bt(r,hf) does not vanish. 4.5 Concluding remarks Approximation theory has a long history, but it remains active due to the fact feat it is not easy to find a flexible enough yet universal approximation tool for so many cases encountered in real-world applications. Polynomials had been an effective tool for theoretical analysis. It is not very suited to computational purpose due to its over-sensitivity to local changes as the order is high. The most effective way to reduce approximation errors is to decrease the
88
4, Approximation and B-spline function
interval [a,b]. This leads to the imroduetion of B-splines, The B-spline is nearly optimal choice for approximation. B-splines have nice properties suited to approximating complicated functions. The best properties of B-splines are that they are flexible enough to yield satisfactory approximation to a given fimction while maintaining stability. Only fundamental things about B-splines are introduced. One of important developments made in recent years, the so-called nonuniform rational B-splines Surface (NURBS), is not mentioned in this chapter for the apparent reasons. The interested reader is referred to Piegl & Tiller (1997),
Chapter 5
Disorder, entropy and entropy estimation
Entropy is one of most elegant concepts in science. Accompanying each progress in the conceptual development of entropy is a big forward step in science, Entropy was first introduced into science as a thermodynamic concept in 1865 for solving the problem of irreversible process. Defining entropy as a measure of the unavailability of a system's thermal energy for conversion into mechanical work, Clausius phrased the second thermodynamic law by claiming that tiie entropy of an isolated system would never decrease. In 1877, Boltanan gave interpretation of entropy in the framework of statistics. Entropy as a mathematical concept appeared first in Shannon's paper (1948) on information theory. This is a quantum jump, having great impact on modem communication theory. Another important progress for mathematical entropy was made by Kullback (1957) in 1950s. Entropy is thus an elegant tool widely used by both mathematicians and physicists. Entropy is yet one of the most difficult concepts in science. Confusions often arise about its definition and applicability due to its abstract trait This results from the phenomena, known as disorder or uncertainty, described by entropy. In fact, entropy is a measure of disorder. In this chapter, entropy as a mathematical concept will be first elucidated, followed by the discussions on how to construct unbiased estimators of entropies S.1 Disorder and entropy Entropy describes a broad class of phenomena around us, disorder. Its difficult mathematical definition does not prevent us from gaining an intuitive understanding of it. Example 5.1 Matter The unique properties of the three-states of matters (solid, gas, and liquid)
89
90
5. Disorder, entropy and entropy estimation
result from differences in the arrangements and of the particles making up of them. The particles of a solid are strongly attracted to each other, and are arranged in a regularly repeating pattern, or in order, to maintain a GAS solid shape. Gas expands in every direction as there are few bonds sublimation among its particles. Gas is a state of matter without order. Liquid flows because its particles are not held rigidly, but the attraction between the particles is sufficient to give a definite volume. So liquid is a state in between ordered solid and disordered gas.See Figure 5.1, Take water for instance. As freeze temperature drops below 0°C, ail SOLID L1DQID particles suddenly switch to an ordered state called crystal. And they Figure 5.1 Arrangement of particles vibrate around their equilibrium in different states of matters positions with average amplitude As. The ordered state is broken as temperature increases higher above 100°C, water becomes vaporized gas. Gas particles do not have equilibrium positions and they go around without restriction. Their mean free path Ag is much larger than the other two, that is, we have the following inequality temperature increases above 0°C, and particles travel around with larger average free distance A,
kt « At « A .
(5.1)
Disorder does matter in determining matter states. If the amount of disorder is low enough, matter will be in the state of solid; and if the amount of disorder is high enough, matter will be in the state of gas. Example 5.2: Digit disorder Take number as another example. Each number can be expressed as a sequence of digit combination using 0 to 9. For example, - = 0.499999..., - = 0.285714285714285714..., V2 = 1.1415926.....
(5.2)
J. /. Disorder and entropy
91
Except the first two digits, the string for 1/2 exhibits simple pattern by straightforwardly repeating digit 9. The string representing 1/7 is more complicated than that representing 1/2, but it remains ordered in the sense that it is a repetition of the six digits 285714. The third string representing -Jl does not show any order, all digits placed without order. We thus say that the first series is the best ordered and the last worst ordered. In other words, the amount of disorder of the first series is least while the third the largest. From the two examples above, it is plausible that disorder is a phenomenon existing both in nature and in mathematics. More examples on disorder can be cited, some of which are given in the following Example 5.3: Airport disorder An airport is in order with all flights arriving and departing on time. Someday and sometime, a storm may destroy the order, resulting in a state that massy passengers wait in halls due to delayed or even cancelled flights. The passengers might get more and more excited, creating a disordered pattern. Disorder is so abstract that we hardly notice it if not for physicists and mathematicians. Disorder is important for it determines the state of a system. It is desirable to introduce a quantity able to describe the amount of disorder. We will see that we are able to define a quantity, called entropy, for quantitatively defining the amount of disorder. And //(in fact, it is the capital Greek letter for E, the first letter of entropy), is frequently used to denote entropy To find the definition of entropy, we return to Example 5.1. Consider a onedimensional bar of length L. If the bar is filled with ice, there will be approximately Nt=L/A, particles in the bar. If the bar is filled with water, there will be approximately N(= Lf At particles in the bar. And if the bar is filled with gas, there will be approximately Ng = L/Ag particles. Because of equation (5.1), we have N,»Nt»Ng.
(5.3)
Therefore, the number of particles in the bar should be relevant to the amount of disorder in the system, or entropy. In other words, entropy H is a function of the number of particles in the bar, H = H{N). And H{N) should be an increasing function of N. Suppose now that the length of the bar is doubled. The number of particles in the new bar will be doubled, too. And how about entropy? The change in entropy cannot be the number itself iV, because otherwise the change of entropy for ice will be N,, for water JV, and for gas Ng, The length of the bar is doubled, but the increase of entropy is not same for the three cases. This is not acceptable
92
J. Disorder, entropy and entropy estimation
if we hope that entropy is a general concept. An alternative method to view the problem is to define entropy in such a way that if the bar is doubled in length or the number of particles is doubled, the entropy increment is one. If the number of particles is four-fold increased, the entropy increment is 2. Then we have
,..,
(5.4)
Note that H(l) = 0 because there does not exist any disorder for one particle. Solving the above equation, we are led to say that entropy is given This is a heuristic introduction to entropy. In the following sections, we will ignore the particular examples in the above, and tarn to abstract yet rigorous definition of entropy. 5.1.1 Entropy of finite schemes To generalize the entropy definition introduced in the above, we must notice that the number JV used in the above is just an average value, hi more general cases, N is a random variable and should be replaced by probability. Consider a bar filled with AT particles. They are not necessarily arranged in an equidistance way. Suppose, the free distance of the first particle is A,, the free distance of the second particle is Aj,..., and the free distance of the N-th particle is AN. If the bar is filled with particles, all of which have free distance A,, then the entropy would be H{ = log{£/!,) based on the above definition. If the bar is filled with particles, all of which have distance Aj, the entropy would be H2 = logfi/lj) And if the bar is filled with particles all of which have distance A^, the entropy would be Hn = log(£ / AN ) . Suppose now the bar is filled with particles of various free distances. We will have to use the averaged quantity to represent the amount of disorder of the system, that is, ff = ~ ( l o g « 1 + l o g ^ + " - + logn w } n
(5.5)
where n = w, + n2 + • • • + nN is the total number of particles. If 1 /«, is replaced by probability, we obtain the following generalized entropy concept based on probability theory.
5.1, Disorder and entropy
93
A complete system of events Ai,A2,--',An in probability theory means a set of events such that one and only one of them must occur at each trial (e.g., the appearance of 1,2,3,4,5 or 6 points in throwing a die). In the case N-2 we have a simple alternative or pair of mutually exclusive events (e. g. the appearance of heads or tails in tossing a coin). If we are given the events Ah A3, .... An of a complete system, together with their probabilities JJ, , j % ,-••,/?„ {pi 2; 0, ^ p, = 1), then we say that we have a finite scheme (5.6) ft •••
P»)
In the case of a "true" die, designating the appearance of / points by A, (1 s i £ 6 ), we have the finite scheme
P\
Pi
Pi
P«
Pi
P«
From the finite scheme (5.6) we can generate a sequence of the form AjA1A]A%Aft.., .The sequence is an ordered one if Ai,At,---tAll appear in a predictable way; otherwise disordered. Therefore, every finite scheme describes a state of disorder. In the two simple alternatives
0.5 0.5)
^0.99 0.01
the first is much more disordered than the second. If a random experiment is made following the probability distribution of the first, we may obtain a sequence which might look like AlAiA2AlA2AlAzA[.,., It is hard for us to know which will be the next. The second will be different, and the sequence generated from it might look like AlAlAlAiAlA)AfAx... Ms are almost sure that the next letter is 4 with small probability to make mistake. We say that the first has more amount of disorder than the second. We sometimes use uncertainty instead of disorder by saying that the first is much more uncertain than the second. The correspondence of the two words uncertainty and disorder can be demonstrated by Equation (5.6). Disorder is more suitable for describing the state of the sequences generated from finite scheme (5.6) while uncertainty is more suitable for describing the finite scheme
94
J, Disorder, entropy and entropy estimation
itself. Large uncertainty implies that all or some of assigned values of probabilities are close. In the extreme case, all probabilities are mutually equal, being 1/n , A sequence generated from such a scheme would be highly disordered because each event has equal probability of occurrence in the sequence. On the other extreme, if fee probability for one of the events is much higher than the rest, the sequence produced from such scheme will look quite ordered. So the finite scheme is of low uncertainty. Thus, disorder and uncertainty are two words defining the same state of a finite scheme. The scheme
4
(5-9)
"'}
0.3 0.7 J represents an amount of uncertainty intermediate between the previous two. The above examples show that although all finite schemes are random, their amount of uncertainty is in fact not same. It is thus desirable to infroduce a quantity which in a reasonable way measures the amount of uncertainty associated with a given finite scheme. The quantity
can serve as a very suitable measure of the uncertainty of the finite scheme (5.6). The logarithms are taken to an arbitrary but fixed base, and we always take pk logj% = 0 if pk = 0 . The quantity H(pl,p2,---,pH)i8 called the entropy of the finite scheme (5.6), pursuing a physical analogy with Maxwell entropy in thermodynamics. We now convince ourselves that this function actually has a number of properties which we might expect of a reasonable measure of uncertainty of a finite scheme. 5.1.2 Axioms of entropy Aided by the above arguments, entropy can be rigorously introduced through the following theorem. Theorem 5.1 Let H(p1,p1,---,plt)be
ajunction defined for any integer n and n
for all values- P 1 ,/ 7 2»'"»A
suc
^
tnat
ft £0,(& = l,2,---,«), ^pk
= 1 . If for
any n this fimction is continuous with respect to all its arguments, and if it has the following properties (1), (2), and (3),
5. /. Disorder and entropy
95
n
(1) For given n and for ^pk
= 1, the function H (p,, p2, - • •, pH ) takes its
largest value far pk = 1 / n, {k = 1,2, • • •, n), (2)
H(AB)
^H(A)+HA(B),
(3)
H(p,,pi,—,pll,Q) = ff(pt,p1}---,pj. (Adding the impossible event or any number of impossible events to a scheme does not change its entropy.) then we have ^
(5,11)
where c is a positive constant and the quantity Hd{B) = ^ipkHk{E)
is the
k
mathematical expectation of the amount of additional information given by realization of the scheme B after realization of scheme A and reception of the corresponding information. This theorem shows that the expression for the entropy of a finite scheme which we have chosen is the only one possible if we want it to have certain general properties which seem necessary in view of the actual meaning of the concept of entropy (as a measure of uncertainty or as an amount of information). The proof can be found in Khinchin (1957). Consider a continuous random variable distributed as f(x) on an interval [a,b]. Divide the interval into n equidistance subintervals using knot sequence 4\' ii >'"'»C+i • The probability for a point to be in the &-th subinterval is
(5.12)
where Ax = £i+1 -f t is the subinterval length. Substituting it in equation (5.11) yields
(5.13)
96
5. Disorder, entropy and entropy estimation
The second term on the right hand side of the above equation is a constant if we n
n
n
note iSx^jf{^k)=^iMf{§k)=^ipk
=1 and log Ax is a constant. So only the
first term on the right hand side is of interest. As division number becomes large, H -> so, the first term on the right hand side is just the integral H(f,f) = -c\f{x)\ogf{x)dx.
(5.14)
where two arguments are in the expression H(f,f). In the subsequent sections, we will encounter expression H(f,g) indicating that the function after the logarithm symbol in equation (5.14) is g. Equation (5.14) is the definition of entropy for a continuous random variable. The constant c = 1 is often assumed. A question that is often asked is: since the probability distribution already describes the probability characteristics of a random variable, why do we need entropy? Yes, probability distribution describes the probability characteristics of a random variable. But it does not tell which one is more random if two probability distributions are given. Entropy is used for comparing two or more probability distributions, but a probability distribution describes the randomness of one random variable. Suppose that the entropy of random variable X is 0.2 and that of random variable Y is 0.9. Then we know that the second random variable is more random or uncertain than the first. In this sense, entropy assign an uncertainty scale to each random variable. Entropy is indeed a derived quantity from probability distribution, but it has value of its own. This is quite similar to the mean or variance of a random variable. In fact, entropy is the mathematical expectation of -log/(jc), a quantity defined by some authors as information. Example 5.4 Entropy of a random variable Suppose a random variable is normally distributed as f(x)=
.—- exp V2
From definition (5.14) we have
= -jf(x)loBf(x)dx 00
- J : &*"*{_
exp
-.* 7J log5 -7==-exp - v n 7 \\ttc 2«x J * " W 2 ^ f f ^ | 2cr
(5.15)
5.2. Kullhack information and model uncertainty
97
I
(5.16)
The entropy is a monotonic ftinction of variance independent of the mean. Larger variance means larger entropy and viee versa. Therefore, entropy is a generalization of the concept variance, measuring data scatters around the mean. This is reasonable because widely scattered data are more uncertain than narrowly scattered data. 5.2 Kullback information and model uncertainty In reality show The wheel of Fortune a puzzle with a slight hint showing its category is given to three contestants. The puzzle may be a phrase, a famous person's name, an idiom, etc. After spinning the wheel, the first contestant has a chance to guess which letter is in the puzzle. If he/she succeeds, he/she has the second chance to spin the wheel and guess again which letter is in the puzzle. If he/she fails, the second contestant will spin the wheeltocontinue the game, and so on untilfeepuzzle is unraveled finally. We simplify the example a little. The process for solving the puzzle is in feet a process for reducing uncertainty, that is, entropy. At the very beginning, which letter will appear in the puzzle is quite uncertain. The guessing process is one that each contestant assigns a probability distribution to the 26 letters. As the guessing process continues, more and more information has been obtained. And the probability assigned to letters by each contestant gets closer and closer to the answer. Suppose at an intermediate step the probability distribution given by a contestant is
(fU7)
The question is to solve the puzzle, how much more information is needed? In other words, how far away is the contestant from the true answer? Each contestant speaks loud a letter which he/she thinks should be in the puzzle. Whether the letter is in the puzzle or not, we obtain information about the puzzle. And the letters given by the contestants form a sample, the occurrence probability of which is (5.18) «,!«,!.••«„
Its entropy is
98
J. Disorder, entropy and entropy estimation log p(B) = ~y]—lag qk
(5.19)
where the constant term is neglected. From the large number theorem, we conclude that as the sample size n, becomes large, the above entropy becomes 1
n
- lim — log p{B) = - V pk log qk .
(5.20)
Denoting the term on the right hand of the above equation by
We conclude that H(p,q) is a new entropy concept interpreted as follows. Suppose the true probability distribution is given by equation (5.6). We take a sample from the population, and obtain a probability distribution given by equation (5.17). The entropy estimated by equation (5.21) is entropy H(p,q). Therefore, H(p,q)represents the entropy of the scheme p measured by model if. More precisely, the entropy of a random variable is model-dependent. If a model other than the true one is used to evaluate the entropy, the value is given by H(p,q). We note that tog A is the entropy of the finite scheme under consideration. The difference between H(p,q) and H(p,p) represents the amount of information needed for solving the puzzle, that is, l(p,q) = H(p,q)-H(p,p) = f > t log^-.
(5.23)
I(p,q) is defined as Kullhack information. It may also be interpreted as the amount of uncertainty introduced by using model q to evaluate the entropy of p .
5.2, Kullback information and model uncertainty
99
Theorem 5.2 Kullback l(p,q)has the following properties: (1) (2) J(p,q) = Q if and only if
pk=qk.
Proof; Let x>0 and define function / ( x ) = logx—x+l . f{x) takes its maximum value 0 at point x = l , Thus, it holds that / ( J C ) ^ O . That is, log x < x - 1 . The equality is valid only when x = 1. Setting * = qk I p k , we have
Pk
Pk
and
w *
Pk «
\Pk
j
w
w
Multiplying minus one on both sides of the above equation, we obtain
Jftlog-^->0. The equality holds true only when pk =qk.
(5.26) •
The above concepts can be generalized to continuous random variable. Suppose X be a continuous random variable with pdf f(x). The entropy measured by g(x)is
H(f, g) = ~ J / t o log g(x)dx.
(5.2?)
The difference between the true entropy and the entropy measured by g(x) is the Kullback information l(f, g) = H{f, g) - H(f, / ) = f/(*) I o g 4 ^ •
C 5 - 28 )
J. Disorder, entropy and entropy estimation
100
Besides the above interpretation, I(f,g) may also be interpreted in the following way. Suppose a sample is taken from X, which entropy is H(/, / ) . Because the sample is a subset of the population, it cannot contain all information of the population. Some of information of the population must be missing. The amount of information missing is I(f,g) if g(x) represents the pdf fully describing sample distribution. In this sense, I(f,g) represents the gap between the population and a sample. Theorem 5.3 Kullback information / { / , g) for continuous distribution satisfies (1)
(2) I(f, g) = 0 if and only if f = g.
(5.29)
Kullback information is interpreted as the amount of information missing. The first Property indicate that the missing amount of information is always positive, a reasonable conclusion. The second Property imply that an arbitrarily given distribution cannot fully carry the information contained by another distribution unless they are same. Note that Kullback information is not symmetric, that is,
/(/.ir)
(5.30)
Example 5.5 Entropies of two normal random variables Suppose two normal random variables are distributed as 1
/(*) =
-exp
and g(x) =
1
2a
I -exp •jhi;
From definitions (5.27) and (5.28) we have
# H(f,f).
(5.85)
based on the large number theorem. The second term, as mentioned above, will asymptotically approach zero because it is a sum of normal random variables.
116
J. Disorder, entropy and entropy estimation
The third term will approach 2J
dafici
This is in fact Ts in equation (5.57c). It is asymptotically a ehi-square random variable of freedom nf, and its expectation with respect to sample Jf is £A=~-
(5-87)
We have shown in the previous section that the asymptotically unbiased estimator of H(f, / ) is (5.88) Furthermore, the first term on the right hand side of the above equation can be estimated by use of the following estimator H(x | a) = - — ] > > § / ( * , 1 a ) . n, w
(5.89)
Substituting equations (5J7) and (5J9) into (5.84) and taking expectation on both sides, we obtain asymptotically unbiased estimator for log-likelihood function L given by equation (5.82). n A special name is given to this estimator, Akaike Information Criterion, or AIC in short. Historically, it was first obtained by Akaike (Sakamoto, 1993). Here AIC is not the original form, different by a factor of 2 In,. It is beneficial to make a comparison between AIC and ME, If sample is large enough, the log-likelihood function in equation (5.81) is asymptotically K(\a)-»H(f,g)
(5.90)
If g{x) = f(x I a) is used. Referring to Figure 5.2, AIC is an estimator of the entropy contained in the
5.3. Estimation of entropy based on large samples
117
sample X. From definition, ME estimate the uncertainty associated the total statistical process. AIC predicts the entropy only present in the estimation process without considering if the model can recover the true model. Example 5.10 ME and AIC The true and estimated models (pdfs) are, respectively
/(*) =
1
I xexp - :
(5.91) ex
(5.92)
P| ~lh
Theoretical value for TSE and its estimate are respectively (5,93a) (5.93b)
2
Comparison of these two quantities is plotted in Figure 5.5. The two quantities are pleasantly close to eaeh other, numerically validating equation (5.80).
200 Figure 5.5 Numerical comparison of TSE and ME
118
J. Disorder, entropy and entropy estimation
5.4 Entropy estimation based on small sample All estimators obtained in the previous section ha¥e been obtained based on the assumption that the sample size is large. In the case of small samples, we will employ different techniques as outlined in the following. In section 2.3.2, we have seen that sample mean is a random variable, too. If sample size nt is very small, say nt = I, sample mean and random variable are identically distributed. On the other hand, if sample size ns is large, sample mean is asymptotically distributed as a normal random variable. If sample size is not so large to guarantee sample mean be close to the normal distribution, nor so small to enable us compute sample distribution by using the method presented in section 2.2.2,, then we have to develop new methods for estimation. The above mentioned two cases, very small samples and large samples, share one thing in common, that is, sample mean is treated as a random variable. For the case in between the two extremes, it is natural to assume sample mean is a random variable, too. Therefore, in the most general cases, the unknown parameter 9 in f(x 19) is treated as a random variable. In doing so, we change ftom traditional statistics practice determining parameter 0 through sample X into determining distribution of parameter 0 through sample X. In notation, X-+$=>X-+P(0\X)
(5.94)
where P(01X) is the pdf of the parameter 0 to be estimated from the given sample X. This is the basic assumption of Bayesian statistics. In the framework of Bayesian statistics, P{0 \ X) is written in the form of P(0\X) = / W W ) ]P(X 10)P(0)d0
(5J5)
where in Bayesian language: P(X 10) = Y\fx (xt I ^ ) : *s m e sample occurrence probability given & P(0):
is the Prior distribution of 0,
P(X) = JP(X 10)P{0)d0; is the sample occurrence probability, ormarginal probability, P(0\X):
is the posterior distribution of 0.
(5.96)
5. J, Model selection
119
Now consider the problem of how to measure the uncertainty of parameter 0. Such uncertainty comes from two sources, the sample Itself and the uncertainty of 0 after the realization of the sample. The uncertainty associated with this sample is H(X) = - JP(X) log P(X)dX = -Ex log P(X).
(5.97)
In the framework of Bayesian statistics, P(X) is given by equation (5.96). Consider two events: A = 8 and B = 0 J X. Then the uncertainty associated with parameter 0 is described by H(0) = H(X) + Hx(0)
(5.98)
if equation property (2) of theorem 5.1 is employed. The uncertainty associated with event B is defined by Hx(0) = \P{X)H{01 X)dX = ExH(01X)
(5.99)
where H(0 \X) = - JP(01X) log P(0 \X)d& = -Em log P(0 \X).
(5.100)
Equation (5.99) shows that -logP(X) is an unbiased estimator of H(X), and equation (5.100) shows that ~logP(0 \ X) is an unbiased estimator of H{6\ X), Therefore, we obtain the unbiased estimator of entropy of parameter 0, that is, H{0)^-\o%P{X)-\ogP(0\X).
(5.101)
S.5 Model selection Model selection will be specially mentioned in Chapters 6 to 10. We will, however, present general theories about model selection here. Traditional statistics has focused on parameter estimation, implicitly assuming that the statistical model, or the pdf of a random variable, is known. This is in fact not the Case. In real-world applications and in most cases, the model is unknown. Referring to Figure 4.1, we have shown three possible models (lognormal, normal and Weibull) to approximate the unknown pdf under consideration. Purely from the graph, it is hard to make a decision on which is the best fit of the observed data. Therefore, we do not have any convincible
120
5, Disorder, entropy and entropy estimation
support to assume that the statistical model under consideration is known. It thus signifies a shift from traditional assumption that the statistical model under consideration is known with unknown parameters to contemporary belief that the statistical model under consideration is also unknown with unknown parameters. Such a slight change makes a big difference because this problem has not been discussed in traditional statistics. In summary, in modem statistics, bom model and parameters are unknown. How to determine a statistical model is thus the biggest challenge we are faced with if we get rid of the assumption that the model under consideration is known. It is at this point that information theory comes into play. The prevailing solution to the problem is that suppose we are given a group of possible statistical models. We are satisfied if there exist some criterion able to tell us which model is the best among all possible models. So we call this procedure as model selection. By model selection, we change the problem from determining statistical models to selecting models. To be able to select the best model, we thus need to do two things. The first is that the group of possible models should be so flexible and universal that the model to be after is included in the group of possible models. This problem has been touched in Chapter 4, being a procedure for function approximation. The second is that we have a criterion at hand to help us to select the best model from the group of possible models. This problem is the main focus of this section. 5.5.1 Model selection based on large samples Suppose we are given a group of possible models {,/)(* |6[-)} (i-h"^), each of which contains unknown parameter 0t, Draw a sample X from the population, and use the sample to estimate unknown parameters 8, in each m.
model by some method (M-L method, to say), yielding estimate Bt, Suppose me true model is among the group{f,(x| &,)}. Then the true model must minimize TSE. And Theorem 5.10 gives the asymptotically unbiased estimate of TSE, denoted by ME. Combining the two theorems together, we obtain the criterion for model selection. Theorem 5.12 Among all possible models, the true model minimizes ME = H(f,f)
+ J(f,g) = H{X | a ) + J ^ - » m i n (5.102) 2« where §t is the M-L estimate of the unknown parameter 0, based on large samples.
5.5. Model selection
121
An alternative criterion for model selection is AIC. Theorem 5,13 Among all possible models, the true model minimizes AIC AIC = -—V
lag f(x(\&)+^--»min.
n,ti
(5,103)
n,
It should be pointed out there are possibly other criteria for model selection. But to the best knowledge of the author of the book, applying ME and AIC to model selection has been studied most extensively up to now, and thus only these two criteria are introduced here. Theorems 5.12 and 5,13 solve our problem about model selection. They are very useful tools for large sample problems. Example S.11 Model Selection: polynomial regression (Sakamoto et al, 1993) In Table 5.1 is given a group of paired data (xl,yj), which are plotted in figure 5.6. We want to know the relationship between x and y. The general relationship between the pair as shown in Figure 5.6 can be approximated by use of a polynomial (polynomial regression) + amxm
(5.104)
In this example, the polynomial regression takes particular form of yt = a0 + aixl + - + amx™ + *,
(5-105)
where st are independent and standard normal random variables, m is the degree of the regression polynomial. This model is the sum of a polynomial of deterministic variable x( and a random error s,, resulting in a random variable yt. This regression polynomial of degree m is a normal variable y, with ae + a, x, + • • •+a m xf as the mean and an unknown er1 as the variance, that is, its pdf is given by
f(yt
y;-aB~a,x,
amx°')
2a
1
(5.106)
In the following, this regression polynomial of degree m is written as SO MODEL(O) is such a regression polynomial that it is
MODEL(OT).
5. Disorder, entropy and entropy estimation
122
independent of variable x, distributed as N(ag,<x2) . This model has two unknown parameters, att and cr2, MODEL(l) is such a regression polynomial that it is distributed as N(a0 + alxi,ai),
with three unknown parameters a 0 , «,
2
and a . Similarly, MODEL(2) is a parabolic curve. If observed data {xstyt) are given, we may perform regression analysis to find the unknown parameters involved in the model. Detailed procedure is given in the following for easy understanding.
Table 5.1 Observed data pair
i
1 2 3 4 5 6 7 8 9 10 11
y> 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
0.012 0.121 -0.097 -0.061 -0.080 0.037 0.196 0.077 0.343 0.448 0.434
0.6 x Figure 5.6 Observed paired data (x,, yt) 0
0.2
0.4
(1) Likelihood analysis Suppose n sets of data (jCp^), ..., ( J C , , , ^ ) are given. They are to be fitted using MODEL(wi) : yj =
+ e,
The pdf of the model is given in equation (5.103). Then the likelihood function of these n sets of data is
5.5. Model selection
123
(5J07)
The corresponding log-likelihood is then given by
(5-108) (2) Likelihood maximization If the maximum likelihood method is employed, the unknown parameters can be found by maximizing the log-likelihood junction in equation (5.113). This is equivalent to minimizing
M
This is in fact the least-square method. Summarizing the above procedure, we conclude feat the problem of polynomial regression is reduced to the least-square method. To minimize S in equation (5.106), aB,...,am must satisfy the conditions
From these equations, we may find the system of linear equations the M-L estimates a^,..., am should satisfy
124
5. Disorder, entropy and entropy estimation ft
1.x (5-111)
Solving this system of linear equations gives the M-L estimates. One more parameter, a1, remains undetermined. By differentiating the left hand side of equation (5.105) with respect to a3, we may find the equation the M-L estimate «x2 should satisfy 0/
(5.112)
la1 ' la21 Solving this equation yields the condition the variance must meet
m
m
rr
(5.113) «^»i) denotes the variance a1 corresponding to MODEL(»i). For example, MODEL(-1) a simplified notion representing the case mat y, is independent of x, so that y, is normal random variable with zero mean and variance er a , that is, yt ~ #(0,cr 2 ). Using these symbols, the maximum log-likelihood is Ky{,--,yAat,---,am,a1) = ~\og{2n)-^\Ogd{m)~.
(5.114)
(3) Model selection In MODEL(/»), there are m+2 parameters. They are regression coefficients
5.5. Model selection
125
ao,...,am and the variance cr2. Based on equation (5.93b) for calculating ME in the case of the normal distribution, we have after some simple manipulations ME = -lQg(2x)+-lQgd{m)+- + - ! ^ l 2 2 2 2n,
(5.115a)
and AIC is after simple manipulations AIC = - log(2s-}+- log d(m)+~+^ii 2 2 2 re
(5
where the number of free parameters n} =m+2 and sample size n = 11. With the above preparations, we reconsider the data given at the beginning of this example. Straightforward calculations yield |>,=5.5,
2>f=3.85,
| > ? =3.025,
| > ; =2.5333
(=1
i=I
(=1
f=I
2>, =1.430,
J>,fl = 1.244, f/Xfyi
i=I
»=1
= 1.11298, £ j £ = 0.586738
(=1
(=1
ME and AIC values for different models are calculated for different models. For example, the results for MODEL(-1) (zero mean normal distribution) are
M E = i log (2ff)+-+- log 0.0533+- x — = 0.0894 2 M ; 2 2 B 2 II ^/C = -log(2s-) + -+ilog0.0533+—= 0.0439 2 ax f 2 2 11 Continue similar calculations to determine the regression polynomials of various degrees and find the variances corresponding to each regression polynomial. Based on such calculations, both ME and AIC values can be easily evaluated. In Table 5.2, such calculation results are summarized up to degree 5. As the number of free parameters increases, the variance decreases fast if the number of free parameters is smaller man 3. As the number of free parameters is above 4, the variance does not change much. Both ME and AIC are minimum as the second degree regression polynomial is employed.
126
J. Disorder, entropy and entropy estimation Table 5,2 Variance, ME and AIC for the regression polynomials of various degree Degree
Free parameter
Variance
ME
AIC
-1 0 1 2 3 4 5
1 2 3 4 5 6 7
0.05326 0.03635 0.01329 0.00592 0.00514 0.00439 0.00423
0.089 0.035 -0.333 .0.600 -0.535 -0.477 -0.360
0.044 -0.056 -0.469 -0.782 -0.762 -0.750 -0.678
Based on the results given in Table 5.2, MODEL(2) y, = 0.03582-0.49218^ +O.97237xf +e, minimizes both ME and AIC, thus being assessed as the best regression polynomial for the data given in Table 5.1. In feet, the data given in Table 5.1 were generated from the following parabolic equation yt = 0.05 - 0.4x, + OMxf + e-, where e ( are normal random variable with zero mean and variance 0.01. It is somewhat remarkable that both ME and AIC do pick the best model from a group of candidate models. The coefficients in these two equations do not differ much. 5.5.2 Model selection based on small samples Bayesian statistics, which is characterized by treating unknown parameters as random variables, has undergone rapid developments since World War II. It is already one of the most important branch in mathematical statistics. But a longstanding argument about Bayesian statistics is about the choice of prior distribution in Bayesian statistics. Initially, prior distribution used to be selected based on user's preference and experience, more subjective than objective. Note that for the same statistical problems, we may have different estimates of posterior distribution due to different choices of prior distribution. Prior distribution is a priori determined. This is somewhat annoying because such
5.5. Model selection
127
ambiguity in final results prevents applied statisticians, engineers and those interested in applying Bayesian statistics to various fields from accepting the methodology. Situations have changed in recent years since information theory was combined with Bayesian methodology. The basic solution strategy is to change the problem into one for model selection. Although there are some rules helping us to choose prior distribution in Bayesian method, the choice of prior distribution is in general not unique. Suppose we have a group of possible prior distributions selected by different users or by different methods. By using some criterion similar to those for large samples, we are thus able to find the best prior distribution from the possible candidates. Referring to equation (5.84), we note that true model for 6 minimizes H(0) according to theorem 5.4 or equation (5.28). Thus, the best prior should minimize H(0) = - log P(JT)- log P(61X) -> min.
(5.116)
Note that the above equation is equivalent to two minimization procedures as follows -logiJ(Z)-»min,
(5.117a)
- tog P(01 ,¥)-»• min.
(5.117b)
Because H(0) is a linear function of -logP(X)
and -logP((?| J f ) . Minimum
H(ff) is attained if and only if the two terms at the right hand side of equation (5.90) are minimized. We use a special name, Bayesian Measured Entropy (MEB), to denote the -logP(Jf). Then based on equation (5.91a) we have Theorem S.14 7%e best prior distribution must satisfy MEB = -2 log P(X) -> min .
(5.118)
Here the constant 2 in front of the logarithm of the marginal probability appears due to historical reason. It is also called Akaike Bayesian Information Criterion (ABIC) (Akaike, 1989;Akaike, 1978). Equation (5.92) is in fact an integral equation for the unknown prior distribution P(8). Theoretically, the solution exists. So equation (5,92) selects
128
5. Disorder, entropy and entropy estimation
the best priors from candidate models. In equation (5.95), Q is unknown. Using Bayesian rule (5J1), equation (5.95) states that
P(X)
(5.119)
In the equation, only parameter & is unknown. Therefore, the above equation yields point estimate of 8. It plays the role as the M-L method because it degenerates to M-L method because it degenerates to M-L estimation if prior distribution is a uniform one. 5.6 Concluding remarks Entropy and entropy estimation are the basic ideas in this Chapter. Based on them, we have constructed quite some unbiased estimators for a certain of entropies. The most important three estimators are ME and AIC for large samples and MEB for small samples. Model selection had not been touched in traditional statistics. It is one of the focuses in modem statistics, and thus deserves special attentions. Entropy-based methods have been introduced in this Chapter to attack this problem. They will be widely applied in the subsequent chapters.
Chapter 6
Estimation of 1-D complicated distributions based on large samples
A random phenomenon can be fully described by a sample space and random events defined on the sample space. This is, however, not convenient for more complicated cases. Introduction of random variables and their distributions provides a powerful and complete mathematical tool for describing random phenomena. With the aid of random variables and their distributions, sample space and random events are no longer needed. Thus, we focus on random variables and tiieir distributions. The most frequently used distribution form is the normal distribution. Although the normal distribution is well studied, it should be noted that the normal distribution is a rare distribution in real-world applications as we mentioned in the previous chapters. Whenever we apply the normal distribution for a real-world problem, we must a priori introduce a significant assumption. Our concern is thus about how to avoid such a priori assumption and place the distribution form under consideration on a more objective and sound base. Attempts have been made to solve the problem by introducing more distribution forms, Weibull, Gamma etc. We have about handful of such distribution forms. They are specially referred to as special distributions. Special distributions usually have good properties for analysis, but not enough capabilities to describe the characters of the random phenomena encountered in practical applications. Such examples are numerous. Ocean waves are of primary concerns for shipping. Wave height and weave period are random in nature. People began to study the joint distribution of wave height and period at least over one hundred years ago. Millions of sample points have been collected, but their distributions are still under extensive study today in part due to the lack of powerful distribution forms for describing random properties of ocean waves. Complicated distributions are not rare in real-world applications, but special distributions are not enough to describe them.
129
130
6. 1-D estimation based on large samples
Another serious problem is that a priori assumption is arbitrarily used for any practical applications and analyses, People involved in stock market analysis often complained about the wide use of me normal distribution for analyzing price fluctuations. Whenever the normal distribution or any other special distribution is used, it implies that we enforce the phenomenon under consider to vary in accordance with the a priori assumed law. If price fluctuations are not distributed as special distributions we assume, we put ourselves in fact at the hand of God. Any prediction based on such assumption is prone to be misleading rather than profitable investment. In summary, at least two concerns must be addressed before we adopt a specific distribution form for the random variable under consideration. The first is about the distribution form, which capability should be powerful enough to describe the most (we cannot say all) random variables, either simple or complicated. The second concern is about objective determination of distributions. The strategy for solving the first concern is to introduce a family ~X)H(X-X,).
(6.6a)
where H(x) is Heaviside function defined by
»M-{? ' " I The index s=i-3, and the function ws(x) is a product of the form
(6.6b)
134
6, !-D estimation based on large samples
f] »=0
Equidistant B-splines are assumed in the above. Suppose there are equidistance points, c = xa <xl ( e,=l
(6.12b)
136 O/ fe0
6. l-D estimation based on large samples , (i' = l,2..",tf) .
(6.12c)
Equations (6.12a)~(6.12c) define a nonlinear programming problem (NLP). Being a linear function of a, f(x | a) is a continuous function in the space defined by equations (6.5) and (6.12). So is the log-likelihood function L. Theorem 6.2 The problem defined by equation (6.12) has a unique solution. Proof. From Weierstrass" theorem, which states that a continuous function defined on a compact interval must have an extreme, we conclude that the solution to equation (6,12) exists. It is provable, as given in the appendix to this Chapter, that there exists only one extreme point over the entire feasible domain for this nonlinear programming problem, see the appendix. • It is known that the most difficult thing in optimization is that the objective function has multiple extremes. A search scheme is often trapped at local extremes and fails to find the global optimum. In terms of this, the property that the problem defined by equation (6.12) has only one extreme point in the entire feasible domain is really a remarkable property. This makes numerical treatments much easier and no special cautions are needed. Therefore, if a local maximum solution is found to equation (6.12), it must be the global optimum solution because the solution is unique. Very often it is difficult to find a solution to a nonlinear programming problem. Even the problem defined by equation (6.12) has only one extreme point, a code based on a general-purpose method may turn out be computationally inefficient. In most applications of optimization research, the number of unknowns is restricted within several parameters, say 2 to 5. For the problem defined by equation (6.12), however, the number of unknowns is of the range of 10~50. In some cases, the number of unknowns may be over 100. For such optimization problems of large number of unknowns, general-purpose optimization methods are usually not applicable. This is particularly true for 2-D cases as will be discussed in Chapter 7. It is desirable to develop a particular method to find the solution in an efficient way. So in the appendix, an iterative formula is derived, which reads (6-B)
We have q £,(£)/ f(x | a) 51 because the nominator is a term in the nonnegative denominator, we conclude OS a, S l / c , . This is in agreement with equation
6.3. Estimation
137
(6.12b). The iteration foimuk remains valid even if f{x, j a) = 0 . To see this, note that every term in f(x{ j a) is nonnegative. So f(xt | a) = 0 implies that each term in / ( x , |a) must be zero. That is, alBi(x,) = Q . Because a.Bt(x)/f(x
| a) < 1 , a< '
l
must be finite even if it is of the type - .
The suggested initial values are 1
(6,14)
A small number, say 10"4, is prefixed. The iteration starts from the initial value given in (6.14). The iteration continues until the difference between the previous and present values of the combination coefficients is smaller than the prefixed small number. Numerical tests have shown that it takes several to several tens of iterations to reach the optimum. This iteration formula is shown to be very computationally efficient, making the methods presented here feasible as a statistical tool on a PC or a laptop. Equations (6,5) and (6.13) give complete solution to finding the continuous pdf based on a large sample if the number N of B-splines is given. A code based on the method is given in the floppy attached to this book and a brief description of the code is given in Chapter 12. The inputs of the code are the number N of Bsplines, the distribution interval [c,d] and the observed data (sample) ^ ( 1 = 1,2,...,^). The model assumes that the random variable under consideration must be disfributed in a finite interval [e,d\. If a random variable is distributed on an infinitely large interval, the model introduced here is used in the sense of approximation. The method requires input of raw data without treatment. If observed data are treated by some approaches, variants of the above method may be used, as demonstrated in the following section. 6,3,2 Estimation from a histogram More often than not, a pdf is expressed in the form of a histogram. Suppose a histogram is composed of K cells as shown in Figure 6.1. The histogram is formed from n, sample points and there are k, sample points in k-th cell (k=l,2,.,.,K), respectively. The nodes of each cell are denoted by §k and Ijk+i to differentiate from knots of B-splines.
6. I-D estimation based on large samples
138
If the sample points have a distribution defined by f{x), the probability for the event that n* points fall in &-th cell is given by the following multinomial distribution n.!
(6.15)
where qk is the partial probability of f(x), It is assumed again that f(x) is approximated by a linear combination given by equation (6.5). The partial probability qk relates to the combination coefficients a through B,(x)dx
(6.16)
If ,.c,. = l
(6.21b)
af^0,
(6.21c)
(f = l,2,-,JV).
The solution to the above problem must satisfy a. = _L x y5^£*. n
A
,- = 1,2,...,^
(6.22)
*-i ft (a)
This iterative formula can be obtained in the similar way as equation (6.13) and its proof is neglected here. Again we may use this equation as an iterative formula to find the coefficients
6. 1-D estimation based on large samples
140
6.4 Model selection In the previous discussions, N is always assumed fixed. How to specify N, however, remains a problem. If, for example, two different N's are used to approximate the same pdf, we would obtain two models. The question immediately arises: which model is better? Before answering the question, let's consider the following example. Example 6.1 Model selection Assume a true distribution is given by
fexp +0.2x|-7Lexp{~2(;c-7)2}l
xe[0,10].
(6.23)
from which 200 random numbers are generated as the given sample. The following two models are used to approximate g(x): (6.24a) (6.24b)
Given N=7
—
N=50
—
-L(N=7)=359 -L(N=50)=330
8
10
Figure 6.2 Fallacy of Likelihood function as a tool for model selection where order 3 B-splines are used. Based on the procedures introduced in section
6.4. Moeiel selection
141
6.3.1, the unknown parameter a is estimated. The estimated pdfs using the two models are plotted in Figure 6.2. The values of the log-likelihood functions for both cases are, respectively L(N = 7) = -359 and L(N = 50) = -330.
(6.25)
It is clear from Figure 6.2 that the model f,(x) is closer to the true distribution, but_^a(x) is not. Based on the values of the log-likelihood function, however, fm(x) is better than / 7 (x) because the former has larger likelihood value. We have two conclusions from this example: 1) Model selection must be properly handled. It has significant influence on the estimation accuracy. It is not true that the more parameters the better the model is. It seems there exists an optimum number of B-splines; 2) Likelihood function fails as a quantitative evaluation tool for model selection. We need a new tool to serve as a quantitative measure of model selection. Fallacy of likelihood function as a quantitative evaluation tool for model selection results from the fact that M-L estimator is biased. There exist several criteria for model selection, among which are Akaike Information Criterion (AIC) and Measured Entropy (ME) introduced in Chapter 5. Whatever is random is uncertain. The amount of uncertainty is measured in the information theory by entropy. Uncertainty comes from two sources: the uncertainty of the random variable itself and the uncertainty of the statistical model resulting from approximation. The uncertainty of the random variable itself is measured by the entropy of the true model of the form
//(/,/) =-f/log/&.
(6.26)
The uncertainty resulting from model approximation is measured by the divergence between the true and the candidate models
fix)
(6-27)
The best model should minimize the sum of the total uncertainty: Hif, f) + J(f, g) -*• min The asymptotically unbiased estimator of H(/, / ) is
(6.28)
142 H{f,f)
6. I-D estimation based on large samples = ~\f{x \ a)log/(x | &)dx+^
(6.29)
where ns is the number of sample points, «/ is the number of free parameters in the model equaling to N-l in light of the equality constraint in equation (6.10), and a is the maximum likelihood estimate of a. The asymptotically unbiased estimator of«/(/, g) is (6-30) And thus, the asymptotically unbiased estimator of equation (6.28) is Measured Entropy = ME = - J / ( x | a ) l o g / ( x | a ) & + - ^ 2«s
(6.31)
In chapter 5, as an asymptotical approximant to likelihood function, Akaike Information Criterion (AIC) is estimated by
i Note that the coefficients in front of the last terms in the two equations above are different because they are obtained on different bases. Aided with above-mentioned criteria, the best estimate of pdf (the optimum N) can be found through the following procedures: Suppose a is the maximum likelihood estimate of A for given N. Find N so that ME(N) = - f/(* | a) tog f{x | k)dx +3 x t # " ^ -> mia
(6.33a)
Or if AIC is used, Suppose a is the maximum likelihood estimate of a for given N. Find N so that AIC = - — X l o S fix, | a ) + — -> min
(6.33b)
6.4. Model selection
143
The above procedures are also an optimization process. Given JV,, find the maximum estimate of a through equation (6.13), and compute corresponding ME (Nt). Give another Nt > Nt and find corresponding ME( JVj). Continue the process until a prefixed large N. And then find the N which minimizes ME or AIC. If a is estimated from a histogram, the above formula for ME has no change in the form: Find N so that ME(N) = - | / ( x | a) log f{x | &)dx+—
1 _» min
(6.34a)
But the formula for AIC is slightly changed Find N so that AIC(N) = - V pk log qk (a)+ *-i
-> min
(6.34b)
«,
To obtain equation (6.34b), divide the interval [c,d] into K subintervals \,§k >&+/]• Denote the length of the subinterval by Ak and the number of points falling into &-th cell by nk. Then the first term on the right hand side of equation (6.33b) is rewritten in the form of
(6.35)
Neglecting the terms are constants on the right hand side of equation (6.35) results in equation (6.34b). The integral in equation (6.34a) can only be evaluated through numerical methods, say Gauss quadrature. For one-dimensional problem, computer time is not a big issue and thus choice of a numerical quadrature scheme does not exhibit significant impact on the numerical accuracy and computational efficiency.
144
6, 1-D estimation based on large samples
6.5 Numerical examples In the following examples, we assume the true pdf f(x) is given, and generate ns random numbers from this distribution by employing the method presented in Chapter 3. Using these random data we estimate the coefficients a and N, based on the above analyses. Example 6.2 The exponential distribution Suppose an exponential distribution of the form is given f(x) = exp(-x)
(6.36)
From this distribution, generate a sample of size 20 by use of PRNG introduced in Chapter 3. The generated data are .67Q55220E-02 .19119940E+01 .90983310E+00 .32424040E+00 .45777790E-01
.88418260E+00 .17564670E+01 .1Q453880E+01 35282420E+00 J3858910E+00
.32364390E+00 J181957GE+01 J9749570E-01 J88Q6860E+00 .13271010E+01
.64127740E+00 .10687590E+01 .22005460E+01 .24852700E+00 .10658780E+01
If MODEL(N) denotes the model for approximating the pdf, that is, MODEL(N): / ( x ] a ) = £ a, £,.{*)
(6.37)
we obtain estimates for the following models using equation (6.13). (a) M0DEL(3) representing 3 B-spline functions are used to approximate the pdf a, = 0.4998, fl2 = 7.43 x 10"8, a, = 5.37 x 10~M . Wife parameters in the above substituting the parameters in MODEL(N), the log-likelihood function, ME and AIC can be calculated from the following equation | a)
(6.38a)
6.5. Numerical examples ME(N) = - [f{x | a)log f{x | a ) * + ^ ^ — ^ J 2«, = —-Ylog/(x t |a}+—•—-
145 (6.38b) (6.38c)
that is, £ = 20.13, ME = 1.51, AIC = l.U This model is in feet approximated by the first B-spline and the rest two Bsplines have coefficients nearly equaling zero. (b) MODEL(4) having 4 B-splines a, =0.999, a 2 =2.65>dtr\ ^ = 0 , « 4 = 0 I = 15.05 , ME = 0.89, AIC = 0.90 (b) MODEL(5) having 5 B-splines a, = 0.953,a2 =0.273, a, = 3,82x 10"\ a 4 = 0 , 5 s = 0 £ = 15.05, ME = 1.13, Among these three models, MODEL(4) minimizes both ME and AIC, thus being the best model for the data presented at the beginning of this example. We now use two larger samples to perform the estimation. Suppose two random samples are generated from (6.36). The size of the first sample is 100 and the second 200. Some of the calculated results are given in Table 6.1. N starts from 3 and stops at 11. Values of Log-likelihood function L, ME and AIC are given in the table. From the Table, the log-likelihood function —L is nearly a monotonic function of N. Again we see that it cannot specify the most suitable model. For the case of «s=lQ0, ME and AIC take their minimums at N=4, both yielding the same estimates. The estimated pdfs for iV=3,4,10 are shown in Figure 6.3 (a). For the case of J¥=3, the curve does not fully represent the characteristics of the given curve because the number of B-splines is not enough. In the case of iV==lG, the curve ejdubits more humps than expected. The curve corresponding to N=4 is competitively closer to the original curve than the rest two. Our visual observations and quantitative assessment by use of ME and AIC yield consistent conclusions. In general, the estimate is acceptable. The estimate is improved a lot if the sample size is increased to 200 as shown
146
6. I -D estimation based on large samples
in Figure 6.3 (b). However, ME and AIC are different in this case. ME attains minimum at N=5 and AIC attains minimum at i¥=6. They do not differ too much if viewed from the figure. On the entire interval, the curves for iV=5 and N=6 are almost coincident except on the interval [2.3, 5], where the curve for N=6 is visibly lower than the original curve and the curve for N—5. Table 6,1 The exponential distribution
N 3 4 5 6 7 8 9 10 11
N 3 4 5 6 7 S 9 10 11
-L 106.5 91.6 92.1 91.5 91.1 90.7 90.3 90,4 89.9
(a) «s=100 ME 1.39 0.980 0.985 1.000 0.984 1.019 1.027 1.039 1.052
AIC 1.085 0.946 0.961 0.965 0.971 0.977 0.983 0.994 0.999
-L 213.3 182.8 183.2 181.7 181.9 180J 180.0 180.5 189.9
(b) «,=200 ME 1.376 0.960 0JS2 0.953 0.959 0.954 0.957 0.974 0.981
AIC 1.077 0.939 0.936 0.934 0.939 0.939 0.940 0.948 0.949
6.5. Numerical examples
147
1.2 1 OJ
1
0.6
ens:
o
t»
0.4
•a
0.2 Si
o
0
Figure 6.3 The exponential distribution Hence, comparison with the original curve has revealed that the curve for N=5 (ME-based best) is slightly better than the curve for N=6 (AlC-based best). This is not surprising if we recall the assumptions and derivations in Chapter 5. AIC is an asymptotically approximant to likelihood function, but ME accounts for model uncertainty. It is interesting to note that section 6.1 gives satisfactory expression of continuous pdfe based on approximation theory, section 6.2 estimates the unknown parameters based on statistical theory while section 6.3 solves the problem of model selection based on information theory. None of the above-
148
6. I-D estimation based on large samples
mentioned three theories eould solve the problem of pdf estimation in such a broad sense if they were not combined together. Thus, this chapter and the present example well demonstrate the power of interdisciplinary studies. Example 6.3 The normal distribution The next example considered is that X is normally distributed as
(639)
Again two samples («s=100 and «/=200) were generated from the distribution. The estimated results are shown in Table 6.2, with N starting from 3 and stopping at 13. For the first case («,,=100), ME and AIC predict different models. ME indicates N=7 is the most suitable model while AIC predicts the best model is given by N=E. The difference is solved by plotting me estimated pdfe in Figure 6.4(a), In the figure, the curve for i¥=5 is also plotted. It shows poor correlation to the original curve.
Table 6.2 The normal distribution (a)
N 3 4 S 6 7 8 9 10 11 12 13
-L 173.5 173.2 147.1 151.7 136.8 135.3 134.9 135.1 134.6 134.1 134.0
(b) nx=200
n,,=100
ME(N) 1.984 1.993 1.760 1J20 1.431 1.483 1.483 1.500 1.498 1.498 1.527
AIC 1.755 1.762 1.511 1.567 1.428 1.423 1.429 1.441 1.446 1.451 1.457
N 3 4 5 6 7 8 9 10 11 12 13
-L 348.5 348.5 299.1 308.1 283.2 282.7 282.1 281.1 281.4 281.2 281.2
ME(N) 1.969 1.976 1.730 1.788 1.458 1.480 1.471 1.477 1.488 1.490 1.495
AIC 1.753 1.757 1.516 1.565 1.446 1.449 1.450 1.451 1.457 1.461 1.466
ft 5. Numerical examples
149
0.5 +3
u 01
J
0.2
13
•s 1
.-.
'
If/ \ \ N=7{ME) — /t--~.A\N=8{AIC) --'
0.3 a
Given
(a) «s=100
0.4
0.1 0
V 4 X
10 §
(b) «s=200
0.4 4.
0.3 0.2
1 !
0.1 0 0
1
Figure 6.4 The normal distribution The model iV=7 (ME-based best) is closer to the original curve in terms of shape. It keeps the symmetry of the original curve while the curve for i¥=§ (AICbased best) loses such symmetry. However, both show some quantitative deviations from the original curve. Generally speaking, the model N=7 is slightly better than JV=8. In Figure 6.4 (b) are shown the results obtained from the sample n/=200 and in Table 6.2 the values for likelihood function, ME and AIC are given. In this case, ME and AIC yield same prediction that N=7 is the best. This is in agreement with visual observation from Figure 6.4 (b). Among the three
150
6. I-D estimation based on large samples
curves plotted in the figure, N=l is closest to the original. To see the efficiency of the iterative formula (6.12), the iterative process for three combination coefficients a2,a^ and a6 for the case N = B are plotted in Figure 6.5. After ten iterations, the results for the three coefficients are already very close to the final solutions. The convergence rate is thus remarkable. This is not special case. Numerical experiments yield the same conclusions. In general, the convergence rate of the iterative formula (6.12) is quite satisfactory, giving convergent results after about ten or several tens of iterations. U.4 , . - • • "
"
0.3 0.2 0.1 0
1
0
Iteration number 10
20
30
40
Figure 6.5 Convergence of linear combination coefficients Example 6AA Compound distribution Consider a more complicated example, in which X has the following mixed distribution: (6.40a)
jg(x)dx where g(x) is a function defined on [0,10]
(6.40b) + 0.2x
xe[0,10]
151
6.S. Numerical examples
The definition domain is [0,10]. Three random samples of size «,= 30 and nx = 50 and «s=1000 were generated, respectively. Estimations were separately perfonned for these three samples using the above procedures. The results for likelihood function, ME and AIC are given in Table 6.3. N starts from 6 and stops at 20. Table 6.3 The compound distribution
"N 6
7 8 9 11 12 13
-L 173.1 166.3 171.6 160.7 158.3 159.3 157.2
ME(N) 1.750 1.813 1.881 1.818 1.743 1.811 1.738
(a}«,-100 AIC(N) N 1.781 14 1.723 15 1.781 16 1.687 17 1.683 18 1.703 19 1.692 20
-L 157.6 157.5 156.5 157.0 156.4 156.3 155.9
ME(N) 1.785 1.800 1.778 1.817 1.837 1.819 1.857
AIC(N) .706 .715 .715 .730 .734 .743 .749
ME(N) 1.679 1.671 1.670 1.674 1.679 1.676 1.682
AIC(N) 1.653 1.653 1.654 1.655 1.658 1.659 1.661
ME{N) 1.663 1.654 1.650 1.656 1.657 1.655 1.659
AIC(N) 1.644 1.643 1.643 1.644 1.645 1.646 1.646
(b)«s=500
N
-L
6
862.9 835.5 847.0 819.0 812.3 817.8 812.2
7 8 9 11 12 13
N
-L
6
1727 1674 1697 1642 1629 1640 1629
7 8 9 11 12 13
ME(N) 1.703 1.739 1.727 1.706 1.667 1.689 1.660
AIC(N) 1.736 1.683 1.708 1.654 1.645 1.658 1.648
ME(N) 1.695 1.732 1.722 1.695 1.656 1.677 1.646
(c) «/=1000 AIC(N) N 1.732 14 1.680 15 1.704 16 1.650 17 1.639 18 1.651 19 1.641 20
N
-L
14 15 16 17 18 19 20
813.5 812.4 811.9 811.7 812.1 811.4 811.3
-L 1631 1629 1628 1628 1628 1628 1628
6. I-D estimation based on large samples
152
U.4
yx / \
0.3 / 0.2
•8
1
0.1
(c)ns=1000 \
/
N=ll N=13
—
N=20 Given
\
_
/ VA
•
.
_
•
X
0
2
4
6
8
10
Figure 6.6 Influence of sample size on estimation accuracy
6.5. Numerical examples
153
In general, for all three easesj the maximum likelihood function is a decreasing function of the number of B-splines, N. But as N is large enough, say N is larger than 13, the likelihood function is almost a constant, varying little as N increases further. This is particularly true as the sample size is large, see Table 6.3(c). Comparison of Table 6,3 (a) and (c) shows the influence of the second term (the number of free parameter over the sample size) in AIC. As sample size is large, see Table 6.3 (c), the influence of the second term is not significant, and thus AIC is nearly a constant as the likelihood. As fee sample size is small, see Table 6.3 (a), the second term in AIC becomes more important. For all the three cases, ME-based best models are given by i¥=13 while AICbased best model are given by N = l l . From the table, ME value is favorably larger than AIC value for each N. This is reasonable because ME accounts for one more term of uncertainty, l(g,f), than AIC, Some of the results are plotted in Figure 6.6. In the figure, the curve for N-2Q is plotted for the sake of comparison. If sample size is small, the difference between the estimated pdf and the given distribution is significant, see Figure 6.6 (a). As sample size increases, the difference becomes smaller and smaller, indicating a convergence trend, see Figure 6.3 (b) and (c). Figures 6,6(b) and (c) do not show apparent difference from the viewpoint of statistics. Example 6.5 Estimation from histogram data This example demonstrates the estimation procedure presented in section 6.3.2, Take the sample n/=500 in Example 6,4 for instance. A K==15 histogram is formed on the interval 10,10] as shown in Figure 6,7. With the estimation procedure neglected, the results for the likelihood function, ME and AIC are tabulated in Table 6.4 for each N. The ME-based best model is N=\S, while AIC does not change at all as the sample size is above 17, failing to find the best model. In figure 6.7 is plotted the estimated pdf and the original curve. The estimated pdf (N=15) is very close to the true distribution in comparison of the curve for JV=13. ME performs well for this example. The computer times for all these calculations were within one minute on a Pentium 4 PC, thanks to the introduction of the iterative formulas. The methods are computer-oriented, making estimation automatically performed once the histogram is given.
154
6. 1-D estimation based on large samples Table 6.4 Estimate by histogram data (B/=500)
N 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0
-L
ME
AIC
2.141 2.129 2.102 2.096 2.078 2.084 2.068 2.075 2.065 2.069 2.065 2.065 2.065 2.065
2.235 2.168 2.200 2.136 2.151 2.148 2.124 2.136 2.107 2.126 2.108 2.112 2.109 2.108
2.169 2.157 2.130 2.124 2.107 2.112 2.096 2.104 2.094 2.098 2.093 2.093 2.093 2.093
4
6
8
10
X Figure 6.7 Estimate from histogmm date
Example 6.6 Revisit of entropy estimation Entropy estimation has been theoretically studied in Chapter 5. In this example, Example 6.4 is used to demonstrate how entropy estimates vary with unknown parameters. Suppose sample size n, = 500. Four quantities defined by
155
6.5. Numerical examples 3 AT-1 2 »
(6.41a) (6.41b)
JV-1 2 n.
(6.41c)
JV-1
(6.4 Id)
The values of these quantities are plotted in Figure 6.8, in which the vertical ordinate shows the values of the above-defined quantities and the horizontal ordinate shows the number of parameters. 1.9
1.8
I
1.7
1.6
1.5 10
20
30
Figure 6.S Entropy estimation As sample size increases, the quantity £ 3 tends to be a constant. Because E% is in fact the estimate of the entropy of the random variable under consideration, constancy of £ 3 as sample size is large is just the right estimate of the entropy. The rest three quantities decrease as sample size increases from small to medium size. As sample size increases further, they begin to increase. Basically, Ei and AIC are quite close, because the latter is the asymptotic estimate of the
156
6. 1-D estimation based on large samples
former. This verifies the theoretical analysis presented in Chapter 5. Over quite large range, E\ is largest among the three. This is due to the fact that Ej measures the uncertainty in the whole process. The domain marked by a circle in the figure is the area of interest because the three quantities take minima in this domain. The critical numbers of parameters around which the four quantities take their respective minima are close. Within this domain, the curves for the four quantities defining four types of entropies show large fluctuations. Outside this domain, fluctuations are not significant, all curves varying in a relatively smooth way. It is within this fluctuation domain that the four entropies take their minima. The fluctuations are not meaningless. Take the curve for £ 2 for instance. It is minimized at iV=13. Its value at N=\2 is, however, much larger than the entropy value at JV==12. Such big difference at these two points enables us to convincingly find the number of B-splines minimizing entropy values. 6,6 Concluding Remarks If the methods presented in this chapter are placed at a larger background, a word should be mentioned about estimation methodologies in general. The task of estimating a probability density from data is a fundamental one in reliability engineering, quality control, stock market, machine learning, upon which subsequent inference, learning, and decision-making procedures are based. Density estimation has thus been heavily studied, under three primary umbrellas: parametric, semi-parametric, and nonparametric. Parametric methods are useful when the underlying distribution is known in advance or is simple enough to be well-modeled by a special distribution. Semi-parametric models (such as mixtures of simpler distributions) are more flexible and more forgiving of the user's lack of the true model, but usually require significant computation in order to fit the resulting nonlinear models. Nanparametric methods like the method in this Chapter assume the least structure of the three, and take the strongest stance of letting the data speak for themselves (Silverman,1986). They are useful in the setting of arbitrary-shape distributions coming from complex real-world data sources. They are generally the method of choice in exploratory data analysis for this reason, and can be used, as the other types of models, for the entire range of statistical settings, from machine learning to pattern recognition to stock market prediction (Gray & Moore, 2003). Nonparamefric methods make minimal or no distribution assumptions and can be shown to achieve asymptotic estimation optimality for ANY input distribution under them. For example using the methods in this chapter, with no assumptions at all on the true underlying distribution, As more data are observed, the estimate converges to the true density (Devroye & Gyorfi, 1985). This is clearly a property that no particular parameterization can achieve. For this reason nonparametric estimators are the focus of a considerable body of advanced
6.6 Concluding Remarks
157
statistical theory (Rao, 1983; Devroye & Lugosi, 2001). We will not spend more space on nonparametric estimation here. The interested reader is referred to these books for more exposure to nonparametric estimation. Nonparametric estimations apparently often come at the heaviest computational cost of the three types of models. This has, to date, been the fundamental limitation of nonparametric methods for density estimation. It prevents practitioners from applying them to the increasingly large datasets that appear in modem real-world problems, and even for small problems, their use as a repeatedly-called basic subroutine is limited. This restriction has been removed by the method presented in this chapter. As a nonparametric estimation method, the advantages of the proposed method are summarized as follows; (1) In the proposed methods the pdf is estimated only from the given sample. No prior information is necessary of the distribution form; (2) The pdf can be estimated by a simple iterative formula, which makes the methods computationally effectively. The methods are computeroriented; (3) The methods provided in this chapter are capable of approximating probability distributions with satisfactory accuracy; (4) As sample size increases, the estimated pdf gets closer to the true distribution; (5) ME and AIC analysis are able to pick up the most suitable one from a group of candidate models. ME is more delicate than AIC, but in most cases they yield same estimates. But note that ME or AIC are not valid if the number of free parameters «/ is too large, otherwise the central limit theorem fails. The cases that the number of free parameters is larger than the sample size are treated in Chapters 8 and 9. Without giving more examples, it should be pointed out that the fourth order B-splines yield same satisfactory results as the third order B-splines. The reader may verify this using the codes given in Chapter 12. The method presented in this chapter, a nonparametric one, also exhibits a shift of estimation strategy from traditional statistics. The typical procedures to determining a probability distribution as shown in Figure 6.9 are (1) SPECIFICATION: Assume a distribution t y p e / ( x | S ) , where 0 is an unknown parameter; from the family 2 jDbf -> min
(9.20)
Setting -3f~ = G, i = l,2,—,M;j
= 1,2,.-,N
(9.21)
we obtain the solution b = (I + o2JfHTl a,q1=Q1 (b).
(9.22)
The pdf calculated from following equation, which is obtained from Equation (9.1) with a replaced by b is thus an estimate based on Bayesian method
9.4. Houholder transform
f(x,y\ b) = JtZbyBXxJBjiy).
221
(9.23)
Once w2 is given, we can obtain Bayesian point estimation b from equation (9.22). Given different eo2, we may obtain different estimates b . Which estimate is the most suitable remains a problem. In the next section, an entropy analysis is used to find the most suitable a2. 9.3,2 Determination of parameter co2 From equation (9.22), we note that b = a if c/=0, and the Bayesian estimation degenerates to the preliminary prediction. From Equation (9.20), we note that b is a plane if a2 -> «o, Thus, e^ is a very important factor in our analysis. We hope to determine a/ in an "objective" way. Note that the marginal probability P(a | m2) = J>(a | b)P(b)db
(9.24)
describes the averaged behavior of a, and gives the true distribution of the occurrence of a. Because it is independent of parameter vector b, we can obtain the estimation « 2 andCT2by maximizing the marginal probability. Or, Find a/ such that P(a j m1) = j>(a |b)F(b)db -» max.
(9.25)
Rewritten in the following logarithmic form, the marginal probability is MEB{©2) = - 2 log F(a | &1) = - 2 log j>(a j b ) P ( b ) * .
(9.26)
Substituting F(a | b) and F(b) in equation (9.26) and denoting,
x=
,F =
F r F | = determinant of F r F,
, and
(9.27)
222
9. 2-D estimation based on small samples
we have (9.28) Furthermore, using «f|2
g 2 = Qx - Fb|| = |x - F b | + |[F(b - b)| = f2 + F(b -1
(9.29)
we may write MEBfa?) in the form of 2
) = -21ogJP(a|b)P(b)db
lima v2MM
t
2
(9.30)
= -21og
The last integral in the equation above is in fact the multidimensional normal distribution with the following normalizing factor 1
MN
FrF
1/2
(9.31)
where the parallel sign j | denotes the determinant of the matrix inside the parallel signs. Thus we obtain MEB(® 2 } = -(MM - 4) log Q}2 + (MN - 4) logo"2
• + log|FrFJ + constant.
(9.32)
The best model is obtained by minimizing MEB. Differentiating MEB in the equation above with respect to a 2 and setting the derivative to zero, we obtain (9.33)
9.4. Householder transform
223
Substituting 2
2
+ (M¥-4)logf 2 + log|F r F|-» min (9.34)
In summary, the Bayesian estimation is composed of following steps (!) Preliminary prediction from Equation (9.2); (2) Smooth Bayesian point estimation from Equation (9.22) for a given (3) Estimation of MEB(«J 2 ) from Equation (9.32); (4) Repeat steps (2) and (3) for different m2 and choose the m2 which minimizes MEB, It should be pointed out that special numerical treatments are needed to find the determinant of the matrix |FTF| because the size of this matrix is very big. For example, if M=N=40, the size of this matrix is of the order of 3200 x 1600, which is hardly solvable by using simple numerical methods. This matrix is, however, a sparse one. We may Householder reduction method to solve it. 9.4 Householder Transform Recall in Chapter 8 that the determinant F r F is found through Gauss elimination method. It is not, however, feasible here to use the method for finding the determinant F F in equation (9,34) for the size of matrix F F is so large that the computer time becomes unbearable. An alternative method must be used instead. The proper method for mis case is the so-called Householder reduction method, which is suitable for large-scale sparse matrix. Householder method is composed of three steps. First of all, transform a real symmetric matrix A into a tridiagonal matrix C. Then the eigenvalues of matrix C is calculated by use of root-finding method and the corresponding eigenvectors are found. Finally, the determinant is solved using the eigenvalues by use of the following theorem from linear algebra. Theorem 9.1 If Ai,A2,---,An determinant of A is
are eigenvalues of the matrix A, then the
Based on this theorem, we focus on finding the eigenvalues of a matrix using Householder transfer. Details are given in the appendix to this chapter.
224
9, 2-D estimation based on small samples
Figure 9.3 Bayesian estimation based on 200 sample points.
8900 8700 03
8500 8300
20 Figure 9.4 Relationship between MEB and to2 and the search for the optimum point (minimum MEB}
9.5, Numerical examples
225
9.S Numerical examples In the first example, we assume a true pdf f(x,y), and generate ns random points from this distribution. Using these random data we estimate the coefficient vector b based on the above analysis. In the second example, the present method is applied to a practical problem. Example 9,1 Normally correlated 2-dimensional pdf Suppose the true distribution is given by Equation (9.3). It is further assumed that M=JV=4Q (totally 40 x 40 =1600 B-splines are used). Then we generate n, = 200 random points. The shape of f(x,y) is shown in Figure 9,2 and the random points are shown in Figure 3.9 (b). By following the steps given in section 9.3, the optimum Bayesian estimation is found as shown in Figure 9.3 for this case. Compared with preliminary prediction, the Bayesian estimation is much improved and is close to the true distribution as shown in Figure 9.2. If we notice the noise-like irregularity in the preliminary prediction and the closeness of the Bayesian estimation to the true distribution, the usefulness of the analysis employed in this paper is strongly supported. The searching process for the optimum MEB is shown in Figure 9.4. From the figure, we see that after the optimum point, MEB does not change much with m2. The function relationship between eo2 and MEB is quite simple. Thus, it is a rule of thumb (because it is just our observation without mathematical justification) that there exists only one optimum solution for MEB( &t2) and that MEB( m2) is a simple and quite smooth function of m2. To see the influence of sample size on the estimation, three samples of pseudo random points are generated from the given distribution. The sample sizes are 100,200 and 300, respectively. The optimum estimated pdf for the three samples are plotted in Figure 9.5. What is impressive is that the estimations based on 100, 200 and 300 sample points are quite close to each other. Figure 9.6 shows the estimations for three specific a)2 values. The estimated pdf for o 2 = 0.01, m2 = 8 and m2 = 200 are plotted in Figure 9.6(a)-(c). If m2 is very small (say, a>2 =0.01), or the variance t2 of b is very large, the Bayesian estimation is close to the preliminary prediction, and the smoothness information is ignored in the estimation. On the other hand, if©2 is very large (say, m2 =200), or the variance r 2 of b is very small, the estimated pdf tends to be a flat plane, and the sample information is ignored in the estimation. Thus there are two extremes. On one extreme, the smoothness information is ignored, and on the other extreme the sample information is ignored. By aid of Bayesian approach, we successfully combine the two sources of information and obtain greatly unproved estimation.
9. 2-D estimation based on small samples
226
®2=W MEB=8830
(a) Estimation based on 100 sample points ro2 = 8 MEB=8335
I •a 8
9
J
(b) Estimation based on 200 sample points co2 = 9 MEB=8027
(c) Estimation based on 300 sample points Figure 9.5 Estimation based on three different samples
Probability density
Probability density
Probability density
1°
(I O
r 8'
f T3
I
KJ
9. 2-D estimation based an small samples
228
But, it should be mentioned that around the optimum point, MEB varies very slowly. For example, the MEB differences for a?2 =10 and m1 =2 in this example is less than 1%. Example 9.2 Joint distribution ofwave-height and wave-period
H(m)
Figure 9.7 The Bayesian estimation of the joint distribution of wave-height and wave-period (M=N=30). H is wave height and Tis wave period. This problem has been studied in Chapter 8 as an example for large sample. Here we use the method developed in this chapter to solve the problem again. The data of wave height and wave period are taken from the records measured by ships for winters in the Pacific (Zong, 2000). The wave periods range from 0 seconds to 16 seconds and wave heights range from 0 meters to 16 meters. We use 900 B-spline functions to approximate the distribution (Af=30 and i\N30). The optimum a1 =0.01 and MEB=W*. The estimated pdf is shown in Figure 9.7 9.6 Application to discrete random distributions The methodology presented in sections 9,2~9.4 has been applied to logistic model. In this section, we apply it to discrete random variable to show its capability. Consider a bivariate discrete random vector of the form
9.7. Concluding remarks
229
pn (9.35) p M2
r
•••
p *MN.
The fee M-L estimate of PtJ is n, S-
(9.36)
where ntJ is the number of event A^. If sample size is small, large fluctuations are expected in the estimate p,. To remove the irregularities present in pt, we assume that
Again we obtain (9.6). From here, the formulas presented in sections 9.2-9.4 are applicable. 9.7 Concluding remarks We are often faced with the cases where observed samples show complex distributions and it is difficult to approximate the samples with well known simple pdfs. In such situations, we have to estimate the pdf directly fiom samples. Especially influenced by statistical fluctuations, estimation based on small samples becomes more difficult. In this paper, a method mat can directly identify an appropriate pdf for a 2dimensional random vector from a given small sample is presented. Three models are established in this paper. One is the likelihood function, which pools the information in sample data, one is the smooth prior distribution which defines the smoothness of the unknown parameters, and the last is the MEB which helps us to find the most suitable m1 (and the prior distribution) in an "objective" way. The usefulness of the method is examined with numerical simulations. It has been found that the estimated pdfe under consideration based on the present analysis are stable and yield satisfactory results even for small samples.
230
9. 2-D estimation based on small samples
Appendix: Householder transform A.I Tridiagonalization of a real symmetric matrix The special case of matrix that is tridiagonai, that is, has nonzero elements only on the diagonal plus or minus one column, is one that occurs frequently. For tridiagonai sets, the procedures of LU decomposition, forward- and back substitution each take only O(N) operations, and the whole solution can be encoded very concisely. Naturally, one does not reserve storage for the full N x N matrix, but only for the nonzero components, stored as three vectors. The purpose is to find the eigenvalues and eigenvectors of a square matrix A. The optimum strategy for finding eigenvalues and eigenvectors is, first, to reduce the matrix to a simple form, only then beginning an iterative procedure. For symmetric matrices, the preferred simple form is tridiagonai. Instead of trying to reduce the matrix all the way to diagonal form, we are content to stop when the matrix is tridiagonai. This allows the procedure to be carried out in a finite number of steps, unlike the Jacobi method, which requires iteration to convergence. The Householder algorithm reduces an n*n symmetric matrix A to tridiagonai form by n - 2 orthogonal transformations. Each transformation annihilates the required part of a whole column and whole corresponding row. The basic ingredient is a Householder matrix P, which has the form P = I-2wwr
(9.A.1)
where w is a real vector with |w|2 = 1. (In the present notation, the outer or matrix product of two vectors, a and b is written as a b r , while the inner or scalar product of the vectors is written as a r b.) The matrix P is orthogonal, because P2=(l-2wwr)-(l-2wwr) = I - 4 w - w r + 4 w . ( w T - w ) - w T =1
(9.A.2)
Therefore P = P - I . But P r = P, and so P r = P-I, proving orthogonality. Rewrite P as T
P =I~
(9.A.3) XT
where the scalar H is
Appendix „
231
1I |
(9.A.4)
21 '
and u can now be any vector. Suppose x is the vector composed of the first column of A. Choose (9.A.5)
u=x+xe
where ei is the unit vector [1, 0,. . . , 0 ] r , and the choice of signs will be made later. Then
{jx] +1 This shows that the Householder matrix P acts on a given vector x to zero all its elements except the first one. To reduce a symmetric matrix A to tridiagonal form, we choose the vector x for the first Householder matrix to be the lower n — 1 elements of the first column. Then the lower n — 2 elements will be zeroed:
10 0 P, A = 0
0
0
(«-D p *1
a,.,
irrelevant
0
k 0 0 0
(9.A.6)
irrelevant
Here we have written the matrices in partitioned form, with ("~13P denoting a Householder matrix with dimensions (n - 1) x (w - 1). The quantity k is simply plus or minus the magnitude of the vector [a2l»• • •, anl ] r . The complete orthogonal transformation is now
9. 2-D estimation based on small samples
232
k A' = P A P =
(9.A.7)
0 0
irrelevant
0 We have used the fact that P 7 = P. Now choose the vector x for the second Householder matrix to be the bottom « - 2 elements of the second column, and from it construct
1 0 0 0 1 0
0 0
0 0
(9.A.8)
0 0 The identity block in the upper left corner insures that the tridiagonalization achieved in the first step will not be spoiled by this one, while the (n ~ 2)~ dimensional Householder matrix tn~2)P2 creates one additional row and column of the tridiagonal output. Clearly, a sequence of « - 2 such transformations will reduce the matrix A to tridiagonal form. Instead of actually carrying out the matrix multiplications in P • A • P, we compute a vector
Au H
(9.A.11)
Then 1» _ A
U
U
\ _ A
H A' = A P A = A - p u r - u -
(9.A.12a) (9.A.12b)
where the scalar K is defined by
2H
(9.A.13)
Appendix
233
If we write qsp-Xu
(9.A.14)
then we have A' = A-qu r -uq r
(9.A.15)
This is the computationally useful formula, Most routinesforHouseholder reduction actually start in the n-th column of A, not the first as in the explanation above. In detail, the equations are as follows: At stage m(m= 1,2,..., «-2) the vector u has the form Kn.WVi
,-,0].
(9.A.16)
Here i = n-m + l = n,n-l,--',3
(9.A.17)
and the quantityCT(|JC|2 in our earlier notation) is
a Hanf +(aaf+-
+ (aIJ_lf
(9 A18)
We choose the sign of a in (9.A. 18) to be the same as the sign of at,._, to lessen round-off error. Variables are thus computed in the following order: a, H, if, p, K, q, A'. At any stage m, A is tridiagonal in its last m - 1 rows and columns. If the eigenvectors of the final tridiagonal matriK are found (for example, by the routine in the next section), then the eigenvectors of A can be obtained by applying the accumulated transformation Q=P,P2-PB_2
(9.A.19)
to those eigenvectors. We therefore form Q by recursion after all the P's have been determined: Q2 2 Q,= P, • Qi+I,
/= » - 3
1.
(9.A.20)
234
9. 2-D estimation based on small samples
A.2 Finding eigenvalues of a tridiagonal matrix by bisection method Tridiagonalization leads to the following tridiagonal matrix c,
b2
(9.A.21)
K Once our original, real, symmetric matrix has been reduced to tridiagonal form, one possible way to determine its eigenvalues is to find the roots of the characteristic polynomial pn(X) directly. The characteristic polynomial of a tridiagonal matrix can be evaluated for any trial value of X by an efficient recursion relation. Theorem A.1 Suppose b^O
(i=2,3,.,.,n).
For a«y X, the characteristic
polynomials form a Sturmian sequence {pt (^)}" =0 satisfying po(A) =
,
i = 2,-,n
(9.A22)
If a(A) denotes the number for the sign between two neighboring numbers to change, then the number of the eigenvalues of A smaller than A is a{A), The polynomials of lower degree produced during the recurrence form a Sturmian sequence that can be used to localize the eigenvalues to intervals on the real axis. A root-finding method such as bisection or Newton's method can then be employed to refine the intervals. Suppose all eigenvalues of A satisfy Al < A^
r
— x\ n{X-Xj,
(lU.yj
Now we are to determine the parameters a y . From the process of classification, the membership function ft,(xt) is regarded as same as the probability that a sample point xt is classified into Ixaacy set A,, that is, Pr[xteA,] = iii(xt).
(10.11)
Therefore, we employ the likelihood analysis for the determination of the membership functions. The probability of the classification event for all the sample points xt (£ = 1,2, • • •, ns) is expressed by the following likelihood function.
x
- n The log-likelihood function is
l=\ xte
246
10. Estimation of the membershipjumtion M
M
N
N
M
where J>(x) = £2>,«,(*) = £(£« s )^ W = 1According to the B-spline function properties we have ^iBJ(x) = \.
(10.14)
From the above two equations the following relationship is obtained, fdalj=l,j
= l,-,N.
(10.15)
A membership function is always greater than or equal to zero. To guarantee this we simply impose the restriction that all parameters atJ are greater than or equal to zero, that is, o,^0
i = l,...,M;j = l,...,N.
(10.16)
Usually, we hope /! j=i 1=1
Mi
1=1 1,64 Mi
j=
From equations (10.A.9) and (10.A.12), the following relationship is obtained, M ft
££f^
SI
(10.A.13)
The above equation indicates that if a0 is a local optimum solution it must be the global solution.
This page intentionally left blank
Chapter 11
Estimation of distributions by use of the maximum entropy method
Maximum Entropy Method (MEM) cannot be ignored when information theory is applied to finding the pdf of a random variable. Although MEM was explicitly formulated first in 1957 by Jaynes (1957a,b), it was implicitly used by Einstein in the early 20th century for solving problems in quantum statistical mechanics (Eisberg & Resnick, 1985). However, both Shannon (Shannon & Weaver, 1949; Khinehin, 1957) and Kullback (1957) did not touch the topic of MEM in their original works. The central idea in MEM is the Maximum Entropy (Maxent) Principle. It was proposed for solving problems short of information. When formulating the Maxent Principle, Jaynes (1957) believed that in the cases where only partial information is available about the problem under consideration, we should use fee probability maximizing entropy subject to constraints. All other probabilities imply that unable-to-prove assumptions or constraints are introduced into the inference such that the inference is biased. MEM is a fascinating tool. Formulated properly, all special distributions (the normal, exponential, Cauchy etc) can be determined by use of MEM. In other words, mese special distributions solve the governing equations of MEM. Expressed in Jaynes language, all known special distributions represent an unbiased probability distribution when some of information is not available. Besides academic research on MEM, MEM has been applied to a variety of fields for solving problems present in communication (Usher, 1984), economics (Golan et al, 1996; Fraser, 2000; Shen & Perloff, 2001), agriculture (Preekel, 2001) and imaging processing (Baribaud,1990). In this Chapter, application of MEM for estimating distributions based on samples is introduced.
259
260
1 L Estimation by use of the MEM
11,1 Maximum entropy The detective-crack-case story features the MEM. Initially, only very few information is available. And as many suspects as possible should be investigated without favoring some suspects. As investigation proceeds, some suspects are eliminated from the investigation, but others receive more extensive and intensive investigations. As more and more evidences are collected, the true murder is found. At each stage of the investigation, the principle behind the investigation is to involve all suspects for investigation. No suspect is eliminated from the investigation if without strong evidence to support to do so. Mathematically speaking, the probability for each suspect to commit the crime is p,,, i = I, • • •, M, Initially, all suspects are equally suspected, and thus Pi=l/M.
(11.1)
As more information is collected, some suspects are eliminated from the investigation due to alibis. If Mi suspects are excluded, the probability for each of the remaining suspects is />,=1/(M-M,).
(11.2)
Finally, as Mi = M - 1 , only one suspect is identified. Suppose there are ten suspects initially. Then the detecting process can be written in the following form Ai
A2
A3
A4
Aj
AB
AI
At
A$
A\a
0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
0 0 0
0.2 0 0,2 0 0.2 0.2 0.2 0 0 0 0 0.5 0 0 0 0.5 0 0 0 0 0 0 0 0 1 0 0
(11.3)
Two observations from the process are of direct relevance to MEM. First of all, at each stage, the probability is always assigned to each suspect with equal probability without ftvoring one of the suspects. Why to do so? Without flather alibis, it is dangerous to eliminate a suspect from the investigation too early. This is in fact the rational behind the Maxent Principle. Because the uncertainty is expressed by entropy, the maxent principle states ~Y.Pt log/% -» max.
(11.4)
/ /. /. Maximum entropy
261
The second observation is that as more and more information is collected, the probability to find the true criminal becomes larger and larger. Initially, all suspects are of equal probability to commit the crime. This piece of information, expressed as a constraint, is 5>=I.
(11.5)
Equations (11.4) and (11.5) outline the essence of MEM. In words, if we are seeking a probability density function subject to certain constraints (e.g., a given mean or variance), use the density satisfying those constraints which makes entropy as large as possible, Jaynes (1957a,b) formulated the principle of maximum entropy, as a method of statistical inference. His idea is that this principle leads to the selection of a probability density function which is consistent with our knowledge and introduces no unwarranted information. Any probability density function satisfying the constraints which has smaller entropy will contain more information (less uncertainty), and thus says something stronger than what we are assuming. The probability density function with maximum entropy, satisfying whatever constraints we impose, is the one which should be least surprising in terms of the predictions it makes. It is important to clear up an easy misconception; the principle of maximum entropy does not give us something for nothing. For example, a coin is not fenjust because we don't know anything about it. In fact, to the contrary, the principle of maximum entropy guides us to the best probability distribution which reflects our current knowledge and it tells us what to do if experimental data does not agree with predictions coming from our chosen distribution: understand why the phenomenon being studied behaves in an unexpected way (find a previously unseen constraint) and maximize entropy over the distributions which satisfy all the constraints we are now aware of, including the new one. A proper appreciation of the principle of maximum entropy goes hand in hand with a certain attitude about the interpretation of probability distributions. A probability distribution can be viewed as: (1) a predictor of frequencies of outcomes over repeated trials, or (2) a numerical measure of plausibility that some individual situation develops in certain ways. Sometimes the first (frequency) viewpoint is meaningless, and only the second (subjective) interpretation of probability makes sense. For instance, we can ask about the probability that civilization will be wiped out by an asteroid in the next 10,000 years, or the probability that the Red Sox will win the World Series again. We illustrate the principle of maximum entropy in the following three theorems (Conrad, 2005).
262
/ /, Estimation by use of the MEM
Theorem 11.1 For a probability density function pi on a finite set {xu---,Xn}, H(pl,p2l-,pn)£logn
= H(-,-,-,-) « «
(11.6) n
withequalityifandonfy if pi is uniform, i.e., p(xf) = l/nforalli. Proof: see equation (5.13) in Chapter 5. Concretely, if pt,p%,,.., pnare nonnegative numbers with ^pi = 1, then Theorem 11.1 says -^pt
logpt < logw, with equality if and only if every p, is
l/n. Theorem 11.2 For a continuous probability density function fix) with variance
(11.7) with equality if and only iff(x) is Gaussian with variance a2, Le., for some fi we have 0140
V23-CT2
Note that the right hand side of equation (11.8) is the entropy of a Gaussian. This describes a conceptual role for Gaussians which is simpler than the Central Limit Theorem. Proof. Let /(x) be a probability density function with variance a 1 . Let ft be its mean. (The mean exists by definition of variance). Letting g(x) be the Gaussian with mean p and variance a2
U Splitting up the integral into two integrals, the first is — log(2ffitr2)
(11.9) since
11,1. Maximum entropy
263
Jf(%)dx =1", and the second is 1/2 since f/"(x)(x—/ij J& = O"2 by definition. Thus the total integral is — [l + log(2fffxa)], which is the entropy of g(x). Based on equation (5.45), we conclude that (*)Iogs(x)dr (11.10)
Theorem 11.3 For any continuous probability density Junction p on (0,1) with mean A ) Sl +logl
(11.11)
with equality if and only iff is exponential with mean, i.e.,
Proof Let^x) be a probability density function on ( 0 , » ) with mean A Letting
-'"1,-)f(x)lagg(x)dx = ["/(*)Jogfloga+yU .Since/has mean A, this integral is log A + 1, which is the entropy of g. Theorem 11.3 suggests that for an experiment with positive outcomes whose mean value is known, the most conservative probabilistic model consistent with that mean value is an exponential distribution. In each of Theorems 11.1, 11.2, and 11.3, entropy is maximized over distributions on a fixed domain satisfying certain constraints. The following Table summarizes these extra constraints, which in each case amounts to fixing the value of some integral. Distribution Uniform Normal with mean ju Exponential
Domain Finite (-so, oo) (0.OD)
Fixed Value None
l(x-/i)'ftx)dx
[xfixyt
How does one discover these extra constraints? They come from asking, for a given distribution g(x) (which we aim to characterize via maximum entropy),
264
II, Estimation by use of the MEM
what extra information about distributions jfljx) on the same domain is needed. For instance, in the setting ofTheorem 11.3, we want to realize an exponential distribution g(x) = (l/l)e" I '*on(0,®) as a maximum entropy distribution. For any distribution^) on (0,oo),
-]f(x)logg(x)dx =
j
= (\OgA)[f{x)dx+\[xf(x)dx
(11.12)
j [xf(x)dx To complete this calculation, we need to know the mean ofjfx). This is why, in Theorem 11.3, the exponential distribution is the one on (0, w) with maximum entropy having a given mean. The reader should consider Theorems 11.1 and 11.2, as well as later characterizations of distributions in terms of maximum entropy, in this light We turn to «-dimensional distributions, generalizing Theorem 11.2. Entropy is defined in terms of integrals over Rn. Theorem 11.4. For a continuous pmbability density function f on R" with fixed covariances (Ty (11.13) where £ = {0^ ) is the covariance matrix for fix). There is equality if and only if j(x) is an n-dimensional Gaussian density with covariances try, We recall the definition of the covariances o# . For an w-dimensional probability density function f(x), its means are /* = [ Xif(x)dx
and its
covariances are dx.
(11.14)
In particular, ers a 0 , When n = 1, a~v = u2 in the usual notation. The symmetric
11.2. Formulation of the maximum entropy method
265
matrix £ = (cTj,) is positive-definite, since the matrix ({vi.Vj}) is positivedefinite for any finite set of linearly independent vt in a real inner product space Proof. The Gaussian densities onR* are those probability density functions of the form. G(x)=
l
==g -min.
(11.39)
->min
(11.40a)
J
Or equivalently
B/(x)exp - l + ^^jB ( (x) ufc-/?f
subject to
11.5. Asymptotically unbiased estimate of At f
n
\
fexp -l + yU5,(x) \dx = \. J
\
t*
271
(11.40b)
)
Equations (11.25) and (11.26), (11.40) are all equivalent We are already familiar with optimization problems that appeared in Chapters 6-10. Three optimization methods have been used. They are iterative formulas used in Chapters 6 and 7, and Flexible Tolerance Method (FTM) used in Chapter 10. It seems hard to obtain a nice iterative formula as that obtained in Chapter 6. So we have to resort to FTM and GA. When solving equations (11.39) or (11.40), the search process is stopped if L < e or solution step number > Ne. Here £ is a prefixed small number, say 10"3. Ne is a prescribed large number, say 500. 11.5 Asymptotically unbiased estimate of A, We prove that solving At by equations (11.39) or (11.40) yields asymptotically unbiased estimate of ^ , To see this, note that in equation (11.31) p, are asymptotically normal random variables with zero mean based on the large number law. Denote the true value of Aj by A? and expand the left hand side of equation (11.26) with respect to the true value ^»°. Moreover, denoting pdf by / ( x | A), we obtain
(11.41)
The estimate At are dependent on the sample, thus being function of sample. Taking average on both sides of equation (11.26) about sample X yields
As mentioned above, & are asymptotically normal variables with mean pf. Thus, as sample size is large, Expi = pf, where superscript "0" represents true value of pi. Therefore, the second terms on the left hand side of equation (11.42) must be zero because the first term on the left hand side and the term on the right hand side of equation (11.42) are equal. In other words,
272
11. Estimation by use of the MEM
ExAj=Af.
(11.43)
Therefore, equation (11,39) or equations (11.40) yield asymptotically unbiased estimate of