Statistical Methods for Geography

STATISTICAL METHODS FOR GEOGRAPHY STATISTICAL METHODS FOR GEOGRAPHY PETER A. ROGERSON London SAGE Publications . ...

Author: Peter A. Rogerson

1996 downloads 15927 Views 5MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

STATISTICAL METHODS FOR

GEOGRAPHY

STATISTICAL METHODS FOR

GEOGRAPHY PETER A. ROGERSON

London

SAGE Publications .

Thousand Oaks

.

New Delhi

# Peter A. Rogerson 2001 First published 2001 Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act, 1988, this publication may be reproduced, stored or transmitted in any form, or by any means, only with the prior permission in writing of the publishers, or in the case of reprographic reproduction, in accordance with the terms of licences issued by the Copyright Licensing Agency. Inquiries concerning reproduction outside those terms should be sent to the publishers. SAGE Publications Ltd 6 Bonhill Street London EC2A 4PU SAGE Publications Inc. 2455 Teller Road Thousand Oaks, California 91320 SAGE Publications India Pvt Ltd 32, M-Block Market Greater Kailash - I New Delhi 110 048 British Library Cataloguing in Publication data A catalogue record for this book is available from the British Library ISBN 0 7619 6287 5 ISBN 0 7619 6288 3 (pbk) Library of Congress catalog record available Typeset by Keyword Publishing Services Limited, UK Printed in Great Britain by The Cromwell Press Ltd, Trowbridge, Wiltshire

Contents

Preface 1

2

3

x

Introduction to Statistical Analysis in Geography 1.1 Introduction 1.2 The scientific method 1.3 Exploratory and confirmatory approaches in geography 1.4 Descriptive and inferential methods 1.4.1 Overview of descriptive analysis 1.4.2 Overview of inferential analysis 1.5 The Nature of statistical thinking 1.6 Some special considerations with spatial data 1.6.1 Modifiable areal unit problem 1.6.2 Boundary problems 1.6.3 Spatial sampling procedures 1.6.4 Spatial autocorrelation 1.7 Descriptive statistics in SPSS for Windows 9.0 1.7.1 Data input 1.7.2 Descriptive analysis Exercises

1 1 4 5 5 9 12 13 13 14 14 15 15 15 15

Probability and Probability Models 2.1 Mathematical conventions and notation 2.1.1 Mathematical conventions 2.1.2 Mathematical notation 2.1.3 Examples 2.2 Sample spaces, random variables, and probabilities 2.3 The binomial distribution 2.4 The normal distribution 2.5 Confidence intervals for the mean 2.6 Probability models 2.6.1 The intervening opportunities model 2.6.2 A model of migration 2.6.3 The future of the human population Exercises

18 18 20 23 23 25 27 30 31 32 36 38

Hypothesis Testing and Sampling 3.1 Hypothesis testing and one-sample z-tests of the mean 3.2 One-sample t-tests 3.2.1 Illustration

1

16 18

39 42

42 46 47

vi CONTENTS

3.3 One-sample tests for proportions 3.3.1 Illustration 3.4 Two-sample tests 3.4.1 Two-sample t-tests for the mean 3.4.2 Two-sample tests for proportions 3.5 Distributions of the variable and the test statistic 3.6 Spatial data and the implications of nonindependence 3.7 Sampling 3.7.1 Spatial sampling 3.8 Two-sample t-tests in SPSS for Windows 9.0 3.8.1 Data entry 3.8.2 Running the t-test Exercises 4

5

Analysis of Variance 4.1 Introduction 4.1.1 A note on the use of F-tables 4.1.2 More on sums of squares 4.2 Illustrations 4.2.1 Hypothetical swimming frequency data 4.2.2 Diurnal variation in precipitation 4.3 Analysis of variance with two categories 4.4 Testing the assumptions 4.5 The nonparametric Kruskal±Wallis test 4.5.1 Illustration: diurnal variation in precipitation 4.5.2 More on the Kruskal±Wallis test 4.6 Contrasts 4.6.1 A priori contrasts 4.7 Spatial dependence 4.8 One-way ANOVA in SPSS for Windows 9.0 4.8.1 Data entry 4.8.2 Data analysis and interpretation 4.8.3 Levene's test for equality of variances 4.8.4 Tests of normality: the Shapiro±Wilk test Exercises Correlation 5.1 Introduction and examples of correlation 5.2 More illustrations 5.2.1 Mobility and cohort size 5.2.2 Statewide infant mortality rates and income 5.3 A significance test for r 5.3.1 Illustration 5.4 The correlation coefficient and sample size 5.5 Spearman's rank correlation coefficient 5.6 Additional topics 5.6.1 Confidence intervals for correlation coefficients

48 48 49 49 53 54 55 57 58 59 59 60 62 65

65 67 67 68 68 69 70 70 70 71 72 73 75 75 76 76 77 79 80

81 86

86 89 89 91 92 93 93 94 96 96

CONTENTS vii

5.6.2 5.6.3

Differences in correlation coefficients The effect of spatial dependence on significance tests for correlation coefficients 5.6.4 Modifiable area unit problem and spatial aggregation 5.7 Correlation in SPSS for Windows 9.0 5.7.1 Illustration Exercises 6

7

Introduction to regression analysis 6.1 Introduction 6.2 Fitting a regression line to a set of bivariate data 6.3 Regression in terms of explained and unexplained sums of squares 6.4 Assumptions of regression 6.5 Standard error of the estimate 6.6 Tests for beta 6.7 Confidence intervals 6.8 Illustration: income levels and consumer expenditures 6.9 Illustration: state aid to secondary schools 6.10 Linear versus nonlinear models 6.11 Regression in SPSS for Windows 9.0 6.11.1 Data input 6.11.2 Analysis 6.11.3 Options 6.11.4 Output Exercises More on Regression 7.1 Multiple regression 7.1.1 Multicollinearity 7.1.2 Interpretation of coefficients in multiple regression 7.2 Misspecification error 7.3 Dummy variables 7.3.1 Dummy variable regression in a recreation planning example 7.4 Multiple regression illustration: species in the Galapagos Islands 7.4.1 Model 1: the kitchen-sink approach 7.4.2 Missing values 7.4.3 Outliers and multicollinearity 7.4.4 Model 2 7.4.5 Model 3 7.4.6 Model 4 7.5 Variable selection 7.6 Categorical dependent variable 7.6.1 Binary response 7.7 A Summary of some problems that can arise in regression analysis

97 97 99 100 101 102 104

104 107 109 112 112 112 113 113 116 118 120 120 120 121 122 122 124

124 125 126 126 128 130 132 132 134 136 136 138 139 140 140 141 145

viii CONTENTS

7.8 Multiple and logistic regression in SPSS for Windows 9.0 7.8.1 Multiple regression 7.8.2 Logistic regression Exercises 8

9

145 145 145

150

Spatial Patterns 8.1 Introduction 8.2 The analysis of point patterns 8.2.1 Quadrat analysis 8.2.2 Nearest neighbor analysis 8.3 Geographic patterns in areal data 8.3.1 An example using a chi-square test 8.3.2 The join-count statistic 8.3.3 Moran's I 8.4 Local statistics 8.4.1 Introduction 8.4.2 Local Moran statistic 8.4.3 Getis's Gi statistic 8.5 Finding Moran's I Using SPSS for Windows 9.0 Exercises

154 154 156 161 164 164 165 167 173 173 173 174 175

Some Spatial Aspects of Regression Analysis 9.1 Introduction 9.2 Added-variable plots 9.3 Spatial regression 9.4 Spatially varying parameters 9.4.1 The expansion method 9.4.2 Geographically weighted regression 9.5 Illustration 9.5.1 Ordinary least-squares regression 9.5.2 Added-variable plots 9.5.3 Spatial regression 9.5.4 Expansion method 9.5.5 Geographically weighted regression Exercises

179 180 181 182 182 183 184 186 186 187 188 190

10 Data Reduction: Factor Analysis and Cluster Analysis 10.1 Factor analysis and principal components analysis 10.1.1 Illustration: 1990 census data for Buffalo, New York 10.1.2 Regression analysis on component scores 10.2 Cluster analysis 10.2.1 More on agglomerative methods 10.2.2 Illustration: 1990 census data for Erie County, New York 10.3 Data reduction methods in SPSS for Windows 9.0 10.3.1 Factor analysis 10.3.2 Cluster analysis Exercises

154

176 179

190 192

192 193 197 197 201 201 207 207 207

208

CONTENTS ix

Epilogue Selected publications Appendix A: Statistical Table A.1 Table A.2 Table A.3 Table A.4

tables Random digits Normal distribution Student's t distribution Cumulative distribution of Students t distribution Table A.5 F distribution Table A.6 2 distribution Table A.7 Coefficients for the Shapiro ±Wilk W Test Table A.8 Critical values for the Shapiro ±Wilk W Test Appendix B: Review and extension of some probability theory Expected values Variance of a random variable Covariance of random variables Bibliography Index

210 211 212 212 214 215 216 218 221 222 224 225 225 227 227 229 233

Preface

The development of geographic information systems (GIS), an increasing availability of spatial data, and recent advances in methodological techniques have all combined to make this an exciting time to study geographic problems. During the late 1970s and throughout the 1980s there had been, among many, an increasing disappointment in, and questioning of, the methods developed during the quantitative revolution of the 1950s and 1960s. Perhaps this re¯ected expectations that were initially too high ± many had thought that sheer computing power coupled with sophisticated modeling would ``solve'' many of the social problems faced by urban and rural regions. But the poor performance of spatial analysis that was perceived by many was at least partly attributable to a limited capability to access, display, and analyze geographic data. During the last decade, geographic information systems have been instrumental not only in providing us with the capability to store and display information, but also in encouraging the provision of spatial datasets and the development of appropriate methods of quantitative analysis. Indeed, the GIS revolution has served to make us aware of the critical importance of spatial analysis. Geographic information systems do not realize their full potential without the ability to carry out methods of statistical and spatial analysis, and an appreciation of this dependence has helped to bring about a renaissance in the ®eld. Signi®cant advances in quantitative geography have been made during the past decade, and geographers now have both the tools and the methods to make valuable contributions to ®elds as diverse as medicine, criminal justice, and the environment. These capabilities have been recognized by those in other ®elds, and geographers are now routinely called upon as members of interdisciplinary teams studying complex problems. Improvements in computer technology and computation have led quantitative geography in new directions. For example, the new ®eld of geocomputation (see, e.g., Longley et al. 1998) lies at the intersection of computer science, geography, information science, mathematics, and statistics. The recent book by Fotheringham et al. (2000) also summarizes many of the new research frontiers in quantitative geography. The purpose of this book is to provide undergraduate and beginning graduate students with the background and foundation that are necessary to be prepared for spatial analysis in this new era. I have deliberately adopted a fairly traditional approach to statistical analysis, along with several notable dierences. First, I have attempted to condense much of the material found in the

PREFACE xi

beginning of introductory texts on the subject. This has been done so that there is an opportunity to progress further in important areas such as regression analysis and the analysis of geographic patterns in one semester's time. Regression is by far the most common method used in geographic analysis, and it is unfortunate that it is often left to be covered hurriedly in the last week or two of a ``Statistics in Geography'' course. The level of the material is aimed at upper-level undergraduate and beginning graduate students. I have attempted to structure the book so that it may be used as either a ®rst-semester or a second-semester text. It may be used for a second-semester course by those students who already possess some background in introductory statistical concepts. The introductory material here would then serve as a review. However, the book is also meant to be fairly self-contained, and thus it should also be appropriate for those students learning about statistics in geography for the ®rst time. First-semester students, after completing the introductory material in the ®rst few chapters, will still be able to learn about the methods used most often by geographers by the end of a one-semester course; this is often not possible with many ®rst-semester texts. In writing this text, I had several goals. The ®rst was to provide the basic material associated with the statistical methods most often used by geographers. Since a very large number of textbooks provide this basic information, I also sought to distinguish it in several ways. I have attempted to provide plenty of exercises. Some of these are to be done by hand (in the belief that it is always a good learning experience to carry out a few exercises by hand, despite what may sometimes be seen as drudgery!), and some require a computer. Although teaching the reader how to use computer software for statistical analysis is not one of the speci®c aims of this book, some guidance on the use of SPSS for Windows 9.0 is provided. It is important that students become familiar with some software that is capable of statistical analysis. An important skill is the ability to sift through output and pick out what is important from what is not. Dierent software will produce output in dierent forms, and it is also important to be able to pick out relevant information whatever the arrangement of output. In addition, I have tried to give students some appreciation of the special issues and problems raised by the use of geographic data. Straightforward application of the standard methods ignores the special nature of spatial data, and can lead to misleading results. Topics such as spatial autocorrelation and the modi®able areal unit problem are introduced to provide a good awareness of these issues, their consequences, and potential solutions. Because a full treatment of these topics would require a higher level of mathematical sophistication, they are not covered fully, but pointers to other, more advanced work and to examples are provided. Another objective has been to provide some examples of statistical analysis that appear in the recent literature in geography. This should help to make clear the relevance and timeliness of the methods. Finally, I have attempted to point out some of the limitations of a con®rmatory statistical perspective, and

xii PREFACE

have directed the student to some of the newer literature on exploratory spatial data analysis. Despite the popularity and importance of exploratory methods, inferential statistical methods remain absolutely essential in the assessment of hypotheses. This text aims to provide a background in these statistical methods and to illustrate the special nature of geographic data. A Guggenheim Fellowship aorded me the opportunity to ®nish the manuscript during a sabbatical leave in England. I would like to thank Paul Longley for his careful reading of an earlier draft of the book. His excellent suggestions for revision have led to a better ®nal result. Yifei Sun and Ge Lin also provided comments that were very helpful in revising earlier drafts. Art Getis, Stewart Fotheringham, Chris Brunsdon, Martin Charlton, and Ikuho Yamada suggested changes in particular sections, and I am grateful for their assistance. Emil Boasson and my daughter, Bethany Rogerson, assisted with the production of the ®gures. I am thankful for the thorough job carried out by Richard Cook of Keyword in editing the manuscript. Finally, I would like to thank Robert Rojek at Sage Publications for his encouragement and guidance.

1

Introduction to Statistical Analysis in Geography

1.1 Introduction

The study of geographic phenomena often requires the application of statistical methods to produce new insight. The following questions serve to illustrate the broad variety of areas in which statistical analysis has recently been applied to geographic problems: (1) How do blood lead levels in children vary over space? Are the levels randomly scattered throughout the city, or are there discernible geographic patterns? How are any patterns related to the characteristics of both housing and occupants? (Grif®th et al. 1998). (2) Can the geographic diffusion of democracy that has occurred during the post-World War II era be described as a steady process over time, or has it occurred in waves, or have their been ``bursts'' of diffusion that have taken place during short time periods? (O'Loughlin et al. 1998). (3) What are the effects of global warming on the geographic distribution of species? For example, how will the type and spatial distribution of tree species change in particular areas? (MacDonald et al. 1998). (4) What are the effects of different marketing strategies on product performance? For example, are mass-marketing strategies effective, despite the more distant location of their markets? (Cornish 1997). These studies all make use of statistical analysis to arrive at their conclusions. Methods of statistical analysis play a central role in the study of geographic problems ± in a survey of articles that had a geographic focus, Slocum (1990) found that 53% made use of at least one mainstream quantitative method. The role of statistical analysis in geography may be placed within a broader context through its connection to the ``scienti®c method,'' which provides a more general framework for the study of geographic problems. 1.2 The Scienti®c Method

Social scientists as well as physical scientists often make use of the scienti®c method in their attempts to learn about the world. Figure 1.1 illustrates this

2 STATISTICAL METHODS FOR GEOGRAPHY

organize Concepts

surprise Description

Hypothesis

formalize

validate Theory

Laws

Model

Figure 1.1 The scienti®c method

Figure 1.2 Distribution of cancer cases

method, from the initial attempts to organize ideas about a subject to the building of a theory. Suppose that we are interested in describing and explaining the spatial pattern of cancer cases in a metropolitan area. We might begin by plotting recent incidences on a map. Such descriptive exercises often lead to an unexpected result ± in Figure 1.2, we perceive two fairly distinct clusters of cases. The surprising results generated through the process of description naturally lead us to the next step on the route to explanation by forcing us to generate hypotheses about the underlying process. A ``rigorous'' de®nition of the term hypothesis is a proposition whose truth or falsity is capable of being tested. Though in the social sciences we do not always expect to come to ®rm conclusions in the form of ``laws,'' we can also think of hypotheses as potential answers to our initial surprise. For example, one hypothesis in the present example is that the pattern of cancer cases is related to the distance from local power plants. To test the hypothesis, we need a model, which is a device for simplifying reality so that the relationship between variables may be more clearly studied.

INTRODUCTION TO STATISTICAL ANALYSIS IN GEOGRAPHY 3

Whereas a hypothesis might suggest a relationship between two variables, a model is more detailed, in the sense that it suggests the nature of the relationship between the variables. In our example, we might speculate that the likelihood of cancer declines as the distance from a power plant increases. To test this model, we could plot cancer rates for a subarea versus the distance the subarea centroid was from a power plant. If we observe a downward sloping curve, we have gathered some support for our hypothesis (see Figure 1.3). Models are validated by comparing observed data with what is expected. If the model is a good representation of reality, there will be a close match between the two. If observations and expectations are far apart, we need to ``go back to the drawing board'' and come up with a new hypothesis. It might be the case, for example, that the pattern in Figure 1.2 is due simply to the fact that the population itself is clustered. If this new hypothesis is true, or if there is evidence in favor of it, the spatial pattern of cancer then becomes understandable; a similar rate throughout the population generates apparent cancer clusters because of the spatial distribution of the population. Though a model is often used to learn about a particular situation, more often one also wishes to learn about the underlying process that led to it. We would like to be able to generalize from one study to statements about other situations. One reason for studying the spatial pattern of cancer cases is to determine whether there is a relationship between cancer rates and the distance to speci®c power plants; a more general objective is to learn about the relationship between cancer rates and the distance to any power plant. One way of making such generalizations is to accumulate a lot of evidence. If we were to repeat our analysis in many locations throughout a country, and if our ®ndings were similar in all cases, we would have uncovered an empirical generalization. In a strict sense, laws are sometimes de®ned as universal

Cancer rate in subarea

Distance from Power Plant

Figure 1.3 Cancer rates versus distance from power plant


statements of unrestricted range. In our example, our generalization would not have unrestricted range, and we might want, for example, to con®ne our generalization or empirical law to power plants and cancer cases in a particular country. Einstein called theories ``free creations of the human mind.'' In the context of our diagram, we may think of theories as collections of generalizations or laws. The whole collection is greater than the sum of its parts in the sense that it gives greater insight than that produced by the generalizations or laws alone. If for example, we generate other empirical laws that relate cancer rates to other factors, such as diet, we begin to build a theory of the spatial variation in cancer rates. Statistical methods occupy a central role in the scienti®c method, as portrayed in Figure 1.1, because they allow us to suggest and test hypotheses using models. In the following section, we will review some of the important types of statistical approaches in geography.

1.3 Exploratory and Con®rmatory Approaches in Geography

The scienti®c method provides us with a structured approach to answering questions of interest. At the core of the method is the desire to form and test hypotheses. As we have seen, hypotheses may be thought of loosely as potential answers to questions. For instance, a map of snowfall may suggest the hypothesis that the distance away from a nearby lake may play an important role in the distribution of snowfall amounts. Geographers use spatial analysis within the context of the scienti®c method in at least two distinct ways. Exploratory methods of analysis are used to suggest hypotheses; con®rmatory methods are, as the name suggests, used to help con®rm hypotheses. A method of visualization or description that led to the discovery of clusters in Figure 1.2 would be an exploratory method, whereas a statistical method that con®rmed that such an arrangement of points would have been unlikely to occur by chance would be a con®rmatory method. In this book we will focus primarily upon con®rmatory methods. We should note here two important points. First, con®rmatory methods do not always con®rm or refute hypotheses ± the world is too complicated a place, and the methods often have important limitations that prevent such con®rmation and refutation. Nevertheless, they are important in structuring our thinking and in taking a rigorous and scienti®c approach to answering questions. Second, the use of exploratory methods over the past few years has been increasing rapidly. This has come about as a combination of the availability of large databases and sophisticated software (including GIS), and a recognition that con®rmatory statistical methods are appropriate in some situations and not others. Throughout the book we will keep the reader aware of these points by pointing out some of the limitations of con®rmatory analysis.


1.4 Descriptive and Inferential Methods

A key characteristic of geographic data that brings about the need for statistical analysis is that they may often be regarded as a sample from a larger population. Descriptive statistical analysis refers to the use of particular methods that are used to describe and summarize the characteristics of the sample, whereas inferential statistical analysis refers to the methods that are used to infer something about the population from the sample. Descriptive methods fall within the class of exploratory techniques; inferential statistics lie within the class of con®rmatory methods. 1.4.1 Overview of Descriptive Analysis

Suppose that we wish to learn something about the commuting behavior of residents in a community. Perhaps we are on a committee that is investigating the potential implementation of a public transit alternative, and we need to know how many minutes, on average, it takes people to get to work by car. We do not have the resources to ask everyone, and so we decide to take a sample of automobile commuters. Let's say we survey n 30 residents, asking them to record their average time it takes to get to work. We receive the responses shown in panel (a) of Table 1.1. We begin our descriptive analysis by summarizing the information. The sample mean commuting time is simply the average of our observations; it is found by adding all of the individual responses and dividing by thirty. Table 1.1 Commuting data (a) Data on individuals Individual no. Commuting time (min.) Individual no. Commuting time (min.) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

5 12 14 21 22 36 21 6 77 12 21 16 10 5 11

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

42 31 31 26 24 11 19 9 44 21 17 26 21 24 23

(b) Ranked commuting times 5, 5, 6, 9, 10, 11 , 11, 12, 12, 14, 16, 17, 19, 21, 21, 21, 21, 21, 22, 23, 24, 24, 26, 26, 31, 31, 36, 42, 44, 77


The sample mean is traditionally denoted by x; in our example we have x 21.93 minutes. In practice, this could sensibly be rounded to 22 minutes. The median time is de®ned as the time that splits the ranked list of commuting times in half ± half of all respondents have commutes that are longer than the median, and half have commutes that are shorter. When the number of observations is odd, the median is simply equal to the middle value on a list of the observations, ranked from shortest commute to longest commute. When the number of observations is even, as it is here, we take the median to be the average of the two values in the middle of the ranked list. When the responses are ranked as in panel (b) of Table 1.1, the two in the middle are 21 and 21. The median in this case is equal to 21 minutes. The mode is de®ned as the most frequently occurring value; here the mode is also 21 minutes, since it occurs more frequently (four times) than any other outcome. We may also summarize the data by characterizing its variability. The data range from a low of ®ve minutes to a high of 77 minutes. The range is the dierence between the two values ± here it is equal to 77 5 72 minutes. The interquartile range is the dierence between the 25th and 75th percentiles. With n observations, the 25th percentile is represented by observation (n+1)/4, when the data have been ranked from lowest to highest. The 75th percentile is represented by observation 3(n+1)/4. These will often not be integers, and interpolation is used, just as it is for the median when there is an even number of observations. For the commuting data, the 25th percentile is represented by observation (30+1)/4 7.75. Interpolation between the 7th and 8th lowest observations requires that we go 3/4 of the way from the 7th lowest observation (which is 11) to the 8th lowest observation (which is 12). This implies that the 25th percentile is 11.75. Similarly, the 75th percentile is represented by observation 3(30+1)/4 23.25. Since both the 23rd and 24th observations are equal to 26, the 75th percentile is equal to 26. The interquartile range is the dierence between these two percentiles, or 26 11.75 14.25. The sample variance of the data (denoted s2) may be thought of as the average squared deviation of the observations from the mean. To ensure that the sample variance gives an unbiased estimate of the true, unknown variance of the population from which the sample was drawn (denoted 2 ), s2 is computed by taking the sum of the squared deviations, and then dividing by n 1, instead of by n. Here the term unbiased implies that if we were to repeat this sampling many times, we would ®nd that the average or mean of our many sample variances would be equal to the true variance. Thus the sample variance is found from n P 2

s

i1

x2

xi n

1

1:1

where the Greek letter means that we are to sum the squared deviations of the observations from the mean (notation is discussed in more detail in Chapter 2).


In our example, s2 208.13. The sample standard deviation p is equal to the square root of the sample variance; here we have s 208:13 14:43: Since the sample variance characterizes the average squared deviation from the mean, by taking the square root and using the standard deviation, we are putting the measure of variability back on a scale closer to that used for the mean and the original data. It is not quite correct to say that the standard deviation is the average absolute deviation of an observation from the mean, but it is close to being correct. Since data come from distributions with dierent means and dierent degrees of variability, it is common to standardize observations. One way to do this is to transform each observation into a z-score by ®rst subtracting the mean of all observations and then dividing the result by the standard deviation: x x z 1:2 s z-scores may be interpreted as the number of standard deviations an observation is away from the mean. For example, the z-score for individual 1 is (5 21.93)/14.3 1.17. This individual has a commuting time that is 1.17 standard deviations below the mean. We may also summarize our data by constructing histograms, which are vertical bar graphs. To construct a histogram, the data are ®rst grouped into categories. The histogram contains one vertical bar for each category. The height of the bar represents the number of observations in the category (i.e., the frequency), and it is common to note the midpoint of the category on the horizontal axis. Figure 1.4 is a histogram for the commuting data, produced by SPSS for Windows 9.0. Skewness measures the degree of asymmetry exhibited by the data. Figure 1.4 reveals that there are more observations below the mean than above it ± this

Figure 1.4 Histogram for commuting data


is known as positive skewness. Positive skewness can also be detected by comparing the mean and median. When the mean is greater than the median, as it is here, the distribution is positively skewed. In contrast, when there are a small number of low observations and a large number of high ones, the data exhibit negative skewness. Skewness is computed by ®rst adding together the cubed deviations from the mean and then dividing by the product of the cubed standard deviation and the number of observations: n P

skewness

i1

xi

x3 1:3

ns3

The 30 commuting times have a positive skewness of 2.06. If skewness equals zero, the histogram is symmetric about the mean. Kurtosis measures how peaked the histogram is. Its de®nition is similar to that for skewness, with the exception that the fourth power is used instead of the third: n P

kurtosis

i1

xi ns4

x4 1:4

Data with a high degree of peakedness are said to be leptokurtic, and have values of kurtosis over 3.0. Flat histograms are platykurtic, and have kurtosis values less than 3.0. The kurtosis of the commuting times is equal to 6.43, and hence the distribution is relatively peaked. Data may also be summarized via box plots. Figure 1.5 depicts a box plot for the commuting data. The horizontal line running through the rectangle denotes

Figure 1.5 Boxplot for commuting data


the median (21), and the lower and upper ends of the rectangle (sometimes called the ``hinges'') represent the 25th and 75th percentiles, respectively. Velleman and Hoaglin (1981) note that there are two common ways to draw the ``whiskers'' which extend upward and downward from the hinges. One way is to send the whiskers out to the minimum and maximum values. In this case, the boxplot represents a graphical summary of what is sometimes called a ``®ve-number summary'' of the distribution (the minimum, maximum, 25th and 75th percentiles, and the median). There are often extreme outliers in the data that are far from the mean, and in this case it is not preferable to send whiskers out to these extreme values. Instead, whiskers are sent out to the outermost observations that are still within 1.5 times the interquartile range of the hinge. All other observations beyond this are considered outliers, and are shown individually. In the commuting data, 1.5 times the interquartile range is equal to 1.5(14.25) 21.375. The whisker extending downward from the lower hinge extends to the minimum value of 5, since this is greater than the lower hinge (11.75) minus 21.375. The whisker extending upward from the upper hinge stops at 44, which is the highest observation less than 47.375 (which in turn is equal to the upper hinge (26) plus 21.375). Note that there is a single outlier ± observation 9 ± which has a value of 77 minutes. A stem-and-leaf plot is an alternative way to show how common observations are. It is similar to a histogram tilted onto its side, with the actual digits of each observation's value used in place of bars. The leading digits constitute the ``stem,'' and the trailing digits make up the ``leaf.'' Each stem has one or more leaves, with each leaf corresponding to an observation. The visual depiction of the frequency of leaves conveys to the reader an impression of the frequency of observations that fall within given ranges. John Tukey, the designer of the stem-and-leaf plot, has said `Ìf we are going to make a mark, it may as well be a meaningful one. The simplest ± and most useful ± meaningful mark is a digit.'' (Tukey 1972, p. 269). For the commuting data, which have at most two-digit values, the ®rst digit is the ``stem,'' and the second is the ``leaf'' (see Figure 1.6).

1.4.2 Overview of Inferential Analysis

Since we did not interview everyone, we do not know the true mean commuting time (which we denote ) that characterizes the entire community. (Note that we use regular, Roman letters to indicate sample means and variances, and that we use Greek letters to represent the corresponding, unknown population values. This is a common notational convention that we will use throughout.) We have an estimate of the true mean from our sample mean, but it is also desirable to make some sort of inferential statement about that quanti®es our uncertainty regarding the true mean. Clearly we would be less uncertain about the true mean if we had taken a larger


Figure 1.6 Stem-and-leaf plot for commuting data

sample, and we would also be less uncertain about the true mean if we knew there was less variability in the population values (that is, if 2 were lower). Although we don't know the ``true'' variance of commuting times (2), we do have an estimate of it (s2). In the next chapter, we will learn how to make inferences about the population mean from the sample mean. In particular we will learn how to test hypotheses regarding the mean (e.g., could the ``true'' commuting time in our population be equal to 30 minutes?), and we will also learn how to place con®dence limits around the mean to make statements such as ``we are 95% con®dent that the true mean lies 3.5 minutes from the observed mean.'' To illustrate some common inferential questions using another example, suppose you are handed a coin, and you are asked to determine whether it is a ``fair'' one (that is, the likelihood of a ``head'' is the same as the likelihood of a ``tail''). One natural way to gather some information would be to ¯ip the coin a number of times. Suppose you ¯ip the coin ten times, and you observe heads eight times. An example of a descriptive statistic is the observed proportion of heads ± in this case 8/10 0.8. We enter the realm of inferential statistics when we attempt to pass judgement on whether the coin is ``fair''. We plan to do this by inferring whether the coin is fair, on the basis of our sample results. Eight heads is more than the four, ®ve, or six that might have made us more comfortable in a declaration that the coin is fair, but is eight heads really enough to say that the coin is not a fair one? There are at least two ways to go about answering the question of whether the coin is a fair one. One is to ask what would happen if the coin were fair, and to simulate a series of experiments identical to the one just carried out. That is, if we could repeatedly ¯ip a known fair coin ten times, each time recording the number of heads, we would learn just how unusual a total of eight heads actually was. If eight heads comes up quite frequently with the fair coin, we will judge our original coin to be fair. On the other hand, if eight heads is an


extremely rare event for a fair coin, we will conclude that our original coin is not fair. To pursue this idea, suppose you arrange to carry out such an experiment 100 times. For example, one might have 100 students in a large class each ¯ip a coin that is known to be fair ten times. Upon pooling together the results, suppose you ®nd the results shown in Table 1.2. We see that eight heads occurred 8% of the time. We still need a guideline to tell us whether our observed outcome of eight heads should lead us to the conclusion that the coin is (or is not) fair. The usual guideline is to ask how likely a result equal to or more extreme than the observed one is, if our initial, baseline hypothesis that we possess a fair coin (called the null hypothesis) is true. A common practice is to accept the null hypothesis if the likelihood of a result more extreme than the one we observed is more than 5%. Hence we would accept the null hypothesis of a fair coin if our experiment showed that eight or more heads was not uncommon and in fact tended to occur more than 5% of the time. Alternatively, we wish to reject the null hypothesis that our original coin is a fair one if the results of our experiment indicate that eight or more heads out of ten is an uncommon event for fair coins. If fair coins give rise to eight or more heads less than 5% of the time, we decide to reject the null hypothesis and conclude that our coin is not fair. In the example above, eight or more heads occurred 12 times out of 100, when a fair coin was ¯ipped ten times. The fact that events as extreme as, or more extreme than the one we observed will happen 12% of the time with a fair coin leads us to accept the inference that our original coin is a fair one. Had we observed nine heads with our original coin, we would have judged it to be unfair, since events as rare or more rare than this (namely where the number of heads is equal to 9 or 10) occurred only four times in the one hundred trials of a fair coin. Note, too, that our observed result does not prove that the coin is unbiased. It still could be unfair; there is, however, insucient evidence to support the allegation. Table 1.2 Hypothetical outcome of 100 experiments of ten coin tosses each No. of heads 0 1 2 3 4 5 6 7 8 9 10

Frequency of occurrence 0 1 4 8 15 22 30 8 8 3 1


The approach just described is an example of the Monte Carlo method, and several examples of its use are given in Chapter 8. A second way to answer the inferential problem is to make use of the fact that this is a binomial experiment; in Chapter 2 we will learn how to use this approach.

1.5 The Nature of Statistical Thinking

The American Statistical Association (1993, cited in Mallows 1998) notes that statistical thinking is (a) the appreciation of uncertainty and data variability, and their impact on decision making; and (b) the use of the scienti®c method in approaching issues and problems. Mallows (1998), in his Presidential Address to the American Statistical Association, argues that statistical thinking is not simply common sense, nor is it simply the scienti®c method. Rather, he suggests that statisticians give more attention to questions that arise in the beginning of the study of a problem or issue. In particular, Mallows argues that statisticians should (a) consider what data are relevant to the problem, (b) consider how relevant data can be obtained, (c) explain the basis for all assumptions, (d) lay out the arguments on all sides of the issue, and only then (e) formulate questions that can be addressed by statistical methods. He feels that too often statisticians rely too heavily on (e), as well as on the actual use of the methods that follow. His ideas serve to remind us that statistical analysis is a comprehensive exercise ± it does not consist of simply ``plugging numbers into a formula'' and reporting a result. Instead, it requires a comprehensive assessment of questions, alternative perspectives, data, assumptions, analysis, and interpretation. Mallows de®nes statistical thinking as that which ``concerns the relation of quantitative data to a real-world problem, often in the presence of uncertainty and variability. It attempts to make precise and explicit what the data has to say about the problem of interest.'' Throughout the remainder of this book, we will learn how various methods are used and implemented, but we will also learn how to interpret the results and understand their limitations. Too often students working on geographic problems have only a sense that they ``need statistics,'' and their response is to seek out an expert on statistics for advice on how to get started. The statistician's ®rst reply should be in the form of questions: (1) What is the problem? (2) What data do you have, and what are its limitations? (3) Is statistical analysis relevant, or is some other method of analysis more appropriate? It is important for the student to think ®rst about these questions. Perhaps simple description will suce to achieve the objective. Perhaps some sophisticated inferential analysis will be necessary. But the subsequent course of events


should be driven by the substantive problems and questions of interest, as constrained by data availability and quality. It should not be driven by a feeling that one needs to use statistical analysis simply for the sake of doing so.

1.6 Some Special Considerations with Spatial Data

Fotheringham and Rogerson (1993) categorize and discuss a number of general issues and characteristics associated with problems in spatial analysis. It is essential that those working with spatial data have an awareness of these issues. Although all of their categories are relevant to spatial statistical analysis, among those that are most pertinent are: (a) (b) (c) (d)

the modi®able areal unit problem; boundary problems; spatial sampling procedures; spatial autocorrelation.

1.6.1 Modi®able Areal Unit Problem

The modi®able areal unit problem refers to the fact that results of statistical analyses are sensitive to the zoning system used to report aggregated data. Many spatial datasets are aggregated into zones, and the nature of the zonal con®guration can in¯uence interpretation quite strongly. Panel (a) of Figure 1.7 shows one zoning system and panel (b) another. The arrows represent migration ¯ows. In panel (a) no interzonal migration is reported, whereas an interpretation of panel (b) would lead to the conclusion that there was a strong southward movement. More generally, many of the statistical tools described in the following chapters would produce dierent results had dierent zoning systems been in eect. The modi®able areal unit problem has two dierent aspects that should be appreciated. The ®rst is related to the placement of zonal boundaries, for zones or subregions of a given size. If we were measuring mobility rates, we could overlay a grid of square cells on the study area. There are many dierent ways that the grid could be placed, rotated, and oriented on the study area. The second aspect has to do with geographic scale. If we were to replace the grid with another grid of larger square cells, the results of the analysis would be dierent. Migrants, for example, are less likely to cross cells in the larger grid than they are in the smaller grid. As Fotheringham and Rogerson (1993) note, GIS technology now facilitates the analysis of data using alternative zoning systems, and it should become more routine to examine the sensitivity of results to modi®able areal units.


(a)

(b) Figure 1.7 Two alternative zoning systems for migration data. Arrows show origins and destinations of migrants

1.6.2 Boundary Problems

Study areas are bounded, and it is important to recognize that events just outside the study area can aect those inside it. If we are investigating the market areas of shopping malls in a county, it would be a mistake to neglect the in¯uence of a large mall located just outside the county boundary. One solution is to create a buer zone around the area of study to include features that aect analysis within the primary area of interest. An example of the use of buer zones in point pattern analysis is given in Chapter 8. Both the size and shape of areas can aect measurement and interpretation. There are many migrants leaving Rhode Island each year, but this is partially due to the state's small size ± almost any move will be a move out of the state! Similarly, Tennessee experiences more out-migration than other states with the same land area, in part because of its narrow rectangular shape. This is because individuals in Tennessee live, on average, closer to the border than do individuals in other states with the same area. A move of given length in some random direction is therefore more likely to take the Tennessean outside of the state.

1.6.3 Spatial Sampling Procedures

Statistical analysis is based upon sample data. Usually one assumes that sample observations are taken randomly from some larger population of


interest. If we are interested in sampling point locations to collect data on vegetation or soil, for example, there are many ways to do this. One could choose x- and y-coordinates randomly; this is known as a simple random sample. Another alternative would be to choose a strati®ed spatial sample, making sure that we chose a predetermined number of observations from each of several subregions, with simple random sampling within subregions. Alternative methods of sampling are discussed in more detail in Section 3.7.

1.6.4 Spatial Autocorrelation

Spatial autocorrelation refers to the fact that the value of a variable at one point in space is related to the value of that same variable in a nearby location. The travel behavior of residents in a household is likely to be related to the travel behavior of residents in nearby households, because both households have similar accessibility to other locations. Hence observations of the two households are not likely to be independent, despite the requirement of statistical independence for standard statistical analysis. Spatial autocorrelation can therefore have serious eects on statistical analyses, and hence lead to misinterpretation. It is treated in more detail in Chapter 8.

1.7 Descriptive Statistics in SPSS for Windows 9.0 1.7.1 Data Input

After starting SPSS, data are input for the variable or variables of interest. Each column represents a variable. For the commuting example set out in Table 1.1, the thirty observations were entered into the ®rst column of the spreadsheet. Alternatively, respondent ID could have been entered into the ®rst column (i.e., the sequence of integers, from 1 to 30), and the commuting times would then have been entered in the second column). The order that the data are entered into a column is unimportant.

1.7.2 Descriptive Analysis

Once the data are entered, click on Analyze (or Statistics, in older versions of SPSS for Windows). Then click on Descriptive Statistics. Then click on Explore. A split box will appear on the screen; move the variable or variables of interest from the left box to the box on the right that is headed ``Dependent List'' by highlighting the variable(s) and clicking on the arrow. Then click on OK. Simple descriptive statistics.


Table 1.3 SPSS output for data of Table 1.1

Other options. Options for producing other related statistics and graphs are available. To produce a histogram for instance, before clicking OK above, click on Plots, and you can then check a box to produce a histogram. Then click on Continue and OK. Results. Table 1.3 displays results of the output. In addition to this table, boxplots (Figure 1.5), stem and leaf displays (Figure 1.6) and, optionally, histograms (Figure 1.4) are also produced.

Exercises 1. The 236 values that appear below are the 1990 median household incomes (in dollars) for the 236 census tracts of Bualo, New York. (a) For the ®rst 19 tracts, ®nd the mean, median, range, interquartile range, standard deviation, variance, skewness, and kurtosis using only a calculator (though you may want to check your results using a statistical software package). In addition, construct a stem-and-leaf plot, a box plot, and a histogram for these 19 observations. (b) Use a statistical software package to repeat part (a), this time using all 236 observations. (c) Comment on your results. In particular, what does it mean to ®nd the mean of a set of medians? How do the observations that have a value of 0 affect the results? Should they be included? How might the results differ if a different geographic scale were chosen? 22342, 19919, 8187, 15875, 17994, 30765, 31347, 27282, 29310, 23720, 22033, 11706, 15625, 6173, 15694, 7924, 10433, 13274, 17803, 20583, 21897, 14531, 19048, 19850, 19734, 18205, 13984, 8738, 10299, 10678, 8685, 13455,


14821, 23231, 26931, 15608, 17096, 10232, 36229, 18374, 43229, 27292, 25612, 29063, 39083, 41957, 30179, 34186, 34281, 16250, 37500, 22083,

23722, 8740, 12325, 10717, 21447, 11250, 16016, 11509, 11395, 19721, 21293, 24375, 19510, 14926, 22490, 21383, 25060, 22664, 8671, 31566, 0, 24965, 34656, 24493, 21764, 25843, 32708, 22188, 19909, 33675, 15857, 18649, 21880, 17250, 16569, 14991, 0, 8643, 22801, 39708, 20647, 30712, 19304, 24116, 17500, 19106, 17517, 12525, 13936, 7495, 6891, 16888, 42274, 43033, 43500, 22257, 22931, 31918, 29072, 31948, 33860, 32586, 32606, 31453, 32939, 30072, 32185, 35664, 27578, 23861, 26563, 30726, 33614, 30373, 28347, 37786, 48987, 56318, 49641, 85742, 53116, 44335, 30184, 36744, 39698, 0, 21987, 66358, 46587, 26934, 31558, 36944, 43750, 49408, 37354, 31010, 35709, 32913, 25594, 28980, 28800, 28634, 18958, 26515, 24779, 21667, 24660, 29375, 30996, 45645, 39312, 34287, 35533, 27647, 24342, 22402, 28967, 28649, 23881, 31071, 27412, 27943, 34500, 19792, 41447, 35833, 14333, 12778, 20000, 19656, 22302, 33475, 26580, 0, 24588, 31496, 33694, 36193, 41921, 35819, 39304, 38844, 37443, 47873, 41410, 36798, 38508, 38382, 37029, 48472, 38837, 40548, 35165, 39404, 24615, 34904, 21964, 42617, 58682, 41875, 40370, 24511, 31008, 29600, 38205, 35536, 35386, 36250, 31341, 33790, 31987, 42113, 33841, 37877, 35650, 28556, 27048, 27736, 30269, 32699, 28988, 27446, 76306, 19333

2. Ten migration distances corresponding to the distances moved by recent migrants are observed (in miles): 43, 6, 7, 11, 122, 41, 21, 17, 1, 3. Find the mean and standard deviation, and then convert all observations into z-scores. 3. The probability of commuting by train in a community is 0.1. A survey of residents in a particular neighborhood ®nds that four out of ten commute by train. We wish to conclude either that (a) the ``true'' commuting rate in the neighborhood is 0.1, and we have just witnessed four out of ten as a result of sampling ¯uctuation, or (b) the ``true'' commuting rate in the neighborhood is greater than 0.1, and it is very unlikely that we would have observed four out of ten train commuters if the true rate was 0.1. Decide which choice is best via the following steps, using the random number table in Table A.1 of Appendix A: (1) take a series of ten random digits, and then count and record the number of ``0''s; these will represent the number of train commuters in a sample of ten, where the ``true'' commuting probability is 0.1. (2) Repeat step 1 twenty times. (3) Arrive at either conclusion (a) or (b). You should arrive at conclusion (b) if you had four or more commuters either once, or not at all, in the twenty repetitions (since one out of twenty is equal to 0.05, or 5%).

2

Probability and Probability Models

LEARNING OBJECTIVES

. Review of mathematical notation and ordering of mathematical operations . Introduction to probability concepts, including (a) sample spaces as potential outcomes of experiments, (b) assignment of probabilities to individual outcomes . Binomial and normal distributions . Confidence intervals for the sample mean . Examples of applications based upon simple probability models

In Chapter 1, we had our ®rst glimpse into some of the concepts that are used both to describe sample data and to make inferences. In this chapter, we will build upon these concepts. After reviewing mathematical conventions and notation in the beginning of the chapter, we will explore some of the basic concepts of probability, which form the basis for statistical inference. 2.1 Mathematical Conventions and Notation

The amount of mathematical notation used in this book is actually quite small, but, nevertheless, it is useful to review some basic notation and mathematical conventions. 2.1.1 Mathematical Conventions

By the term ``mathematical conventions'' we are not referring here to the gatherings of mathematicians at conferences, but rather to the standards that are used in the writing and use of mathematical material. The primary conventions we are concerned with are those regarding parentheses and the ordering of mathematical operations. In a mathematical expression, one performs operations in the following order, arranged from operations performed ®rst to those performed last: (1) Factorials (the factorial of an integer m is the product of the integers from 1 to m, and is further de®ned below). (2) Powers and roots.

PROBABILITY AND PROBABILITY MODELS 19

(3) Multiplication and division. (4) Addition and subtraction. Thus the expression 3 10=52

2:1

is evaluated by ®rst squaring 5, then ®nding 10/25 0.4, and then adding 3 to ®nd the result of 3.4. One does not simply go from left to right; if you did, you would incorrectly add 10 to 3, then divide by 5 to get 2.6, and then square 2.6 for a ®nal (incorrect) answer of 6.76. If there is more than one operation in any of the four categories above, one carries out those particular operations from left to right. Thus, to evaluate 3 10=5 6 7

2:2

one would do the division ®rst and the multiplication second, yielding 3 2 42 47

2:3

Although it would be unusual to see it written this way, 6=3=3

2:4

is equal to 2/3, since 6/3 would be carried out ®rst. Operations within parentheses are always performed before those that are not within parentheses, and those within nested parentheses are dealt with by performing the operations within the innermost set of parentheses ®rst. So, for example, 3 5 32 =2 4 3 82 =2 4 3 32 4 100

2:5

Although these basic principles are taught before the high-school years, it is not uncommon to need a little review! It is important to realize too that it is not just students of statistics that need brushing up ± software developers and decision-makers sometimes do not abide by these conventions. For example, new variables that are created within the geographic information system (GIS) ArcView 3.1 are created by simply carrying out operations from left to right. Although parentheses are recognized, the fundamental order of operations, as outlined above, is not! This leads to visions of planners and others all over the world making decisions based upon inaccurate information! Suppose we have data on the proportion of people commuting by train (variable 1), the number of people who commute by bus (variable 2), and the total number of commuters (variable 3) for a number of census tracts in our database. Thinking that ArcView will surely use the standard order of mathematical operations, we compute a new variable re¯ecting the proportion of people who commute by bus or train (variable 4) via Var: 4 Var: 1 Var: 2=Var: 3

2:6


ArcView will provide us with a column of answers where Var: 4 Var: 1 Var: 2=Var: 3

2:7

when in fact what we wanted was Var: 4 Var: 1 Var: 2=Var: 3

2:8

One way of ensuring that problems like this do not arise is to use extra sets of parentheses, as in the last equation (and, in fact, to obtain the desired variable within ArcView, they must be used). 2.1.2 Mathematical Notation

The mathematical notation used most often in this book is the summation notation. The Greek letter is used as a shorthand way of indicating that a sum is to be taken. For example, in X i1

xi

2:9

denotes that the sum of n observations is to be taken; the expression is equivalent to x1 x2 xn

2:10

The `ì1'' under the symbol refers to where the sum of terms begins, and the `ìn'' refers to where it terminates. Thus i5 X i3

xi x3 x4 x5

2:11

implies that we are to sum only the third, fourth, and ®fth observations. There are a number of rules that govern the use of this notation. These may be summarized as follows, where a is a constant, n is the number of observations, and x and y are variables: in X

9 > > > > > > > > > > =

a na

i1 in X

axi a

in X

xi

> > > > > in in in X X X > > > > xi yi xi yi > ; i1

i1

i1

i1

i1

2:12


The ®rst states that summing a constant n times yields a result of an. Thus i3 X

4 4 4 4 4 3 12

2:13

i1

The second rule in (2.12) indicates that constants may be taken outside of the summation sign. So, for example, i3 X i1

3xi 3

i3 X i1

xi 3x1 x2 x3

2:14

The third rule implies that the order of addition does not matter when sums of sums are being taken. Other conventions include 9 > > x i y i x 1 y 1 x 2 y 2 xn y n > > > > i1 > > > in = X 2 2 2 2 x i x 1 x 2 xn > > i1 > !2 > > in > X 2 > > xi x1 x2 xn > ; in X

2:15

i1

Shorthand versions of the summation notation leave out the upper limit of the summation, and sometimes the lower limit as well. This is done in those situations where all of the terms, and not just some subset of them, are to be summed. The following are all equivalent: in X i1

xi

n X i

xi

X i

xi

X

xi

2:16

It should also be recognized that the letter `ì'' is used in this notation simply as an indicator (to indicate which observations or terms to sum); we could just as easily use any other letter: in X i1

xi

kn X

xk

2:17

k1

In each case we ®nd the sum by adding up all of the n observations. In fact, we often have use for more than one summation indicator. Double summations are required when we want to denote the sum of all of the observations in a table. A table of commuting ¯ows, such as the one in Table 2.1, indicates the origins and destinations of individuals. The value of any cell is denoted xij and this refers to the number of commuters from origin i who


Table 2.1 Hypothetical commuting data Destination Origin 1 2 3

1

2

3

130 20 30

40 100 20

50 10 100

commute to destination Pin j. The number of commuters going to destination j from all origins is i1 xij (where there are n transportation zones), and the Pjn number of commuters leaving origin i for all destinations is j1 xij : The total number of commuters is designated by the double summation, Pn Pn x : Using the data in Table 2.1, for example, we ®nd that P P Pi1 j1 ij P x 160, i i2 j x1j 220; and i j xij 500: Whereas the summation notation refers to the addition of terms, the product notation applies to the multiplication of terms. It is denoted by the capital Greek letter , and is used in the same way as the summation notation. For example, n Y i1

xi yi x1 y1 x2 y2 xn yn

2:18

The factorial of a positive integer, n, is equal to the product of the ®rst n integers. Surprisingly perhaps, factorials are denoted by an exclamation point. Thus 5! 5 4 3 2 1 120

2:19

Note that we could express factorials in terms of the product notation: n!

in Y

i

2:20

i1

There is also a convention that 0!=1; factorials are not de®ned for negative integers or for nonintegers. Factorials arise in the calculation of combinations. Combinations refer to the number of possible outcomes that particular probability experiments may have (see Section 2.2). Speci®cally, the number of ways that r items may be chosen from a group of n items is denoted by nr ; and is equal to n r

n! r!n r!

2:21

5! 120 10 2!3! 2 6

2:22

For example, 5 2


What does this mean? If, for example, we group income into n5 categories, then there are ten ways to choose two of them. If we label the ®ve categories (a) through (e), then the ten possible combinations of two income categories are ab, ac, ad, ae, bc, bd, be, cd, ce, and de. 2.1.3 Examples

6! 6 5 4 3 2 1 720 i4 Y

2:23

i2 12 22 32 42 576

2:24

34 26=13 12 58

2:25

i1

Now let a 3, and let the values of a set (n 3) of x and y values be x1 4, x2 5, x3 6, y1 7, y2 8, and y3 9. Then X 2 X i1

9 > > > > > > > > > > > > > > > > =

axi 34 5 6 45 xi yi 47 58 68

X

x3i 43 53 63 405 P xi 4 5 6 5 x 3 n P yi y2 7 82 8 2 sy n 1 3

82 9 1

82

1

> > > > > > > > > > > > > > > > ;

2:26

Note that the sum of products does not necessarily equal the product of sums: X

xi yi 4 7 5 8 6 9 122 X X yi 4 5 6 7 8 9 360 6 xi

2:27

2.2 Sample Spaces, Random Variables, and Probabilities

Suppose we are interested in the likelihood that current residents of a suburban street are new to the neighborhood during the past year. To keep the example manageable, we shall assume that just four households are asked about their duration of residence. There are several possible questions that may be of interest. We may wish to use the sample to estimate the probability that


residents of the street moved to the street during the past year. Or we may want to know whether the likelihood of moving onto that street during the past year is any dierent than it is for the entire city. This problem is typical of statistical problems in the sense that it is characterized by the uncertainty associated with the possible outcomes of the household survey. We may think of the survey as an experiment of sorts. The experiment has associated with it a sample space, which is the set of all possible outcomes. Representing a recent move with a ``1'' and representing longer-term residents with a ``0'', the sample space is enumerated in Table 2.2. These sixteen outcomes represent all of the possible results from our survey. The individual outcomes are sometimes referred to as simple events or sample points. Random variables are functions de®ned on a sample space. This is a rather formal way of saying that associated with each possible outcome is a quantity of interest to us. In our example, we are unlikely to be interested in the individual responses, but rather the total number of households that are newcomers to the street. Portrayed in Table 2.3 is the sample space with the variable of interest, the number of new households, given in parentheses. In this instance, the random variable is said to be discrete, since it can take on only a ®nite number of values (namely, the non-negative integers 0±4). Other random variables are continuous ± they can take on an in®nite number of values. Elevation, for example, is a continuous variable. Associated with each possible outcome in a sample space is a probability. Each of the probabilities is greater than or equal to zero, and less than or equal to one. Probabilities may be thought of as a measure of the likelihood or relative frequency of each possible outcome. The sum of the probabilities over the sample space is equal to one. There are numerous ways to assign probabilities to the elements of sample spaces. One way is to assign them on the basis of relative frequencies. Given a description of the current weather pattern, a meteorologist may note that in 65 out of the last 100 times that such a pattern prevailed there was measurable

Table 2.2 The sixteen possible outcomes on a sample of four residents 0000 0001 0010 0011

0100 0101 0110 0111

1000 1001 1010 1011

1100 1101 1110 1111

Table 2.3 Possible outcomes, with the number of new households in parentheses 0000 0001 0010 0011

(0) (1) (1) (2)

0100 0101 0110 0111

(1) (2) (2) (3)

1000 1001 1010 1011

(1) (2) (2) (3)

1100 1101 1110 1111

(2) (3) (3) (4)


precipitation the next day. The possible outcomes ± rain or no rain tomorrow ± are assigned probabilities of 0.65 and 0.35, respectively, on the basis of their relative frequencies. Another way to assign probabilities is on the basis of subjective beliefs. The description of current weather patterns is a simpli®cation of reality, and may be based upon only a small number of variables such as temperature, wind speed and direction, barometric pressure, etc. The forecaster may, partly on the basis of other experience, assess the likelihoods of precipitation and no precipitation as 0.6 and 0.4, respectively. Yet another possibility for the assignment of probabilities is to assign each of the n possible outcomes a probability of 1/n. This approach assumes that each sample point is equally likely, and it is an appropriate way to assign probabilities to the outcomes in special kinds of experiments. If, for example, we ¯ipped four coins, and let ``0'' represent ``heads'' and ``1'' represent ``tails,'' there would be sixteen possible outcomes (identical to the sixteen outcomes associated with our survey of the four residents above). If the probability of heads is 1/2, and if the outcomes of the four tosses are assumed independent from one another, the probability of any particular sequence of four tosses is given by the product 1/21/21/21/2 1/16. Similarly, if the probability that an individual resident is new to the neighborhood is 1/2, we would assign a probability of 1/16 to each of the sixteen outcomes in Table 2.2. Note that if the probability of heads diers from 1/2, the sixteen outcomes will not be equally likely. If the probability of heads or the probability that a resident is a newcomer is denoted by p, the probability of tails and the probability the resident is not a newcomer is equal to (1 p). In this case, the probability of a particular sequence is again given by the product of the likelihoods of the individual tosses. Thus the likelihood of ``1001'' (or ``HTTH'' using H for heads and T for tails) is equal to p(1 p)(1 p) p p2(1 p)2.

2.3 The Binomial Distribution

Returning to the example of whether the four surveyed households are newcomers, we are more interested in the random variable de®ned as the number of new households than in particular sample points. If we want to know the likelihood of receiving two ``successes,'' or two new households out of a survey of four, we must add up all of the probabilities associated with the relevant sample points. In Table 2.4 we use an ``*'' to designate those outcomes where two households among the four surveyed are new ones. If the probability that a surveyed household is a new one is equal to p, the likelihood of any particular event with an ``*'' is p2(1 p)2. Since there are six such possibilities, the desired probability is 6p2(1 p)2.


Table 2.4 Asterisked outcomes, indicating outcomes of interest 0000 0001 0010 0011*

0100 0101* 0110* 0111

1000 1001* 1010* 1011

1100* 1101 1110 1111

Note that we have assumed that the probability p is constant across households, and also that households behave independently. These assumptions may or may not be realistic. Dierent types of household might have dierent values of p ± for example, those who live in bigger houses may be more (or less) likely to be newcomers. The responses received from nearby houses may also not be independent. If one respondent was a newcomer, it might make it more likely that a nearby respondent is also a newcomer (if for example, a new row of houses has just been constructed). Under these assumptions, the number of households who are newcomers is a binomial variable, and the probability that it takes on a particular value is given by the binomial distribution. We can ®nd the probability that the random variable, designated X, is equal to 2, using the binomial formula pX 2

4 2

2 p 1

p2 6p2 1

p2

2:28

The binomial coecient provides a means of counting the number of relevant outcomes in the sample space: 4 2

4! 24 6 2!2! 22

2:29

The binomial distribution is used whenever (a) the process of interest consists of a number (n) of independent trials (in our example, the independent trials were the independent responses of the n 4 residents, (b) each trial results in one of two possible outcomes (e.g., a newcomer, or not a newcomer), and (c) the probability of each outcome is known, and is the same for each trial; these probabilities are designated p and 1 p. Often the outcomes of trials are labelled ``success'' with probability p and ``failure'' with probability 1 p. Then the probability of x successes is given by the binomial distribution pX x nx px 1

pn

x

2:30

You should recognize that, for given values of n and p, we can generate a histogram by using this formula to generate the expected frequencies associated with dierent values of x. The histogram is also known as the binomial probability distribution, and it reveals how likely particular outcomes are. For example, suppose that the probability that a surveyed resident is a newcomer


Relative Frequency 0.4096

0.1536

0.0256 0.0016 0

1

2

3

4

Number of Successes

Figure 2.1 Binomial distribution with n = 4, p = 0.2

to the neighborhood is p 0.2. Then the probability that our survey of four residents will result in a given number of newcomers is pX 0 pX 1 pX 2 pX 3 pX 4

4 0 4 0 :2 :8 1 3 4 1 :2 :8 2 2 4 2 :2 :8 3 1 4 3 :2 :8 4 0 4 4 :2 :8

9 :4096 > > > > > :4096 > > = :1536 > > > :0256 > > > > ; :0016

2:31

The probabilities may be thought of as relative frequencies. If we took repeated surveys of four residents, 40.96% of the surveys would yield no newcomers, 40.96% would reveal one newcomer, 15.36% would reveal two newcomers, 2.56% would yield three newcomers, and 0.16% would result in four newcomers. Note that the probabilities or relative frequencies sum to one. The binomial distribution depicted in Figure 2.1 portrays these results graphically. If we multiplied the vertical scale by n, the histogram would represent absolute frequencies expected in each category. 2.4 The Normal Distribution

The most common probability distribution is the normal distribution. Its familiar symmetric, bell-shaped appearance is shown in Figure 2.2. The normal distribution is a continuous one ± instead of a histogram with a ®nite number of vertical bars, the relative frequency distribution is continuous. You can think of it as a histogram with a very large number of very narrow vertical bars. The vertical axis is related to the likelihood of obtaining particular x values. As with all frequency distributions, the area under the curve between


Figure 2.2 The normal distribution

any two x values corresponds to the probability of obtaining an x value in that range. The total area under the curve is equal to one. The normal distribution arises in a variety of contexts and is related to a variety of underlying processes. One way in which the normal distribution arises is through an approximation to binomial processes. Suppose that instead of interviewing four residents, we interviewed 40. We could still use the binomial distribution to evaluate the probability that eleven or fewer households were new to the neighborhood, but that would entail a long, tedious calculation involving large factorials: pX 11 pX 0 pX 1 pX 11 0 40 11 29 40 40 0 :2 :8 11 :2 :8

2:32

When the sample size is large, the binomial distribution is approximately the same as a normal distribution which has a mean of np and a variance of np(1 p). In our example, we would expect a mean of np (40)(0.2) 8 residents to indicate that they were newcomers. The variance, np(1 p) 40(0.2) (0.8) 6.4, represents the variability we would expect in a summary of the results produced by many people who went out and surveyed 40 households. The probability that eleven or fewer residents are newcomers, p(X 11), may be determined by the shaded area under the normal curve shown in Figure 2.3. The areas under normal curves are given in tables such as that found in Table A.2 in Appendix A. Since variables with normal distributions may have an in®nite number of possible means and standard deviations, normal tables are standardized, and they display the areas under normal distributions that have a mean of zero and a standard deviation of one. Before using a normal table, we must transform our data so that it has a mean of zero and a standard deviation of one. This is achieved by converting the data into z-scores.


0.45

Relative frequency

0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00

µ=8

x=11

Figure 2.3 Probability of X < 11

For our example, we convert x 11 into a z-score by ®rst subtracting the mean and then dividing the result by the standard deviation: 11 8 z p 1:19 6:4

2:33

We now ®nd the probability that z B

3:27

We collect the following sample data: p1 0:3; p2 0:2;

n1 39 n2 50

3:28

The pooled estimate of the proportion is p

0:339 0:250 0:244 39 50

3:29

The z-statistic is 0:3 0:2 z p 0:95 0:244=39 0:244=50

3:30


With 0.05, zcrit for this one-sided test is 1.645, and so we accept the null hypothesis and conclude that there is no dierence between the two communities. The p-value for this example is 0.17, since this corresponds to the area under the standard normal curve that is more extreme than the observed z-value. Note the fact that the p-value is greater than 0.05 is consistent with accepting the null hypothesis.

3.5 Distributions of the Variable and the Test Statistic

Frequency

A key distinction exists between the distribution of the variable of interest and the sampling distribution of the test statistic. This distinction is often not fully appreciated. Suppose that the distribution of distances traveled by park-goers from their residences to the park is governed by the ``friction of distance'' eect, irrespective of the weather. This eect is widely observed in many types of spatial interaction, where the distribution of trip lengths is characterized by many short trips and relatively fewer longer ones (see Figure 3.6). If we want to test the null hypothesis that the mean trip distance is the same on rainy days as it is on sunny days, we would take two samples. We might expect that both the sunny-day trip length distribution and the rainy-day trip length distribution would have shapes that are similar to the exponential distribution. To test the null hypothesis, we must compare the observed dierence in mean distances with the sampling distribution of dierences, derived by assuming H0 to be true. The latter may be thought of as the histogram of dierences when many samples are used to calculate many dierences in means, when H0 is true. We know from the central limit theorem that the means of variables are normally distributed (given a large enough sample size), even when the underlying variables themselves are not normally distributed. We also know that the

Trip length Figure 3.6 Distribution of trip lengths

Relative frequency

HYPOTHESIS TESTING AND SAMPLING 55

0 Difference in mean trip length Figure 3.7 Distribution of differences in mean trip length

dierence of two normal variables is normally distributed. Hence the twosample test makes use of the fact that the sampling distribution of dierences in means is normal (Figure 3.7). The important point is that there are two distributions to keep in mind ± the distribution of the underlying variable (in this case exponential), and the distribution of the test statistic (in this case normal).

3.6 Spatial Data and the Implications of Nonindependence

One of the assumptions of the one- and two-sample t-tests is that observations are independent. This means that the observed value of one observation is not aected by the value of another observation. At ®rst glance, this assumption sounds innocent enough, and it is tempting to simply ignore it and hope that it is satis®ed. However, spatial data are often not independent; the value of one observation is very likely to be in¯uenced by the value of another observation. In the swimming example, two individuals chosen at random in the central city are more likely to have similar responses than two individuals chosen at random from the suburbs. This could be because the accessibility of swimming pools is similar for them; the closer the two chosen individuals live together, the more similar is their distance to pools, and this would tend to make their swimming frequencies similar to one another. The closer two individuals live together, the more similar their incomes and lifestyles tend to be. This too would tend to cause similar swimming frequencies. What are the consequences of a lack of independence among the observations? Because observations that are located near one another in space often exhibit similar values on variables, the eect is to reduce the eective sample size. Instead of n observations, the sample eectively contains information on


less than n individuals. To take an extreme case, suppose that two individuals lived next door to one another, and thirty miles from the nearest pool. If we survey both of them, they are both likely to indicate that their swimming frequency was either zero or some very small number. The information contained in these two responses is essentially equivalent to the information contained in one response. The implication of this is that when we carry out a two-sample t-test on observations that do not exhibit independence, we should really be using a critical value of t that is based on a smaller number of degrees of freedom than n. This in turn means that the critical value of t should be larger than the one that we use when we assume independence. A larger critical value of t means that it would be more dicult to reject the null hypothesis, and also that we are rejecting too many null hypotheses if we incorrectly assume independence. Thus there is a tendency to ®nd signi®cant results when in fact there are no signi®cant dierences in the underlying means of the two populations. The `àpparent'' dierences between the two samples can, instead, be attributed to the fact that each sample contains observations that are similar to each other because of spatial dependence. Cli and Ord (1975) give some examples of this, and supply the correct critical values of t that one should use, given a speci®ed level of dependence. When data are independent and the variance is 2, we have seen that a p p 95% con®dence interval for the mean, , is x 1:96= n, x 1:96= n. Following Cressie (1993), suppose that we collected n 10 observations. For example, we might collect air quality data systematically along a transect (Figure 3.8). Let us choose x1 from a normal distribution with mean and variance 2. Then, instead of choosing x2 from a normal distribution with mean and variance 2, choose x2 as x2 x1 "

Figure 3.8 Systematic collection of data along a transect

3:31


where " comes from a normal distribution with mean 0 and variance 2 1 2 , and is a constant between 0 and 1 indicating the amount of dependence (with 0 implying independence and 1 implying a perfect dependence, so that x2 x1). Cressie indicates that when successive points are chosen as in 3.31, the variance of the mean is equal to 2/n only when data are independent ( 0); more generally it is equal to 2x

" 2 2n 1 1 n1 n

22 1 n1

n 1 2

# 3:32

Cressie gives an example for n 10 and 0.26; in this case 2x 2 10 1:608, implying that a 95% two-sided con®dence interval for is p p x 2:458= n, x 2:458= n. It is important to realize that this is wider than the con®dence interval that results from assuming independence. If we write the variance of the mean as 2x 2 n f , where f is the in¯ation factor induced by the lack of independence, we can also write 2x

2 f 2 0 n n

3:33

where n0 n/f is the eective number of independent observations. With n 10 and 0.26, f 1.608 and n0 10/1.608 6.2; this means that our 10 dependent observations are equivalent to a situation where we have n0 6.2 independent observations.

3.7 Sampling

The statistical methods discussed throughout this book rely upon sampling from some larger population. The population may be thought of as the collection of all elements or individuals that are the object of our interest. The list of all elements in the population is referred to as the sampling frame. Sampling frames may consist of spatial elements ± for example, all of the census tracts in a city. We may be interested in the commuting times of all individuals in a community, or in the migration distances of all people who have moved during the past year. It is important to have a clear de®nition of this population, since this is the group about which we are making inferences. The inferences are made using information collected from a sample. There are many ways to sample from a population. Perhaps the simplest sampling method is random sampling, where each of the elements has an equal probability of being selected from the population into the sample. For example, suppose we wish to take a random sample of size n 4 from a population


of size N 20. (A common convention is to use upper case ``N '' to denote population size and lower case ``n'' to denote sample size.) Choose a random number from 1 to 20. Then select another random number from 1 to 20. If it is the same as the previous random number, discard it and choose another. Repeat this until four distinct random numbers, representing elements of the sampling frame, have been chosen. To illustrate, we will use the ®rst two digits of the ®ve-digit random numbers from Table A.1 in Appendix A. Beginning at the upper-left of the table and proceeding down the column, the ®rst two-digit number in the range 01±20 is 17. To complete our sample of n 4, we proceed down the column and choose the next three numbers in this range ± they are 04, 03, and 07. Choosing a systematic sample of size n begins by selecting an observation at random from among the ®rst [N/n] elements, where the square brackets indicate that the integer part of N/n is to be taken. Thus if N/n is not an integer, one just uses the integer part of N/n. Call the label of this randomly chosen element k. The elements of the sampling frame that are in the sample are k+i[N/n], i 0, 1, . . . , n 1. With N 20 and n 4, k N/n 5. Suppose, from among the ®rst ®ve elements, we choose element k 2 at random. The elements in the sample are 2, 2+5 7, 2+10 12, and 2+15 17. Note that it was necessary to choose only one random number. When it is known beforehand that there is likely to be variation across certain subgroups of the population, the sampling frame may be strati®ed before sampling. For example, suppose that our N 20 individuals can be divided into two groups ± Nm 15 men and Nw 5 women. A proportional, strati®ed sampling of individuals is achieved by making the sample proportions in each stratum equal. Thus we could choose nm 3 men randomly from among the group of Nm 15, and nw 1 woman randomly from the group of Nw 5 women. For both men and women, the sampling proportion is 1/5. When the sampled size of the stratum is small, it may be advantageous to obtain a disproportional random sample, where the small group is oversampled. In the case above, using nm 2 and nw 2 would result in unequal sample proportions, since nm/Nm 2/15 for men and nw/Nw 2/5 for women.

3.7.1 Spatial Sampling

When the sampling frame consists of all of the points located in a geographical region of interest, there are again several alternative sampling methods. A random spatial sample consists of locations obtained by choosing x-coordinates and y-coordinates at random. If the region is a non-rectangular shape, x- and y-coordinates may be chosen by selecting them at random from the ranges (xmin, xmax) and (ymin, ymax). If the pair of coordinates happens to correspond to a location outside of the study region, the point is simply discarded.


1

2

3

n

2 3 m (a)

(b)

(c)

Figure 3.9 Examples of spatial sampling: (a) study region stratified into subregions; (b) stratified spatial sampling; (c) systematic spatial sampling

To ensure adequate coverage of the study area, the study region may be broken into a number of mutually exclusive and collectively exhaustive strata. Figure 3.9(a) divides a study region into a set of s mn strata. A strati®ed spatial sample of size mnp is obtained by taking a random sample of size p within each of the mn strata (Figure 3.9b). A systematic spatial sample of size mnp is obtained by (i) taking a random sample of size p within any individual stratum, and then (ii) using the sample spatial con®guration of those p points within that stratum within the other strata (Figure 3.9c). The question of which sampling scheme is ``best'' depends upon the spatial characteristics of variability in the data. In particular, because values of variables at one location tend to be strongly associated with values at nearby locations, random spatial samples can provide redundant information when sample locations are close to one another. Consequently, strati®ed and systematic random sampling tend to provide better estimates of the variable's mean value. Thus if one were to repeat the sampling many times, the variability associated with the means calculated using systematic or strati®ed sampling would be less than that found with random spatial sampling. Haining (1990a) discusses this in more detail, and gives references that suggest that systematic random sampling is often slightly better than strati®ed random sampling.

3.8 Two-Sample t-Tests in SPSS for Windows 9.0 3.8.1 Data Entry

Suppose we wish to enter the data from Table 3.1 into SPSS and conduct a two-sample t-test of the null hypothesis that the mean annual swimming frequency among residents of the central city is equal to the mean annual swimming frequency among residents of the suburbs. We begin by entering the data. In SPSS, this entails entering all of the swimming frequencies into one column. Another column contains a numeric


value indicating which region the corresponding swimming frequency belongs to. For our two-region example, we would have Swim 38 42 50 57 80 70 32 20 58 66 80 62 73 39 73 58

Location 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2

The variable names ``Swim'' and ``Location'' are de®ned by right-clicking at the head of each column on the heading ``var'' that appears in the SPSS data editor. Then, under De®ne Variable, variable names may be assigned. Note that here the ®rst eight rows correspond to the data from the central city; location 1 refers to the central city. Similarly, the last eight rows contain the value ``2'' in the second column, and these correspond to the observations from the suburbs. In general, if there are n1 observations on one variable and n2 observations on the other, then there will be n1+n2 rows and 2 columns once data have been entered into SPSS. 3.8.2 Running the t-Test

To run the analysis within SPSS, click on Analyze (Statistics in earlier versions of SPSS for Windows), then on Compare Means, and then on Independent Samples t-test. A box will open, and the variable Swim should then be highlighted and moved to the Test Variable box via the arrow tab. The variable Location is moved to the box headed Grouping Variable (since we are testing the variable Swim for dierences by Location). Under the Grouping Variable box, click on De®ne Groups, and enter 1 for Group 1 and 2 for Group 2; these are the numeric values that SPSS will use from the second column of data to distinguish between groups. Then click Continue. Under Options, the percentage associated with con®dence interval may be assigned if desired (the default is 95%). Finally, click OK. An example of the output from a two-sample t-test is shown in Table 3.2, which depicts the results of the test of equality of swimming frequencies in central city and suburbs using SPSS 9.0 for Windows.

Table 3.2 Results of two sample t-test


First, the swimming frequencies in each region are summarized; location 1 (central city) has a mean response of 48.625 days and a standard deviation of 19.8778, while those in the suburbs apparently swim more often ± the responses there have a mean of 63.625 and a standard deviation of 12.6597. Below this are the results of the analysis. First, note that there is a test of the assumption that the variances of the two groups are indeed equal. This test, Levene's test, is based upon an F-statistic. The key piece of output is the column headed ``Sig.'', since this tells us whether to accept or reject the null hypothesis that the two variances are equal. Since this value (which is also known as a p-value) is greater than 0.05, we can accept the null hypothesis, and conclude that the variances may be assumed equal. The results of the t-test are given for both instances ± one where the variances are assumed equal, and one where they are not. In both cases, the t-statistic is 1.8, and in both cases we accept the null hypothesis since the ``Sig.'' column indicates a value higher than 0.05. Note that, when equal variances are assumed, we come slightly closer to rejecting the null hypothesis (the p-value in that case is 0.093, compared with 0.097 when the variances are not assumed equal). The p-values dier despite identical t-statistics because the degrees of freedom dier. Finally, note that the 95% con®dence intervals include zero, indicating that the true dierence between city and suburbs could be zero.

Exercises 1. A political geographer is interested in the spatial voting pattern during a recent presidential election involving two candidates, A and B. She suspects that university professors were more likely than the general population to vote for candidate A. She takes a random sample of 45 professors in the state, and ®nds that 20 voted for the candidate A. Is there sucient evidence to support her hypothesis? The statewide percentage of the population voting for candidate A was 0.38. What is the p-value? 2. A survey of the white and nonwhite population in a local area reveals the following annual trip frequencies to the nearest state park: x1 4:1; x2 3:1;

s21 14:3; s22 12:0;

n1 20 n2 16

(a) Assume that the variances are equal, and test the null hypothesis that there is no difference between the park-going frequencies of whites and nonwhites. (b) Repeat the exercise, assuming that the variances are unequal. (c) Find the p-value associated with the tests in parts (a) and (b). (d) Find a 95% con®dence interval for the difference in means. (e) Repeat parts (a)±(d), assuming sample sizes of n1 24 and n2 12.


3. Test the hypothesis that two communities have equal support for a political candidate using the following data: Community A:

p 0:33;

nA 54

Community B:

p 0:18;

nB 38

In addition to testing the hypothesis, ®nd the p-value. 4. A researcher suspects that the level of a particular stream's pollutant is higher than the allowable limit of 4.2 mg/l. A sample of n 17 reveals a mean pollutant level of x 6.4 mg/l, with a standard deviation of 4.4. Is there sucient evidence that the stream's pollutant level exceeds the allowable limit? What is the p-value? 5. Information is collected by a researcher from 14 individuals on their use of rapid transit. Seven individuals were from suburb A and seven were from suburb B. The following data are the number of times per year the individual used rapid transit: Individual 1 2 3 4 5 6 7 Mean Std. dev.

Suburb A 5 12 14 54 34 14 23

Suburb B 67 56 44 22 16 61 37

22.29 16.67

43.28 19.47

Pooled data

32.79 20.57

Do the suburbs dier with respect to the mean number of rapid transit trips taken by individuals? Use the two-sample t-test, assuming the variances are equal. Give the critical value of t, recalling that the number of degrees of freedom is equal to n1+n2 2. What is the p-value associated with this test? 6. The contour lines of the map below represent elevation. 100 30 40 70

60

50 0

50 100


(a) Take a random spatial sample of n 18 points and estimate the mean elevation of the study area. (b) Divide the study region into a set of 33 9 strata of equal size. Take a strati®ed sample of size 18 by randomly choosing two points from within each stratum. Estimate the mean. (c) Using the same 33 9 strata from (b), choose a systematic random sample by ®rst randomly selecting two points from within any individual stratum. Then use the con®guration of points within that stratum to select points within the other strata (see Figure 3.9c). Estimate the mean elevation from the resulting 18 points. Note: If answers from the entire class are pooled together, it will usually (but not always!) be the case that the means found in part (a) will display greater variability than those found in parts (b) and (c). 7. (a) A two-tailed test of a one-sample hypothesis of a mean yields a test statistic of z 1.47. What is the p-value? (b) A one-tailed test of a two-sample hypothesis involving the difference of sample means yields t 1.85, with 12 degrees of freedom. What is the p-value?

4

Analysis of Variance

LEARNING OBJECTIVES

. Comparison of means in three or more samples . Analysis of underlying assumptions . Introduction of alternative tests for cases where assumptions are not reasonable

4.1 Introduction

The two-sample dierence of means test may be generalized to the case of more than two samples. In this case, we wish to test the null hypothesis that the population means from a set of k>2 are all equal: H0 : 1 2 k

4:1

Such hypotheses may concern variation in means over time or space. For example, we may wish to know whether trac counts vary by month, or whether the number of weekly shopping trips made by households varies among the central city, suburban, and rural portions of a county. Data for such problems are typically given in a table such as Table 4.1, with the categories constituting the columns. Note that the notation in the table is a bit dierent ± ``pluses'' are used to indicate means, so that X1 designates the mean of column one, and X denotes the mean of all observations, summed over rows and columns. Analysis of variance (ANOVA) represents an extension of the two-sample ttest for dierences of means. It involves the introduction of some new ideas, though the underlying assumptions of the test are similar to those used in the two-sample t-test. The assumptions of analysis of variance may be stated as follows: (1) (2) (3)

Observations between and within samples are random and independent. The observations in each category are normally distributed. The population variances are assumed equal: 21 22 2k 2

4:2

The assumed equality of variances is referred to as the assumption of homoscedasticity (sometimes written as homoskedasticity). Though the analysis of


Table 4.1 Arrangement of data for analysis of variance Category 1

Category 2

. . . Category k

Obs. 1 Obs. 2 Obs. 3 .. .

X11 X21 X31

X12 X22 X32

X1k X2k X3k

Obs. i No. of obs. Mean Standard deviation Overall mean: X++

Xi1 n1 X+1 s1

Xi2 n2 X+2 s2

Xik nk X+k sk

variance test is one that tests for the equality of group means, the test itself is carried out using two independent estimates of the common variance 2. One estimate of the variance is a pooled estimate of the within-group variances. The other estimate of the variance is a between-group variance. The idea behind the test is to compare the variation within columns to the variation between column means. If the variation between group means is much greater than the variation within columns, we will be inclined to reject the null hypothesis. If, however, the variation between group means is not very large relative to the variation within columns, this suggests that any dierences in group means may be due to sampling ¯uctuation, and hence we are more inclined to accept H0. For example, in Table 4.1, there is variability within columns; dierent individuals within each subregion have diering levels of participation. There is also variability between columns; the sample means in each region are dierent. If the between-column variability is high relative to the within-column variability, we will reject the null hypothesis and conclude that the true column means are not equal. To be more speci®c, we may de®ne the total sum of squares as the sum of the squared deviations of all observations from the overall mean. This total sum of squares (TSS) may be partitioned into a ``between sum of squares'' (BSS) and a ``within sum of squares'' (WSS). The comparison of between-column variation to within-column variation leads to an F-statistic. The partitioning of the sum of squares is as follows: 9 XX 2 > TSS Xij X > > > j i > > X 2 = nj Xj X BSS 4:3 > > XX j > 2 X > WSS nj 1 s2j > > Xij Xj ; i

j

j

The F-statistic is F

BSS=k 1 WSS=N k

4:4

ANALYSIS OF VARIANCE 67

When the null hypothesis is true, this statistic has an F-distribution, with k 1 degrees of freedom and N k degrees of freedom associated with the numerator and denominator, respectively. 4.1.1 A Note on the Use of F-Tables

F-statistics are based on ratios. There are degrees of freedom associated with both the numerator and the denominator. F-tables are typically arranged so that the columns correspond to particular degrees of freedom associated with the numerator, and rows correspond to particular degrees of freedom associated with the denominator. Entries in the table give the critical F-values, and the entire table is associated with a given signi®cance level . Many texts give separate tables for 0.01, 0.05, and 0.10; these are provided, for example, in Table A.4 in Appendix A. Because F-tables are displayed in this way, it is often dicult to state the p-value associated with the test. Stating the p-value would require a very complete set of F-tables for many values of . Software for statistical analysis is often useful in this regard, since p-values are usually provided. 4.1.2 More on Sums of Squares

To see why the between sum of squares plus the within sum of squares add to the total sum of squares, recognize that the total sum of squares may be written as XX j

i

X ij

2 X

X X j

i

Expanding the square yields XX XX X ij X 2 X ij i

j

i

Xj

Xij

XX

2 Xj 2Xij

j

j

i

Xj Xj

2

Xj

Xj

X

X

2

4:5

X 4:6

The middle term on the right-hand side is equal to zero, since the sum of deviations from a mean is equal to zero: X Xij Xj 0 4:7 i

The ®rst term on the right-hand side is equal to the within sum of squares: WSS

XX i

j

Xij

Xj 2

4:8


Similarly, the last term on the right-hand side is equal to the between sum of squares: XX X nj X j X 2 BSS X j X 2 4:9 i

j

j

This demonstrates that the total sum of squares is equal to the between sum of squares plus the within sum of squares.

4.2 Illustrations 4.2.1 Hypothetical Swimming Frequency Data

Suppose we extend the previous example to include residents of the outlying rural region, using the data in Table 4.2. We formulate the null hypothesis of no dierence in the mean annual swimming frequency between the three regions: SUB CC R

4:10

With 0.05, the critical value of F is equal to F.05,2,21=3.47. The observed F-statistic along with its components are given below: 9 Total sum of squares: 6406:79 > > > > Between sum of squares: 1603:85 > = Within sum of squares: 4802:94 > > > 1605:33=3 1 > > ; 3:51 F 4802:94=24 3

4:11

Since the observed F value exceeds the critical value of 3.47, the null hypothesis is rejected. How are the sums of squares most easily calculated? One way is to Table 4.2 Annual swimming frequencies for three regions Annual swimming frequencies

Mean Standard deviation X++ = 59.96; s2 = 16.69

Central city

Suburbs

Rural

38 42 50 57 80 70 32 20

58 66 80 62 73 39 73 58

80 70 60 55 72 73 81 50

48.63 19.88

63.63 12.66

67.63 11.43


recognize that the total sum of squares is equal to the overall variance multiplied by N 1. Thus 6406.79 (16.69)2(23). The within sum of squares is equal to the sum of the products of the group variances and (nj 1). Thus 4802.94 (7 * 19.882)(7 * 12.662)(7 * 11.432) 7 * (19.88212.66211.432). The between sum of squares is then derived as the dierence between the total and within sum of squares: BSS 6406.79 4802.94 1603.85. 4.2.2 Diurnal Variation in Precipitation

The eects of urban areas on temperature are well known ± temperatures are generally higher in cities than in the surrounding countryside (this is known as the urban ``heat island'' eect). But what about the eects of urban areas on precipitation? One possibility is that the particulate matter ejected by urban factories forms the condensation nuclei necessary for precipitation. If this is correct, one might expect to see diurnal variation in precipitation, since factories are typically idle on the weekends. If there is no lag, precipitation would be lightest on the weekends, and heaviest during the week. I collected the data in Table 4.3 while I was an undergraduate, in conjunction with an assignment in my geography statistics class! For each day of the week, the data are lumped into six-month categories. One consequence of this lumping is to make the assumption of normality more plausible (since sums of variables from any type of distribution tend to be normally distributed). A look at the data reveals that the largest amount of precipitation occurs on Fridays and Sundays, and the least on Tuesdays and Wednesdays. Perhaps there is a roughly two-day lag between the buildup of particulate matter during the week and the precipitation events. Table 4.3 Precipitation data for LaGuardia airport Year

Precipitation at LaGuardia airport (inches) SUN MON TUE WED

THUR

FRI

4.47 1.97 1.74 3.96 2.87 1.68

3.40 2.26 3.00 1.17 0.79 1.36

0.94 3.03 5.89 4.35 3.90 0.45

1.71 4.42 4.16 4.78 3.11 4.03

8.30 5.08 2.88 7.09 5.68 5.27

Mean 3.27 4.36 2.78 Std. dev. 1.70 2.18 1.20 Overall mean: 3.56; Overall std. dev.: 1.90

2.00 1.06

3.09 2.08

3.70 1.12

5.72 1.86

1971 1972 1972 1973 1973 1974

SAT II I II I II I

2.30 5.56 5.31 2.15 1.71 2.60

6.84 6.81 1.50 4.39 4.12 2.50

Sum of squares Between: Within: F = 3.15 F0.05,6,35 = 2.37; F0.01,6,35 = 3.37

51.97 96.34

d.f.

Variance

6 35

8.663 2.753


The null hypothesis is that mean precipitation in each six-month period does not vary with day of the week. The results of the analysis of variance, shown in the table, reveal that the null hypothesis is rejected. In the exercises at the end of the chapter, you will be asked to repeat this analysis for Boston and Pittsburgh. 4.3 Analysis of Variance with Two Categories

The analysis of variance with two categories gives the same results as the twosample t-test. To illustrate, consider once again the ®rst two columns (central city and suburbs) of the swimming frequency data. Analysis of variance yields the following: 9 BSS 900 > > > > > WSS 3888 > > > > = 900=1 4:12 3:24 F > 3888=14 > > > > > F:05;1;14 4:60 > > > ; F0:10;1;14 3:10 The null hypothesis of no dierence is therefore rejected using 0.10, and accepted using 0.05. The p-value must be close to, but less than, 0.10. The result is in fact the same as that found in the last chapter using the t-test, under the assumption of equal variances. When there are two categories, either the F-test or the t-test may be used. 4.4 Testing the Assumptions

Since the analysis of variance depends upon a number of assumptions, it is important to know whether these assumptions are satis®ed. One way of testing the assumption of homoscedasticity is to use Levene's test. There are a number of ways to test normality. Two common methods are the Kolmogorov± Smirnov test and the Shapiro±Wilk test. Although a detailed discussion of all of these tests is beyond the scope of this text, most statistical software packages do provide these tests to allow researchers to test the underlying assumptions. Some additional details on each are given at the end of this chapter, in Section 4.8. 4.5 The Nonparametric Kruskal±Wallis Test

What do we do if the assumptions are not satis®ed? We have at least two options. One is to proceed with the analysis of variance anyway, and ``hope''


that we get a valid answer. Fortunately, that is often not a bad way to proceed. The F-test is said to be relatively ``robust'' with respect to deviations from the assumptions of normality and homoscedasticity. This means that the results of the F-test may still be used eectively if the assumptions are at least ``reasonably close'' to being satis®ed. If either (a) the assumptions are close to being satis®ed, or (b) the F-statistic yields a ``clear'' conclusion (say, for example, a p-value much less than, say, 0.01, or greater than 0.20), the conclusion will generally be acceptable. If the data deviate drastically from the assumptions, or if the p-value is close to the signi®cance level, then an alternative test that does not rely on the assumptions might be considered. Tests that do not make assumptions regarding how the underlying data are distributed are called nonparametric tests. The nonparametric test for two or more categories is the Kruskal±Wallis test. There is another set of circumstances in which the Kruskal±Wallis test is useful for testing hypotheses about a set of means ± namely when only ranked (i.e., ordinal) data are available. In such situations, there is insucient information to use ANOVA, which requires interval or ratio level data. (With interval and ratio data, the magnitude of the dierence between the observations is meaningful.) The application of the Kruskal±Wallis test begins by ranking the entire pooled set of N observations from lowest to highest. That is, the lowest observation is assigned a rank of 1, and the highest observation is assigned a rank of N. The idea behind the test is that, if the null hypothesis is true, then the sum of the ranks in each column should be about the same. Again, no assumptions about normality and homoscedasticity are required. The test statistic is H

k 2 X 12 Ri NN 1 i1 ni

! 3N 1

4:13

where Ri is the sum of the ranks in category i, and ni is the number of observations in category i. Under the null hypothesis of no dierence in category means, the statistic H has a chi-square distribution with k 1 degrees of freedom. Table A.5 contains critical values for the chi-square distribution. 4.5.1 Illustration: Diurnal Variation in Precipitation

The LaGuardia airport precipitation data are ranked and displayed in Table 4.4. Also shown is the sum of the ranks for each column. Employing Equation 4.13 yields a value of H=13.17. This is just slightly higher than the critical value of 12.59, and so the null hypothesis of no variation in precipitation by day of the week is rejected at the 0.05 signi®cance level. Note that the hypothesis would have been accepted using 0.01. The p-value associated with the test is approximately 0.04, meaning that if the null hypothesis were true, a test statistic this high would be observed only 4% of the time.


Table 4.4 Ranked precipitation data for LaGuardia airport Year 1971 1972 1972 1973 1973 1974

SAT II I II I II I

Ranks of observations (1 = lowest; 42 = highest) SUN MON TUE WED THUR

14 36 35 12 9 16

40 39 6 29 26 15

31 11 10 24 17 7

SUM 122 155 100 Kruskal±Wallis statistic: H = 13.17 Critical value: 20.05,6 = 12.59; 20.01,6 = 16.81

FRI

22 13 19 4 2 5

3 20 38 28 23 1

8 30 27 32 21 25

42 33 18 41 37 34

65

113

143

205

The reader should compare this with the p-value associated with the ANOVA results. The ANOVA results yielded a p-value just slightly higher than 0.01 (we know this since the observed F-value is just slightly less than the critical F value of 3.37, using 0.01). This result is a typical one ± the Kruskal±Wallis test, though not relying on as many assumptions as the analysis of variance, is not as powerful. That is, it is harder to reject false hypotheses. Thus we would have rejected H0 with ANOVA using, say, 0.02 or above, whereas we would only have rejected H0 using the Kruskal±Wallis test had we chosen 0.04 or above. 4.5.2 More on the Kruskal±Wallis Test

If there are values for which the ranks are tied, an adjustment is made to the value of H. Suppose that we have N=10 original observations, ranked from lowest to highest: 3.2, 4.1, 4.1, 4.6, 5.1, 5.2, 5.2, 5.2, 6.1, and 7.0. There are two sets of tied observations. When the data are assigned ranks, the tied values are each assigned the average rank. Thus the ranks of these ten observations are: 1, 2.5, 2.5, 4, 5, 7, 7, 7, 9, 10. In instances where tied ranks exist, the usual value of H is divided by the quantity P 1

i

t3i

N3

ti N

4:14

where ti is the number of observations tied at a given rank, and the sum is over all sets of tied ranks. In our example containing ten observations, the adjustment is 1

23 2 33 3 30 32 1 3 990 33 10 10

4:15

The eect of this adjustment is to make H bigger, and therefore to give the Kruskal±Wallis test slightly higher power, since it is then easier to reject false hypotheses.


The formula for the Kruskal±Wallis test appears rather mysterious, and the reader may wish to have a little more understanding of it. A glimpse of insight may be obtained by asking what the value of H would be if all of the observations were equal, and therefore all observations had the same rank. We will assume that there is an equal number of observations (n N/k) per category. With N observations, the sum of the ranks is equal to the sum of the integers, from 1 to N: S

N X i1

i

NN 1 2

4:16

(An historical aside that I was told when I was a boy: Gauss, at the age of 7, was punished by his schoolteacher. His punishment consisted of having to ®nd the sum of all of the integers, from 1 to 100. Within a few minutes, he had ®gured out the formula above, and used it to ®nd the answer (5050), rather than have to carry out the tedious task of actually summing all 100 numbers.) Now, if all of the observations are tied, the average rank assigned to all N observations is S/N. Furthermore the sum of the ranks in each category i will be Ri nS=N S=k

NN 1 2k

4:17

Using this in the ®rst term on the right-hand side of Equation 4.13 yields k k X X 12 NN 1 2 k 12 NN 12 3N 1 NN 1 i1 2k N NN 1 i1 4k 4:18 We have just shown that H is therefore equal to zero when all of the ranks are tied.

4.6 Contrasts

The analysis of variance, as a test for the equality of means, can sometimes leave the analyst with a sense of unful®llment. In particular, if the null hypothesis is rejected, what have we learned? We've learned that there is signi®cant evidence to conclude that the means are not equal, but we do not know which means dier from one another. We might look at the data and get a feel for which means seem high and which seem low, but it would be nice to have a way of testing to see whether particular combinations of categories had signi®cantly dierent means. We may, for example, want to know, in an example involving ®ve categories, whether the dierence between categories 2 and 5 (that is, 2 5 ) diers signi®cantly from zero. Dierences


that are of interest may involve more than two separate means. For instance, with the precipitation data, we may wish to contrast weekends with weekdays. In that case, we would want to contrast the mean of the ®rst two categories (Saturday and Sunday) with the mean of the last ®ve, weekday categories. This could be represented as SAT SUN

MON FRI

2

4:19

5

ScheeÂ (1959) described a formal procedure for contrasting sets of means with one another. A contrast, , is de®ned as a combination of the means. Usually, one de®nes linear combinations, so

k X i1

ci i

4:20

where the values of ci are speci®ed by the analyst in a manner that is consistent with the contrast of interest. In our ®rst example, categories 2 and 5 would be contrasted with each other using the values c1 0, c2 1, c3 0, c4 0, and c5 1. This choice of coecients arises from writing 2

5 0 1 1 2 0 3 0 4 1 5

4:21

In the precipitation example, the coecients for the weekend days would each be equal to 1/2, and the coecients for the weekdays would each be 1/5. Why? This combination arises from realizing that Equation 4.20 can be written as

SAT SUN

MON FRI

2 5 1=2 SAT 1=2 SUN 1=5 MON

1=5 FRI

4:22

ScheeÂ's contribution was to show that simultaneous con®dence intervals at the 1 level for all possible contrasts are given by R

R

4:23

q k 1 Fk 1;N k;

4:24

where R and 2

k 2 cj WSS X N k j1 nj

4:25


If the original null hypothesis of no dierence among the k means is rejected, then there is at least one contrast that is signi®cantly dierent from zero. If we have one or more contrasts that we wish to test, we may do so using the inequality in (4.23). The null hypothesis that any given contrast is equal to zero is rejected if the con®dence interval (4.23) does not contain zero. If the original null hypothesis of no dierence among the k means is not rejected, then there are no contrasts that will be found signi®cant.

4.6.1 A Priori Contrasts

The contrasts just described are a posteriori or post hoc contrasts, since they occur after the analysis of variance test. But sometimes the analyst may be interested in a particular contrast or set of contrasts before the analysis of variance is carried out, or instead of the analysis of variance altogether. For example, with the swimming data, we may only be interested in whether swimming frequencies among rural residents dier from all the others. With the precipitation data, we may only wish to know whether weekend and weekday magnitudes dier. When contrasts are speci®ed prior to the analysis of variance, con®dence intervals are narrower than when they are determined using Equation 4.23, after the fact. For a given contrast, the a priori con®dence interval at the 1 level is tN

k

tN

k

4:26

where is de®ned as before, and the critical value of t comes from a t-table with N k degrees of freedom, using =2 in each tail. If we are interested in more than one contrast, the value of has to be apportioned among the contrasts of interest. For example, if we were interested in looking at ®ve a priori contrasts, we could use 0.01 for each of the ®ve contrasts, giving a simultaneous con®dence level of 0.95. An example illustrating the use of contrasts is given in Section 4.8.

4.7 Spatial Dependence

One of the assumptions in ANOVA is that the observations within each category are independent. With spatial data, observations are often dependent, and some adjustment to the analysis should be made. The general eect of spatial dependence will be to render the eective number of observations smaller than the actual number of observations. With an eectively smaller number of observations, results are not as signi®cant as they appear in the F-tests outlined


in this chapter. With spatial data, therefore, it is possible that signi®cant ®ndings are due to the spatial dependence among the observations, and not to any real underlying dierences in the means of the categories. Grith (1978) has proposed a spatially adjusted ANOVA model. The details of his model are beyond the scope of this text. Grith's paper may also be of interest since it contains citations to other studies in geography that use analysis of variance.

4.8 One-Way ANOVA in SPSS for Windows 9.0 4.8.1 Data Entry

As before, there is a separate row for each observation. Using the data in Table 4.2, we will have 24 rows and 2 columns. Again, the second column designates the group number, and now we have added a third value to correspond with the rural region. The data are entered into the Data Editor of SPSS as follows: 38 42 50 57 80 70 32 20 58 66 80 62 73 39 73 58 80 70 60 55 72 73 81 50

1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3


4.8.2 Data Analysis and Interpretation

The analysis of variance proceeds in SPSS 9.0 for Windows by clicking on Analyze, then on Compare Means, and then on One-Way ANOVA. Swim (or whatever name is given to the variable in the ®rst column) is then highlighted and moved over into the dialog box entitled Dependent List, and Location is highlighted and moved over into the dialog box entitled Factor. At this point, one can simply click OK to proceed with the analysis, but here we will also click on Options, and then check the boxes entitled Descriptive and Homogeneity of Variance. Also, post hoc contrasts can be made by simply clicking on Post Hoc and then clicking on the box labeled ScheeÂ. A priori contrasts are chosen by clicking on the box labeled Contrasts. Suppose we wish to contrast swimming frequency in the central city with the average swimming frequency in the other two regions. After choosing Contrasts, click on Polynomial, and leave Linear as the selected polynomial. We then need to specify the contrast coecients (the c's in Equation 4.20). Here we could use either c1 1; c2 0:5; and c3 0:5; or c1 1; c2 0:5; and c3 0:5: Enter the coecients one at a time, clicking on Add after each entry. Finally, choose Continue, and then OK. The output that results is shown below.


The ®rst box again provides descriptive information on the variable in each region. Note that the mean frequency among respondents in the rural region is higher (67.625) than that in other regions, and its standard deviation is lower (11.426). The second box gives us the results of a test of the assumption of homoscedasticity. This Levene's test supports the null hypothesis that the variances of the three region's responses could be equal (since the column headed ``Sig.'' has an entry greater than 0.05) and that we have merely observed sampling variation. Had the p-value associated with this test been less than 0.05, we would have had to take the results of the analysis of variance more cautiously, since one of the underlying assumptions would have been violated. The next box displays the results of the analysis of variance. The table gives the sums of squares, the mean squares, the degrees of freedom, and the Fstatistic. Note that these match the results discussed in section 4.2.1, with small dierences due to rounding error (as they should!). Importantly, the output also includes the p-value associated with the test under the column labelled ``Sig.''. Since this value is less than 0.05, we reject the null hypothesis, and conclude that there are signi®cant dierences in swimming frequencies among the residents of these three regions and that these dierences cannot be attributed to sampling variation alone (unless we just happened to get a fairly unusual sample). The results of the a priori (not shown) contrasts indicate that there is indeed a signi®cant dierence between the swimming frequencies in the central city and in other areas. The value of the contrast is 17, which is the mean dierence in swimming frequencies (65.63 48.63). The signi®cance or p-value is indicated in the last column, and this is less than 0.05 when variances are assumed equal (and equal to .051 when variances are not assumed equal). Results of the post hoc contrasts indicate that there is one paired dierence that


is close to being signi®cant ± that between central city and rural regions. This is indicated by the p-value (in the column headed ``Sig.'') of .063. Con®dence intervals for the dierence in swimming frequencies associated with each paired comparison are also given. 4.8.3 Levene's Test for Equality of Variances

Levene's test of the assumption that the variances of each column of data are equal is actually similar to an analysis of variance test, except that the test is carried out on the absolute value of the data after the column means have been subtracted Let zij jxij

xj j

Then Levene's statistic is k P

L

j1

nj zj

nj P k P j1 i1

zij

z =k zj 2 =

k P j1

nj

1 1

where there are k categories, z is the overall mean of the z's, and zj is the mean of the z's in category j. When the null hypothesis of equal column variances is true, Levene's statistic has an F-distribution, with (k 1) and Pk 1) degrees of freedom. i1 (ni For the data in Table 4.2 on swimming frequencies, the ®rst step is to subtract the column mean from each observation. Then take the absolute values of the results; these are the z-values. Then the required quantities are

Illustration.

k 3;

3 X j1

nj

1 21

z 11:489; z1 15:625; z2 9:375; z3 9:4675 n1 n2 X X zi1 z1 2 812:75; zi2 z2 2 418:75; i1

n3 X i1

L

i1

zi3 h

z3 2 196:81

815:625

1:509

11:4892 89:375

11:4892 89:4675

812:75 418:75 196:81=21

11:4892

i.

3

1


This value of L is the same as that shown in the output, and it is not signi®cant since it is less than the critical value of F2,21 3.47. Hence the assumption of homoscedasticity is satis®ed. 4.8.4 Tests of Normality: the Shapiro±Wilk Test

One of the assumptions that we should test is whether the data come from a normal distribution. The Shapiro±Wilk test is a particularly good test of normality to use when sample sizes are small. Within SPSS for Windows 9.0, it can be run by using Analyze/Descriptive Statistics/Explore. Before clicking on OK, click on Plots, and check the box that reads ``Normality plots with tests''. The results will include a p-value for the Shapiro±Wilk (W) statistic. The W statistic is found as W

n P i1

b2 xi

x2

b2 n 1s2

where b

k X i1

an

i1 fxn i1

xi g

Here k n/2 when n is even, and the x-values have been ordered so that x1 < x2 < < xn . The coecients come from a table, and one is provided in Table A.6 in Appendix A. If W is less than its critical value (also taken from a table; see Table A.7), the null hypothesis of normality is rejected. For further details, see Shapiro and Wilk (1965). Illustration. To determine whether the eight swimming frequencies observed in the rural area could have come from a normal distribution, we use

x1 50;

x2 55;

x3 60;

x4 70;

x5 72;

x6 73;

x7 80;

x8 81

From the table, for n 8 and k 4, 8 .6052, 7 .3164, 6 .1743, and 5 .0561. Thus b :605281

50 :316480

55 :174373

60 :056172

and W

29:042 843:86 :923 n 1s2 913:875

70 29:04


Since this is greater than the critical value of W.05 .818, we accept the null hypothesis and conclude that there is not enough evidence to reject the assumption of normality.

Exercises 1. Using the following data: Precipitation at Boston Airport (inches) Year 1971 1972 1972 1973 1973 1974

II I II I II I

SAT

SUN

MON

TUE

WED

THUR

FRI

0.83 4.66 3.03 3.69 2.35 3.18

3.14 4.15 5.80 3.72 3.62 3.28

4.20 3.40 2.29 4.29 3.56 1.82

1.28 1.74 3.17 2.06 2.27 3.75

1.16 3.91 3.50 3.04 4.46 2.07

4.25 5.15 3.40 2.30 2.52 3.54

2.08 5.06 3.04 4.26 3.36 2.27

(a) Find the mean and standard deviation for each day of the week. (b) Use Levene's test to determine whether the assumption of homoscedasticity is justi®ed. (c) Perform an analysis of variance to test the null hypothesis that precipitation does not vary by day of the week. Show the between and within sum of squares, the observed F-statistic, and the critical F-value. (d) Repeat the analysis using data for Pittsburgh: Precipitation at Pittsburgh Airport (inches) Year 1971 1972 1972 1973 1973 1974

II I II I II I

SAT

SUN

MON

TUE

WED

THUR

FRI

1.64 2.20 2.75 2.23 3.65 4.96

5.55 3.37 1.72 4.31 2.66 3.00

3.19 0.78 2.34 2.02 3.95 2.61

2.45 2.63 3.40 1.83 2.31 1.75

1.44 2.32 3.68 4.35 1.85 2.70

1.07 5.57 3.48 4.07 2.63 2.45

1.66 2.80 2.50 2.66 1.11 4.06

2. Assume that an analysis of variance is conducted for a study where there are N 50 observations and k 5 categories. Fill in the blanks in the following ANOVA table:

Between Within Total

Sums of squares

Degrees of freedom

Ð 2000 Ð

Ð Ð Ð

Mean square

F

116.3 Ð

Ð


If the critical value of F is 2.42, what is your conclusion regarding the null hypothesis that the means of the categories are equal? 3. What are the assumptions of analysis of variance? What does it mean to say that analysis of variance is relatively robust with respect to deviations from the assumptions? What does it mean to say that the Kruskal±Wallis test is not as powerful as ANOVA? 4. Fill in the blanks in the following analysis of variance table. Then compare the F value with the critical value, using 0.05. Between SS Within SS Total SS

Sums of squares 34.23 Ð 217.34

5. Using WSS

df 2 Ð 35 k X i1

ni

Mean square Ð Ð

F Ð

1 s2i

(a) Find the within sum of squares for the following data: Toxin levels in shell®sh (mg) Observation 1 2 3 4 5 6

Long Island Sound

Great South Bay

Shinnecock Bay

32 23 14 42 13 22

54 27 18 11 10 34

15 18 19 21 28 9

Mean 24.33 25.67 Std. dev. 11.08 16.69 Overall mean 22.78. Overall std. dev. 11.85

18.33 6.31

(b) Find the value of the test statistic F and compare it with the critical value. (c) Rank the data (1=lowest), using the average of the ranks for any set of tied observations. Then ®nd the Kruskal±Wallis statistic ! k 2 X 12 Ri H 3N 1 NN 1 i1 ni Then adjust the value of H by dividing it by P 3 ti ti i 1 N3 N where ti is the number of observations that are tied for a given set of ranks. Compare this test statistic with the critical value of chi-square, which has k 1 degrees of freedom to decide whether to accept or reject the null hypothesis.


6. Twelve low-income and twelve high-income individuals are asked about the distance of their last residential move. The following data represent the distances moved, in kilometers: Low income 5 7 9 11 13 8 10 34 17 50 17 25 Mean Std. dev.

High income 25 24 8 2 11 10 10 66 113 1 3 5

17.17 13.25

23.17 33.45

Test the null hypothesis of homogeneity of variances by forming the ratio s21 =s22 , which has an F-ratio with n1 1 and n2 1 degrees of freedom. Then use the appropriate F-test. Set up the null and alternative hypotheses, choose a value of alpha and a test statistic, and test the null hypothesis. What assumption of the test is likely not satis®ed? 7. Are the con®dence intervals associated with a priori contrasts in ANOVA narrower or wider than a posteriori contrasts? Why? Which would be more powerful in rejecting the null hypothesis that the contrast was equal to zero? 8. A study groups 72 observations into nine groups, with eight observations in each group. The study ®nds that the variance among the 72 observations is 803. Complete the following ANOVA table: Sums of squares

df

Mean square

F

6000 Ð Ð

Ð Ð

Ð Ð

Ð


If the critical value of F is 2.8, what do you conclude about the hypothesis that the means of all groups are equal? What can you conclude about the p-value? 9. A sample is taken of incomes in three neighborhoods, yielding the following data: Neighborhood A n mean std. dev.

12 43.2 36.2

B 10 34.3 20.3

C 8 27.2 21.4

Overall (combined sample) 30 35.97 29.2


Use analysis of variance to test the null hypothesis that the means are equal. 10. Use the Kruskal±Wallis test to determine whether you should accept the null hypothesis that the means of the following four columns of data are equal: Col 1

Col 2

Col 3

Col 4

23.1 13.3 15.6 1.2

43.1 10.2 16.2 0.2

56.5 32.1 43.3 24.4

10002.3 54.4 8.7 54.4

11. A researcher is interested in dierences in travel behavior for residents living in four dierent regions. From a sample of size 48 (12 in each region), she ®nds that the mean commuting distance is 5.2 miles, and that the standard deviation is 3.2 miles. What is the total sum of squares? Suppose that the standard deviations for each of the four regions are 2.8, 2.9, 3.3, and 3.4. What is the within sum of squares? Fill in the table: Sum of squares

df

Mean square

F

Ð Ð Ð

Ð Ð

Ð Ð

Ð


Suppose the critical value of F is 2.7. Do you accept or reject the null hypothesis? 12. A researcher wishes to know whether distance travelled to work varies by income. Eleven individuals in each of three income groups are surveyed. The resulting data are as follows (in commuting miles, one-way): Income Observations 1 2 3 4 5 6 7 8 9 10 11

Low

Medium

High

5 4 1 2 3 10 6 6 4 12 11

10 10 8 6 5 3 16 20 7 3 2

8 11 15 19 21 7 7 4 3 17 18

Use analysis of variance to test the hypothesis that commuting distances do not vary by income. Also evaluate (using, e.g., the Levene test) the assumption of homoscedasticity. Finally, lump all of the data together and produce a histogram, and comment on whether the assumption of normality appears to be satis®ed.


13. Data are collected on automobile ownership by surveying residents in central cities, suburbs, and rural areas. The results are: Central cities No. of observations 10 Mean 1.5 Std. dev. 1.0 Overall mean = 1.725. Overall std. dev. = 1.2.

Suburbs

Rural areas

15 2.6 1.1

15 1.2 1.2

Test the null hypothesis that the means are equal in all three areas.

5

Correlation

LEARNING OBJECTIVES

. Understanding the nature of the relationship between two variables . Understanding the effects of sample size on tests of significance . Alternative tests of correlation when assumptions are not reasonable

5.1 Introduction and Examples of Correlation

One of the most common objectives of researchers is to determine whether two variables are associated with one another. Does patronage of a public facility vary with income? Does interaction vary with distance? Do housing prices vary with accessibility to major highways? Researchers are interested in how variables co-vary. The concept of covariance is a straightforward extension of the concept of variance. Whereas the variance is the expected or average value of the squared deviation of observations on a single variable from their mean, the covariance is the expected or average value of the product of the two variables' (say X and Y) respective deviations from their means: CovX; Y EX

X Y

Y

5:1

The reader may note that, using an argument outlined in Appendix B, Equation 5.1 may be rewritten EX

X Y

Y EXY

X Y

5:2

Alternatively, based upon Equation 5.2, the sample covariance may be found using CovX; Y

n 1X xy n i1 i i

xy

5:3

Note that the covariance of a variable X with itself, Cov(X, X), is equal to the variance of x. In practice, the covariance may be found by taking the average

CORRELATION 87

of the products of the deviations from the means (although using n of n in the denominator, as is the case with the variance): CovX; Y

1 n

1

n X xi i1

xyi

y

1 instead

5:4

The covariance of X and Y may be negative or positive. The covariance will be positive if most of the points (x, y) lie along a line with positive slope when they are plotted. The covariance will be negative when the plotted points lie along a line with negative slope. Figure 5.1 depicts points in the (x, y) plane, where the axes are centered at ( x; y). It demonstrates that points lying in quadrants I and III will contribute positively to the covariance, and points in quadrants II and IV will contribute negatively to the covariance. The magnitude of the covariance will depend upon the units of measurement. The covariance may be standardized, so that its values lie in the range from 1 to +1, by dividing by the product of the standard deviations. This standardized covariance is known as Pearson's correlation coecient. The correlation coecient provides a standardized measure of the linear association between two variables. Its theoretical value is

X Y X Y

X

Y

5:5

where x and y refer to the standard deviation of variables x and y in the population. The sample correlation coecient, r, may be found from n P

r

xi

i1

n

xyi 1sx sy

y 5:6

where sx and sy are the sample standard deviations of variables x and y, respectively. This is known as Pearson's correlation coecient. Note that this is equal to n P

r i1 n

zx zy 1

5:7

where zx and zy are the z-scores associated with x and y, respectively. It is important to note that the correlation coecient is a measure of the strength of the linear association between variables. As Figure 5.2 demonstrates, it is possible to have a strong, nonlinear association between two variables and yet have a correlation coecient close to zero. One implication of this is that it is important to plot data (the term scatterplot is often used to refer to graphs such as Figures 5.1 and 5.2, where each observation is represented by a point in the plane, and where the two axes represent the levels of the two variables), since potential associations between the variables might be revealed in those cases where the value of r is low.


(a)

II

y

I

x III

IV

y

(b) II

I

x III

IV

Figure 5.1 Scatterplots illustrating (a) positive and (b) negative correlation

It is also important to realize that the existence of a strong linear association does not necessarily imply that there is a causal connection between the two variables. A strong correlation was once found between British coal production and the death rate of penguins in the Antarctic, but it would be a stretch of the imagination to connect the two in any direct way! Changes in both British coal production and the death rate of penguins happened to go in the same

CORRELATION 89

y r

0

x Figure 5.2 Nonlinear relationship with r approximately 0

direction over a period of time, but this does not necessarily imply a causal connection between the two. Another article once pointed out the strong connection between the annual number of tornados and the volume of automobile trac in the United States. The claim ± presumably in jest ± was that both the number of tornados and the volume of automobile trac had steadily increased in the United States throughout the twentieth century. If years were used as observations on the x-axis, with the number of tornados on the y-axis, a very strong positive correlation would be observed; years with many tornados would coincide with years with a high volume of trac. Strong linear relationships often prompt deep thought about possible explanations, and in this case an explanation was oered. The correlation was deemed to be due to the fact that Americans drive on the right-hand side of the road! As cars pass one another, counterclockwise movements of air are generated, and we all know that counterclockwise movements of air are associated with low pressure systems. Some of these low pressure systems spawn tornados. Increasing trac, then, would understandably lead to more tornados. Furthermore, since the British drive on the left, it should come as no surprise that there are not many tornados there! (Though I doubt one could claim that they have the great weather one would expect from the high pressure systems created by the clockwise movement of trac-generated air currents!). A better explanation of the relationship is that the two have increased over time for very dierent reasons. The increase in the number of tornados is likely due to the simple fact that the weather observation network is better than it used to be. The search for a causal relationship is an important one, but the eort may sometimes be carried too far! 5.2 More Illustrations 5.2.1 Mobility and Cohort Size

Easterlin (1980) has suggested that young adults who are members of a large cohort (like the baby boom) will face a more dicult time in labor and housing


markets. For these cohorts, the supply of people relative to the number of job and housing opportunities is relatively high. Consequently, there will be a tendency for mortgage rates and unemployment to be higher when large cohorts pass through their young adult years. Similarly, mortgage and unemployment rates will tend to be lower when small cohorts reach their twenties and thirties. Rogerson (1987) has extended this argument to hypothesize that large cohorts of young adults will exhibit lower mobility rates, since the cohort's opportunities for changing residence will be limited by the relatively inferior state of the labor and housing markets. The mobility rate is measured as the percentage of individuals changing residence during a one-year period, and the size of the young adult cohort is measured by the fraction of the total population in a speci®ed young adult age group. Data on these variables for the period 1948±84 are presented in Table 5.1. For 20±24 year olds, n 28, and the correlation coecient between mobility rate and cohort size is equal to 0.747; for 25±29 year olds, r 0.805. For P the 20±24 year olds, the cross-product (xi x)(yi y) (i.e., the numerator Table 5.1 U.S. mobility data, 1948±1984 Mobility rate Year 1948±49 1949±50 1950±51 1951±52 1952±53 1953±54 1954±55 1955±56 1956±57 1957±58 1958±59 1959±60 1960±61 1961±62 1962±63 1963±64 1964±65 1965±66 1966±67 1967±68 1968±69 1969±70 1970±71 1975±76 1980±81 1981±82 1982±83 1983±84

Fraction of total population

20±24

25±29

20±24

25±29

35.0 34.0 37.7 37.8 40.5 38.1 41.8 44.5 41.2 42.6 42.5 41.2 43.6 43.2 42.0 43.4 45.0 42.4 41.0 41.5 42.5 41.8 41.2 38.0 36.8 35.5 33.7 34.1

* * 33.6 31.6 33.4 30.5 31.3 32.3 32.0 34.6 33.2 32.1 34.4 33.0 34.6 35.2 35.8 35.5 33.0 33.2 32.6 32.6 32.4 32.6 30.1 30.0 29.8 30.1

.0804 .0784 .0767 .0746 .0720 .0757 .0735 .0713 .0694 .0671 .0645 .0617 .0616 .0625 .0641 .0672 .0692 .0708 .0715 .0767 .0787 .0813 .0839 .0897 .0943 .0949 .0939 .0928

.0829 .0821 .0812 .0794 .0774 .0762 .0758 .0750 .0734 .0715 .0701 .0682 .0605 .0592 .0582 .0580 .0528 .0584 .0593 .0609 .0638 .0658 .0669 .0790 .0859 .0868 .0892 .0904

Note: Data unavailable for missing years. Source: Rogerson (1987).

Mobility Rate (Per Thousand)

CORRELATION 91

440 430 420 410 400 390 380 370 360 350 340 6

7

7.5

8

8.5

9

9.5

% Population 20-24

Figure 5.3 Correlation of mobility with cohort size. Source: Plane and Rogerson (1991)

of the covariance) is .694. Dividing by n 1 27 yields a covariance of .0257. The standard deviations for x and y are 3.371 and 0.0102, respectively. Dividing the covariance by the product of these standard deviations, as in Equation 5.6, yields the correlation coecient of .747. The data for 20±24 year olds are graphed in Figure 5.3, where the negative relationship between the variables is apparent. 5.2.2 Statewide Infant Mortality Rates and Income

As part of an assignment back in my graduate school days, I decided to investigate geographic variation in infant mortality rates in the United States. The data I collected were at the state level. I was interested in understanding whether infant mortality rates varied with factors such as educational attainment, income, access to health care, etc. As part of my analysis, I graphed the relationship between infant mortality rates and personal income for the white population. The graph is shown in Figure 5.4. Most of the states fall close to a line with negative slope, ranging from states such as Mississippi and Kentucky (with low values of personal income and high infant mortality rates) to states such as Connecticut, where the statewide personal income was high and mortality rates were low. Pearson's correlation coecient for the 50 states is equal to 0.28. Notice, though, the presence of six states that have infant mortality levels above the level expected, given the personal income in the state (TX, CO, AZ, WY, NM, and NV). Cases such as this that do not ®t the general trend are known as outliers. It is interesting to note that these states form a compact geographic cluster. Additional inspection of the data had raised the possibility that these six states had a relatively low number of physicians per 100 000 statewide residents. A two-tail t-test con®rmed that indeed these six states had a signi®cantly lower number of physicians per 100 000, relative to the other states.


33

White Infant Mortality Rates

31

NM

NV

29 WY

27

CO

MS

TX

KY

ME

TN

25

AL ND

SD

GA

WV

23

VA WA FL

VT SC

MT NH

IA NE

PA

MD

IN

RI

NC

21

OR LA

ID

OK

AR

AZ

MN KS

MO

WI MA UT

CA

OH MI IL AK NY NJ CT

19 DE

17 2400

2800

3200

3600

4000

4400

4800

5200

Median Personal Income Figure 5.4 White infant mortality rates as a function of median personal income

The treatment of outliers depends upon the circumstances. A good underlying understanding of why particular points are outliers provides some rationale for removing those points from the analysis. In this case, we have a reasonably good explanation for the outliers, and we are justi®ed in asking what the correlation would be without the outliers (it is r 0.64, a much stronger negative relationship than was found with the original 50 observations). Of course we would not want to get in the habit of plotting variables and arbitrarily eliminating those points that do not fall close to the line, just so that we can report a high value of r. But it is good practice to plot the data and think carefully about the reasons for any outliers.

5.3 A Signi®cance Test for r

To test the null hypothesis that the true correlation coecient, , is equal to zero, the data for each variable are assumed to come from normal distributions. If this assumption is satis®ed, the test may be carried out by forming the t-statistic p r n 2 t p 1 r2

5:8

CORRELATION 93

If the null hypothesis is true, this statistic has a t-distribution, with n degrees of freedom.

2

5.3.1 Illustration

The data in Table 5.1 for the n 28 observations for 20±24 year olds yields a correlation coecient of r 0.747. For the null hypothesis H0 : 0; and a two-tailed test with 0.05, the t-statistic is p 0:747 28 2 t q 5:73 5:9 1 0:747 2 A t-table reveals that the critical values of t, using 0.05 in a two-tail test with 26 degrees of freedom, are 2:056. The null hypothesis that the correlation coecient is zero is rejected.

5.4 The Correlation Coef®cient and Sample Size

An extremely important point is that the correlation coecient is in¯uenced by sample size. It is far easier to reject the null hypothesis that 0 with a large sample size than it is with a small sample size. To see this, compare the situation where r 0.4 with a sample size of 11, and the situation where r 0.4 with a sample size of n 38. In the former case, the observed t-statistic is 1.3, which is less than the critical value t0.05,9 2.262, and the null hypothesis is accepted. In the latter case, when n 38, the t-statistic is 2.0, and the null hypothesis is rejected since the t-statistic is greater than the critical value, t0.05,36 1.96. One of the implications of this is that there really should be no popular rules of thumb that are invoked to decide whether r is suciently high to make the researcher happy about the level of correlation. Such rules of thumb do seem to exist ± for instance, an r value of 0.7 or 0.8 may be taken as important or signi®cant. But, as we have just seen, whether a correlation coecient is truly signi®cant depends upon the sample size. Thus, when the researcher is working with large data sets, a relatively low value of r should not be as disappointing as that same value of r when the sample size is smaller. A value of r 0.4 could be quite meaningful if n 1000, and the researcher should not necessarily throw the results out the window just because the r-value is noticeably less than 1 and perhaps less than some arbitrary, rule-of-thumb value such as r 0.8. Table 5.2 gives, for various values of n, the minimum absolute value of r to achieve signi®cance. For example, with a sample size of n 50, any value of r>0.288 or less than 0.288 would be found signi®cant using the t-test described above. The reader will note how even quite small values of r are signi®cant when the sample size is only modestly large. For values of n>30, the


Table 5.2 Minimum values of r required for signi®cance Sample size, n

Minimum absolute value of r needed to attain signi®cance (using 0.05)

15 20 30 50 100 250

p For large n, rcrit is approximately 2= n.

.514 .444 .361 .279 .197 .124

p quantity 2= n is equal to the approximate absolute value of r that is needed for signi®cance using 0.05. Forpexample, if n 49, a correlation coecient with absolute value greater than 2= 49 0:286 would be signi®cant. While we have just argued that we should not too hastily discard the results of an analysis because of a seemingly low correlation (since with a large sample size that correlation may be signi®cantly dierent from zero), there is also some concern about attaching too much importance to the results of a signi®cance test. Meehl (1990) has noted that with many data sets there is a strong tendency to ®nd that `èverything correlates to some extent with everything else'' (p. 204). This is sometimes referred to as the ``crud factor.'' There is no particular reason for believing that the correlation between any two variables chosen from most data sets should be exactly zero, and therefore, if the sample size is large enough, we will be able to reject the null hypothesis that they are unrelated. For example, Standing et al. (1991) ®nd that in a data set containing 135 variables related to the educational and personal attributes of 2058 individuals, the typical variable exhibited a signi®cant correlation with 41% of the other variables. The extreme case was the variable measuring Grade 5 mathematics scores ± it was signi®cantly correlated with 76% of the other variables, leading the authors to conclude that ``the number of statistically signi®cant possible `causes' of mathematics achievement available to the unbridled theorizer will almost be as large'' (p. 125).

5.5 Spearman's Rank Correlation Coef®cient

In situations where only ranked data are available, or where the assumption of normality required for the test H0 : 0 is not satis®ed, it is appropriate to use Spearman's rank correlation coecient. As the name implies, this measure of correlation is based only upon the ranks of the data. Two separate sets of ranks are developed, one for each variable. A rank of 1 is assigned to the lowest value and a rank of n to the highest observation in each column. Spearman's rank correlation coecient, rs, is n P 6 di2 i1 rS 1 5:10 n3 n

CORRELATION 95

Table 5.3 U.S. mobility data, 1948±84, with ranks Fraction of total population

Mobility rate Year 1950±51 1951±52 1952±53 1953±54 1954±55 1955±56 1956±57 1957±58 1958±59 1959±60 1960±61 1961±62 1962±63 1963±64 1964±65 1965±66 1966±67 1967±68 1968±69 1969±70 1970±71 1975±76 1980±81 1981±82 1982±83 1983±84

20±24

Rank

20±24

Rank

di

37.7 37.8 40.5 38.1 41.8 44.5 41.2 42.6 42.5 41.2 43.6 43.2 42.0 43.4 45.0 42.4 41.0 41.5 42.5 41.8 41.2 38.0 36.8 35.5 33.7 34.1

5 6 9 8 15.5 25 12 21 19.5 12 24 22 17 23 26 18 10 14 19.5 15.5 12 7 4 3 1 2

.0767 .0746 .0720 .0757 .0735 .0713 .0694 .0671 .0645 .0617 .0616 .0625 .0641 .0672 .0692 .0708 .0715 .0767 .0787 .0813 .0839 .0897 .0943 .0949 .0939 .0928

17.5 15 13 16 14 11 9 6 5 2 1 3 4 7 8 10 12 17.5 19 20 21 22 25 26 24 23

12.5 9 4 8 1.5 14 3 15 14.5 10 23 19 13 16 18 8 2 3.5 0.5 4.5 9 15 21 23 23 21

Source: Plane and Rogerson (1991).

where di2 is the squared dierence between the ranks for observation i, and n is the sample size. The mobility data for 20±24 year olds for the period 1950±1984 are repeated in Table 5.3, with ranks adjacent to the mobility and cohort size variables. Note that tied ranks are treated by replacing them with the average of the tied ranks. The dierences in ranks, di, are given in the last column. For these data, rS 1 (6(5045.5))/ (263 26) 0.725. To test hypotheses, we may use the fact that the quanp tity rS n 1 has a t-distribution with n 1 degreespof freedom. In our example, the observed value of t is therefore 0:725 26 3:70. This is less than the critical value t0.05,25 2.06, and so the null hypothesis of no association is rejected. (Technically, when there are tied ranks, Equation 5.10 should not be used; instead one calculates Spearman's correlation coecient by calculating Pearson's correlation coecient, using the ranks as observations.)


5.6 Additional Topics 5.6.1 Con®dence Intervals for Correlation Coef®cients

We saw above that a t-statistic may be used to test the hypothesis that 0. One might suppose that the fact that this statistic has a t-distribution could be used to create a con®dence interval for in much the same way that con®dence intervals are created for means or dierences between means. Thus one might contemplate using the observed value of Pearson's r as follows: r

t;n 2 ^r r t;n 2 ^r

5:11

where p 1 r2 ^r p n 2

5:12

The problem with this idea is that, in general, the sampling distribution of r is not symmetric, and therefore the con®dence intervals will not be accurate. A more accurate con®dence interval may be constructed by ®rst transforming the value of r into a variable that has a normal distribution. The quantity y is derived as follows: y 1:151 ln

1r 1 r

5:13

This variable has a normal distribution with standard deviation equal to 1 y p n 3

5:14

and so a 95% con®dence interval for the quantity y is 1:96 y y0:05 ^y y p n 3

5:15

The endpoints of this con®dence interval can then be transformed back into values of r using Equation 5.13. More speci®cally, Equation 5.13 can be solved for r: r

ey=1:151 1 ey=1:151 1

5:16

To illustrate, let us place a con®dence interval around the observed correlation of 0.747 between mobility rates and cohort size (where n 28). We have y 1:151 ln

1 0:747 1 0:747

2:224

5:17

CORRELATION 97

The lower and upper limits of y are therefore 2:224

1:96 p 28 3

2:616

and

1:96 2:224 p 28 3

1:832

5:18

Equation 5.16 is then solved twice for the limits of the con®dence interval for , using each of these two values of y. We have 0:813

0:662

Note that these limits are not symmetric around the observed value of

5:19 0.747.

5.6.2 Differences in Correlation Coef®cients

The transformation of r into the quantity y is also used when comparing correlation coecients with one another. Suppose we wish to know whether the correlation between two variables observed in one year is signi®cantly dierent from the correlation observed in another year. The null hypothesis of no dierence in the correlations, H0 : 1 2 , could be tested by converting the observed correlations r1 and r2 to y values, and then using the z-statistic z

y1 y2 y 1 y 2

5:20

where y 1

y2

s 1 1 n1 3 n2 3

5:21

Is there a signi®cant dierence between the mobility/cohort size correlation coecients for 20±24 year olds and 25±29 year olds? The transformed values, y, corresponding to the r-values of 0.747 and 0.805 are 2.224 and 2.561, respectively. The z-statistic is then 2:224 2:561 z p 1:166 1=28 3 1=26 3

5:22

Comparing this with the critical value of z0:05 1:96 used in a two-sided test implies that we accept the null hypothesis of no dierence and conclude that the two correlation coecients are not signi®cantly dierent. 5.6.3 The Effect of Spatial Dependence on Signi®cance Tests for Correlation Coef®cients

The tests of signi®cance outlined for both Pearson's r and Spearman's rS assume that the observations of x are independent and that the values of y


are also independent. When the x and y variables come from spatial locations, this assumption of independence may not be satis®ed. Indeed, one of the most important points in this book is that spatial data often exhibit dependence ± the value of x in one location is often related to the value of x in nearby locations. In turn, spatial dependence aects the outcome of statistical tests, and this point should always be borne in mind when interpreting statistical results. When spatial dependence is present and not accounted for, the variance of the correlation coecient under the null hypothesis of no correlation is underestimated. When repeated samples are taken from spatially dependent data, and the x and y values follow the null hypothesis of no correlation between x and y, the frequency distribution of r values will look like the dotted line in Figure 5.5. The dotted line has a wider frequency distribution than the solid line. The solid line corresponds to the variability in r that is calculated as present when standard signi®cance tests such as Equation 5.3 are applied. The critical values associated with the standard statistical tests (b and c) are lower in absolute value than those which should be used (a and d ). When sample values of r fall in the shaded region, the standard statistical tests will incorrectly imply that the correlation coecient is signi®cantly dierent from zero. Correlation coecients falling in the shaded region are likely not signi®cant, and may be the result of the underlying spatial dependence exhibited by the x and y variables. Haining (1990a) states this as follows: The important issue here is not to use conventional procedures to test for the signi®cance of the correlation coef®cient, and to recognize that a large r (or rS) value may be due to spatial correlation [i.e. dependence] effects . . . The risks of inferring association between variables that is nothing other than the products of the spatial characteristics of the system are real and call for caution on the part of the user (p. 321).

Relative frequency

To see the eects of spatial dependence on correlation tests, consider the following model for the value of a variable a, at location (x1, y1), applied to the

a

b

0

c

d

Difference in mean trip length Figure 5.5 Assumed (ÐÐÐ) and actual (± ± ± ±) variability in r when spatial dependence is present but not accounted for

CORRELATION 99

interior cells of a region that has been subdivided into a grid of square cells: ax1 ; y1 4 a

"x1 ; y1

5:23

where is the overall mean, a is the mean value of the variable in the four cells that share a side with (x1, y1), and "x1 , y1 is a normally distributed error term with mean zero. If 0, the values of the variable ax1 , y1 are independent of the values at other sites. In this case the values of a are simply equal to the overall mean and a normally distributed error term. When > 0, the value in a particular interior cell depends upon the values of the four surrounding cells. The value of measures the amount of dependence, and it can range up to 0.25. Note that, when 0:25, the value of a at a location is precisely equal to the average of the values in surrounding locations, plus an error term. Cliord and Richardson (1985) use 5.23 to simulate two spatial variables that are not correlated with one another. Next, they ®nd r and use Equation 5.8 and 0:05 to see whether r is signi®cant. One would expect to ®nd signi®cant values 5% of the time (since 0.05 is the Type I error probability). Table 5.4, as reported by Haining (1990a), displays the ®ndings. 1 and 2 represent the amount of spatial dependence used in generating the two variables. When 1 and 2 are zero, the Type I error probability is near its expected value of 0.05. Note that when one variable has no spatial dependence (1 0), the other variable can exhibit strong spatial dependence (e.g., 2 0:24), and there is still no eect on the test for correlation since the Type I error probability is still near 0.05. But when both variables exhibit strong spatial dependence, the Type I error probabilities ± where one incorrectly ®nds signi®cant correlation coecients ± rise dramatically. With strong spatial dependence and no corrective action, one will too often reject true null hypotheses. 5.6.4 Modi®able Area Unit Problem and Spatial Aggregation

Gehlke and Biehl (1934) noted that correlation coecients tend to increase with the level of geographic aggregation when census data are analyzed. A Table 5.4 Type I error probabilities with spatial dependence 1 0 0 0 0.1 0.1 0.15 0.15 0.2 0.2 0.225 0.24

2

Type I error probability

0 0.2 0.24 0.1 0.225 0.15 0.225 0.2 0.24 0.225 0.24

0.0566 0.0500 0.0400 0.0700 0.1000 0.1000 0.1500 0.1900 0.3100 0.3366 0.5000


smaller number of large geographic units tends to give a larger correlation coecient than does an analysis with a larger number of small geographic units. In a classic study, Robinson (1950) noted that the correlation between race and illiteracy rose with the level of geographic aggregation. It is important to keep in mind the fact that the size and con®guration of spatial units may aect the analysis. What is signi®cant at one spatial scale may not be signi®cant at another.

5.7 Correlation in SPSS for Windows 9.0

Each variable should be represented by a column of data. Then click on Analyze and Correlate. Next, click on Bivariate, and move the variables you wish to correlate from the list on the left to the box on the right. You may move more than two variables into the box if you wish to see a table of correlations among a number of variables. If desired, check the box to have Spearman's correlation coecient calculated (Pearson's correlation is calculated by default). Table 5.5 1990 Census data for a random sample of census tracts in Erie County, New York AREANAME Tract Tract Tract Tract Tract Tract Tract Tract Tract Tract Tract Tract Tract Tract Tract Tract Tract Tract Tract Tract Tract Tract Tract Tract Tract Tract Tract Tract Tract

0010 0016 0026 0028 0045 0057 0060 0068 0068 0073.02 0076 0079.01 0079.02 0079.02 0085 0087 0097.01 0100.02 0100.02 0101.01 0101.02 0111 0115 0117 0120.02 0142.05 0150.03 0152.02 0153.02

TOTPOP90 MEDHSINC BG BG BG BG BG BG BG BG BG BG BG BG BG BG BG BG BG BG BG BG BG BG BG BG BG BG BG BG BG

4 2 1 2 5 4 1 2 4 9 4 3 5 8 1 3 2 1 5 5 2 4 1 2 1 1 2 9 1

999 477 647 856 994 1083 879 374 806 2194 1150 1720 540 1128 434 1415 1639 3072 1715 755 731 544 885 633 851 681 1270 1334 634

20862 17804 10545 14602 33603 24440 15964 33750 14597 39779 29250 44205 34625 32439 39375 29513 39104 26174 28477 35000 17647 28438 33214 36346 28500 38125 25515 29554 47083

MEDAGE

SAGE

49.24 50.70 51.24 50.66 45.12 52.31 39.43 45.07 42.93 52.49 54.65 53.63 60.55 58.49 54.07 51.42 53.35 54.41 49.28 56.58 43.49 57.78 51.68 48.22 59.56 46.44 51.64 49.15 52.38

19.08 17.49 16.92 18.29 15.36 18.05 15.83 16.92 18.30 16.58 17.97 13.83 14.06 14.18 17.89 17.77 12.48 16.36 15.38 15.48 16.66 17.04 19.54 16.09 17.39 18.23 16.78 17.89 15.21

MAGE PCTOWN 60 60 58 60 60 60 60 60 60 39 46 41 44 44 54 60 33 30 42 54 45 47 47 47 42 19 60 37 46

.510 .354 .5 .479 .683 .516 .416 .202 .346 .906 .884 .978 .967 .899 .967 .706 .992 .886 .638 1 .239 .796 .862 .856 .876 .885 .58 .74 .86

CORRELATION 101

Table 5.6 Bivariate correlations among four neighborhood variables

5.7.1 Illustration

The data in Table 5.5 are a random sample of 29 census block groups (areas of about 1000 people) from Erie County, New York, in 1990. For each block group, there are data on population (totpop90), median household income (medhsinc), the median age of heads of households (medage), the standard deviation of the householder's age (sage), which is a measure of age mixing in the block group, the median age of housing (mage), and the percentage of housing that is owner-occupied (pctown). Rogerson and Plane (1998) discuss the age structure of householders in residential neighborhoods, and develop a model showing how age structure is related to variables such as age of the housing in the neighborhood, mobility, and homeownership. Table 5.6 shows the bivariate correlation coecients (Pearson and Spearman) among four of the variables. The median age of householders in block groups has a signi®cant correlation with homeownership; as one might expect, the association is positive, and median age is higher in areas of higher homeownership. The variability of ages in a neighborhood (sage) is negatively related to income (high income neighborhoods are more homogeneous with respect to age) and negatively related to homeownership (where ownership is


high, the ages of the householders are more homogeneous). Finally, note the similarity between Pearson's and Spearman's coecients.

Exercises 1. (a) Find the correlation coef®cient, r, for the following sample data on income and education: Observation 1 2 3 4 5

Income ($1000)

Education (years)

30 28 52 40 35

12 12 18 16 16

(b) Test the null hypothesis 0. (c) Find Spearman's rank correlation coef®cient for these data. (d) Test whether the observed value of rS from part (c) is signi®cantly different from zero. 2. (a) Draw a graph depicting a situation where the correlation coef®cient is close to zero, but there is a clear relationship between two variables. (b) Draw a graph depicting a situation where there is a strong positive relationship between two variables, but where the presence of a small number of outliers makes the strength of the relationship less strong. 3. The t-statistic for testing the signi®cance of a correlation coecient is p r n 2 t p 1 r2 with n 2 degrees of freedom. If the sample size is 36 and 0.05, what is the smallest absolute value a correlation coecient must have to be signi®cant? What if the sample size is 80? 4. Find the correlation coecient for the following data: Obs.

X

Y

1 2 3 4

2 8 9 7

6 6 10 4

CORRELATION 103

5. (a) Why is a ``rule of thumb'' for the signi®cance of a correlation coef®cient (e.g., r2 above 0.7 is signi®cant) not a good idea? (b) Why is a very large sample a ``problem'' in the interpretation of signi®cance tests for the correlation coef®cient? 6. Find the correlation coecient between median annual income in the United States and the number of horse races won by the leading jockey, for the period 1984±1995: Year

Median income

Number of races won by leading jockey

1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994

35 165 35 778 37 027 37 256 37 512 37 997 37 343 36 054 35 593 35 241 35 486

399 469 429 450 474 598 364 430 433 410 317

Test the hypothesis that the true correlation coecient is equal to zero. Interpret your results.

6

Introduction to Regression Analysis

LEARNING OBJECTIVES

. . . . .

Modeling one variable as a linear function of another Fitting a straight line through a set of points plotted in two dimensions Assumptions of linear regression Relationship of regression to analysis of variance Testing the significance of the regression slope

6.1 Introduction

Whereas correlation is used to measure the strength of the linear association between variables, regression analysis refers to the more complete process of studying the causal relationship between a dependent variable and a set of independent, explanatory variables. Linear regression analysis begins by assuming that a linear relationship exists between the dependent variable ( y) and the independent variables (x), proceeds by ®tting a straight line to the set of observed data, and is then concerned with the interpretation and analysis of the eects of the x variables on y, and with the nature of the ®t. An important outcome of regression analysis is an equation that allows us to predict values of y from values of x. Regression analysis is used to specify and test a functional relationship between variables. As discussed in Chapter 1, the process of description often leads one to suspect that two or more variables are related. Once we specify how the variables are related, we have a model, which may be thought of as a simpli®cation of reality. Regression analysis provides us with (a) a simpli®ed view of the relationship between variables, (b) a way of ®tting the model with our data, and (c) a means for evaluating the importance of the variables and the correctness of the model. For example, we may wish to know whether the distance that adults live away from their parents is dependent upon education, whether snowfall is dependent upon elevation, whether infant mortality is related to income, or whether park attendance is related to the income of the population living within a certain distance of the park. In each case a good place to begin is to plot the data on a graph, and to measure the correlation between variables. Linear regression analysis takes this a step further by ®nding the best-®tting straight line through the set of points.

INTRODUCTION TO REGRESSION ANALYSIS 105

When there is just one independent, explanatory variable, as is the case in the examples above, we wish to ®t a straight line through the set of data points; the equation of this line is y^ a bx

6:1

where y^ is the predicted value of the dependent variable, x is the observed value of the independent variable, a is the intercept (or point where the line intersects the vertical axis), and b is the slope of the line. The quantities a and b represent parameters describing the line, and these will be estimated from the data. This case, with one independent variable, is known as simple regression or bivariate regression, and it is depicted in Figure 6.1. We shall in this chapter con®ne our attention to this special case. More generally, multiple regression (treated in the next chapter) refers to the case where there is more than one independent variable. The slope of the line, b, may be interpreted as the change in the dependent variable expected from a unit change in the independent variable. For example, suppose a regression of housing sale prices on the square footage of houses yielded the following equation: p^ 30 000 70s

6:2

where p^ is the predicted housing sales price, and s represents square footage. The slope in this equation is 70, and this means that an increase of one square foot leads, on average, to a $70 increase in sales price. The price predicted for a y

slope =

intercept

y x

(0,a)

x Figure 6.1 Regression line through a set of points


house with 2000 square feet is 30 000+70(2000) $170 000. The intercept is the predicted value of the dependent variable when the independent variable is set equal to zero. In this example, a house with 0 square feet would sell at a predicted price of $30 000! This intercept of $30 000 could be interpreted as the value of the land on which the house is built. More generally, the intercept does not always have a realistic interpretation, since a zero value for the independent variable may lie well outside the range of observed values. In studying the linear relationship between variables, each observation of the dependent variable, y, may be expressed as the sum of a predicted value and a residual term: y a bx e y^ e

6:3

where y^ a bx is the predicted value, and e is termed the residual. The value y^ represents the value of the dependent variable predicted by the regression line. Note that the residual is equal to the dierence between observed and predicted values: ey

y^

6:4

In keeping with the distinction between sample and population, note that a and b are estimates of some ``true'', unknown regression line. The slope and intercept of this true regression line could, in theory, be determined by taking a complete, 100% sample of the population. As usual, we use Greek letters to denote the population values of the parameters: y x "

6:5

where and are the intercept and slope of the true regression line, respectively. Each observation y may be viewed as the sum of a component that predicts the value of y on the basis of the value of x (using the true coecients and ) and some random error ("). The error term re¯ects the fact that we do not expect the model to work ``perfectly''; inevitably there will be other variables that also in¯uence y, though we hope their in¯uence is relatively minor. Observations on the dependent variable may be expressed as the sum of the predicted value and a ``true'' population error, ", where "y

y~

6:6

is the dierence between the observed value (y) and that predicted by the true regression line ( y~). The latter quantity is y~ x

6:7

In Figure 6.2 both the true regression line (Equation 6.7) and the best-®tting line based on the sample of points (Equation 6.1) are shown. It is important to


y y= α +βx

y = a + bx

x Figure 6.2 True and sample regression lines

keep in mind that, had a dierent sample been collected, the regression line based on the sample would be dierent but the true regression line would remain the same.

6.2 Fitting a Regression Line to a Set of Bivariate Data

Figure 6.3 shows a straight line ®t through a set of points plotted in a twodimensional, x±y space. In regression analysis, the objective is to ®nd the slope and intercept of a best-®tting line that runs through the observed set of data points. But what is meant by best-®tting? There are certainly many ways to ®t a line through a set of points. One way would be to ®t the line so that the sum of the minimum distances of the observations to the line was a minimum. In this case, the distances are represented in Figure 6.3 geometrically as (dotted) lines that run from the observations to the regression line and which are perpendicular to the regression line. In linear regression analysis, the sum of squared vertical distances from the observed points to the line (i.e., the solid lines in Figure 6.3) is minimized. The fact that vertical distances are used is consistent with the idea that the dependent variable, which is always portrayed on the vertical axis, is being predicted from the independent variable (portrayed on the horizontal axis). In fact, the vertical distance is identical to the value of the residual, which, as we have indicated, is the dierence between the observed and predicted values of the dependent variable. Thus regression analysis minimizes the sum of squared residuals. The sum of squared residuals is used primarily for reasons of mathematical convenience ± expressions for ®nding the values of a and b from the data are much easier to derive and express. Thus the objective is to ®nd values of a and b that minimize the


sum of squared residuals: min a; b

n X i1

yi

y^2 min a; b

n X i1

yi

a

bxi 2

6:8

Geometrically, this problem corresponds to ®nding the minimum of a threedimensional parabolic cone, where a and b are coordinates in the twodimensional plane and the sum of squared residuals is the vertical axis (see Figure 6.4). Viewing the ®gure, we can imagine trying dierent values of a and b ± some will work relatively well in the sense that the sum of squared residuals will be quite small, and other combinations will be poor since the sum of squared residuals will be large. The values of a and b at the bottom of the parabolic cone may be determined using the data as follows (see inset for more detail): 9 n P xi xyi y > > > = b i1P n 2 6:9 xi x > > i1 > ; a y b x

Figure 6.3 Alternative measures of distance from points to regression line

Figure 6.4 Minimization of sum of squared residuals


INSET: Finding the slope and intercept via the solution to a calculus problem The solution of Equation 6.8 to ®nd the values of the slope and intercept is the solution of a calculus problem. Solving the calculus problem amounts to ®nding the a, b combination that leads to the smallest sum of squared residuals at the bottom of the parabola. For those with a calculus background, we proceed by taking the derivatives of Equation 6.8 with respect to a and b, setting them each to zero, and solving for the two unknowns a and b. The result is Equation 6.9.

The reader should note that the numerator of the expression for the slope b is identical to that for the correlation coecient r. In fact, for bivariate regression, the slope may be written in terms of the correlation coecient: br

sy sx

6:10

Once the values of a and b have been determined, plotting the line is straightforward. As indicated in the equation used above to determine a, the regression line goes through the mean ( x, y). Another point on the line is (0, a) (the intercept). After plotting these two points, the regression line may be drawn by connecting the two points together with a straight line.

6.3 Regression in Terms of Explained and Unexplained Sums of Squares

Another way to understand regression is to recognize that it provides a way to partition the variation in the observed values of a dependent (y) variable. In particular, the variability one observes in y may be decomposed into (a) a part that is explained by the regression line (via the assumption that y is linearly related to one or more independent variables) and (b) a part that remains unexplained. More speci®cally, the variability in y may be measured by the sum of squared deviations of the y values from their mean. Some of this variability is explained by the regression line ± the values of y vary partly because of the assumed relationship with the x variables. The partitioning of the total sum of squares into explained and unexplained components is analogous to the analysis of variance, where the sum of squares is divided into between- and withincolumn components. Here we have n X i1

yi

y2

n X i1

^ yi

y2

n X i1

yi

y^2

6:11

The left-hand side is the total sum of squares; the deviation between observed value and mean is represented geometrically by the distance A in Figure 6.5. In the ®gure, the points clearly vary about the horizontal line representing the mean value of y. Some of this variability may be attributed to the regression


y B

A C

y

x Figure 6.5 Partitioning the variability in y

line; we expect points on the right-hand side of the diagram to be above the mean, and points on the left-side of the diagram to be below the mean. The ®rst term on the right-hand side is the regression (or explained ) sum of squares ± it is the sum of squared dierences between predicted values and the mean value of y. These dierences are the deviations of the regression line from the mean and constitute the `èxplained'' portion of the variability in y. The distance between the predicted value and the mean is represented by C in Figure 6.5. Finally, the second term on the right-hand side is the unexplained or residual sum of squares. This is the sum of squared dierences between the observed and predicted values. It is this quantity that is minimized when the coecients are estimated. The distance between the observation and the regression line (i.e., the residual) is represented by B in the ®gure. The proportion of the total variability in y explained by the regression is sometimes called the coecient of determination, and it is equal to the square of the correlation coecient: n P 2

r

i1 n P i1

^ yi

y2

yi

y2

n P

1

i1

n

e2i 1s2y

6:12

where e is the value of the residual. Note that r2 is equal to the regression sum of squares divided by the total sum of squares. It is also equal to one minus the ratio of the residual sum of squares to the total sum of squares. The value of r2 varies from 0 to 1; a value of zero would indicate that no variability has been explained, whereas a value of one would imply that all of the residuals are zero and the regression line ®ts perfectly through all of the observed points. One way to determine whether the regression has been successful at explaining a signi®cant portion of the variation in y is to perform an F-test, analogous


to the F-test used in the analysis of variance. Speci®cally, for simple regression, the null hypothesis that 2 0 is tested with the F-statistic F

r2 n 2 1 r2

6:13

which has an F-distribution with 1 and n 2 degrees of freedom when the null hypothesis is true. By looking back to the previous chapter on correlation, you will notice that this F-statistic is the square of the t-statistic used to test the hypothesis that the correlation coecient is equal to zero. The tests are identical in the sense that they will always yield identical conclusions and p-values. The origins of this F-test lie in the partitioning of the sums of squares just described. We can create an analysis of variance table, for a hypothetical example with 12 observations, as follows: Sums of squares Regression (explained) Residual (unexplained) Total

578

df

Mean square

F

1

578

13.7

422

n

2 = 10

1000

n

1 = 11

42.2

The total sum of squares has n 1 degrees of freedom associated with it. In simple regression, where there is one independent variable, there is always 1 degree of freedom associated with the regression sum of squares, leaving n 2 degrees of freedom associated with the residual sum of squares. The F-ratio is therefore equal to F

explained SS 578=1 13:7 residual SSn 2 422=10

6:14

Recalling the de®nition of r2, this ANOVA-like expression for F can be seen to be equivalent to Equation 6.13, since F

explained SS r2 n 2 residual SS=n 2 1 r2

6:15

The value of r2 may be thought of as the maximal correlation between a weighted combination of the independent variables and the dependent variable. (The weights happen to be the regression coecients if the dependent and independent variables are put in their standardized, z-scores form). The calculated value of r2 overestimates the true value, R2. Note that if the number of observations is equal to the number of variables, r2 will always equal one,


even if the variables are unrelated (i.e., R2 0). If R2 0, then the expected value of r2 is (p 1)/(n 1), where p is the number of variables and n is the number of observations. For example, if p 11 and n 21, then the expected value of r2 is 10/20 0.5, even when the individual variables are not truly correlated! This serves to emphasize the importance of having a large number of observations relative to the number of variables. The adjusted r2 represents a downward adjustment of r2 that takes this dierence between the sample and population values into account. We will see an example of this adjustment in Section 6.11.

6.4 Assumptions of Regression

The assumptions of regression analysis for simple regression are: (1) The relationship between y and x is linear; that is, there is an equation y x " that constitutes the population model. (2) The errors have mean zero and constant variance; that is, E" 0 and V" 2 The errors do not vary with x; that is, V"jx 2x 2 : (3) The residuals are independent; the value of one error is not affected by the value of another error. (4) For each value of x, the errors have a normal distribution about the regression line. This normal distribution is centered on the regression line. This assumption may be written " N0, 2 : Multiple regression, treated in the next chapter, adds another assumption ± namely, that the x variables have no multicollinearity (that is, the independent variables are not signi®cantly correlated with one another).

6.5 Standard Error of the Estimate

The standard error of the estimate is another expression for the standard deviation of the residuals; for the case of simple regression, it is estimated by v u n uX yi yî 2 se t 6:16 n 2 i1 6.6 Tests for Beta

We are often interested in testing the null hypothesis that the true value of the slope is equal to zero, i.e., H0 : 0. We are, of course, usually interested in the prospect of rejecting this null hypothesis, thereby accumulating some evidence that the variable x is important in understanding y. This can be


done via a t-test t

b

sb

b sb

6:17

where sb is the standard deviation of the slope: s s2e sb n 1s2x

6:18

6.7 Con®dence Intervals

A con®dence interval for the regression line is found from ! s2e x x2 1 V y^jx n s2x A con®dence interval for individual predictions is found from ! s2e x x2 n1 V y^jx n s2x

6:19

6:20

6.8 Illustration: Income Levels and Consumer Expenditures

A supermarket is interested in how income levels (x) may aect the amount of money spent per week by its customers ( y). The null hypothesis is that income levels do not aect the amount of money spent per week by customers, and the alternative hypothesis is that higher incomes are associated with greater spending. Table 6.1 depicts the data collected from ten survey respondents. Table 6.1 Amount spent/week (y) $120 $68 $35 $60 $100 $91 $44 $71 $89 $113

Income (000) (x) 65 35 30 44 80 77 32 39 44 77


To ®t the regression line, we can ®rst compute the following quantities: 9 x 52:3; sx 20:20 > > > > y 79:1; sy 28:34 = n X xi xyi y 3672:1 > > > > i1 ; r 0:835

6:21

Income x is given in thousands of dollars, and weekly supermarket spending is given in dollars. From these, we may ®nd the slope from either of the following: 9 sy 28:34 > > b r 0:835 1:171 > > > 20:2 sx > = n P xi xyi y > 4301:7 > b i1P 1:171 > > n > 2 2 > 9 20:2 ; xi x

6:22

i1

The intercept is a y

b x 79:1

1:17152:3 17:8

6:23

Our regression line may be expressed as y^ 17:8 1:171x

6:24

implying that every increase of $1000 of income leads to an increase in $1.17 spent at the supermarket each week. For each observation, we can compute a predicted value and a residual, using the regression line, along with the values of x. For example, for the ®rst observation, the predicted value of the dependent variable is y^ 17:8 1:17165 93:9

6:25

The residual for this observation is the observed value of y minus the predicted value: e 120

93:9 26:1

6:26

Table 6.2 depicts the results, including residuals and predicted values for all observations. The sum of the residuals (subject to a bit of rounding error) is equal to zero ± the amount by which the positive residuals lie above the regression line is equal to the amount by which the negative residuals lie below the regression line.


Table 6.2 Amount spent/week (y)

Income (000) (x)

Predicted y^

65 35 30 44 80 77 32 39 44 77

94.0 58.8 53.0 69.4 111.5 108.0 55.3 63.5 69.3 108.0

$120 $68 $35 $60 $100 $91 $44 $71 $89 $113

Residual " 26.0 9.2 18.0 9.4 11.5 17 11.3 7.5 19.7 5

The analysis of variance table associated with this regression is as follows: Regression (explained) Residual (unexplained) Total

Sums of squares

df

Mean square

F

5039.2

1

5039.2

18.4

2189.7

n

2=8

7228.9

n

1=9

273.7

This table can be constructed by recalling that, for bivariate regression, the number of degrees of freedom associated with the explained sum of squares is equal to one, and that associated with the total sum of squares is equal to n 1. Also recall that the mean square is simply the sum of squares divided by the number of degrees of freedom, and the F-statistic is the ratio of the sums of squares. The ®rst column may be completed by recognizing that the total sum of squares is equal to the variance of y multiplied by n 1: Total sum of squares

n X i1

yi

y2 n

1s2y

6:27

Since r2 0.8352 is the proportion of the total sum of squares explained by the regression, the regression sum of squares is equal to 7228.4(0.8352) 5039.8. The residual sum of squares is simply the dierence between the total and regression sums of squares. The observed F-ratio of 18.4 in the table may be compared with the critical value of 5.32 found in the F-table, using 0.05 and 1 and 8 degrees of freedom, respectively, for numerator and denominator. Since the observed value of F exceeds the critical value of 5.32, we reject the null hypothesis that the true correlation coecient 2 is equal to zero, and conclude that income explains a signi®cant amount of the variability in supermarket spending. We can also test the hypothesis that the true regression coecient is equal to zero by using Equations 6.16 to 6.18. This ®rst requires ®nding the variance of


the residuals (s2e ) and the standard error of the estimate (se ): n P

s2e

i1

n

e2 2

2189:7 273:7; 8

se

p 273:7 16:54

6:28

Next, the standard deviation of the estimate of the slope is given by s s p s2" 273:7 sb 0:0745 0:27 n 1s2x 920:22

6:29

The test of the null hypothesis H0 : 0 is then carried out with the t-statistic t

b

sb

1:171 4:34 0:27

6:30

Since the observed value of t exceeds the critical value of 2.62 found from the table using a two-tailed test with 0:05, n 1 9 degrees of freedom, we reject the null hypothesis that the true regression coecient is equal to zero. In bivariate regression, the t-test and the F-test will always give consistent results.

6.9 Illustration: State Aid to Secondary Schools

In New York State, aid is given to schools by the state government. The aid formula is based upon a number of factors, but one principle is that districts with residents that have relatively higher household incomes should receive relatively less in state aid. A graph of state aid per pupil versus average household income for 27 public school districts in western New York reveals that there is in fact a downward trend in aid with increasing income (see Figure 6.6). Regression analysis of the 27 pairs of points con®rms this: y^ 6106:93

0:0818x

6:31

where y^ is the predicted amount of school aid per pupil, in dollars, and x is the average household income. The slope implies that every increase of $1 in household income brings with it a decline in state aid of $0.08 per pupil. Equivalently, every increase of $1000 in household income leads to a decline of $81.80 in state aid per pupil (although this estimate is also capturing the eects of other variables in the state aid formula that have been omitted here; see Sections 7.1.2 and 7.2). The value of r2 is 0.257, and the slope is signi®cantly negative (a t-test yields t 2.94, which is more extreme than the critical value and implies a p-value of 0.007.


Figure 6.6 New York State school aid per pupil vs average household income

Of particular interest to me is the darkened circle, which represents the Amherst school district where my children attend school! Note that the state aid per pupil is signi®cantly lower than the expected value, given the average household income in the district. A partial explanation of this has to do with the way in which the state collects information on income. All taxpayers must note their school district on their tax return, using a three-digit code taken from a list at the back of the tax form instruction booklet. The problem is that the Amherst school district lies in the town of Amherst, but so too do a number of other districts, including the Williamsville and Sweet Home districts. Residents of the Williamsville and Sweet Home districts dutifully go to the back of the instruction booklet, begin to scan the list, and quickly come across Amherst, since it is near the beginning of the alphabet. Since they live in the town of Amherst, they (incorrectly) copy the Amherst code onto their tax form. State ocials then tally up the income in each district, and ®nd (incorrectly) that Amherst residents make a lot of money and consequently should receive less in school aid! The average household income data used in Figure 6.6 represents the more accurate household income data taken from the US Census. Even though it may be out-of-date, since it was collected in 1990 and the analysis was done in 1997, the census data give a truer picture than do the tax return data of how deserving of aid each district is. If the aid received by the Amherst school district from the state was in line with expectations, the black dot would be raised vertically from its present location to the regression line. This would represent an increase of $1100 per pupil, which is almost a 70% increase over the present ®gure of about $1600 per pupil. Since the district contains over 3000 students, this would represent an increase of over three million dollars. Unfortunately for the residents of the Amherst District, it is dicult to correct the imbalance, and this is true


despite the potential of geographic information systems to attribute each resident's address to the correct local school district. There is a legal clause that limits the degree to which the problem can be corrected, since any change that gave more to the Amherst District would give less to surrounding districts. So, despite the fact that residents in the Williamsville and Sweet Home districts have less income attributed to their district than they should (and consequently have higher state aid), the imbalance has been only partially corrected. One solution I suggested at a meeting of the Amherst School Board was that we change the name of our district! Since the problem is caused by the fact that the school district has the same name as the town, perhaps we could simply change the name to something else. This creative suggestion met with diculty too, since it turns out that it is dicult, if not impossible, to change legally the school district's name. Another, less savory, approach would be to launch a campaign to encourage residents of the Amherst School district to put the school codes of Sweet Home or Williamsville on their tax forms. In any event, this example serves to illustrate how regression analysis can be used to both estimate the magnitude of eects of one variable on another (the term `èect'' in the current example refers to the eect of income on state aid) and to interpret unusual observations. The example also illustrates that because of the quirky nature of data, potential pitfalls abound when one attempts to establish relationships between one variable (income) and another (state aid).

6.10 Linear versus Nonlinear Models

It should be understood that the word ``linear'' refers to the fact that in linear regression analysis the relationship is one that is linear in the parameters. The parameters are the intercept and slope coecients, a and b. Thus linear regression could be used to study the eects of the square of some variable x on the value of y. Similarly, the equations ) p yab x 6:32 y a bln x2 may also be studied using the methods of linear regression, since the parameters a and b appear linearly (that is, they are raised to the power 1). One example of an equation that is not linear in its parameters is y a b2 x

6:33

since the parameter b is raised to the power 2. In many instances, a nonlinear relationship may still be analyzed using linear regression, since the nonlinear curve may be transformed into a linear one. For


example, a common ®nding in geographic research is that there is a distance± decay eect for many kinds of interaction. Furthermore, this eect is often well modeled with a negative exponential curve. Attendance at a local swimming pool, for instance, may appear to decline exponentially with the distance that individuals reside from the pool (see Figure 6.7). Negative exponential decay in this example could be modeled with an equation of the form pd p0 e

bd

6:34

where pd is the pool attendance rate among residents residing a distance d from the pool and p0 is the attendance rate among those living as close as possible to the pool. Furthermore, e is the constant 2.718. . ., and b is the rate at which attendance declines with distance (that is, it is a measure of the steepness of the exponential decline; higher values of b imply a greater eect of distance on attendance). In this example, we can transform the curvilinear relationship seen in Figure 6.2 into a linear one. One reason for wanting to do this is to be able to make use of the well-developed methods of linear regression analysis. The transformation is brought about by taking the logarithms of both sides of Equation 6.1: ln pd ln p0

bd

6:35

This is the equation of a straight line; if the relation is linear, when ln pd is plotted against distance d, the result will be a straight line with slope equal to b and intercept equal to ln p0 (see Figure 6.8). Note also that models with negative exponential decline have to be initially written with a multiplicative error term, since that allows them to be linearized: pd p0 e

bd

" ) ln pd ln p0

pd "

6:36

pd

Pool attendance rate

d Distance from pool Figure 6.7 Negative exponential decline in pool attendance rate with distance


ln

pd

Natural log of distance from pool

d Distance from pool

Figure 6.8 Linear decline in log of pool attendance rate with distance

If a model with negative exponential decline is written with an additive error term, as follows: pd p0 e

bd

"

6:37

it is said to be intrinsically nonlinear, since there is not a transformation that can convert it into the equation of a straight line. In the next chapter, we will focus on multiple regression, where the regression model includes more than one explanatory variable. This leads to additional issues, and these are also discussed in Chapter 7.

6.11 Regression in SPSS for Windows 9.0 6.11.1 Data Input

Each observation is placed into a row of the data table. Each column of the data table corresponds to a variable. It is often convenient, but not necessary, to place the dependent variable in the ®rst column. 6.11.2 Analysis

To carry out a regression analysis, ®rst click on Analyze, then Regression, and then Linear; this will open a box. Within the box, select the dependent variable


from the list of variables on the left, and use the arrow key to move it into the box titled ``Dependent.'' Likewise, move the independent variables from the left to the box on the right labeled `Ìndependent(s).'' Clicking on `ÒK'' will then carry out a regression analysis. This produces information on r, se , an ANOVA-type summary table, and information on the coecients and their signi®cance. Options for additional output are discussed below. 6.11.3 Options

Clicking on alternatives in the categories Statistics, Plots, and Save can produce additional output. It is common, for example, to want to save additional information. Under Save, one can click on boxes to save, among other items, predicted values, residuals, and con®dence intervals associated with both the mean predicted value of y given x (i.e., the regression line; see Equation 6.19) and individual values of y (Equation 6.20). New columns containing the desired information are attached to the right-hand side of the data table. Table 6.3 Regression of amount spent per week vs income


6.11.4 Output

An example of output is shown in Table 6.3. This output corresponds to the results of the regression associated with the data in Table 6.1. Note that the value of r2 is 0.697, and the adjusted r2 value is 0.659.

Exercises 1. A regression of weekly shopping trip frequency on annual income (data entered in thousands of dollars) is performed on data collected from 24 respondents. The results are summarized below: Intercept 0.46 Slope 0.19 Sum of squares Regression Residual Total

df

Mean square

F

1.7 2.3

(a) Fill in the blanks in the ANOVA table. (b) What is the predicted number of weekly shopping trips for someone making $50 000/ year? (c) In words, what is the meaning of the coef®cient 0.19? (d) Is the regression coef®cient signi®cantly different from zero? How do you know? (e) What is the value of the correlation coef®cient? 2. Name four assumptions of simple linear regression. 3. The correlation coecient and the slope are as follows: n P

r i1

xi

xyi

n

1sx sy

n P

y

b i1

;

xi n

xyi

y

1s2x

Find an equation for b in terms of r. 4. A regression of infant mortality rates (annual deaths per hundred births) on median annual household income (data entered in thousands of dollars) is performed on data collected from 34 counties. The results are summarized below: Intercept Slope

18.46 0.14 Sum of squares

Regression Residual Total

1.8 3.4

df

Mean square

F


(a) Fill in the blanks in the ANOVA table. (b) What is the predicted infant mortality rate in a county where the median annual household income is $40 000? (c) In words, what is the meaning of the coef®cient 0.14? Do NOT simply say that this is the slope or the regression coef®cient; indicate what it means and how it can be interpreted. (d) What is the standard error of the residuals? (e) Are predictions of mortality rates more accurate near the mean of the median incomes or away from the mean of median incomes? (f ) What is the value of the correlation coef®cient? 5. A simple regression of Y vs X reveals, for n 22 observations, that r2 0.73. The standard deviation of x is 2.3. The regression sum of squares is 1324. What is the value of the standard deviation of y? What is the value of the slope b? 6. In linear regression, which is wider, the con®dence interval for a single predicted value of y, given x, or the con®dence interval for the regression line? Give reasons for your answer. 7. Given a simple regression with slope b 3, sy 8, and sx 2, ®nd the standard error of the estimate (i.e., the standard deviation of the residuals). 8. The following data are collected in an eort to determine whether snowfall is dependent upon elevation: Snowfall (inches) 36 78 11 45

Elevation (feet) 400 800 200 675

Without the aid of a computer, show your work on problems (a) through (g). (a) Find the regression coef®cients (the intercept and the slope coef®cient). (b) Estimate the standard error of the residuals about the regression line. (c) Test the hypothesis that the regression coef®cient associated with the independent variable is equal to zero. Also place a 95% con®dence interval on the regression coef®cient. (d) Find the value of R2. (e) Make a table of the observed values, predicted values, and residuals. (f) Prepare an analysis of variance table portraying the regression results. (g) Graph the data and the regression line.

7

More on Regression

LEARNING OBJECTIVES

. . . . . .

Regression with more than one independent explanatory variable Regression with categorical explanatory variables Regression with categorical dependent variables Interpreting multiple regression coefficients Choosing explanatory variables Consequences of poorly satisfied assumptions

7.1 Multiple Regression

It is most often the case that there is more than one variable that is thought to aect the dependent variable. For example, housing prices are aected by many characteristics of both the house and the neighborhood. The number of shopping trips generated by a residential neighborhood is aected by the income of its residents, the number of automobiles its residents own, accessibility to shopping alternatives, and so on. With p independent explanatory variables, the regression equation is y^ a b1 x1 b2 x2 bp xp

7:1

where y^ is the predicted value of the dependent variable. With a given set of observations on the dependent (y) and independent (x) variables, the problem is to ®nd the values of the parameters a and b1 ; b2 , . . . , bp . The solution is found by minimizing the sum of the squared residuals: min

y

fa,b1 ,..., bp g

a

b1 x1

bp xp 2

7:2

The problem and solution are identical in concept to that of bivariate regression discussed in the previous chapter, except that there are now more parameters to estimate and the geometric interpretation is carried out in a higher-dimensional space. If p 2, we wish to ®nd a, b1, and b2 by ®tting a plane through the set of points plotted in a three-dimensional space where the axes are represented by the y variable and the two x-variables (see Figure 7.1). The intercept a is the point of the plane on the y-axis when x1 x2 0. The value of b1 describes how much the value of y changes in the plane when x1 increases by one unit along any line where x2 is constant. Similarly, the value of

MORE ON REGRESSION 125

y

(x1,x2,y)=0,0,a

x1

x2 Figure 7.1 Fitting a plane through a set of points in three dimensions

b2 describes the change in y when x2 changes by one unit while x1 is held constant. Although it is dicult (if not impossible!) to visualize, we wish to ®nd the minimum of a four-dimensional parabolic cone, where the sum of squared residuals is represented along the vertical axis and the values of a, b1, and b2 are represented along the other dimensions. More generally, for p independent variables, we ®t a p-dimensional hyperplane through the set of points that are plotted in a p+1 dimensional space (one dimension for the y-variable and p additional dimensions, one for each of the independent variables). The coecients a and b1 , . . . , bp are found at the base of a p+2 dimensional parabola. Though it is of course not possible to actually picture this for high-dimensional spaces, this geometric description serves to reinforce what is actually being carried out in regression analysis. Multiple regression carried out on spatial data raises special issues. One particularly dicult problem is that associated with the modi®able area unit problem, ®rst discussed in Chapter 5. A regression of a dependent variable on a set of independent variables may yield substantially dierent conclusions when carried out on spatial units of diering sizes. Fotheringham and Wong (1991) note that with multiple and logistic regression (to be discussed in Section 7.6) the magnitude and signi®cance of regression coecients can be very sensitive to the size and con®guration of areal units. If feasible, the sensitivity of one's results to changes in the size and/or shape of the spatial units should be explored. 7.1.1 Multicollinearity

In addition to the assumptions given for bivariate regression in the previous chapter, multiple regression analysis makes use of one additional assumption. In particular, it is assumed that there is no multicollinearity among the independent variables. This means that the correlation among the explanatory x-variables should not be high. In the extreme case where two variables are perfectly correlated, it is not possible to estimate the coecients


(and computer software will not provide results for this situation). In the more common case where multicollinearity is high, but not perfect, the estimates of the regression coecients become very sensitive to individual observations; the addition or deletion of a few observations can change the coecient estimates dramatically. Also, the variance of the coecient estimates becomes in¯ated. Because the coecients are more variable, it is not uncommon for insigni®cant independent, explanatory variables to appear signi®cant. 7.1.2 Interpretation of Coef®cients in Multiple Regression

Suppose that a regression of house prices in dollars (p) on lot size in square feet (x1) and the number of bedrooms (x2) results in the following equation: p 4000 20x1 10 000x2

7:3

The coecient on lot size means that every increase of one square foot adds an average of $20 to the house price, holding constant the number of bedrooms. Similarly, the coecient on the number of bedrooms implies that an added bedroom will increase the value of the house by an estimated $10 000, for houses with identical lot sizes. As with simple regression, the coecients tell us the eect on the dependent variable of an increase of one unit in the independent variable. In addition, they control for the eects of other variables in the equation. That is, to understand the eect of a particular explanatory variable on the dependent variable, it is not sucient to simply include it in the right-hand side of a regression equation. Since other variables may also aect the dependent variable, they also have to be included so that the separate eects of each contributing variable may be estimated. If all relevant variables are not included, this may lead to misspeci®cation error.

7.2 Misspeci®cation Error

Suppose that Equation 7.3 characterizes the ``true'' relationship between house prices, lot size, and the number of bedrooms. We will examine the eects of a misspeci®ed regression equation that has an omitted variable by ®rst making up some sample data from the equation p 4000 20x1 10 000x2 "

7:4

where " is a normal random variable with mean 0 and variance equal to 30002 (this implies that one can, with 95% con®dence, predict house prices within about two standard deviations, or $6000). Table 7.1 displays the data associated with ten observations, where the data on lot size and the number


Table 7.1 House price 132 767 134 689 159 718 164 937 132 489 125 766 146 568 168 932 171 180 187 921

Lot size

Bedrooms

5000 5500 6000 6500 5200 5400 5700 6100 6300 6400

3 2 4 3 2 1 3 4 4 5

of bedrooms are simply hypothetical, and then Equation 7.4 is used to generate housing prices. Now suppose that we incorrectly assume that house prices are a function of lot size only. A regression of house prices on lot size yields p^

57 809:7 36:2x1 1:7

6:21

7:5

where the t-values associated with each coecient are given below the equation in parentheses. We see that the coecient of lot size is signi®cant (since its t-value is greater than the one-sided critical value of t 1.86 with n 2 8 degrees of freedom) and is in the ``right'' direction (i.e., larger lot sizes lead to higher housing prices, as we would expect), but it is larger than the ``true'' value of 20. We have overestimated the eect of lot size by omitting the number of bedrooms, which also aects housing price. Similarly, if we had incorrectly assumed that house prices were a function of the number of bedrooms only, we would ®nd that the regression equation, based on the observed data, is p^ 103 361 15 580x2 12:05 6:1

7:6

The eect of number of bedrooms on housing price is signi®cant and in the expected direction, but we have again overestimated somewhat the ``true'' eect of an added bedroom, which we know to be $10 000. Finally, if we use the data to estimate a regression equation with both independent variables, we ®nd p^

1993 21:6x1 9333x2 0:13 7:12 7:24

7:7

Both variables have signi®cant eects on housing prices, and, more importantly, we have estimated their eects on housing prices quite accurately,


since the coecient of 21.6 is close to its true value of 20 and the coecient of 9333 is close to its true value of 10 000. The intercept is not too close to its true value of 4000, but we note from Table 7.2 that all of the true coecients are within two standard deviations of their estimated values, and that all true values of the coecients lie within their estimated con®dence intervals.

7.3 Dummy Variables

It is sometimes necessary to include explanatory independent variables that are categorical. For instance, income is often not reported exactly but is rather classi®ed into a category. Locations may be classi®ed into, for example, central city, suburb, or rural categories. To handle independent variables that have, say, k categories in regression analysis, we create k 1 variables. One category is arbitrarily omitted; often the ®rst category (e.g., lowest income) or last category (e.g., highest income) is omitted. Each of these new variables is a binary, 0±1 dummy variable. An observation is assigned a value of zero on one of these variables if it is not in the category, and is assigned a value of one if it is in the category. Consider the example in Table 7.3, where individuals either report or are assigned one of three locations ± central city, suburb, or rural. We will arbitrarily choose the rural region as the omitted category. We de®ne k 1 2 categories, the ®rst associated with the central city and the second associated with the suburb. The ®rst two individuals live in the city ± they are each assigned x1 1 since they live in the city, and x2 0 since they do not live in the suburb. Individuals 3 and 5 live in the suburb, so they are assigned x1 0 since they do not live in the city, and x2 1 since they live in the suburb. Note that individual 4 lives in the rural region, and is assigned a value of 0 on both x1 and x2 , since the person does not live in either the city or the suburb. Table 7.2 Intercept X1 X2

Coef®cient

Standard deviation

Con®dence interval

1993 21.6 9333

14 881 2.98 1310

( 37 181, 33 195) (14.6, 28.7) (6234, 12 432)

Table 7.3 Individual 1 2 3 4 5

Location

x1

x2

Central city Central city Suburb Rural Suburb

1 1 0 0 0

0 0 1 0 1


The reason that a category is always omitted when dummy variables are employed has to do with multicollinearity. If all categories were included, there would be perfect multicollinearity, and this violates an assumption of multiple regression. Perfect multicollinearity would occur if we de®ned k dummy variables, since the sum of the k columns would always equal one. In our example above, we have included only two of the three categories; there is no reason to include a separate column for the third category, since it simply supplies us with redundant information (for example, we know that if individual 4 does not live in the central city or the suburb, he or she must live in the rural area). Dummy variables are coded as 0 or 1 only, and not, for example, 1 or 2. The 0/1 coding is a result of the fact that the dummy variable is a nominal, categorical variable, and 0/1 coding corresponds to absence/presence. Once dummy variables are de®ned, regression analysis proceeds in the usual way. Suppose that individuals one through ®ve are observed to make 3, 4, 7, 1, and 5 weekly shopping trips, respectively. The resulting, best-®tting regression equation is found from a computer program to be y 1 2:5x1 5x2

7:8

Table 7.4 displays the observed and predicted values. The regression coecients may be interpreted as follows. Being located in the rural region implies that x1 and x2 are zero, and so the predicted value of y is simply the intercept. Thus the intercept in dummy variable regression is the predicted value of the dependent variable for the omitted category. Being located in the central city is ``worth'' an extra 2.5 shopping trips, relative to the omitted category. Therefore, we predict that someone located in the central city will shop an average of 3.5 times per week. Being in the suburb is ``worth'' an extra 5 shopping trips per week, again, relative to the omitted (rural) category. Therefore, individuals residing in the suburb are predicted to shop 1+5 6 times per week. You may be wondering at this point what would have happened if we had omitted a category other than the rural region. Suppose the central city had been chosen as the omitted category. Then our data would look like those in Table 7.5. Here we de®ne x1 1 if the individual lives in the suburb, and x2 1 if the individual lives in the rural region. Using this data in a multiple regression analysis yields y^ 3:5 2:5x1

2:5x2

7:9

Table 7.4 Weekly shopping trip frequency Individual 1 2 3 4 5

Observed

Predicted

3 4 7 1 6

3.5 3.5 6 1 6


Table 7.5 Individual 1 2 3 4 5

Location

x1

x2

Weekly shopping trips

Central city Central city Suburb Rural Suburb

0 0 1 0 1

0 0 0 1 0

3 4 7 1 5

The coecients are dierent, but when we interpret them in light of the new variable de®nitions, we come to the same conclusions as before (as we should!). For instance, the intercept of 3.5 is the predicted number of weekly shopping trips made by those in the central city (the omitted category). The coecient of +2.5 is the extra number of shopping trips made by suburban residents relative to the omitted category. Thus suburban residents are predicted here to make 3.5+2.5 6 shopping trips per week, the same as in the previous example. Similarly, the coecient of 2.5 means that rural residents make 2.5 fewer shopping trips than central city residents do each week (3.5 2.5 1.0).

7.3.1 Dummy Variable Regression in a Recreation Planning Example

Part of the statewide recreation planning process is to generate estimates and forecasts of recreation activity. Annual participation in a speci®c recreation activity is taken to be a function of variables such as age, income, and population density. In New York State, a survey of approximately 7500 people was undertaken; individuals were asked about their participation frequencies in various activities, and their age, income, and location were recorded. The independent explanatory variables were recorded as dummy variables. The dependent variable is the number of times the individual participated in the activity at organized public or private facilities, over the course of a year. A multiple regression analysis was run for each recreation activity. Table 7.6 displays the results. For each independent variable, the highest category is omitted (i.e., high income, elderly, and urban locations). Recall that the coecients are to be interpreted relative to these omitted categories. Thus a person in the low income category swims, on average, 5.55 times less per year than a person in the high income category. By using these coecients, it is easy to estimate the participation frequency for any activity and any set of categories. For example, how often does a middle income, young adult, living in the rural regions of the state, participate in court games at organized facilities each year? The answer is 1.34 (the intercept, called the ``base participation rate'' in the table) plus 0.43 (the coecient associated with those in the middle income category) plus 5.76 (for being a young adult) minus 3.65 (for those living in the rural area).

Table 7.6 Recreation participation coef®cients

Activity Swimming Biking Court games Camping Tennis Picnicking Golf Fishing Hiking Boating Field games Skiing Snowmobiling Local winter

Base participation ratea 2.45 0.06 1.34 0.27 0.74 0.66 0.93 0.21 1.31 0.24 0.17 0.70 0.07 0.66

Income Low 5.55 0.94 0.55 0.13 2.30 0.15 1.44 1.35 0.22 1.06 0.39 0.98 0.50 1.49

Low±middle

Middle

4.37 0.11 0.63 0.01 2.42 0.31 1.38 0.17 0.20 0.77 0.19 1.10 0.48 1.18

0.73 0.92 0.43 0.34 2.45 0.48 0.71 0.57 0.04 0.09 1.53 0.86 0.08 0.80

Age

Population density

High±middle

Youth

Young adult

Adult

Middle aged

Rural

3.22 0.07 1.31 0.39 0.46 0.67 0.58 0.83 0.77 0.45 0.86 0.36 0.37 0.15

22.83 21.63 16.41 1.93 7.76 1.87 0.30 3.36 2.49 3.58 8.22 0.85 0.33 5.13

9.83 6.21 5.76 0.44 4.28 2.23 0.30 1.26 0.65 1.42 2.73 0.37 0.85 1.34

8.94 3.82 1.65 0.01 3.31 1.84 0.28 2.01 0.47 1.28 0.94 0.63 0.20 0.82

2.91 1.17 0.12 0.19 0.61 0.67 0.70 0.72 0.18 0.50 0.13 0.03 0.14 0.06

8.51 1.64 3.65 1.25 1.48 2.30 1.02 1.77 0.53 2.36 1.48 0.19 1.52 1.84

Exurban Suburban 5.98 2.95 2.11 0.41 1.78 0.97 1.34 1.41 0.54 1.68 0.83 0.78 0.97 0.32

a The control group selected was the highest income, age, and density group. This column gives the estimated base annual participation rate for this group. The other columns give the amounts to be added to this amount to obtain participation rates for any other income, age, and density group. Negative participation rates are possible and should be interpreted as zero.

Source: New York State Of®ce of Parks and Recreation (1978).

4.78 1.31 1.54 0.80 2.14 1.10 0.42 1.17 0.30 1.86 0.56 0.31 0.44 1.05


Our estimate is therefore equal to 3.88 times per year. It should of course be kept in mind that this is an average participation rate across all individuals in that category. If an individual happens to be in an omitted category, the implied coecient is equal to zero. Thus high±middle income elderly residents of urban areas swim on average 2.45 (base participation rate, or intercept) plus 3.22 (income coecient) plus 0 (since they are in the omitted, elderly category) plus 0 (since they are also in the omitted, urban category), which is equal to 5.67 times per year. Recreation planners use these coecients to plan for future use. Demographic projections provide forecasts of the number of people in each age/income/population density category. If we can rely on the coecients in the table as estimates of individual recreation participation, we can use the coecients together with the demographic forecasts to project how much demand there will be for each recreation activity. The state can then prioritize recreation projects in a way that will best meet the anticipated demand. 7.4 Multiple Regression Illustration: Species in the Galapagos Islands

The data in Table 7.7 contain two possible dependent variables related to the number of species found on 30 islands (total species, and number of native species), as well as ®ve potential independent variables that may help to understand why dierent islands have dierent numbers of species. In our example, we will use the total number of species as the dependent variable. We will now explore some of the choices and questions that one is faced with in arriving at a suitable regression model. All of the output shown in the tables is from SPSS for Windows 9.0. 7.4.1 Model 1: The Kitchen-Sink Approach

One idea would be to simply put all ®ve independent variables on the righthand side and see what happens ± i.e., everything is put into the equation except the kitchen sink! This approach is not recommended, and is shown here for illustrative purposes only. The output in Table 7.8 shows the value of r2 to be 0.877. There are two signi®cant variables ± elevation has a positive eect on species number, and the area of the adjacent island has a negative eect. Note that the sign of the area variable is negative, which is counter to one's intuition that more species would be found on larger islands. The standard error of the estimate is 66, which is approximately equivalent to the average absolute value of a residual (in this case the actual average absolute value of a residual is 44). This is pretty high, since half of the islands have fewer than 44 species! Finally, note that the ANOVA table is similar to that in the univariate case, with the regression degrees of freedom equal to the number of independent variables.


It is often tempting to use the kitchen-sink approach because when an independent variable is added to the regression equation, the R2 value always increases. It is important to realize that a high R2 value is not the primary goal of regression analysis; if it were, we could simply keep adding explanatory variables until we achieved our desired value of R2! A more reasonable strategy often involves either (a) deleting variables that do not reduce the value of R2 very much, and/or (b) adding variables only when they increase R2 appreciably. Variable selection is discussed in more detail in Section 7.5. Table 7.7 Galapagos Islands: species and geography Distance (km)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Baltra BartolomeÂ Caldwell Champion CoamanÄo Daphne Major Daphne Minor Darwin Eden Enderby Espanola Fernandina Gardner* Gardner{ Genovesa Isabela Marchena Onslow Pinta PinzoÂn Las Plazas Rabida San CristoÂbal San Salvador Santa Cruz Santa FeÂ Santa Maria Seymour Tortuga Wolf

Area of adjacent island (km2)

Native

Area km2

Elevation (m)

58 31 3 25 2 18

23 21 3 9 1 11

25.09 1.24 0.21 0.10 0.05 0.34

± 109 114 46 ± 119

0.6 0.6 2.8 1.9 1.9 8.0

0.6 26.3 58.7 47.4 1.9 8.0

1.84 572.33 0.78 0.18 903.82 1.84

24

±

0.08

93

6.0

12.0

0.34

10 8 2 97 93 58 5 40 347 51 2 104 108 12 70 280

7 4 2 26 35 17 4 19 89 23 2 37 33 9 30 65

2.33 0.03 0.18 58.27 634.49 0.57 0.78 17.35 4669.32 129.49 0.01 59.56 17.95 0.23 4.89 551.62

168 ± 112 198 1494 49 227 76 1707 343 25 777 458 ± 367 716

34.1 0.4 2.6 1.1 4.3 1.1 4.6 47.4 0.7 29.1 3.3 29.1 10.7 0.5 4.4 45.2

290.2 0.4 50.2 88.3 95.3 93.1 62.2 92.2 28.1 85.9 45.9 119.6 10.7 0.6 24.4 66.6

2.85 17.95 0.10 0.57 4669.32 58.27 0.21 129.49 634.49 59.56 0.10 129.49 0.03 25.09 572.33 0.57

237

81

572.33

906

0.2

19.8

4.89

444 62 285

95 28 73

903.82 24.08 170.92

864 259 640

0.6 16.5 2.6

0.0 16.5 49.2

0.52 0.52 0.10

44 16 21

16 8 12

1.84 1.24 2.85

± 186 253

0.6 6.8 34.1

9.6 50.9 254.7

25.09 17.95 2.33

Observed species Island

From Santa Cruz

From nearest island

Number

*Near Espanola. { Near Santa Maria. The values marked ± are not known. Source: Andrews and Herzberg (1985).


7.4.2 Missing values

Delving right into the analysis without a consideration of certain questions can lead to misinterpretation. Before we start, we should decide how we are going to treat missing values. In the table there are ®ve missing values of elevation. All of the missing values, with the exception of the ®rst observation, are for islands that have extremely small areas (and, with the exception of Seymour, small numbers of species). There are various ways we can proceed, including: (1) Delete all observations with missing values. This is the default option used by many software packages. (2) Replace missing values with the mean. Most statistical software packages have an option that allows missing values to be replaced with the mean of the remaining values. (3) Use the other independent variables, or some subset of them, to predict the missing value. We could perform an initial regression of elevation on the other four independent variables for the nonmissing cases. Then we can use the results to predict the value of elevation, based upon the values of the other independent variables, for the missing observations. Which option, or combination of options, should we choose? Option two is not a reasonable one here. The mean elevation is 412 m, and it would be


Table 7.8 The kitchen-sink model

foolish to suppose that the unknown elevations are this great on the very small islands for which we have no data. For the very small islands (observations 5, 9, and 21), one can justi®ably exclude them from the analysis. Although we might also exclude Baltra and Seymour, here we will estimate their elevations from a regression of elevation on area. The resulting regression equation is Elevation 300 0:358Area

7:10

We estimate Baltra's elevation as 300+25.09(0.358) 309 m, and Seymour's as 300+1.84(0.358) 301 m.


7.4.3 Outliers and Multicollinearity

A cursory examination of the data reveals that there is a small number of large islands. This is not an unusual feature of many studies, and it is important that we know whether these observations are exerting a signi®cant eect on the results. Leverage values are designed to indicate how in¯uential particular observations are in regression analysis. If the leverage value exceeds 2p/n, where p is the number of independent variables, the observation should be considered as an outlier. We also want to make sure that multicollinearity is not exerting an undue in¯uence on the results. An examination of the correlations among the independent variables will reveal those where the high correlations exist. The tolerance is equal to the amount of variance in an independent variable that is not explained by the other independent variables. It is equal to 1 r2 , where the r2 is associated with the regression of the independent variable on all other independent variables. A low tolerance indicates problems with multicollinearity, since the variable in question has a high correlation with the other independent variables. The reciprocal of the tolerance is the variance in¯ation factor (VIF); if it is greater than about 5, this indicates potential multicollinearity problems.

7.4.4 Model 2

In this second model, we have accounted for the missing elevation data and have output information on outliers and multicollinearity. From the output (Table 7.9), we see that the inclusion of Baltra and Seymour has not changed the results very much. The r2 value is 0.869, the standard error of the residuals is about 65, and elevation and the area of the adjacent island are still the only signi®cant independent variables. We learn, however, that the variance in¯ation factor is slightly high (though not greater than the rule-of-thumb value of 5) for elevation and area. This is not surprising when we also inspect the correlation matrix, which reveals a very high correlation between the two variables. One option for treating multicollinearity is to exclude from the analysis one or more variables. Here area and elevation are correlated, and we might decide to drop one of the two since they are close to being redundant (and since the sign of one of them is not correct). Which one should we drop? The choice should come primarily from a consideration of the underlying process and, secondarily, from the magnitude of the variance in¯ation factors. Both area and elevation should aect species number, but we will choose to exclude area because (a) elevation is important in terms of species diversity, and (b) the VIF is slightly higher for area than it is for elevation. A note of caution is in order here. Dropping variables from the analysis should only occur after a wellreasoned consideration of the underlying process. It does little to advance


Table 7.9 Regression estimation with outliers removed

understanding of the process when important variables are dropped from the regression equation solely because they don't perform well. The leverage values also reveal that several outliers have an important impact upon the results. Leverage values are over the rule-of-thumb value of 2p/n 10/27 0.37 for observations 8, 12, 15, and 16. Fernandina (observation 12) and Isabela (observation 16) have by far the two highest elevations among the thirty islands. There seems to be less justi®cation for deleting the other


two observations. Darwin (observation 8) and Genovesa (15) are geographic outliers, but there are also other geographic outliers with low leverage values. 7.4.5 Model 3

In this third model, we have deleted area as an independent variable and have deleted the observations for Fernandina and Isabela. The output (Table 7.10)

Table 7.10 Regression with missing data removed or estimated, and outliers and area variable removed


shows that only elevation remains signi®cant. The value of r2 remains high at 0.858, and the standard error of the estimate is slightly lower, at about 62. Multicollinearity is not an issue, since all of the VIFs are less than 5. Three observations (Bartolome, Darwin, and Rabida) are still outliers. This is likely due to their extreme values on some of the independent variables. BartolomeÂ and Rabida are both adjacent to large islands, and Darwin is a long way from Santa Cruz. 7.4.6 Model 4

To end up with a parsimonious model, we can remove the variables that are not signi®cant. Thus we regress species number on elevation. The result is the equation Species

24:27 0:35Elevation

7:11

From Table 7.11, the value of r2 is still high at 0.846 (it is necessarily lower than before since we have removed variables, but it has not declined very much). The standard error of the estimate is about 60. Furthermore, a check of the leverage values reveals no outliers. Table 7.11 Final regression equation for species data


7.5 Variable Selection

As seen in the example, a common issue in regression analysis is the selection of variables that will appear as explanatory variables on the right-hand side of the regression equation. A brute-force approach to this question would be to try all possible combinations. With p potential independent variables this would mean that we would try p separate regressions that have just one indep pendent variable, all pp 1=2 equations that have two variables, all 2 p pp 1p 2=6 equations that have three variables, and so on. If p 3 is large, this is a lot of equations. Even with p 5, one would have to try 31 regression equations. But perhaps the real drawback is that this is even more of a ``kitchen-sink'' approach than to include all of the variables on the right-hand side. It amounts to an admission that we don't know what we are doing, and that our strategy is just to go with what looks best. It is always desirable to start from hypotheses and underlying processes ®rst, in keeping with the principles of the scienti®c method described in the ®rst chapter. In a more exploratory spirit, however, there may be some cases where we really have little in the way of a priori hypotheses ± in this case, the all-possible regressions approach might be viewed as a potential way to generate new hypotheses. An alternative way to select variables for inclusion in a regression equation is the forward selection approach. The variable that is most highly correlated with the dependent variable is entered ®rst. Then, given that that variable is already in the equation, a search is made to see whether there are other variables that would be signi®cant if added. If so, the one with the greatest signi®cance is added. In this way, a regression equation is built up. The procedure terminates when there are no variables in the set of potential variables that would be signi®cant if entered into the equation. Backward selection starts with the kitchen-sink equation, where all of the possible independent variables are in the equation. Then the one that contributes least to the r2 value is removed if the reduction in r2 is not signi®cant. The process of removing variables continues until the removal of any variable in the equation would constitute a signi®cant reduction in r2. Stepwise regression is a combination of the forward and backward procedures. Variables are added in the manner of forward selection. However, as each variable is added, variables entered on earlier steps are re-checked to see if they are still signi®cant. If they are not still signi®cant, they are removed.

7.6 Categorical Dependent Variable

There are many situations where the dependent variable will be a categorical variable. For example, we may wish to model whether individuals patronize a park as a function of the distance to the park, or whether individuals commute by train as a function of automobile travel time. We may want to


estimate the probability that a customer patronizes any of, say, four supermarkets as a function of characteristics of the stores and characteristics of the customers. In each of these examples, the dependent variable is categorical in the sense that possible outcomes may be placed into categories. An individual either goes to the park or does not go. An individual either commutes by train or does not. If there are only four choices of supermarkets in an area, the consumers may be classi®ed according to which one they patronize. When the dependent variable is categorical, special consideration must be given to how regression analyses are carried out. In this section, we will examine why this is the case, and we will ®nd out how logistic regression may be used in such situations. 7.6.1 Binary Response

In the simplest case, there are two possible responses. For example, we may assign the dependent variable a value of y 1 if the individual takes the train to work, and y 0 otherwise. Suppose, for instance, we had the data in Table 7.12 for n 12 respondents. A cursory examination of the table reveals that there seems to be a tendency to take the train when the travel time by automobile would be high. Where auto travel time is not as high, there is more of a tendency for the y variable to be zero, indicating that the individual drives to work. We could begin by running an ordinary least-squares analysis; we would ®nd y^

0:396 0:0153x

7:12

The value of y^, which is a continuous variable, may be interpreted as the predicted probability of taking the train, given an automobile travel time equal to x. There are several problems with this approach. One is that the assumption of homoscedasticity is not met; the estimated variance about the regression line is Table 7.12 y

x: Auto travel time (min)

0 1 0 1 0 1 0 1 1 1 0 0

32 89 50 49 80 56 40 70 72 76 32 58


equal to y1 y and therefore is not constant. Perhaps more troubling is that the predicted probabilities ( y^) do not have to stay on the (0, 1) interval; the reader might con®rm that values of x less than around 25 will yield negative probabilities, and values of y greater than about 100 will yield probabilities greater than one! The problem is as shown in Figure 7.2. How then should we proceed? One idea is to make the probabilities relate to x in a nonlinear way. The logistic equation in Figure 7.3 has the following equation: yî

e xi 1 e xi

7:13

Note that when x is a large negative number the predicted probability is near zero, whereas if x is a large positive number the predicted y

1 -

0

Figure 7.2 Predicted probabilities outside the (0,1) interval

Figure 7.3 The logistic curve

x


probability is near one. In these cases, the predicted probabilities approach their asymptotes of 0 or 1 but never actually reach them. Thus it is no longer possible to predict probabilities that are either negative or greater than one. While we've solved one problem by keeping the dependent variable on the (0, 1) interval, we've created another. How can we estimate the parameters? We can't use linear regression, since the equation is clearly not linear. One approach is to use nonlinear least squares. Speci®cally, we want to ®nd and to minimize the sum of squared deviations between observed and predicted values min

n X yi i1

yî 2

7:14

where the predicted value, yî , is given by Equation 7.13 above. The answer, when minimizing the sum of squared residuals for the data in Table 7.12, is 4:501 and 0:0802: Although the predicted probabilities are not linearly related to x, we can transform the predicted probabilities into a new variable, z, which is linearly related to x. The transformation is called the logistic transform, and it is carried out by ®rst ®nding y^=1 y^ and then taking the logarithm of the result. The quantity y^=1 y^ is known as the odds, and so the new variable is known as the ``log-odds.'' Thus we have y^ z ln x 7:15 1 y^ The logistic regression model is therefore one that assumes that the log-odds increases (or decreases) linearly as x increases (see Figure 7.4). Odds are often stated in place of probabilities for events such as horse racing. If the probability that a horse wins a race is 0.2, the probability that it loses is y 0.8. The odds against it winning are stated as 4 to 1 ( 0.8/0.2). Suppose that ®ve people each bet $1; one bets that the horse will win, and the other four that the horse will lose. If the horse wins, the person betting for the horse will collect $5, equal to the total amount bet (of course, in reality the winner will have to give a share of his winnings to the race track, and to the government in payment of taxes!). If the probability that the horse wins rises to 0.333, the probability that it loses declines to y 0.667, and the odds against it decline to 2 to 1 ( 0.667/0.333). Though it is less common to do so, we could also state the odds in the other direction. When the horse has a winning probability of 0.2, the odds that the horse wins are 0.2/0.8 0.25 to 1. When the probability of winning rises to 0.33, the odds in favor of the horse rise to 0.33/0.67 0.5 to 1. Returning to our example, the log-odds that an individual takes the train is given by y^ z ln 4:501 0:0802x 7:16 1 y^


Log-odds

x Figure 7.4 Linear relationship between log-odds and x

We see that the slope coecient tells us how much the log-odds will change when x changes by one unit. In the present example, when x 32 minutes, the predicted probability of taking the train is 0.1262, using x 32 and the estimated values of and . The odds of taking the train are then given as 0.1444 ( 0.1262/(1 0.1262) to 1. Stated another way, the odds against taking the train are 6.92 ( 1/0.1444) to 1. Should the individual experience an increase in auto travel time of one minute, the log of the odds of choosing the train would go up by 0.0802. What does it mean to say that the log of the odds has gone up by 0.0802? We can `ùndo'' the log by exponentiating; if a ln(b), then b ea. This means that if the log of the odds has increased by 0.0802, the odds of choosing the train have increased by a factor of e0.0802 1.084. The new odds of taking the train are now equal to 0.1565 0.1444 * 1.084 to 1. Equivalently, the odds against the train have declined by a factor of 1.084, and are now 6.39 ( 6.92/1.084) to 1. When the automobile travel time increases by one minute, from x 32 to x 33, the probability of taking the train increases from 0.1262 to 0.1353. This may be veri®ed by using Equation 7.2 with the estimated values of and , and x 33: yî

e xi 0:1565 0:1353 x i 1 0:1565 1e

7:17

Finally, note that in logistic regression, when x 0, the predicted probability is equal to 1/2 (since e0 1). This is equivalent to x = . In our example, = 4:501=0:0802, which is approximately equal to 56.


When automobile travel time is about 56 minutes, the probability that an individual takes the train is about 0.5 (or ``50-50'').

7.7 A Summary of Some Problems That Can Arise in Regression Analysis

Table 7.13, adapted from Haining (1990a), summarizes some of the problems that can plague regression analyses. The table describes the consequences of the problem and, in addition, describes how the problem may be diagnosed and corrected. Section numbers refer to other sections of this text that provide relevant discussion of the problem.

7.8 Multiple and Logistic Regression in SPSS for Windows 9.0 7.8.1 Multiple Regression

Data input is similar to that for simple linear regression (Chapter 6). Each observation is represented by a row in the data table, and each variable is represented by a column. Click on Analyze/Regression/Linear. Then move the dependent variable into the box labeled Dependent on the right, and move the desired independent variables into the box labeled Independents. Then click on OK. There are a number of common options that you may wish to choose before clicking on OK. Under save, it is common to check the boxes to save predicted values, residuals, leverage values to detect outliers, and con®dence intervals for either the mean (i.e., the regression line) or individual predictions. All of these saved quantities will be attached as new columns in the dataset. Under statistics, it is desirable to check Collinearity diagnostics, to check for multicollinearity. Under the box where variables that are in the regression are indicated, one may choose the method by which independent variables are entered onto the right-hand side. The default is `ènter'', which means that all independent variables will be entered. A common alternative is to choose stepwise, which enters and removes variables one at a time, depending upon their signi®cance. 7.8.2 Logistic Regression

There are two ways that logistic regression may be carried out using SPSS for Windows. The ®rst approach is to use nonlinear least squares. This is easiest to understand, since, as the previous section indicates, we are simply looking for the values of and that will make the sum of squared residuals as small as possible.


Table 7.13 Some problems that can arise in regression analysis Problem

Consequences

Diagnostic

Corrective action

Inferential tests may be invalid

Shapiro±Wilk test (4.82)

Transform y values

Heteroscedastic

Biased estimation of error variance, leading to invalid inference

Plot residuals against y and xs

Transform y values

Not independent

Underestimatevariance of regression coef®cients. In¯ated R2

Moran's I (8.3.3)

Spatial regression (9.3)

Nonlinear relationship

Poor ®t and nonindependent residuals

Scatterplots of y against xs. Added variable plots (9.2)

Transform y and/or x variables

Multicollinearity (7.1.1)

Variance of regression estimates is in¯ated

Variance in¯ation factor (7.4)

Delete variable(s)

Incorrect set of explanatory variables (7.2)

Dif®culties in Added variable performing plots (9.2) ef®cient analysis, and poor regression estimates

Stepwise regression (7.5)

Outliers (7.4)

May severely affect model estimates and ®t

Delete observations (7.4)

Categorical response variable

Linear regression model inappropriate

Spatially varying parameters geographically

Invalid estimation and inference

Missing data at random

Could waste other case information if deleted

Estimate missing values (7.4)

Missing data (nonrandom)

Possibly invalid estimation and inference

Delete observation (7.4)

Residuals: Nonnormal

Plots. Leverage values (7.4)

Logistic regression (7.6) Moran's I (8.3.3)

Expansion method; Weighted regression (9.4)

Note: Relevant sections of text are given in parentheses. Source: Adapted from Haining (1990a), pp. 332±33.

Data input. In both cases, the approach to data input is the same. As in

linear regression, the dependent variables and independent variables are arranged in columns. Each row represents an observation. It is common, but not necessary, to have the dependent variable in the ®rst column. Make sure that the column containing the dependent variable consists of a column of ``0''s and ``1''s, consistent with its binary response nature.


Using SPSS for Windows 9.0 and nonlinear least squares

1. 2.

3. 4. 5.

Choose Analyze, Regression, Nonlinear. Under parameters, de®ne and , and give estimated values. Choosing good estimated values is sometimes important, and not always easy to do. It may require a bit of trial and error. Using 0 and 0 is often not a bad way to start. Next, select the dependent variable. Next, set up the model; this refers to the equation for the predicted values of the dependent variable. For logistic regression you should de®ne the model as in Equation 7.13. Choose OK to run the nonlinear least-squares analysis.

The nonlinear least-squares approach, however, is a bit more awkward to implement in SPSS than its alternative, known as the maximum likelihood approach to ®nding and . In addition, maximum likelihood is, to a statistician, generally a preferable alternative since it produces estimates that are unbiased and that have relatively smaller sampling variances, at least when the sample sizes are large. Many statistical packages, including SPSS for Windows 9.0, make use of maximum likelihood estimation. The likelihood of observing y 1 is e x 1 e x Similarly, the likelihood of observing y 0 is 1

e x 1 e x

The likelihood of the sample is therefore

L

x

e 1 e x

!P yi 1

x

e 1 e x

!n P yi

Many programs, such as SPSS for Windows, choose and to maximize this likelihood of obtaining the observed sample. Using logistic regression in SPSS for Windows

1. Choose Analyze, Regression, Logistic. 2. Choose the dependent variable and the covariates (independent variables). 3. Choose OK.


Table 7.14 Logistic regresssion output


Using the SPSS for Windows logistic regression routine, we ®nd z

4:5362 0:0773x

Note that these values of 4.5362 and 0.0773 are similar to those found via nonlinear least squares. Interpreting output from logistic regression. Tables 7.14 and 7.15 display the

output from the logistic regression analysis of the commuting behavior data in Table 7.12. We are, of course, interested in the slope and intercept, and these are displayed in the same part of the output where we found them in linear regression. They are given along with their standard deviations (also known as standard errors; see the column headed ``S.E.''). If the coecients are more than twice their corresponding standard errors (approximately), they may be regarded as signi®cantly dierent from zero. In this example, the coecients are not signi®cantly dierent from zero; this is also re¯ected in the column headed ``Sig,'' where we ®nd that the p-value associated with each coecient is greater than 0.05. Note that the output also contains a column headed Exp(B); this is the exponentiated slope referred to in the text, and it tells us by how much the odds will change when the x variable is increased by one unit. In this example, an increase of one minute in the commuting time leads to the odds of taking the train increasing by a factor of 1.0804. Another interesting part of the output is the two-by-two classi®cation table. It shows us that there were six observations where y 0 (the individual did not take the train). Of these, ®ve were predicted correctly by the logistic regression equation, and one was predicted incorrectly. (A prediction is classi®ed as ``correct'' if the model predicts that the actual outcome has a likelihood of

Table 7.15 Summary of results Predicted probabilities Y

X: Auto travel time (min)

Linear OLS

Logistic

0 1 0 1 0 1 0 1 1 1 0 0

32 89 50 49 80 56 40 70 72 76 32 58

.093 .963 .368 .352 .826 .459 .215 .673 .704 .765 .093 .490

.113 .913 .339 .321 .839 .449 .191 .706 .737 .793 .113 .487


greater than 0.5.) Note that the ®fth individual has an observed auto travel time of x 80 minutes. The model predicted that there would be a 0.839 probability that the individual would take the train (Table 7.14), yet we observed that he/she did not (y 0). Of the six individuals who did take the train, the model predicted four correctly, and there were two cases where the model predicted that the individual would not take the train when in fact they did. Individuals 4 and 6 both took the train, yet the model predicted probabilities of less than 0.5 that they would do so. This table summarizes how successful the model is in predicting actual outcomes.

Exercises 1. The following data are collected in a study of park attendance: Park visit? 1 = yes; 0 = no 0 0 1 0 1 0 0 1 1 1 1 1 1 1 0 0 0 0 1 1 0 0

Distance from park (km) 8 6 1 4 3 2 6 5 7 2 1 3 5 7 8 9 8 6 4 4 7 9

Use logistic regression to determine how the likelihood of visiting the park varies with the distance that an individual resides from the park. 2. In American football, the likelihood of a successful ®eld goal declines with increasing distance. The following data were collected one week from games played by teams in the National Football League.


Made? 1 = yes; 0 = no 0 1 0 1 0 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1

Yards 34 20 51 32 51 29 19 37 43 47 24 31 41 22 26 34 41 24 39 43

(a) Use logistic regression to determine how the odds of making a ®eld goal change as distance increases. (b) Use the results to draw a graph depicting how the predicted probability of making a ®eld goal changes with distance. (c) In the waning seconds of SuperBowl XXV, Scott Norwood missed a 47-yard ®eld goal that would have carried the Buffalo Bills to victory over the New York Giants. Use your model to predict the likelihood that a kicker is successful at a 47-yard ®eld goal attempt. Did Norwood really deserve the criticism he received for missing the attempt? 3. What is multicollinearity? How can it be detected? Why is it a potential problem in regression analysis? How might its eects be ameliorated? 4. The number of times per year a person uses rapid transit is a linear function of income: Y 1:2 2:4X1 8:4X2 15:6X3 where X1, X2, and X3 are dummy variables for medium, high, and very high incomes, respectively (the low income category has been omitted). What is the predicted number of annual transit trips per year for each of the four income categories? 5. Given the following data, Y

X

0 1 0 1 1

8 6 9 4 3


is (the ``slope'' of the logistic curve) positive or negative? How do you know? 6. Suppose, for a given set a data, we ®nd that a logistic regression yields 0.43. What is the change in odds for a unit change in x? 7. The following results were obtained from a regression of n 14 housing prices (in dollars) on median family income, size of house, and size of lot: Sum of squares

df

Mean square

F

4234 3487 Ð

3 Ð Ð

Ð Ð

Ð

Regression SS: Residual SS: Total SS:

Median family income Size of house (sq. ft.) Size of lot (sq. ft) Constant

Coef®cient (b)

Standard error (sb)

VIF

1.57 23.4 9.5 40000

0.34 11.2 7.1 1000

1.3 2.9 11.3

(a) (b) (c) (d)

Fill in the blanks. What is the value of R2? What is the standard error of the estimate? Test the null hypothesis that R2 0 by comparing the F-statistic from the table with its critical value. (e) Are the coef®cients in the direction you would hypothesize? If not, which coef®cients are opposite in sign from what you would expect? (f) Find the t-statistics associated with each coef®cient, and test the null hypotheses that the coef®cients are equal to zero. Use 0.05, and be sure to give the critical value of t. (g) What do you conclude from the variance in¯ation factors (VIFs)? What modi®cations would you recommend in light of the VIFs? (h) What is the predicted sales price of a house that is 1500 square feet, on a lot 60 ft 100 ft, and in a neighborhood where the median family income is $40 000? 8. Choose a dependent variable and two or three independent variables. The variables chosen should be de®ned spatially. There should be at least 15 to 20, and preferably about 30 observations. (a) State any null hypotheses you may have, as well as the alternative hypotheses. (b) Graph the dependent variable (y) vs each independent (x) variable. Describe any obvious outliers. (c) Graph the dependent variables against each other, and comment on any obvious multicollinearity.


(d) Regress y on each of the independent variables separately. Also regress y on the combined set of all independent variables. If you have three independent variables, you may wish to regress y on pairs of independent variables. Comment on the results. (e) For the regression including only the single most signi®cant independent variable, (i) ®nd and graph the 95% con®dence interval for the regression line; (ii) ®nd and graph the 95% con®dence interval for predictions of the y values. 9. Use the data in Table 7.7 to study how the number of native species on islands varies with the size of the island, the maximal elevation of the island, and the distance to nearby islands. There are many choices you will need to make; there is no single ``correct'' answer to this question. Some considerations you should think about include: (a) What should be done about the missing elevation values that occur in some cases? (b) Are there outliers? If so, how can they be identi®ed? (c) What about multicollinearity? Should variable(s) be eliminated from the analysis? One goal you should have is to come up with a ``best'' equation, in the sense that variables in the equation are both signi®cant and meaningful.

8

Spatial Patterns

LEARNING OBJECTIVES

. Finding geographic patterns in point and areal data . Introduction to local statistics . Application of Monte Carlo simulation tests to the statistical analysis of geographic clustering

8.1 Introduction

One assumption of regression analysis as applied to spatial data is that the residuals are not spatially autocorrelated ± that is, there is no spatial pattern to the errors. Residuals that are not independent can aect estimates of the variances of the coecients and hence make it dicult to judge their signi®cance. We have also seen in other, previous chapters that lack of independence among observations can aect the outcome of t-tests, ANOVA, and correlation, often leading one to ®nd signi®cant results where none in fact exist. In addition to a desire to remove the complicating eects of spatially dependent observations, spatial analysts also seek to learn whether geographic phenomena cluster in space. Here they have a direct interest in the phenomenon and/or process itself, not an indirect one; the latter is the case when one wishes to correct a statistical analysis based upon spatial data. For example, crime analysts wish to know if clusters of criminal activity exist. Health ocials seek to learn about disease clusters and their determinants. In this chapter we will investigate statistical methods aimed at detecting spatial patterns and assessing their signi®cance. The structure of the chapter follows from the fact that data are typically in the form of either point locations (where exact locations of, e.g., disease or crime are available) or in the form of aggregated areal information (where, e.g., information is available only on regional rates).

8.2 The Analysis of Point Patterns

Carry out the following experiment: Draw a rectangle that is six inches by ®ve inches on a sheet of paper. Locate 30 dots at random within the rectangle. This means that each dot should be

SPATIAL PATTERNS 155

located independently of the other dots. Also, for each point you locate, every subregion of a given size should have an equal likelihood of receiving the dot. Then draw a six-by-®ve grid of 30 square cells on top of your rectangle. You can do this by making little tick marks at one-inch intervals along the sides of your rectangle. Connecting the tick marks will divide your original rectangle into 30 squares, each having a side of length one inch. Give your results a score, as follows. Each cell containing no dots receives 1 point. Each cell containing one dot receives 0 points. Each cell containing two dots receives 1 point. Cells containing three dots receive 4 points, cells containing four dots receive 9 points, cells containing 5 dots receive 16 points, cells containing 6 dots receive 25 points, and cells containing 7 dots receive 36 points. Find your total score by adding up the points you have received in all thirty cells. DO NOT READ ON UNTIL YOU HAVE COMPLETED THE INSTRUCTIONS ABOVE! Classify your pattern as follows: If your score is 16 or less, your pattern is signi®cantly more uniform or regular than random. If your score is between 17 and 45, your pattern is characterized as random. If your score is greater than 45, your pattern exhibits signi®cant clustering. On average, a set of 30 randomly placed points will receive a score of 29. 95% of the time, a set of randomly placed points will receive a score between 17 and 45. The majority of people who try this experiment produce patterns that are more uniform or regular than random, and hence their scores are less than 29. Their point patterns are more spread out than a truly random pattern. When individuals see an empty space on their diagram, there is an almost overwhelming urge to ®ll it in by placing a dot there! Consequently, the locations of dots placed on a map by individuals are not independent of the locations of previous dots, and hence an assumption of randomness is violated. Consider next Figures 8.1 and 8.2, and suppose you are a crime analyst looking at the spatial distribution of recent crimes. Make a photocopy of the page, and indicate in pencil where you think the clusters of crime are. Do this by simply encircling the clusters (you may de®ne more than one on each diagram). DO NOT READ THE NEXT PARAGRAPH UNTIL YOU HAVE COMPLETED THIS EXERCISE! How many clusters did you ®nd? It turns out that both diagrams were generated by locating points at random within the square! In addition to having trouble drawing random patterns, individuals also have a tendency to


Figure 8.1 Spatial pattern of crime

Figure 8.2 Spatial pattern of crime

``see'' clusters where none exist. This results from the mind's strong desire to organize spatial information. Both of these exercises point to the need for objective, quantitative measures of spatial pattern ± it is simply not sucient to rely on one's visual interpretation of a map. Crime analysts cannot necessarily pick out true clusters of crime just by looking at a map, nor can health ocials always pick out signi®cant clusters of disease from map inspection. 8.2.1 Quadrat Analysis

The experiment involving the scoring of points falling within the ``6 5'' rectangle is an example of quadrat analysis, developed primarily by ecologists in the ®rst half of the twentieth century. In quadrat analysis, a grid of square cells of equal size is used as an overlay on top of a map of incidents. One then counts the number of incidents in each cell. In a random pattern, the mean


number of points per cell will be roughly equal to the variance of the number of points per cell. If there is a large amount of variability in the number of points from cell to cell (some cells have many points, some have none, etc.), this implies a tendency toward clustering. If there is very little variability in the number of points from cell to cell, this implies a tendency toward a systematic pattern (where the number of points per cell would be the same). The statistical test makes use of a chi-square statistic involving the variance±mean ratio: 2

1s2

m x

8:1

where m is the number of quadrats, and x and s2 are the mean and variance of the number of points per quadrat, respectively. This value is then compared with a critical value from a chi-square table with m 1 degrees of freedom. Quadrat analysis is easy to employ, and it has been a mainstay in the spatial analyst's toolkit of pattern detectors over several decades. One important issue is the size of the quadrat; if the cell size is too small, there will be many empty cells, and if clustering exists on all but the smallest spatial scales it will be missed. If the cell size is too large, one may miss patterns that occur within cells. One may ®nd patterns on some spatial scales and not at others, and thus the choice of quadrat size can seriously in¯uence the results. Curtiss and McIntosh (1950) suggest an `òptimal'' quadrat size of two points per quadrat. Bailey and Gatrell (1995) suggest that the mean number of points per quadrat should be about 1.6. Summary of the quadrat method

(1) Divide a study region into m cells of equal size. (2) Find the mean number of points per cell ( x). This is equal to the total number of points divided by the number of cells (m). (3) Find the variance of the number of points per cell, s2, as follows: im P 2

s

i1

x2

xi m

1

8:2

where xi is the number of points in cell i. (4) Calculate the variance±mean ratio (VMR): VMR

s2 x

8:3

(5) Interpret the results as follows. x < 1, the variance of the number of points is less than the mean. In If s2 = the extreme case where the ratio approaches zero, there is very little variation in


the number of points from cell to cell. This characterizes situations where the distribution of points is spread out, or uniform, across the study area. If s2 = x > 1, there is a good deal of variation in the number of points per cell ± some cells have substantially more points than expected ( i.e., xi > x for some cells i), and some cells have substantially fewer than expected (i.e., xi < x). This characterizes situations where the point pattern is more clustered than random. A value of s2 = x near one indicates that the points are close to being randomly distributed across the study area. Hypothesis Testing. How can we be more precise in testing the null hypothesis

that there is no spatial pattern? Suppose we were to simulate the null hypothesis by placing points at random in a study area, and that we then carried out the procedure described above for ®nding the variance±mean ratio. Furthermore, suppose we were to repeat this many times (say 1000), and then draw a histogram of the results. We would ®nd that the mean of our 1000 VMR values would be near one, and that the histogram would be asymmetric, displaying a positive skew (see Figure 8.3). Values of VMR in the tails of the histogram (also known as the sampling distribution of VMR), indicate values that are relatively rare when the underlying null hypothesis of no pattern is true. For an actual set of observed data, we decide to accept the null hypothesis that the points are randomly distributed in space if the VMR for the observed data does not dier too much from one; otherwise, we reject the null hypothesis. More speci®cally, if the VMR for an observed pattern is greater than VMRH (shown in Figure 8.3), the null hypothesis is rejected, and the pattern is taken to be more uniform than random. Similarly, if the observed VMR is less than VMRL, the null hypothesis is rejected, and the pattern is taken to be more clustered than random.

Figure 8.3 Sampling distribution of VMR when H0 is rue


If we were to actually observe an extreme value of VMR in our data (either greater than VMRH or less than VMRL), we reject the null hypothesis that the pattern is random. In this case, either (a) the null hypothesis is actually true (in which case we have incorrectly rejected it, and committed a Type I error), or (b) the null hypothesis is not true, and we have made a correct decision. To establish the critical, cuto values, VMRL and VMRH, we ®rst have to decide upon how great a likelihood of a Type I error that we are willing to tolerate. If we use 0.05, then the 50 most extreme values out of the total of 1000 in our experiment are used to obtain the critical values (since 50/1000 0.05). If we rank the 1000 VMR values from lowest to highest, the 25th VMR on our list would be chosen as VMRL; 25 out of 1000 times we will observe a lower VMR than this when H0 is true. Similarly, the 975th VMR on our ordered list would be chosen as VMRH; 25 out of 1000 times we can expect to observe a VMR higher than this when H0 is true. Thus 50 out of 1000, or 5% of the time we will incorrectly reject a true hypothesis when we use these critical values. In those 50 instances, we would make a Type I error, since we would reject H0 when in fact it was true, and we had simply observed an unusual value of VMR in the tail of the sampling distribution. Example. We wish to know whether the pattern observed in Figure 8.4 is

consistent with the null hypothesis that the points were placed at random. We ®rst calculate the VMR. There are 100 points on the 10 10 grid, implying a mean of one point per cell. There are 6 cells with 3 points, 20 cells with 2 points, 42 cells with one point, and 32 cells with no points. The variance is n 63

12 202

12 421

Figure 8.4 A spatial point pattern

12 320

o 12 =99 76=99 0:77 8:4


and, since the mean is equal to one, this is also our observed VMR. Since VMR 2H . When the number of degrees of freedom (df) is large, the sampling distribution of 2 (m 1)VMR begins to approach the shape of a normal distribution. In particular, when df>30 or so (m 1)VMR will, when H0 is true, have a normal distribution with mean m 1 and variance equal to 2(m 1). This means that we can treat the quantity z

m

1VMR m p 2m 1

1

p m 1=2 VMR

1

8:5

as a normal random variable with mean 0 and variance 1. With 0.05, the critical values are zL 1.96 and zH +1.96. The null hypothesis of no


pattern is rejected if zzH (implying uniformity). In our example, we have z

990:77 99 p p 99=20:77 299

1

1:618

8:6

This also falls within the critical values of z and hence we do not have strong enough evidence to reject the null hypothesis. If cells of a dierent size had been used, the results, and possibly the conclusions, would have been dierent. By aggregating the cells in Figure 8.4 to a 5 5 grid of 25 cells, the VMR declines to 0.6875 (based on a variance of 1.6582 and a mean of 4 points per cell). The 2 value is 24(0.6875) 16.5. Since the number of degrees of freedom is less than 30, we will use the chisquare table (Table A.5) to assess signi®cance. With 24 degrees of freedom, and using interpolation to ®nd the critical values at p 0.025 and p 0.975, yields 2L 12:0 and 2H 40:5. Since our observed value falls between these limits, we again fail to reject the hypothesis of randomness. To summarize, after ®nding VMR, steps 1±4 above, calculate 2 (m 1)VMR, and compare it with the critical values found in a chi-square table using df m 1. p If m 1 is greater than about 30, you can use the fact that z m 1=2VMR 1 has a normal distribution with mean 0 and variance 1, implying that, for 0.05, one can compare z with the critical values zL 1.96 and zH +1.96. It is interesting to note that the quantity 2 m 1VMR may be written as 2

m

1VMR

1s2

m x

m

P P 1 xi x2 xi x2 xm 1 x

8:7

P The quantity (xi x)2 = x is the sum across cells of the squared deviations of the observed numbers from the expected numbers of points in a cell, divided by the expected number of points in a cell. This is commonly known as the chi-square goodness-of-®t test. 8.2.2 Nearest Neighbor Analysis

Clark and Evans (1954) developed nearest neighbor analysis to analyze the spatial distribution of plant species. They developed a method for comparing the observed average distance between points and their nearest neighbors with the distance that would be expected between nearest neighbors in a random pattern. The nearest neighbor statistic, R, is de®ned as the ratio between the observed and expected values: R

R0 x p Re 1=2

8:8


where x is the mean of the distances of points from their nearest neighbors and is the number of points per unit area. R varies from 0 (a value obtained when all points are in one location, and the distance from each point to its nearest neighbor is zero) to a theoretical maximum of about 2.14 (for a perfectly uniform or systematic pattern of points spread out on an in®nitely large two-dimensional plane). A value of R 1 indicates a random pattern, since the observed mean distance between neighbors is equal to that expected in a random pattern. It is also known that if we examined many random patterns, we would ®nd that the variance of the mean distances between nearest neighbors is VRe

4 4n

8:9

where n is the number of points. Thus we can form a z-test to test the null hypothesis that the pattern is random: R R R0 Re z p0e p 3:826R0 VRe 4 =4n

p Re n

8:10

The quantity z has a normal distribution with mean 0 and variance 1, and hence tables of the standard normal distribution may be used to assess signi®cance. A value of z>1.96 implies that the pattern has signi®cant uniformity, and a value of z< 1.96 implies that there is a signi®cant tendency toward clustering. The strength of this approach lies in its ease of calculation and comprehension. Several cautions should be noted in the interpretation of the nearest neighbor statistic. The statistic, and its associated test of signi®cance, may be aected by the shape of the region. Long, narrow, rectangular shapes may have relatively low values of R simply because of the constraints imposed by the region's shape. Points in long, narrow rectangles are necessarily close to one another. Boundaries can also make a dierence in the analysis. One solution to the boundary problem is to place a buer area around the study area. Points inside the study area may have nearest neighbors that fall within the buer area, and these distances (rather than distances to those points that are nearest within the study area) should be used in the analysis. Another potential diculty with the statistic is that, since only nearest neighbor distances are used, clustering is only detected on a relatively small spatial scale. To overcome this it is possible to extend the approach to second- and higher-order nearest neighbors. It is often of interest to ask not only whether clustering exists, but whether clustering exists over and above some background factor (such as population). Nearest neighbor methods are not particularly useful in these situations because they only relate to spatial location and not to other attributes.


The approaches to the study of pattern that are described in the next section do not have this limitation. Illustration. For the point pattern in Figure 8.5, distances are given along the

lines connecting the points. The mean distance between nearest neighbors is R0 (1+2+3+1+3+3)/6 13/6 2.167. The expected mean distance between nearest neighbors in a pattern of six points placed randomly in a study region with area 7 6 42 is p p Re 1=2 1=2 6=42 1:323

8:11

The nearest neighbor statistic is R=2.167/1.323 1.638, which means that the pattern displays a tendency toward uniformity. To assess signi®cance, p we can calculate the z-statistic from (8.10) as 3.826 (2.167 1.323) 6=42 6 2:99, which is much greater than the critical value of 1.96; this implies rejection of the null hypothesis of a random pattern. However, we have neglected boundary eects, and these have a signi®cant eect. As an alternative way to test the null hypothesis, we can randomly choose 6 points by choosing random xcoordinates in the range (0,7) and random y-coordinates in the range (0,6). Then we compute the mean distance to nearest neighbor, and repeat the whole process many times. Simulating the random placement of 6 points in the 7 6 study region 10 000 times led to a mean distance between nearest neighbors of 1.62. This is greater than the expected distance of Re 1.323 noted above. This greater-than-expected distance can be attributed directly to the fact that points near the border of the study region are relatively farther from other points in the study region than they presumably would have been to points just outside of the study region. Ordering the 10 000 mean distances to nearest neighbors

3

3

6

1 2

7 Figure 8.5 Nearest neighbor distances

3


reveals that the 9500th highest one is 2.29. Only 5% of the time would we expect a mean distance greater than 2.29. Our observed distance of 2.167 is less than 2.29, and so we, having accounted for boundary eects through our Monte Carlo simulation, accept the null hypothesis.

8.3 Geographic Patterns in Areal Data 8.3.1 An Example Using a Chi-Square Test

In a regression of housing prices on housing characteristics, suppose that we have 51 observations that have been categorized into three spatial locations (neighborhoods). How might we tell whether there is a tendency for positive or negative residuals to cluster in one or more neighborhoods? One idea is to note whether each residual is positive or negative, and then to tabulate the residuals by neighborhood (see the hypothetical data in Table 8.1). We can use a chi-square test to determine whether there is any neighborhood-speci®c tendency for residuals to be positive or negative. Under the null hypothesis of no pattern, the expected values are equal to the product of the row and column totals, divided by the overall total. These expected values are given in parentheses in Table 8.2. The chi-square statistic is 2

n X O

E 2

8:12

E

i1

where O is the observed frequency and E is the expected frequency. In this example, the value of chi-square is 4.40 (see inset), which is less than the critical value of 5.99, using =0.05 and 2 degrees of freedom (the number of degrees Table 8.1 Hypothetical residuals 1

Neighborhood 2

+

10 6

6 15

7 7

23 28

Total

16

21

14

51

3

Total

Table 8.2 Observed and expected frequencies of residuals

+

Total

1

Neighborhood 2

3

Total

10 (7.22) 6 (8.78)

6 (9.47) 15 (11.53)

7 (6.31) 7 (7.69)

23

16

21

14

28 51


of freedom is equal to the number of rows minus one, times the number of columns minus one). Therefore the null hypothesis of no pattern is not rejected. INSET: The observed chi-square statistic for the data in Table 8.2: 2

7:222 6 9:472 7 6:312 6 8:782 15 11:532 7:22 9:47 6:31 8:78 11:53 7 7:692 4:40 7:69

10

When spatial autocorrelation is detected, what can be done about it? One idea is to include a new, location-speci®c dummy variable. This will serve to capture the importance of an observation's location in a particular neighborhood. In our housing price example, we could add two variables, one for two of the three neighborhoods (following the usual practice of omitting one category). You should also note that if there are k neighborhoods, it is not necessary to have k 1 dummy variables; rather, you might choose to have only one or two dummy variables for those neighborhoods having large deviations between the observed and predicted values. 8.3.2 The Join-Count Statistic

Whether areas of positive or negative residuals cluster on the map can be determined by ®rst asking how many total ``joins'' there are (i.e., the total number of cases where two subareas share a common boundary). Each join is then classi®ed as a ``'', `` '', or `` '', depending upon the signs of the residuals in the two areas. For example, the ®ve-zone system in Figure 8.6 has three zones with negative residuals and two with positive residuals. There is a total of seven joins (i.e., pairs of regions that share a common boundary). One of these joins is a ``'' join, one is a `` '' join, and the remaining ®ve joins are `` '' joins. 4 3 1

+

-

5

2

-

+

-

Figure 8.6 Positive and negative residuals in a ®ve-region system


The join count statistic compares the observed number of joins with the number of joins that would be expected if no spatial autocorrelation were present. The expected number of joins is E

2JPM NN 1

8:13

where J is the total number of joins, P is the number of positive residuals, M is the number of negative (``minus'') residuals, and N is the total number of areas (N P+M). For the system in Figure 8.6, E

2723 4:2 54

8:14

The variance of the number of `` '' joins is equal to the complex expression P

Li Li 1PM E 2 i NN 1 P 4Ji Ji 1 Li Li 1PP 1MM

V E

i

NN

1N

2N

1

3

8:15

where Li is the number of links (joins) from region i to other regions. For Figure 8.6, the values of Li are 2, 3, 4, 2, and 3, for i 1, 2, 3, 4, and 5, respectively. Using Equation 8.15, for the zonal system in Figure 8.6 we have V 4:2

4:22

2823 476 282132 0:56 54 5432

8:16

The z-statistic z

Obs: ª º E p V

8:17

has a normal distribution with mean zero and variance one. Thus tables of the normal distribution can be used to test the null hypothesis that the spatial pattern is random. For our example, we have slightly more joins than expected. Clearly there is no clustering of positive or negative residuals on the map, but might the ``checkerboard'' pattern characterized by ``''s being next to `` ''s be signi®cant? The z-statistic is 5 4:2 z p 1:07 :56

8:18

which is less than the critical value, indicating that the null hypothesis of randomly placed residuals cannot be rejected.


8.3.3 Moran's I

The join count statistic, if used to evaluate the presence or absence of spatial autocorrelation, has the drawback of not using all of the available information ± that is, it makes use only of the signs of the residuals and not their magnitude. Moran's I statistic is an alternative measure of spatial autocorrelation. Moran's I statistic (1948, 1950) is one of the classic (as well as one of the most common) ways of measuring the degree of spatial autocorrelation in areal data. Moran's I is calculated as follows: n I

n P n P j

i

n P n P i

j

yyj

wij yi wij

n P i

yi

y y2

8:19

where there are n regions and wij is a measure of the spatial proximity between regions i and j. It is interpreted much like a correlation coecient. Values near 1 indicate a strong spatial pattern (high values tend to be located near one another, and low values tend to be located near one another). Values near 1 indicate strong negative spatial autocorrelation; high values tend to be located near low values. (Spatial patterns with negative autocorrelation are either extremely rare or nonexistent!) Finally, values near 0 indicate an absence of spatial pattern. Though perhaps daunting at ®rst glance, it is helpful to realize that if the variable of interest is ®rst transformed into a z-score fz x x=sg, a much simpler expression for I results: PP n wij zi zj i j PP I 8:20 n 1 wij i

j

The conceptually important part of the formula is the numerator, which sums the products of z-scores in nearby regions. Pairs of regions where both regions exhibit above-average scores (or below average scores) will contribute positive terms to the numerator, and these pairs will therefore contribute toward positive spatial autocorrelation. Pairs where one region is above average and the other is below average will contribute negatively to the numerator, and hence to negative spatial autocorrelation. The weights {wij} can be de®ned in a number of ways. Perhaps the most common de®nition is one of binary connectivity; wij 1 if regions i and j are contiguous, and wij 0 otherwise. Sometimes the wij de®ned in this way are standardized to de®ne new wij by dividing by the number of regions i is conP nected to; i.e., wij wij = j wij . In this case all regions i are characterized by a P set of weights linking i to other regions that sum to one; i.e., j wij 1: Alternatively, {wij} may be de®ned as a function of the distance between i and j (e.g., wij dij or wij exp dij ; where the distance between i and j


could be measured along the line connecting the centroids of the two regions. It is conventional to use wij 0. It is also common, though not necessary, to use symmetric weights, so that wij wji. It is important to recognize that the value of I is very dependent upon the de®nition of the {wij}. Using a simple binary connectivity de®nition for the map in Figure 8.6 gives us 0 1 W fwij g 1 0 0

1 0 1 1 0

1 1 0 1 1

0 1 1 0 1

0 0 1 1 0

In this instance, the de®nition of {wij} causes the neighborhood around region 1 to be much smaller than the neighborhood around region 2 or 3. This is not necessarily ``wrong'', but suppose that we were interested in the spatial autocorrelation of a disease that was characterized by rates that were strongly associated over small spatial scales but not correlated over large spatial scales. If we expect disease rates in regions 1 and 2 to be highly correlated while we expect those in regions 4 and 5 to be uncorrelated due to their large spatial separation, our observed value of I will be a combined measure of strong association between close pairs and weak association between distant pairs. For this example, it might be more appropriate to use a distance-based de®nition of {wij}. Illustration. Consider the six-region connectivity de®nition of the weights 2 0 1 61 0 6 61 1 W6 60 1 6 40 1 0 0

system in Figure 8.7. Using a binary leads to: 3 1 0 0 0 1 1 1 07 7 0 0 1 17 7 8:21 0 0 1 07 7 1 1 0 15 1 0 1 0

where an entry in row i and column j is denoted by wij. The double summation in the numerator of I (see Equation 8.19) is found by taking the product of the deviations from the mean for all pairs of adjacent regions 32

2126

21 32

2119

21 26

2132

21

26 19

2119 2132

21 26 21 19

2118 2126

21 26 21 19

2117 2117

21 21

19

2114

21 18

2126

21 18

2117

21

17 17

2119 2118

21 17 21 17

2119 2114

21 17 21 14

2126 2119

21 21

14

2117

21 100

8:22


3 1

19

32

6 5

26

2

14

4

17 18

Figure 8.7 Hypothetical six-region system

Since the sum of the weights in (8.21) is 18, and since the variance of the regional values is 224/5, Moran's I is equal to I

6100 0:1488 18224

8:23

In addition to this descriptive interpretation, there is a statistical framework that allows one to decide whether any given pattern deviates signi®cantly from a random pattern. If the number of regions is large, the sampling distribution of I, under the hypothesis of no spatial pattern, approaches a normal distribution, and the mean and variance of I can be used to create a Z-statistic in the usual way: I EI Z p VI

8:24

The value is then compared with the critical value found in the normal table (e.g., 0:05 would imply critical values of 1.96 and 1:96). The mean and variance are equal to EI VI

9 > > > > =

1 n 1 n2 n

1S1

nn

n 1n

1S2 2n 12 S0

2S02 > > > > ;

8:25


where S0

n X n X i

S1 0:5

wij j6i n X n X

wij wji 2

9 > > > > > > > > > > =

> > !2 > > > n n n > X X X > > S2 wkj wik > > ; k

i

j6i

j

8:26

i

Computation is not complicated, but it is tedious enough for one to not want to do it by hand! Unfortunately, few software packages that calculate the coecient and its signi®cance are available. Exceptions include Anselin's (1992) SpaceStat and the CrimeStat package downloadable from www.icpsr.umich.edu/NACJD/crimestat.html Fortunately, there are also simpli®cations and approximations that facilitate the use of Moran's I. An alternative way of ®nding Moran's I is to simply take the ratio of two regression slope coecients (see Grith 1996). The numerator P of I is equal to the regression slope obtained when the quantity ai nj1 wij zj is regressed on zi, and the denominator of I is equal to the regression slope P obtained when the quantity bi nj1 wij is regressed on ci 1. The zs represent the z-scores of the original variables, and the slope coecients are found using no-intercept regression (i.e., constraining the result of the regression so that the intercept is equal to zero). PP In addition, Grith gives 2= wij as an approximation for the variance of the Moran coecient. This expression, though it works best only when the number of regions is suciently large (about 20 or more), is clearly easier to compute than the alternative given in Equations 8.25 and 8.26! If observational units are on a square grid and connectivity is indicated by the four adjacent cells, the variance may be approximated by 1/(2n), where n is the number of cells. Based on either a grid of hexagonal cells or a map displaying `àverage'' connectivity with other regions, the variance may be approximated by 1/(3n). An example is given in Section 8.5. The use of the normal distribution to test the null hypothesis of randomness relies upon one of two assumptions: (1) Normality. It can be assumed that regional values are generated from identically distributed normal random variables (i.e., the variables in each region arise from normal distributions that have the same mean and same variance in each region). (2) Randomization. It can be assumed that all possible permutations (i.e., regional rearrangements) of the regional values are equally likely. The formulae given above (Equations 8.25 and 8.26) for the variance assumes that the normality assumption holds. The variance formula for the


randomization assumption is algebraically more complex, and gives values that are only slightly dierent from that given in 8.25 and 8.26 (see, e.g., Grith 1987). If either of the two assumptions above holds, the sampling distribution of I will have a normal distribution if the null hypothesis of no pattern is true. One of the two assumptions must hold to generate the sampling distribution of I so that critical values of the test statistic may be established. For example, if the ®rst assumption were used to generate regional values, I could be computed; this could then be repeated many times, and a histogram of the results could be produced. The histogram would have the shape of a normal distribution, a mean of E[I ], and a variance of V[I ]. Similarly, the observed regional values on a map could be randomly rearranged many times, and the value of I computed each time. Again, a histogram could be produced; it would again have the shape of a normal distribution with mean E[I ] and a variance slightly dierent from V[I ]. If we can rely on one of these two assumptions we do not need to perform these experiments to generate histograms, since we know beforehand that they will produce normal distributions with known mean and variance. Unfortunately, there are many circumstances in geographical applications that lead the analyst to question the validity of either assumption. For example, maps of counties by township are often characterized by high population densities in the townships corresponding to or adjacent to the central city, and by low population densities in outlying townships. Rates of crime or disease, though they may have equal means across townships, are unlikely to have equal variances. This is because the outlying towns are characterized by greater uncertainty ± they are more likely to experience atypically high or low rates simply because of the chance ¯uctuations associated with a relatively smaller population base. Thus assumption 1 is not satis®ed, since all regional values do not come from identical distributions ± some regional values, namely the outlying regions, are characterized by higher variances. Likewise, not all permutations of regional values are equally likely ± permutations with atypically high or low values out in the periphery are more likely than permutations with atypically high or low values near the center. How can we test the null hypothesis of no spatial pattern in this instance? One approach is to use Monte Carlo simulation. Suppose we have data on the number of diseased individuals (ni) and the population (pi) in each region. Since the Z-test described above is no longer valid, we need an alternative way to come up with critical values. The null hypothesis of no spatial pattern in disease rates can be assessed P viaPsimulation. Assign the disease to an individual with probability i ni = i pi . Then calculate Moran's I. This is repeated many times, and the resulting values of Moran's I may be used to create a histogram depicting the relative frequencies of I when the null hypothesis is true. Furthermore,


the values can be arranged from lowest to highest, and this list can be used to ®nd critical values of I. For example, if the simulations are carried out 1000 times, and critical values are desired for a test using 0:05, they can found from the ordered list of I values. The lower critical value would be the 25th item on the list, and the upper critical value would be the 975th item on the list. Illustration of the Monte Carlo method. Dominik Hasek, the goalie for the

gold-medal Czech ice hockey team in the 1998 Olympics, saves 92.4% of all shots he faces when he plays professionally for the Bualo Sabres of the National Hockey League (NHL). The average save percentage of other goalies in the NHL is 90%. Hasek tends to face about 31 shots per game, while the Sabres manage just 25 shots per game on the opposing goalie. To evaluate how much Hasek means to the Sabres, compare the outcomes of 1000 games using Hasek's statistics with the outcomes of 1000 games assuming the Sabres had an `àverage'' goalie who stops 90% of the shots against him.

Solution. Take 31 random numbers between 0 and 1. Count those greater than

0.924 as goals against the Sabres with Hasek. Take 25 numbers from a uniform distribution between 0 and 1, and count those greater than 0.9 as goals for the Sabres. Record the outcome (win, loss, or tie). Repeat this 1000 times (preferably using a computer!), and tally the outcomes. Finally, repeat the entire experiment using random numbers greater than 0.9 (instead of 0.924) to generate goals against the Sabres without Hasek. Each time the experiment is performed, a dierent outcome will be obtained. In one comparison, the results were as follows:

Scenario 1 (with Hasek) Scenario 2 (without Hasek)

Wins

Losses

Ties

434 318

378 515

188 167

To evaluate Hasek's value to the team over the course of an 82-game season, the outcomes above may ®rst be converted to percentages, multiplied by 82, and then rounded to integers, yielding:

Scenario 1 Scenario 2

Wins

Losses

Ties

36 26

31 42

15 14

Thus Hasek is ``worth'' about 10 wins; that is, they win about ten games a year that they would have lost if they had an `àverage'' goalie.


8.4 Local Statistics 8.4.1 Introduction

Besag and Newell (1991) classify the search for clusters into three primary areas. First are ``general'' tests, designed to provide a single measure of overall pattern for a map consisting of point locations. These general tests are intended to provide a test of the null hypothesis that there is no underlying pattern, or deviation from randomness, among the set of points. Examples include the nearest neighbor test, the quadrat method, and the Moran statistic, all outlined above. In other situations, the researcher wishes to know whether there is a cluster of events around a single or small number of prespeci®ed foci. For example, we may wish to know whether disease clusters around a toxic waste site, or whether crime clusters around a set of liquor establishments. Finally, Besag and Newell describe ``tests for the detection of clustering.'' Here there is no a priori idea of where the clusters may be; the methods are aimed at searching the data and uncovering the size and location of any possible clusters. General tests are carried out with what are called ``global'' statistics; again, a single summary value characterizes any deviation from a random pattern. ``Local'' statistics are used to evaluate whether clustering occurs around particular points, and hence are employed for both focused tests and tests for the detection of clustering. Local statistics have been used in both a con®rmatory manner, to test hypotheses, and in an exploratory manner, where the intent is more to suggest, rather than con®rm, hypotheses. Local statistics may be used to detect clusters either when the location is prespeci®ed (focused tests) or when there is no a priori idea of cluster location. When a global test ®nds no signi®cant deviation from randomness, local tests may be useful in uncovering isolated hotspots of increased incidence. When a global test does indicate a signi®cant degree of clustering, local statistics can be useful in deciding whether (a) the study area is relatively homogeneous in the sense that local statistics are quite similar throughout the area, or (b) there are local outliers that contribute to a signi®cant global statisitc. Anselin (1995) discusses local tests in more detail.

8.4.2 Local Moran Statistic

The local Moran statistic is Ii nyi

y

X

wij yj

y

8:27

j6i

The sum of local Morans P is equal to, up to a constant of proportionality, the global Moran; i.e., Ii I. For example, the local Moran statistic for


region 1 in Figure 8.7 is I1 32

2126

21 19

21 33

8:28

The expected value of the local Moran statistic is P wij yj y EIi

j6i

n

8:29

1

and the expression for its variance is more complicated. Anselin gives the variance of Ii, and assesses the adequacy of the assumption that the test statistic has a normal distribution under the null hypothesis. 8.4.3 Getis's Gi Statistic

To test whether a particular location i and its surrounding regions constitute a cluster of higher (or lower) than average values on a variable (x) of interest, Ord and Getis (1995) have used the statistic P wij dxj Wi x Gi

j

sfnS1i

Wi2 =n

1g1=2

8:30

where s is the sample standard deviation of the x values, and wij(d) is equal to one if region j is within a distance of d from region i, and 0 otherwise. The sum is over all regions, including region i. Also, X 9 Wi wij d > > = j X 8:31 > w2ij S1i > ; j

Ord and Getis note that when the underlying variable has a normal distribution, so does the test statistic. Furthermore, the distribution is asymptotically normal even when the underlying distribution of the x-variables is not normal, if the distance d is suciently large. Since the statistic (8.30) is written in standardized form, it can be taken as a standard normal random variable, with mean 0 and variance 1. For region 1 in Figure 8.7, we will use weights equal to 1 for regions 1, 2, and 3, and weights equal to 0 for other regions. The Gi statistic is G1

77 321 p 1:56 6:69 63 9=5

8:32

Since this variable has a normal distribution with mean 0 and variance 1 under the null hypothesis that region 1 is not located in a region of particularly high


values, we can use a one-sided test with 0:05 and z 1.645. We therefore fail to reject the null hypothesis.

8.5 Finding Moran's I Using SPSS for Windows 9.0

Consider the six-region system in Figure 8.7. With connectivity de®ned by a binary 0±1 weight for adjacent regions, we have the weight matrix given by Equation 8.21. To compute the value of Moran's I in SPSS, we ®rst convert the six regional values to z-scores. For the six regions, the z-scores are 1.64, 0.747, 0.299, 0.448, 0.598, and 1.046. Then the quantities ai P j wij zj are found. These are simply weighted sums of the z-scores of the regions that i is connected to. For example, region 1 is connected to region 2 and 3. For region 1, a1 0.747 0.299 0.448. The six ai scores are 0.448, 0.299, 0.747, 0.149, 1.046, and 0.896. Now perform a regression, using the as as the dependent variable and the zs as the independent variable. In SPSS, click on Analyze, Regression, Linear, and de®ne the dependent and independent variables. Then, under Options, make sure the box labeled `Ìnclude constant in equation'' is NOT checked. This yields a regression coecient of 0.446 for the numerator. For the denominator, we again use no-intercept regression to regress six y-values on six x-values. The six ``y-values'' are the sums of the weights in each row (2, 4, 4, 2, 4, and 2 for rows 1±6, respectively). The six x-values are 1, 1, 1, 1, 1, and 1 (this will always be a set of n ones, where n is the number of regions). After again making sure that a constant is NOT included in the regression equation, one ®nds the regression coecient is 3.0. Moran's I is simply the ratio of these two coecients: 0.446/3 0.1487. The variance of I in this example may be found from Equation 8.25: VI

236518

46560 24182 752 182

0:033

The z-value associated with a test of the null hypothesis of no spatial autop correlation is 0:1487 0:2= 0:033 1:92. This would exceed the critical value of 1.645 under a one-sided test (which we would use, for example, if our initial alternative hypothesis was that positive autocorrelation existed), and would be slightly less than the critical value of 1.96 in a two-sided test. We note, however, that we are on shaky ground in assuming that this test statistic has a normal distribution, since the number of regions is small. We also note that, in this case, the approximation of 1/(3n) described in Section 8.3.3 for the variance of I would have yielded a variance of 1/18 0.0555, which is not too far from that found above using Equation 8.25. The approximation of two divided by the sum of the weights, also described in Section 8.3.3, would have yielded 2/18 0.1111. This approximation works better for systems with a greater number of regions.


Exercises 1. The following residuals are observed in a regression of wheat yields on precipitation and temperature over a 6-county area: County:

1

2

3

4

5

6

+

7

10

12

9

14

15

12

8

19

10

10

10

Number of positive residuals Number of negative residuals

Use the chi-square test to determine whether there is any interaction between location and the tendency of residuals to be positive or negative. If you reject the null hypothesis of no pattern, describe how you might proceed in the regression analysis. 2. A regression of sales on income and education leaves the following residuals:

(a) Use the join count statistic to determine whether there is a spatial pattern to the residuals. (b) Use Moran's I to determine whether there is a spatial pattern to the residuals. (c) If you reject the null hypothesis in either (a) or (b), describe how you would proceed with the regression analysis. 3.


(a) Find the nearest neighbor statistic for the following pattern: (b) Test the null hypothesis that pattern is random by ®nding the pthe z-statistic: z 3:826(R0 Re ) n, where n is the number of points and is the density of points. (c) Find the chi-square statistic, 2 (m 1)s2 = x for a set of 81 quadrats, where 1/3 of the quadrats have 0 points, 1/3 of the quadrats have 1 point, and 1/3 of the quadrats have 2 points. Then ®nd the z-value to test the hypothesis of randomness, where 2 m 1 z p 2m 1 where m is the number of cells. Compare it with a critical value of z 1.96 and z +1.96. 4. Find the expected and observed number of black±white joins in the following pattern:

On the basis of your answer, in which direction away from random would you describe this pattern ± more toward a checkerboard pattern, or more toward a clustered pattern? 5. Vacant land parcels are found at the following locations:


Find the variance and mean of the number of vacant parcels per cell, and use the variance±mean ratio to test the hypothesis that parcels are distributed randomly (against the two-tailed hypothesis that they are not). 6. Find the nearest neighbor statistic (the ratio of observed to expected mean distances to nearest neighbors) when n points are equidistant from one another on the circumference of a circle with radius r, and there is one additional point located at the center of the circle. (Hints: the area of a circle is r2 and the circumference of a circle is 2r.) 7. Prove that the following two z-scores are equivalent: R 1 r0 re R r where p R 0:52= n;

0:26 r p ; n

R r0 =re

and r0 and re are the observed and expected distances to nearest neighbors, respectively. Thus there are two equivalent ways of carrying out the nearest neighbor test.

9

Some Spatial Aspects of Regression Analysis

LEARNING OBJECTIVES

. . . .

How to include spatial considerations into regression analyses Added-variable plots for spatial variables Spatial regression analysis Spatially varying parameters, including the expansion method and geographically weighted regression

9.1 Introduction

We have already noted that spatial autocorrelation presents diculties in estimating regression relationships. In some cases, we may be interested in the pattern of spatially correlated residuals for its own sake. Figure 9.1 is a map I produced for an undergraduate project, showing the residuals from a regression of snowfall on temperature, elevation, and latitude. In this case, the primary purpose was to obtain a visual impression of the eect of the North American Great Lakes on snowfall patterns in New York State. One can clearly see two bands of excess snowfall, one downwind from Lake Erie, and the other downwind from Lake Ontario. The eects downwind of Lake Erie are particularly strong, ranging up to 50±60 inches a year greater than that predicted by temperature, elevation, and latitude alone. The remainder of the map has relatively small residuals. One might also speculate that the negative residuals along the northeast border of the state constitute a precipitation shadow eect, since this area is directly east of the Adirondack Mountains and much of the moisture would have precipitated out before reaching the eastern border. In the snowfall example, it was not necessary to have precise estimates of the eects of temperature, elevation, and latitude on snowfall, since primary interest was in the map pattern of the residuals. However, spatial autocorrelation in the residuals violates an underlying assumption of ordinary least-squares regression, and so alternatives must be considered when reliable regression equations are desired. Spatial regression models seek to remedy the situation by adding to the list of explanatory variables the values of x and/or y in surrounding regions as well. Some approaches to these spatial regression models are considered in Sections 9.2 and 9.3. Up to this point, we have assumed that values of the regression coecients were global, in the sense that they were thought to apply to the region as a


Lake Ontario

Lake Erie

0 10 20 30 40 50 60 70 80 km

50

0

10

20

30

40

50 miles

Regression Residuals

Figure 9.1 Regression residuals from snowfall analysis

whole. However, it is possible that the coecients vary over space. Section 9.4 examines two approaches to spatially varying regression parameters. The ®nal section provides an illustration of the various methods.

9.2 Added-Variable Plots

When regression residuals exhibit spatial autocorrelation, this suggests that the regression results may bene®t from additional explanatory variables. Haining (1990b) notes that added variable plots are ``graphical devices that are used to decide whether a new explanatory variable should be added to a regression'' (see also Weisberg 1985, Johnson and McCulloch 1987). He identi®es four situations where spatial eects may be entered into the right-hand side of a regression equation: (1) the value of y depends upon values of y nearby; (2) the value of y at a site depends not only upon values of x at the site but also upon values of x at nearby sites; (3) the value of y at a site depends upon the value of x at the site and on values of x and y at nearby sites; and (4) the size of the error at a site is related to the size of the error at nearby sites. Case (4) is statistically indistinguishable from case (3).

SOME SPATIAL ASPECTS OF REGRESSION ANALYSIS 181

The idea behind added variable plots is to see whether there is a relationship between y, once it has been adjusted for the variables already in the equation, and some omitted variable. Let xp denote the omitted variable. The procedure is as follows: (1) Obtain the residuals of the regression of y on the x-variables. (2) Obtain the residuals of the regression of xp on the x-variables. (3) Plot the residuals obtained in (1) on the vertical axis, and those from (2) on the horizontal axis. The result is the relationship between xp and y, adjusted for the other xs. If the points in the plot lie along or near a straight line, this suggests that the variable should be added to the regression equation. These plots may be produced within SPSS by checking the ``Produce all Partial Plots'' box under the Plots section of Linear Regression.

9.3 Spatial Regression

It is possible to specify a spatial regression model in the same way as the usual linear regression model, with the exception that the residuals are modeled as functions of the surrounding residuals (see, e.g., Bailey and Gatrell 1995). If we use " to denote the usual residual or error term, the residual for a particular observation is written as a linear function of the other residuals: "i

n X j1

wij "j ui

9:1

where wij is a measure of the connection between location i and location j (often taken as a binary connectivity measure), is a measure of the strength of the correlation of the residuals, and ui is the remaining error term after the correlation among residuals has been accounted for. Note that if 0, the model reduces to the ordinary linear regression model. To estimate the model, one can de®ne the quantities 9 n X > yi yi wij yj > > > = j1 9:2 n X > > > xi xi wij xj > ; j1

Then regressions of y vs x are tried for a variety of values, beginning at zero. The residuals of each regression are inspected, and the value of associated with the most suitable set of residuals is adopted. Section 9.5.3 provides an example. Bailey and Gatrell note that this estimation procedure is, strictly, not one that is the best from a statistical viewpoint, and that more


sophisticated approaches exist. However, it should give the analyst a good idea of the spatial eects that may be present in a model.

9.4 Spatially Varying Parameters 9.4.1 The Expansion Method

With linear regression, the slope and intercept parameters are ``global'', in the sense that they apply to all observations. The expansion method (Casetti 1972, Jones and Casetti 1992) suggests that these parameters may themselves be functions of other variables. Thus, in a linear regression equation of house prices (y) on lot size (x1) and number of bedrooms (x2): y b0 b1 x1 b2 x2 "

9:3

the eect of lot size on house prices (b1) may itself depend upon whether there is a park nearby (for example, large lot sizes may be more valuable in a suburb if there is no other green space nearby). So, we add an expansion equation b1 c 0 c 1 d

9:4

where d is the distance to the nearest park. We would expect c1 to be positive; large distances to the nearest park would mean that b1 is high, which in turn means that lot sizes have a large in¯uence on house prices. If we substitute this expansion equation into the original equation we have y b0 c0 c1 dx1 b2 x2 " b0 c0 x1 c1 dx1 b2 x2 "

9:5

To estimate the coecients, we perform a linear regression of y on the variables x1, x2, and dx1. In Equation 9.5, the new quantity dx1 may be thought of as a new variable, created by multiplying together distance to park (d ) and lot size (x1). When the coecient c1 is signi®cant, this is known as an interaction eect; the eect of lot size on housing prices interacts with, or depends upon, the distance to the park (or alternatively, the eect of distance to the park depends upon the size of the lot). The edited collection of Jones and Casetti (1992) contains a wide variety of applications of the expansion method. These include applications to models of welfare, population growth and development, migrant destination choice, urban development, metropolitan decentralization, and the spatial structure of agriculture. The collection also includes methodological contributions that focus upon statistical aspects of the model, including its relationship to spatial dependence in the data.


9.4.2 Geographically Weighted Regression

In a series of articles, Fotheringham and his colleagues at Newcastle have outlined an alternative approach to the expansion method that accounts for spatially varying parameters (see, e.g., Fotheringham et al. 1998, Brunsdon et al. 1996, 1999). Their geographically weighted regression (GWR) technique is based upon ``local'' views of regression as observed from any location. For each location, one can estimate a regression equation where weights are attached to observations surrounding the location. Relatively large weights are given to points near the location, and smaller weights are assigned to observations far from the location. As Fotheringham et al. (2000) note: There is a continuous surface of parameter values. . . In the calibration of the GWR model it is assumed that observed data near to point i have more of an in¯uence in the estimation of the [regression coef®cients] than do data located farther from i (p. 108).

More formally, the dependent variable at location i is modeled as follows: yi bi0

p X j1

bij xij "i

9:6

where, as is the case with simple linear regression, there are p independent variables, and xij represents the observation on variable j at location i. The important point to note is that the b coecients have i subscripts, indicating that they are speci®c to the location of observation i. One reasonable choice for the weights is a negative exponential function of squared distance wij e

dij2

exp dij2

9:7

so that points that are farther away will be assigned lower weights. To estimate the regression coecients at location i, one ®rst de®nes the weights (wij), using an initial ``guess'' for the value of (one possibility would be to use 0, which corresponds to the ordinary least-squares case). Then de®ne the quantities p wij yj p xj wij xj

)

yj

j 1; . . . ; n

9:8

These are the weighted observations. At location i, run a linear regression of the y on the x , omitting observation i from the analysis. Use the resulting regression coecients to predict the value of y at location i. Then ®nd the squared dierence between the observed value of y (denoted yi) and this


predicted value y^6i g2

fyi

9:9

where y^6i is the predicted value of the dependent variable at location i when observation i has not been used in the estimation, and the reminds us that this prediction was made using a speci®c value of . After this has been repeated for each location i, one may compute the total sum of squared deviations between observed and predicted values as s

n X i1

y^6i g2

fyi

9:10

The next step is to repeat this procedure for many values of , choosing as ``best'' the value of that minimizes the score s . This ®nal value of yields the best set of weights. The ®nal regression coecients at each location are given as follows: ®rst use the ®nal, optimal value of to de®ne the weights, and then regress y and x using all of the observations. 9.5 Illustration

Figure 9.2 displays the location of 30 hypothetical houses in a square study area that features a park at its center. The dataset in Table 9.1 was generated by assuming that housing prices were related to lot size, number of bedrooms, and the presence of a ®replace. Furthermore, spatial eects were added in the generation of the data. The lot sizes were generated in such a way as to be spatially autocorrelated, and the eect of lot size on housing prices was made to be a function of how distant the house was from the centrally located park. 1.0

3 10

0.9

2

14 6

21 23

0.8

y coordinate

20 19

28

25 1

24

0.7 0.6

7 29 27

22 26

16

0.5

9

15

17

5 11

0.4

18

12

8

0.3

30 13

0.2 0.0

0.1

0.2 0.3

0.4

0.5 0.6

x coordinate Figure 9.2 Location of thirty hypothetical houses

0.7

0.8 0.9

4

1.0


Table 9.1 Hypothetical data on 30 houses

More speci®cally, housing prices were generated using the equations 9 p 20 000 b1 x1 20 000x2 20 000x3 " > = b1 10 000 20 000d > ; " N0; 20 0002

9:11

All digits have been retained in the generated prices, though in practice one would expect them to be rounded to, say, the nearest hundred. Where p is price, x1 is lot size (in thousands of square feet), x2 is number of bedrooms, and x3 is a dummy variable indicating the presence or absence of a ®replace (1 presence; 0 absence). d is the distance from the centrally located park, and " is a normally distributed error term with standard deviation equal to 20 000. Houses were assigned ®replaces with probability 0.3, and were assigned a number of bedrooms by allowing integers in the range 2±6 with equal likelihoods. Lot sizes were normally distributed with mean 6 and standard deviation 0.8. Thus the ``true'' data follow quite closely an expansion-equation model, and we will expect that such a model will perform quite well. But for now, let us assume that we are simply faced with the data in Table 9.1, and we want to model housing prices as a function of the independent variables.


9.5.1 Ordinary Least-Squares Regression

Table 9.2 shows the results from the ordinary least-squares regression of housing price on lot size, number of bedrooms, and presence of ®replace. The coecients are all signi®cant. The r2 value is 0.562, and the standard error of the estimate is 29 080. The residuals display positive spatial autocorrelation, indicating potential problems with the estimation. The coecient on lot size is a bit low, since we know from the way the data were generated that it ranges from a low of 10 000 near the park to a high of about 20 000 ( 10 000+20 000(0.5)) near the periphery. 9.5.2 Added-Variable Plots

We begin by making the rather arbitrary decision that neighbors are de®ned in this example as the three closest observations. Thus wij 1 if observation j is one of the three nearest neighbors of i, and 0 otherwise. Table 9.2 Results from OLS regression

SOME SPATIAL ASPECTS OF REGRESSION ANALYSIS 187 60,000

Unstandardized Residual y vs. x

Unstandardized Residual y vs. x

60,000

40,000

20,000

0

-20,000

-40,000 -60,000 -60,000

40,000

20,000

0

-20,000

-40,000 -60,000

-40,000

-20,000

0

20,000

40,000

60,000

-20.0

Unstandardized Residual y* vs. x2

-10.0

0

10.0

20.0

Unstandardized Residual y* vs. x1

Figure 9.3 Added variable plots

Following Haining's example, we will consider the addition of new variables. The possibilities we will consider are xi1 yi2

n X j1 n X j1

9 > wij yj > > > = > > wij xj > > ;

9:12

The ®rst suggests that y at a location is a function not only of x at that location but also of the y values in surrounding locations. The second equation suggests that the y value at a location may also be a function of the x-values in surrounding locations. To construct added variable plots for each of these potential additions to the regression equation, we need (a) the residuals of the ordinary least-squares regression (from Section 9.5.1), and (b) the residuals from regressions of xi on x. These residual plots are shown in panels (a) and (b) of Figure 9.3. Neither plot shows a signi®cant correlation, and so we conclude that these variables would not improve the speci®cation of the regression equation. 9.5.3 Spatial Regression

Following the autocorrelated errors model of Section 9.3, and using the same de®nition of the weights (w) used in Section 9.5.2, we de®ne the quantities yi xi

yi xi

n X j1

n X j1

9 > wij yj > > > = > > wij xj > > ;

9:13

Standard error of estimate


29,080

26,795

ρ=0.18

0

ρ

Figure 9.4 Minimizing the standard error of the estimate Table 9.3 Spatial regression results with = 0.18 Coef®cient Intercept Lot size No. of bedrooms Fireplace

2 926.9 17 910 22 921 27 233

Standard error

t

4 914 3 341 3 139 9 003

0.60 5 30 7.30 3.02

We would like to choose a value of that is associated with a ``good'' set of residuals. Although there are dierent ways that this could be done, after trying dierent values we ®nd that 0:18 minimizes the standard error of the estimate (see Figure 9.4). When y is regressed on x using this value of , we obtain the results in Table 9.3. The standard error of the estimate has been reduced to 26 795, and the value of r2 is now 0.883. All variables are signi®cant as before, and the t-values for all coecients are higher than those under ordinary least squares (Section 9.5.1). In addition, the coecient on lot size in this equation is equal to 17 910, which is closer to its average value of about 15 000 (recall that we generated the data so that the true lot size coecient varied from 10 000 to about 20 000). 9.5.4 Expansion Method

Next we estimate the expansion model p b0 b1 x1 b2 x2 b3 x3 " b1 0 1 d

) 9:14

where the variables are as de®ned above, and 0 and 1 are the regression coecients that tell us how the in¯uence of lot size on housing prices varies


with distance from the park. This may be rewritten as p b0 0 1 dx1 b2 x2 b3 x3 "

9:15

which is identical to p b0 0 x1 1 dx1 b2 x2 b3 x3 "

9:16

The results obtained when using ordinary least-squares regression on this equation are shown in Table 9.4. The r2 value is equal to 0.747, and all parameters, including those associated with the expansion equation, are signi®cant. Furthermore, all parameter values are near their ``true'' values, and the standard error of the estimate is 22 552. Of course, it should be kept in mind that one reason this particular approach has worked relatively well here is that the estimated model is consistent with the way in which the data were generated. We helped ourselves out by choosing to expand the model using the relation between lot size eects and distance Table 9.4 Results from expansion method


Figure 9.5 Spatial variation in lot size coefficient

from the park ± this was a good choice because that is how the data were created! 9.5.5 Geographically Weighted Regression

Using the weights de®ned in Equation 9.7 and the method outlined in Section 9.4.2, the optimal value of was found to be 4.8. This de®nes a set of weights that are associated with the variables in Equation 9.8. The regressions are then run once for each data point, using these weights. Figure 9.5 displays a map of the coecient on lot size. From the ®gure one can see that the parameter is higher away from the park, where there is a large cluster of observations in the north. This is in keeping with our expectations, since the eect of lot size on house prices was made to be greater in peripheral locations.

Exercises 1. Use an added variable plot to determine whether distance to the park should be added to a regression of housing price on lot size, number of bedrooms, and presence or absence of a ®replace. Use the data in Table 9.1. 2. Using the data in Table 9.1 and geographically weighted regression, produce a map showing the spatial variation in the coecient on number of bedrooms. Alternatively, you may provide a table showing the regression coecient for the number of bedrooms at each of the 30 sample locations.


3. With the data in Table 9.1, ®rst perform an ordinary least-squares regression with housing price as the dependent variable and lot size as the independent variable. Then use the expansion method, with the lot size coecient depending upon the number of bedrooms. Interpret the results. 4. Use the spatial regression method outlined in Sections 9.3 and 9.5.3 with the data in Table 9.1 for a regression of housing prices on lot size and number of bedrooms.

10

Data Reduction: Factor Analysis and Cluster Analysis

LEARNING OBJECTIVES

. Introduction to multivariate methods for data reduction, including principal components analysis, factor analysis, and cluster analysis . Geometric interpretations of the methods

10.1 Factor Analysis and Principal Components Analysis

Many studies of complex geographic phenomena begin with a set of data and notions of hypotheses and theories that are vague at best. Factor analysis may be used as a data reduction method, to reduce a dataset containing a large number of variables down to one of more manageable size. When many of the original variables are highly correlated, it is possible to reduce the original data from a large number of original variables to a small number of underlying factors. A geometric interpretation helps one to understand the purpose of factor analysis. A data set consisting of n observations on p variables may be represented as n points plotted in a p-dimensional space. This is easiest to imagine when p 1, 2, or 3, and the latter case is illustrated in Figure 10.1. The ®gure also shows an ellipsoidal ®gure that contains the majority of the data points. The idea behind factor analysis is to construct factors that represent a large proportion of the variability of the dataset. The ®rst factor is, geometrically, the longest axis of the ellipse. The original axes correspond to variables; the longest axis of the ellipse is a new variable, which is a linear combination of the original variables. This new variable, or factor, captures as much of the variability in the dataset as possible. A second factor is derived by ®nding the second longest axis of the ellipse, such that this second axis is perpendicular to the ®rst axis. The fact that the axes of the ellipse are perpendicular implies that the newly de®ned factors will be uncorrelated with one another ± they represent separate and independent aspects of the underlying data. A dataset characterized by an extremely elongated ellipse would be well represented by a single factor ± that combination of variables would explain almost all of the variability in the original data. In the extreme case, the plotted data would fall along a single line, which would constitute the axis or single factor that would capture all of the variability in the data. At the other

DATA REDUCTION: FACTOR ANALYSIS AND CLUSTER ANALYSIS 193

x3

first principal component

second principal component

x1

x2 Figure 10.1 Data ellipsoid in p = 3 dimensions

extreme, the data ellipse could be circular; in this case, all factors explain an equal amount of the variability in the original data, and there are no dominant factors. In this discussion we will focus more upon the interpretation of the outputs of factor analysis, and less on its mathematical aspects. The next subsection addresses the interpretation of factor analysis results through an example using 1990 census data from Erie County, New York. 10.1.1 Illustration: 1990 Census Data for Buffalo, New York

Geographers often use many census variables in their analyses, and the set of variables can easily contain subsets that measure essentially the same phenomenon. The following example illustrates, for a small set of census data, how the number of original variables can be collapsed into a smaller number of uncorrelated factors. A 2355 data table was constructed by collecting and deriving the following information for the 235 census tracts in Erie County, New York (variable labels are in parentheses): (a) (b) (c) (d) (e)

median household income (medhsinc) percentage of households headed by females (female) percentage of high school graduates who have a professional degree (educ) percentage of housing occupied by owner (tenure) percentage of residents who moved into their present dwelling before 1959 (lres)

These ®ve variables capture dierent aspects of the socioeconomic and demographic character of census tracts. Do they represent separate dimensions of socioeconomic and demographic structure, or is there signi®cant


redundancy in what they measure, indicating that the variables might be reduced to a smaller number of underlying indices or factors? A natural place to start is with the correlation matrix. Table 10.1 reveals that the highest correlations are with the median household income variable; areas of high income have low percentages of households headed by females, high percentages of homeowners, high percentages of graduates with professional degrees, and a relatively low proportion of long-term residents. Using the test of signi®cance p described in Chapter 5, all correlations with absolute value greater than 2= 235 0:130 are signi®cant. The second step is to examine the outcome of describing the data as an ellipsoid, as described above. The method of principal components is used to describe the p axes of the ellipse (which in turn is constructed in a p-dimensional space, where p is the number of variables). The relative lengths of the axes are called eigenvalues. They are referred to in Table 10.2 as `èxtraction sums of squared loadings.'' A ``loading'' is the correlation between a component or factor and the original variable. If one were to sum the squared correlations between a factor and all of the original variables, this would be equal to the eigenvalue or the length of the ellipse axis. From the table, we see that the highest eigenvalue is 2.6 and the second highest is 0.96. Note that the column displaying these values sums to ®ve ± the eigenvalues (i.e., `èxtraction sums of squared loadings'') will always sum to the number of variables. In the extreme case there would be a single component with perfect correlations with all of the original variables. The eigenvalue for this component would be equal to 12 12 p. All of the other eigenvalues would be equal to zero, and the ellipse would collapse to a single line. Table 10.1 Correlation among variables

Table 10.2 Variance explained by each component


This table also provides us with valuable information concerning how many factors are necessary to adequately describe the data. There are two ``rules of thumb'' that are used to decide on the number of factors. One such rule is to retain components with eigenvalues greater than one. This would be an unfortunate rule to apply in this instance, since the second eigenvalue is just slightly less than one (0.96). An alternative is to plot the eigenvalues on the vertical axis and the factor number (ranging from 1 to p) on the horizontal axis of a graph. Then inspect the graph to locate a point at which the graph (termed a scree plot) ¯attens out; such a feature implies that the additional factors do not contribute much to the explanation of variability in the data set. Figure 10.2 displays a scree plot for our present example. Some judgement is called for, and we could in this instance justify the extraction of either two or three factors. Suppose that we decide to extract two factors. The next step is to inspect the loadings, or correlations between the factors and the original variables. This is a key step in the analysis, since it is where the ``meaning'' and interpretation of each factor occurs. To aid in this interpretation, the extracted component solution is rotated in the p-dimensional space, so that the loadings tend to be either high in absolute value (near plus or minus one) or low (near zero). Table 10.3 shows that the two-factor solution may be described as follows. The ®rst factor is one where income, tenure, and length of residence all ``load highly''. We can think of these variables has being combined to form a single index (the factor) that describes with a single number what the three variables represent. The second factor is associated with the other two variables ± education and family structure. It is common practice to attempt to give the factors snazzy, descriptive names. Having noted this, it is often dicult to come up

Figure 10.2 Scree plot for Erie County example


Table 10.3 Factor loadings

Table 10.4 Communalities

with something creative! The ®rst factor here might be thought of as a housing/ economic factor and the second a sociological factor. The dierence between principal components analysis and factor analysis may be summarized as follows. Principal components analysis is a descriptive method of decomposing the variation among a set of p original variables into p components. The components are linear combinations of the original variables. It is used as a prelude to factor analysis, which attempts to model the variability in the original set of variables via a reduced number of factors which is less than p. In factor analysis, values of the original variables may be reconstructed by writing them as linear combinations of the factors, plus a `ùniqueness'' term. Alternatively stated, in factor analysis part of the variability in an original variable is captured by the factors (this portion of the variability is termed the communality), and part is not captured by the factors (this portion is termed the uniqueness). Table 10.4 shows the communalities for the twofactor solution. The highest communality is for education (0.833) and the lowest for length of residence (0.587). The communalities for a variable are equal to the sum of the squared correlations of the variable with the factors. For example, the communality for education is equal to its squared correlation with factor one (0.03792) plus its squared correlation with factor two (0.9122). Length of residence has the highest uniqueness, since it is not highly correlated with the two factors. It is important to realize that the output of a factor analysis is a strong function of the input. The fact that length of residence is not strongly related to either factor does not mean that it is not an important feature of urban structure. The factors that emerge from a factor analysis are not necessarily the


``most important'' ones, but rather the ones that capture the nature of the dataset. If we had a dataset with ®fteen variables, and eleven of the ®fteen variables were alternative measures of income, we could be certain that an income factor would emerge as the principal factor, simply because so many variables were highly intercorrelated. Finally, one of the outputs from factor analysis is the factor scores. Instead of making p separate maps describing the spatial pattern of each variable, one is now interested in making a number of maps equal to the number of underlying factors. For each factor, and for each observation, a score may be computed as a linear combination of the original variables. The result is a new data table; instead of the original n p table, we now have a n k table, where k is the number of factors. Figures 10.3 and 10.4 display the factor scores on each of our two factors for the Erie County census tracts. 10.1.2 Regression Analysis on Component Scores

As we have seen, the chief use of principal components analysis is to summarize a large number of variables in terms of a set of uncorrelated components. This sounds ideal for regression analysis, where one is faced with a large number of possibly correlated variables, and the objective is to use a small number of uncorrelated variables that will be useful in explaining the variability in the dependent variable. Regression analysis may be carried out on component scores, ensuring not only that the independent variables are a parsimonious subset capturing the underlying dimensions of the full set of potential independent variables, but that they are uncorrelated as well. This idea for eliminating multicollinearity is one that is quite commonly employed (for example, see Ormrod and Cole 1996, Ackerman 1998, O'Reilly and Webster 1998). One disadvantage is that it is somewhat more dicult to interpret the regression coecients. They now indicate how much the dependent variable changes when the component score changes by one unit, and it is more dicult to conceptualize just what a one-unit increase in the component score really implies. Hadi and Ling (1998) also note some pitfalls in the use of principal components regression.

10.2 Cluster Analysis

Whereas factor analysis works by searching for similar variables, cluster analysis has as its objective the grouping together of similar observations. Since it is conventional to represent each observation as a row in a data table, and each variable as a column, cluster analysis has at its core the search for similar rows of data. Factor analysis is based upon similarities among columns of data. Like factor analysis, cluster analysis may be thought of as a data reduction technique. We seek to reduce the n original observations into g groups, where


Figure 10.3 Factor 1 scores

1 g n. In achieving this reduction of n observations into a smaller number of groups, a general goal is to minimize the within-group variation and maximize the between-group variation. In Figure 10.5, there is relatively little variability within groups, as measured by the variation in the location of points around their group centroids. Relative to this within-group variability, there is much more variation in the locations of the group centroids in relation to the centroid for the entire dataset.


Figure 10.4 Factor 2 scores

One of the more widespread applications of cluster analysis in geography has been in the area of geodemographics, where analysts seek to reduce a large number of subregions (e.g., census tracts) by classifying them into a small number of types (see, e.g., Chapter 10 of Plane and Rogerson 1994). Cluster analysis has also been used as a method of regionalization, where the objective is to divide a region into a smaller number of contiguous subregions. In this


x3

x1 Group centroid Centroid for all data

x2 Figure 10.5 Clustering in p = 3 dimensions

case, it is necessary to modify traditional approaches to cluster analysis slightly to ensure that the created groups are composed of contiguous subregions (see, e.g., Murtagh 1985). Approaches to cluster analysis may be categorized into two broad types. Agglomerative or hierarchical methods start with n clusters (where n is the number of observations); each observation is therefore in its own cluster. Then two clusters are merged, so that n 1 clusters remain. This process continues until only one cluster remains (this cluster contains all n observations). The process is hierarchical because the merger of two clusters at any stage of the analysis cannot be undone at later stages. Once two observations have been placed together in the same cluster, they stay together for the remainder of the grouping process. In contrast, nonhierarchical or nonagglomerative methods begin with an a priori decision to form g groups. Then one begins with either an initial set of g seed points or an initial partition of the data into g groups. If one starts with a set of seed points, a partition of the data into g groups is achieved by assigning each observation to the nearest seed point. If one begins with a partition of the data into g groups, g seed point locations are calculated as the centroids of these g partitioned groups. In either case, an iterative process then takes place, where new seed points are calculated from partitions and then new partitions are created from the seed points. This process continues until no reassignments of observations from one group to another occur. The convergence of this iterative process is usually very rapid. The nonhierarchical methods have the advantage of requiring less computational resources, and for this reason they are the preferred method when the number of observations is very large. They have the disadvantage that the number of groups must be speci®ed prior to the analysis, though in practice it is not uncommon to ®nd solutions for a range of g values.


10.2.1 More on Agglomerative Methods

With agglomerative methods, at each stage one merges the closest pair of clusters. There are many possible de®nitions that may be used for ``closest''. Consider all pairs of distances between elements of cluster A and cluster B. If there are nA elements in cluster A and nB elements in cluster B, there are nAnB such pairs. The single linkage (or nearest neighbor) method de®nes the distance between clusters as the minimum distance among all of these pairs. The complete linkage (or furthest neighbor) method de®nes the distance between clusters as the maximum distance among all of these pairs. One of the more commonly used methods is Ward's method. At each stage, all potential mergers will reduce the number of current clusters by one. Each of these potential mergers will result in an increase in the overall within sum of squares. (The within sum of squares may be thought of as the amount of scatter about the group centroids. With n clusters the within sum of squares is equal to zero, since there is no scatter of other members about the group centroids. With one cluster, the within sum of squares is maximal.) Ward's method chooses that merger that results in the smallest increase in the within sum of squares. This is conceptually appealing, since we would like the within-group variability to remain as small as possible. 10.2.2 Illustration: 1990 Census Data for Erie County, New York

Here we will illustrate some of the features of cluster analysis using the dataset described above in the illustration of factor analysis. Table 10.5 displays the results of a nonhierarchical k-means cluster analysis, where solutions range from k 2 to k 4. Three variables were used as clustering variables: the education variable, median household income, and the percentage of households headed by females. After standardization, the z-scores were used in the cluster analysis. For the two-cluster solution, the ®nal cluster centers reveal that the ®rst cluster is one where there are low scores on the education and median household income variables and high values on the percentage of households headed by females. The second cluster has the opposite characteristics, since the ®nal cluster centroid is at education and income values that are above average, and at a location where the percentage of female-headed households is below average. There are 126 observations in the ®rst cluster, and 105 in the second (and there are ®ve observations with missing data). The ANOVA table reveals that all of the variables are contributing strongly to the success of the clustering, since all of the F-values are extremely high and signi®cant. It is important to note that, since the cluster analysis is designed to make the F-statistic large by minimizing within-group variation, these F-statistics should not be interpreted in the usual way. In particular, we would expect the F-statistics to be large since we are creating clusters to make F large. Still, they can be used as rough guidelines to indicate


Table 10.5 (a) Two-cluster solution; (b) three-cluster solution; (c) four-cluster solution (a)

(b)


(c)

the success of the clustering and the relative success that individual variables have in achieving the cluster solution. The three-cluster solution is similar to the two-cluster solution, with the addition of a ``middle'' group that has values on all three variables that are close to the countywide averages. There are 41 observations in the ®rst cluster (characterized by low levels of education and income and a high percentage of female-headed households), 141 observations in the middle group, and 46 tracts in the third group. Again, all of the F-statistics are high, implying that all three variables help to place the observations into clusters. One of the groups in the four-cluster solution has only two observations. These two observations are characterized by an extremely high percentage of individuals with professional degrees. It appears that there are two distinct clusters, with a third, fairly large group characterized by rather average values on the variables. In addition, the cluster analysis has been helpful in locating two census tracts that could be characterized as outliers due to their high values on the education variable. Figure 10.6 depicts the location of tracts in the three-cluster solution. An important piece of output from a hierarchical cluster analysis is the dendogram. As its name implies, a dendogram is a tree-like diagram. It captures the history of the hierarchical clustering process as one proceeds from left to right along it. For illustrative purposes, it is rather dicult to show the dendogram that accompanies a hierarchical cluster analysis that has taken place for a


Figure 10.6 Three-cluster solution

very large number of observations. Instead, Figure 10.7 shows a dendogram for a subset of 30 tracts that have been selected at random from the dataset. At the left of the dendogram, the branches that meet indicate observations that have clustered together. For example, tracts 70 and 146 were very close together in the three-variable space and clustered together early in the process. In fact, the agglomeration schedule (shown in Table 10.6) indicates that these were the ®rst two observations that were clustered together. The horizontal


Figure 10.7 Dendogram

scale of the dendogram indicates the distance between the observations or groups that are clustered together. On the left of the dendogram, observations are close together when they cluster. On the right of the dendogram, there is only a small number of groups and the distance between those groups is larger.


Table 10.6 Agglomeration schedule for hierarchical clustering

To decide on the number of clusters, one can imagine taking a vertical line and proceeding from left to right along the dendogram. As one proceeds, the number of lines from the dendogram that intersect this vertical line decreases from n to 1. A good choice for the number of groups is one where there is a fairly large horizontal range in the dendogram where the number of groups does not change. In Figure 10.7, it would make little sense to choose ®ve groups, since these ®ve groups could easily be simpli®ed into four by proceeding just a little further to the right on the dendogram. The ®gure shows that there are two clear groups of tracts. The tracts that are in each of these groups may be found by proceeding to the left from each of the two parallel, horizontal lines on the dendogram. Following all the way to the left, through all of the branches, reveals all of the tracts in each cluster. For example, one of the two clusters consists of observations 156, 171, 209, 178, 97, 145, 9, 10, 60, 131, 138, 117, 126, and 79. Note that a three-cluster solution would subdivide this particular cluster into two subclusters, and one of those subclusters would be quite small (consisting only of observations 117, 126, and 79). The next step in this analysis would be to examine the characteristics of the observations in each


cluster. For example, observations 117, 126, and 79 all have quite high values on the education variable, coupled with high median household incomes. 10.3 Data Reduction Methods in SPSS for Windows 9.0 10.3.1 Factor Analysis

Click on Analyze, then Data Reduction, and then Factor. Choose the variables that will enter into the factor analysis. Under Rotation, choose Varimax (it is not the default). It is the most commonly used rotation method, and you should use it unless you have a good reason to choose an alternative! Under Extraction, it is most common to choose Principal Components and to choose as signi®cant Eigenvalues over 1. These are the defaults, and so unless you wish to change them you do not have to do anything. Under Scores, choose Save as Variables. This will save the factor scores as new variables by attaching a number of columns to your dataset that is equal to the number of signi®cant factors. Under Descriptive, choose Univariate Descriptives if desired. It is also useful to check the box labeled ``coecients'' under Correlation Matrix, to print out a table of the correlation coecients among variables.

10.3.2 Cluster Analysis Hierarchical methods. Choose Analyze, then Classify, then Hierarchical

Cluster. Next, choose the variables that are to be clustered. Then, under Method, choose the clustering method to be used. Note: Ward's method, though perhaps the most commonly used, is not the default choice. In fact, in Versions 8 and 9, one must scroll down the drop-down list to ®nd it at the end of the list of methods. Next, choose the measure of distance that will be used; squared Euclidean distance is the default, and is a reasonable (and readily understandable!) choice. Still in this section, you will likely want to choose z-scores under the box labeled ``standardize''; again, it is not the default option. Under Plots, one will often want to turn o the default `ìcicle plot'' and check the box labeled ``dendogram''. Under Save, you may save cluster membership, which adds a column of data to the data table indicating the cluster to which each observation belongs. This can be done either for a single prede®ned choice of the number of clusters or for a prede®ned range of cluster numbers.

Nonhierarchical clustering. First click on Analyze, then on k-means. After choosing the variables to cluster (and recalling that it is usually a good idea to standardize the data by computing z-scores before doing this), choose the number of clusters desired. It is usually a good idea to click on Save, and


save ``cluster membership'', which adds a column to the data table indicating the cluster membership of each observation.

Exercises 1. Explain and interpret the rotated factor loading table below % 65 yr old % white collar workers Median income

Factor 1

Factor 2

.88 .13 .92 .17 .24

.21 .86 .11 .81 .71

2. Perform a hierarchical cluster analysis using the following data, and comment on the results. Data for cluster analysis

Region 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Mean age

%nonfamily

Median income (000)

34 45 32 50 55 26 37 42 47 46 51 38 33 29

50 44 58 50 70 62 38 36 39 49 68 36 44 66

34 44 38 59 44 29 33 43 56 58 61 39 41 38

3. How many signi®cant factors would be extracted in the following factor analysis? How many variables were in the original analysis? Explain your answer. Factor 1 2 3 4 5 6 7 8 9 10

Eigenvalue

Cumulative percent of variance explained

3.0 2.5 1.5 0.9 0.3 0.2 0.2 0.2 0.1 0.1

45 70 78 89 92 94 96 98 99 100


4. A researcher collects the following information for a set of census tracts: Tract

Median age

Income (000)

% nonfamily

No. of autos

% new residents

% blue collar

26 35 48 47 36 29 55 56 29 33 44 47 51 44 37 38

29 38 49 55 39 32 58 66 32 44 49 46 52 49 40 41

32 24 29 55 66 42 38 36 33 29 31 38 55 52 38 34

1 2 3 3 2 2 3 3 1 2 2 2 3 2 1 2

23 21 16 18 23 33 10 11 23 21 18 15 12 18 19 21

33 21 44 44 41 40 31 24 28 29 31 30 20 19 43 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Use factor analysis to summarize the data. Assume that the data come from a 44 grid laid over the city as follows: 1 5 9 13

2 6 10 14

3 7 11 15

4 8 12 16

(a) Run a factor analysis to summarize the data above. (b) How many factors are suf®cient to describe the data (i.e., have eigenvalues greater than 1)? (c) Describe the rotated factor loadings, describing each factor in terms of the most important variables that comprise it. Attempt to give names to the factors. (Note: in this part of the question, discuss only those factors with eigenvalues greater than 1.) (d) Save the factor scores, and make a map of the scores on factor 1.

Epilogue

The primary purpose of this book has been to provide a foundation in some of the basic statistical tools that are used by geographers. The focus has been on inferential methods. Inferential statistical methods are attractive because they ®t well into the time-honored framework of the scienti®c method. There are, of course, limitations to the use of these methods. Many of the concerns are related to the nature of hypothesis testing. Why do we test whether two populations have the same mean? An enumeration of two communities would almost certainly show that the true, ``population'' means of, say, commuting distance were in fact dierent. Why do we test whether a true regression coecient is zero? Independent variables will almost always have some eect on the dependent variable, even if it is small. The principal point here is that in many situations the null hypothesis is not going to be true, so why are we testing it? A response to this concern is that the inferential framework provides not only a way of testing hypotheses, but also a way of establishing con®dence intervals around estimated parameters. Thus we can state with a given level of con®dence the magnitude of the dierence in commuting times, and we can specify with a given level of precision the magnitude of a regression coecient. With the increasing availability of large datasets, there has been an appropriate development of exploratory methods. Such exploratory methods are extremely useful in ``data mining'' and ``data trawling'' to suggest new hypotheses. Ultimately, con®rmation of these new hypotheses is called for, and inferential methods become more appropriate. Where does the student of quantitative methods in geography go from here? The books by Longley et al. (1998) and Fotheringham et al. (2000) represent two good examples of recent attempts to summarize some of the new developments in the ®eld. They include accounts of new developments in the areas of exploratory spatial data analysis and geocomputation. In order of increasing mathematical sophistication are books on spatial data analysis by Bailey and Gatrell (1995), Haining (1990a), and Cressie (1993). These require more background than that required for this book, but that should not deter the interested student from exploring them. Even if one does not absorb all of the mathematical detail, it is possible to get a good sense of the types of questions and the range of problems that spatial analysis can address.

EPILOGUE 211

Selected Publications

The following is a short list of some recent examples of the use of statistical methods in geography. Full details can be found in the Reference list. Two-sample tests Nelson 1997 Factor analysis in regression Ormrod and Cole 1996 O'Reilly and Webster 1998 Ackerman 1998 Logistic regression Myers et al. 1997 Correlation Williams and Parker 1997 Allen and Turner 1996 Spearman's Keim 1997 Correlation and regression Wyllie and Smith 1996 (stepwise regression) Regression Fan 1996 Principal components and factor analysis Clarke and Holly 1996 Webster 1996 Cluster analysis Comrie 1996 Dagel 1997


Appendix A: Statistical Tables Table A.1 Random digits

EPILOGUE 213

Table A.1 (Continued)


Table A.2 Normal distribution The tabled entries represent the proportion p of area under the normal curve above the indicated values of z. (Example: .0694 or 6.94% of the area is above z = 1.48). For negative values of z, the tabled entries represent the area less than z. (Example: .3015 or 30.15% of the area is beneath z = .52.)

p 0

z

EPILOGUE 215

Table A.3 Student's t distribution For various degrees of freedom (df), the tabled entries represent the critical values of t above which a speci®ed proportion p of the t distribution falls. (Example: for df = 9, a t of 2.262 is surpassed by .025 or 2.5% of the total distribution.

p 0

t


Table A.4 Cumulative distribution of Student's t distribution

EPILOGUE 217


Table A.5 F distribution For various pairs of degrees of freedom 1 , 2 , the tabled entries represent the critical values of F above which a proportion p of the distribution falls. (Example: for df = 4,16 an F = 2.33 is exceeded by p > = .10 of the distribution.) Tables are provided for values of p equal to .10, .05, .01.

p 0

F

Table A.5 F Distribution (Continued)

Table A.5 F Distribution (Continued)

EPILOGUE 221

Table A.6 2 Distribution For various degrees of freedom df, the tabled entries represent the values of 2 above which a proportion p of the distribution falls. (Example: for df = 5, a 2 = 11.070 is exceeded by p = .05 or 5% of the distribution.)


Table A.7 Coefficients {an n = 2(1)50.

i+1}

for the Shapiro±Wilk W test for normality, for

Source: Wetherill (1981). Note: The notation n 2150 means that entries are provided for n 2 to n 50 with increments of 1.

EPILOGUE 223



Table A.8 Percentage points of the W test* for n = 3(1)50

*Based on fitted Johnson (1949) SB approximation; see Shapiro and Wilk (1965) for details. Source: Wetherill (1981). Note: The notation n 3150 means that entries are provided for n 3 to n 50 with increments of 1.

EPILOGUE 225

Appendix B: Review and Extension of Some Probability Theory

A discrete random variable, X, has a probability distribution (sometimes called a probability mass function) denoted by P(X x) p(x), where x is the value taken by X. A continuous random variable has a probability distribution (also called a probability density function or pdf) denoted by f(x). The likelihood of getting a speci®c value x is zero, since the distribution is continuous. The likelihood of getting a value within a range, a<x > > x = 1 Z B:10 > > EgX gx f x dx> > ; 1

for discrete and continuous random variables, respectively. Useful rules for working with expected values are: (i) the expected value of a constant is simply the constant; (ii) the expected value of a constant times a random variable is the constant times the expected value of the random variable; and (iii) the expected value of a sum is equal to the sum of the expected values. These rules are summarized below: 9 Ea a i > = B:11 EbX bEX ii > ; Ea bX a bEX i; ii; and iii

EPILOGUE 227

Variance of a Random Variable

The variance of a random variable, 2 V[X], is the expected value of the squared deviation of an observation from the mean: EX2 EX

VX 2 EX

2

B:12

Using the rules for expected values above, VX E X 2 2X 2 E X 2

2EX 2 E X 2

2

B:13

To illustrate, let us return to the experiment involving the toss of a die. The variance of the random variable X in this case is equal to E[X2] 3.52. The expected value of X2 is found using Equation B.10: 1 2 1 2 1 EX 1 2 6 15:17 6 6 6 2

2

B:14

The variance is therefore equal to 15.17 3.52 2.92. To illustrate the derivation of the variance using a continuous variable, let us continue with the example of a uniform random variable. We have

E X

2

Zb

x2

a

b3 a3 3b a

B:15

a b2 b a 12 4

B:16

1 b

a

Then VX

3 a3 b 3b a

Covariance of Random Variables

How do two variables co-vary? Is there a tendency for one variable to exhibit high variables when the other does? Or are the variables independent? The covariance of two random variables, X and Y, is de®ned as the expected value of the product of the two deviations from the means: CovX; Y EX

X Y

Y

B:17

This may be rewritten in the form CovX; Y EXY

X Y

Y X X Y EXY

X y

B:18


To ®nd the observed covariance for a set of data, we ®nd the average value of the product of deviations: Covx; y

n X xi i1

xyi n

y

B:19

The correlation coecient is the standardized covariance:

CovX; Y X Y

B:20

Bibliography

Ackerman, W.V. 1998. Socioeconomic correlates of increasing crime rates in smaller communities. Professional Geographer 50: 372±87. Allen, J.P. and Turner, E. 1996. Spatial patterns of immigrant assimilation. Professional Geographer 48: 140±55. Andrews, D.F. 1985. Data: a collection of problems from many ®elds for the student and research worker. New York: Springer-Verlag. Anselin, L. 1992. SpaceStat: A program for the analysis of spatial data. Santa Barbara, CA: National Center for Geographic Information and Analysis. Anselin, L. 1995. Local indicators of spatial association ± LISA. Geographical Analysis 27: 93±115. Bailey, A. and Gatrell, A. 1995. Interactive spatial data analysis. Essex: Longman (published in the U.S. by Wiley). Besag, J. and Newell, J. 1991. The detection of clusters in rare diseases. Journal of the Royal Statistical Society, Series A 154: 143±55. Brunsdon, C., Fotheringham, A.S., and Charlton, M. 1996. Geographically weighted regression: a method for exploring spatial nonstationarity. Geographical Analysis 28: 281±98. Brunsdon, C., Fotheringham, A.S., and Charlton, M. 1999. Some notes on parametric signi®cance tests for geographically weighted regression. Journal of Regional Science 39: 497±524. Casetti, E. 1972. Generating models by the expansion method: applications to geographic research. Geographical Analysis 4: 81±91. Clark, P.J. and Evans, F.C. 1954. Distance to nearest neighbor as a measure of spatial relationships in populations. Ecology 35: 445±53. Clarke, A.E. and Holly, B. 1996. The organization of production in high technology industries: an empirical assessment. Professional Geographer 48: 127±39. Cli, A. and Ord, J.K. 1975. The comparison of means when samples consist of spatial autocorrelated observations. Environment and Planning A 7: 725±734. Cliord, P. and Richardson, S. 1985. Testing the association between two spatial processes. Statistics and Decisions Supplement Number 2, 155±60. Cohen, J.E. 1995. How many people can the earth support? New York: W.W. Norton and Co. Comrie, A.C. 1996. An all-season synoptic climatology of air pollution in the U.S. Mexico border region. Professional Geogapher 48: 237±51. Cornish, S.L. 1997. Strategies for the acquisition of market intelligence and implications for the transferability of information inputs. Annals of the Association of American Geographers 87: 451±70. Cressie, N. 1993. Statistical analysis of spatial data. New York: Wiley. Curtiss, J. and McIntosh, R. 1950. The interrelations of certain analytic and synthetic phytosociological characters. Ecology 31: 434±55.


Dagel, K.C. 1997. De®ning drought in marginal areas: the role of perception. Professional Geographer 49: 192±202. Easterlin, R. 1980. Birth and fortune: the impact of numbers on personal welfare. New York: Basic Books. Fan, C. 1996. Economic opportunities and internal migration: a case study of Guangdong Province, China. Professional Geographer 48: 28±45. Fisher, R.A., and Yates, F. 1974. Statistical Tables for Biological, Agricultural, and Medical Research, 6th edition. London: Longman. Fotheringham, A.S. and Rogerson, P. 1993. GIS and spatial analytical problems. International Journal of Geographical Information Systems 7: 3±19. Fotheringham, A.S. and Wong, D. 1991. The modi®able area unit problem in multivariate statistical analysis. Environment and Planning A 23: 1025±44. Fotheringham, A.S., Charlton, M.E., and Brunsdon, C. 1998. Geographically weighted regression: a natural evolution of the expansion method for spatial data analysis. Environment and Planning A 30: 1905±27. Fotheringham, A.S., Brunsdon, C., and Charlton, M. 2000. Quantitative geography: perspectives on spatial data analysis. London: Sage Publications. Gehlke, C. and Biehl, K. 1934. Certain eects of grouping upon the size of the correlation coecient in census tract material. Journal of the American Statistical Association 29: 169±70. Gott, R. 1993. Implications of the Copernican principle for our future prospects. Nature 363: 315±19. Grith, D.A. 1978. A spatially adjusted ANOVA model. Geographical Analysis 10: 296±301. Grith, D.A. 1987. Spatial autocorrelation: a primer. Washington, DC: Association of American Geographers. Grith, D.A. 1996. Computational simpli®cations for space±time forecasting within GIS: the neighbourhood spatial forecasting model. In Spatial analysis: modelling in a GIS environment (Eds. P. Longley and M. Batty), pp. 247±60. Cambridge: Geoinformation International (distributed by Wiley). Grith, D.A., Doyle, P.G., Wheeler, D.C., and Johnson, D.L. 1998. A tale of two swaths: urban childhood blood-lead levels across Syracuse, New York. Annals of the Association of American Geographers 88: 640±65. Hadi, A.S. and Ling, R.F. 1998. Some cautionary notes on the use of principal components regression. American Statistician 52,1: 15±19. Haining, R. 1990a. Spatial data analysis in the social and environmental sciences. Cambridge: Cambridge University Press. Haining, R. 1990b. The use of added variable plots in regression modelling with spatial data. The Professional Geographer 42: 336±45. Johnson, B.W. and McCulloch, R.E. 1987. Added variable plots in linear regression. Technometrics 29: 427±33. Jones III, J.P. and Casetti, E. 1992. Applications of the expansion method. London: Routledge. Keim, B.D. 1997. Preliminary analysis of the temporal patterns of heavy rainfall across the Southeastern United States. Professional Geographer 49: 94±104. Longley, P., Brooks, S.M., McDonnell, R., and Macmillan, B. 1998. Geocomputation: A primer. Chichester: Wiley. MacDonald, G.M., Szeicz, J.M., Claricoates, J., and Dale, K.A. 1998. Response of the central Canadian treeline to recent climatic changes. Annals of the Association of American Geographers 88: 183±208. Mallows, C. 1998. The zeroth problem. American Statistician 52,1: 1±9.

BIBLIOGRAPHY 231

Meehl, P. 1990. Why summaries of research on psychological theories are often uninterpretable. Psychological Reports 66: 195±244 (Monograph Supplement 1-V66). Moran, P.A.P. 1948. The interpretation of statistical maps. Journal of the Royal Statistical Society, Series B 10: 245±51. Moran, P.A.P. 1950. Notes on continuous stochastic phenomena. Biometrika 37:17±23. Murtagh, F. 1985. A survey of algorithms for contiguity-constrained clustering and related problems. The Computer Journal 28: 82±88. Myers, D., Lee, S.W., and Choi, S.S. 1997. Constraints of housing age and migration on residential mobility. Professional Geographer 49: 14±28. Nelson, P.W. 1997. Migration, sources of income, and community change in the Paci®c Northwest. Professional Geographer 49: 418±30. O'Loughlin, J., Ward, M.D., Lofdahl, C.L., Cohen, J.S., Brown, D.S., Reilly, D., Gleditsch, K.S., and Shin, M. 1998. The diusion of democracy, 1946±1994. Annals of the Association of American Geographers 88: 545±74. Ord, J.K. and Getis, A. 1995. Local spatial autocorrelation statistics: distributional issues and an application. Geographical Analysis 27: 286±306. O'Reilly, K. and Webster, G.R. 1998. A sociodemographic and partisan analysis of voting in three anti-gay rights referenda in Oregon. Professional Geographer 50: 498±515. Ormrod, R.K. and Cole, D.B. 1996. The vote on Colorado's Amendment Two. Professional Geographer 48: 14±27. Pearson, E.S. and Hartley, H.O. (Eds.) 1966. Biometrika Tables for Statisticians, Vol. 1. Cambridge: Cambridge University Press. Plane, D. and Rogerson, P. 1991. Tracking the baby boom, the baby bust, and the echo generations: how age composition regulates US migration. Professional Geographer 43: 416±39. Plane, D. and Rogerson, P. 1994. The geographical analysis of population: with applications to planning and business. New York: Wiley. Robinson, W. 1950. Ecological correlation and the behavior of individuals. American Sociological Review 15: 351±57. Rogers, A. 1975. Matrix population models. Thousand Oaks, CA: Sage Publications. Rogerson, P. 1987. Changes in U.S. national mobility levels. Professional Geographer 39: 344±51. Rogerson, P. and Plane, D. 1998. The dynamics of neighborhood age composition. Environment and Planning A 30: 1461±72. Sachs, L. 1984. Applied statistics: a handbook of techniques. New York: Springer-Verlag. ScheeÂ, H. 1959. The analysis of variance. New York: Wiley. Shapiro, S.S. and Wilk, M.B. 1965. An analysis of variance test for normality. Biometrika 52: 591±611. Slocum, T. 1990. The use of quantitative methods in major geographical journals, 1956± 1986. Professional Geographer 42: 84±94. Standing, L., Sproule, R., and Khouzam, N. 1991. Empirical statistics: IV. Illustrating Meehl's sixth law of soft psychology: everything correlates with everything. Psychological Reports 69: 123±26. Stouer, S. 1940. Intervening opportunities: a theory relating mobility and distance. American Sociological Review 5: 845±67. Tukey, J. W. 1972. Some graphic and semigraphic displays. In Statistical papers in honor of George W. Snedecor (Ed. T.A. Bancroft). Lowa State University Press. Ames, IA. Velleman, P.F. and Hoaglin, D.C. 1981. Applications, basics, and computing of exploratory data analysis. Belmont, CA: Wadsworth. Webster, G.R. 1996. Partisan shifts in presidential and gubernatorial elections in Alabama, 1932±94. Professional Geographer 48: 379±91.


Weisberg, S. 1985. Applied linear regression. New York: Wiley. Wetherill, G.B. 1981. Intermediate statistical methods. London: Chapman Hall. Williams, K.R.S. and Parker, K.C. 1997. Trends in interdiurnal temperature variation for the Central United States, 1945±1985. Professional Geographer 49: 342±355. Wyllie, D.S. and Smith, G.C. 1996. Eects of extroversion on the routine spatial behavior of middle adolescents. Professional Geographer 48: 166±80.

Index absolute deviation, 7 added-variable plots, in regression analysis, 180±1, 186±7 agglomerative methods of cluster analysis, 200, 201, 203±7 alternative hypotheses, 43 testing process, 45±54 analysis of variance (ANOVA), 65±80 assumptions, 65±7, 70 Arcview 3.1, 19±20 asymmetry of data, 7±8 autocorrelation, spatial, 15, 55±7, 75±6, 96±8, 165, 167±8 backward selection, in multiple regression analysis, 140 best-®tting regression line, 104, 106±9 binary connectivity of weights, 167±8, 181 binomial distribution, 25±7 bivariate regression, 105±22 boundaries, in spatial analysis, 14 box plots, 8±9 buer zones, in spatial analysis, 14 cancer case distribution (example), 2±4 categorical dependent variables, 144±5 causal connections and linear correlations, 87±9 regression analysis of, 104 census data analysis (examples), 99±100, 101±2, 193±7, 201±207 central limit theorem, 30, 31 chi-square (2 ) distribution, 219 chi-square (2 ) test, 157, 160±1, 163 cluster analysis, 197±207 agglomerative methods, 200, 201, 203±7 nonagglomerative methods, 200±2, 201±4, 207±8 clusters, 154±6, 158, 162±3, 173±5 of residuals, 165±6 coecient of determination, 110±12 cohort size and mobility (example), 89±91, 93, 95, 97 coin ¯ipping (examples), 10±12, 25 combinations, 22±3 commuting (examples), 30, 141±5, 149±50

con®dence intervals, 10, 30±1, 210 in contrasts of means, 74 for correlation coecients, 95±7 for one-sample tests, 47 for regression analysis, 133 con®rmatory methods of analysis, 4 continuous random variables, 24 contrasts of means, 73±5, 77 correlation, 86±102 correlation coecients, 87 con®dence intervals for, 95±9 dierences in, 97 Pearson's, 87, 96, 97, 101±2 of random variables, 226 and sample size, 93±4 and spatial aggregation, 99±100 Spearman's, 94±5, 97, 101±2 true, 92±3 covariance, 86±7 of a random variable, 225±6 standardized, see correlation coecients critical values, 45±6 eect of spatial dependence, 98 data reduction, 192±208 degrees of freedom, 47, 50 for regression, 111, 115 dendograms, 203±7 dependence of spatial data, see spatial autocorrelation descriptive analysis, 5±9, 15±16 determination, coecient of, 110±12 discrete random variables, 24 disproportional sampling, 58 dummy variables, 128±32, 165 eigenvalues, in method of principal components, 194±5 equations, ordering of, 18±21 errors in testing hypotheses, 43±4 true (population), 106 expansion method, in regression analysis, 182, 185, 188±90 expected value of a random variable, 223±5 expenditure and income (example), 113±16


explained sum of squares, in regression analysis, 109±10 exploratory methods of analysis, 4, 210 extraction sums of squared loadings, 194±5 F-distribution, 50, 67, 216±18 F-statistic for ANOVA, 50, 66±7, 71 in cluster analysis, 201±3 F-test, 50, 51, 70, 71 factor analysis, 192±7 factor scores, 197 factorials, 18, 22, 23 fairness, inferential analysis, 10±12 ®ve-number summary of distribution data, 9 forward selection, in multiple regression analysis, 140 Galapagos species (example), 132±9 generalization, in scienti®c method, 3±4 geographically weighted regression, 183±4, 190 geometric random variables, 33, 35 Getis's Gi statistic, 174±5 global statistics, 173, 182 heteroscedasticity, see homoscedasticity hierarchical (agglomerative) methods of cluster analysis, 200, 201, 203±7 hinges, in box plots, 9 histograms, 7±8, 26±7 homoscedasticity, 49, 52, 65 tests, 70, 78±9 household newcomers survey (example), 24, 25±7, 28±9 housing prices (examples), 105±6, 126±8, 164±5, 182, 184±90 hypotheses alternative, 43, 45±54 null, 11, 42±3 in scienti®c method, 2±3, 4 testing process, 42±62, 158±61, 210 income and expenditure (example), 113±16 income and mortality rates (example), 91±2 income and state aid (example), 116±18 independence of spatial data, see spatial autocorrelation inferential analysis, 5, 9±12, 210 interaction eects, in regression analysis, 182 interquartile range, 6

intervening opportunities model, 32±6 join count statistic, 161±6, 167 kitchen-sink approach to multiple regression, 132±6, 140 Kolmogorov-Smirnov test, 70 Kruskal-Wallis test, 70±3 kurtosis of data, 8 leptokurtosis of data, 8 Levene's test, 70, 79 leverage values, 136, 137±8 local statistics, 173±5 log-odds, 143±4 logistic regression, 125, 143±4, 146, 147±50 Markov model of migration, 36±7 maximum likelihood approach to regression coecients, 147 means contrasting, 74±5, 77 sample, 5±6, 9, 10, 29, 30±1, 44 true (population), 9±10, 30±1 median, 6 migration models, 32, 36±7 missing values approach to multiple regression, 134±5 misspeci®cation errors, multiple regression, 126±8 mobility and cohort size (example), 89±91, 93, 95, 97 mode, 6 modi®able areal units, 13, 99±100, 125 Monte Carlo method, 11±12, 159±60 simulation in testing randomness, 171±2 Moran's I statistic, 167±72, 175 mortality rates and income (example), 91±2 multicollinearity, 125±6, 129, 136±40 multiple regression, 105, 124±40, 145 nearest neighbor analysis, 161±4 nonagglomerative methods of cluster analysis, 200±1, 207±8 nonindependence of spatial data, see spatial autocorrelation nonlinear regression, 118±20, 142±3, 145±7 nonparametric tests, 71 normal distribution (normality), 27±9, 44, 214 in randomness testing, 170±1 tests, 70, 79±80, 220±2

INDEX 235

null hypotheses, 11, 42±3 testing process, 45±54 odds, 143 one-sample tests, for the mean, 44±5, 46±9 one-sided hypotheses, 43, 45 ordering of equations, 18±21 outliers, 9, 91±2, 136±8 p-value in hypothesis testing, 45, 47, 49, 51±2 see also probability parentheses in equations, 18±20 Pearson's correlation coecient, 87, 96, 97, 101±2 notation for products, 22, 23 platykurtosis of data, 8 point patterns, 154±64 population values, see true values precipitation, diurnal variation, 69±70, 71±2, 74 predicted values, 34±5 principal components, method of, 194±5, 196, 197 probability, 24±5, 223±6 see also p-value binomial distribution, 25±7, 29 models, 31±9 normal distribution, 27±9 predicted in logistic regression, 141±5 product notation, 22, 23 proportional sampling, 58 proportions one-sample tests, 48±9 two-sample tests, 53±4 quadrat analysis, 156±61 random digits, 212±13 random patterns, 155±7 random samples, 15, 58±9 random sampling, 57±8 random variables, 24, 33, 35, 223±6 randomness, testing 158±73 range, 6 interquartile, 6 ranked data correlation, 94±5 Kruskal-Wallis test, 71±3 recreation activities (example), 130±2 regression analysis assumptions, 122, 125±6 linear, 104±22 logistic, 125, 143±4, 146, 147±50 multiple, 105, 124±40, 145

nonlinear, 118±20, 142±3, 145±7 of principal component scores, 196 problems, 145, 146 simple (bivariate), 105±22 spatial, 178±89 regression coecients, 126±8 from geographically weighted observations, 183±4 maximum likelihood approach, 147 residuals joint-count statistic, 165±6 in regression analysis, 106±12, 124±5, 179, 180±1 sample size, and correlation coecients, 93±4 sample space, 24 sample values, 9±10 absolute and standard deviation, 7 mean, 5±6, 9, 10, 29, 30±1, 44 variance, 6±7, 9 sampling distribution, 160 sampling frames, 57, 58 sampling procedures random, 57±8 in spatial analysis, 14±15, 58±9 systematic, 58 scatterplots, 87 ScheeÂ's procedure for contrasting means, 74±5, 77 scienti®c method, 1±4 scree plots, 194 selection of variables, for multiple regression, 132±3, 140 Shapiro-Wilk W test, 70, 79±81, 220±2 shopping trips (examples), 42±8, 128±30 notation, for summation, 6, 20±2, 23 signi®cance level, 43 simple (bivariate) regression, 105±22 simple events, 24 simple samples, 15, 24 skewness of data, 7±8 snowfall analysis (example), 179 software, 19±20, 170 see also SPSS for Windows 9.0 SpaceStat, 170 spatial aggregation, and modi®able areal units, 99±100 spatial analysis, 2±4, 154±76, 210 nonindependence of data, see spatial autocorrelation problems, 13±15, 99±100 regression models, 179±84 sampling procedures, 14±15, 58±9 see also cluster analysis


spatial autocorrelation, 15, 55±7, 75±6, 97±9, 165, 167±8 spatially varying parameters, 182±4 Spearman's rank correlation coecient, 94±5, 97, 101±2 SPSS for Windows 9.0 for ANOVA, 76±80 for cluster analysis, 207±8 for correlation, 100±2 data input, 15, 59±60 for data reduction, 207±8 for descriptive analysis, 15±16 for factor analysis, 207 for Moran's I, 175 for regression analysis, 120±2, 145±50 for Shapiro-Wilk test, 79 for two-sample t-tests, 59±62 standard deviation, 7 of residuals, 112 standard error of estimate, 112 state aid and income (example), 116±18 statistical analysis, geographical applications, 1 statistical thinking, 12±13 stem-and-leaf plots, 9 stepwise regression, 140 strati®ed samples, 15, 58, 59 Student's t-distribution, see t-distribution; t-test sum of squares of deviations in ANOVA, 66±8, 68±9 eigenvalues, in method of principal components, 194±5 of residuals in regression analysis, 107±12, 124±5 summation notation, 20±2, 23 survival time model, 38±9 swimming frequency (example), 50±2, 59±62, 68±9, 70, 76±8, 79, 80 systematic sampling, 58, 59 t-distribution, 31, 46±7, 215 t-test with nonindependent variables, 56 one-sample, 47±8 two-sample, 49±52, 56, 70 test statistics, distribution, 54±5 tolerance, in multiple regression analysis, 136 travel behavior commuting (examples), 30, 141±5, 149±50 intervening opportunities model, 32±6 trips to park (example), 54±5 true (population) values, 9±10

error, 106 mean, 9±10, 30±1 regression line intercept, 106 regression line slope, 106, 112±13 variance, 6, 9 two-sample tests for the mean, 49±52, 56, 70 with nonindependent variables, 56±7 for proportions, 53±4 t-tests on SPSS 9.0 for Windows, 59±62 two-sided hypotheses, 43, 45 Type I and II errors, in testing hypotheses, 43±4 unbiased estimates, 6 uncertainty, in statistical problems, 24 unexplained (residual) sum of squares, 109, 110±12 uniformity of spatial patterns, 155, 158, 162, 163 uniqueness, in factor analysis, 196 variability of data, 6±7, 30 of spatial patterns, 157±8 variables categorical dependent, 144±5 distribution, 54±5 dummy, 128±32, 165 random, 24, 33, 35 223±6 selection for multiple regression, 132±3, 140 spatially varying parameters, 182±4 variance, 6±7, 9 analysis (ANOVA), 65±80 of sample means, 30±1 true (population), 6, 9 in two-sample tests, 49, 52 variance in¯ation factor, 136 variance-mean ratio (VMR), 157±61 W statistic, 70, 79±80, 220±2 weather patterns, and probability, 24±5 weights geographical, 182±4 in Moran's I, 167±8 whiskers, in box plots, 9 z-statistic, 7, 28±9, 30, 214 z-test one-sample, 44±5, 48±9 two-sample, 45, 53±4 zoning systems, 13, 35±6

Statistical Methods for Geography

Statistical Methods for Geography (2001)(en)(248s)

Statistical methods for forecasting

Statistical Methods for Biostatistics

Statistical Methods for Psychology

Probability theory for statistical methods

Statistical Methods for Meta-Analysis

Statistical methods for research workers

Statistical Methods for Clinical Trials

Statistical Methods for Meta-Analysis

Statistical Methods for Fuzzy Data

Statistical Methods for Reliability Data

Statistical Methods for Health Sciences

statistical methods for biostatistics hardle

Statistical Methods for Human Rights

Statistical Methods for Quality Improvement

Statistical methods for human rights

Statistical Methods for Reliability Data

Statistical methods for health sciences