Modern Regression Techniques Using R
Modern Regression Techniques Using R A Practical Guide for Students and Research...
268 downloads
3223 Views
2MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Modern Regression Techniques Using R
Modern Regression Techniques Using R A Practical Guide for Students and Researchers
Daniel B. Wright and Kamala London
© Daniel B. Wright and Kamala London 2009 First published 2009 Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act, 1988, this publication may be reproduced, stored or transmitted in any form, or by any means, only with the prior permission in writing of the publishers, or in the case of reprographic reproduction, in accordance with the terms of licences issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers. SAGE Publications Ltd 1 Oliver’s Yard 55 City Road London EC1Y 1SP SAGE Publications Inc. 2455 Teller Road Thousand Oaks, California 91320 SAGE Publications India Pvt Ltd B 1/I 1 Mohan Cooperative Industrial Area Mathura Road New Delhi 110 044 SAGE Publications Asia-Pacific Pte Ltd 33 Pekin Street #02-01 Far East Square Singapore 048763 Library of Congress Control Number:
2008926086
British Library Cataloguing in Publication data A catalogue record for this book is available from the British Library ISBN 978-1-84787-902-8 ISBN 978-1-84787-903-5 (pbk)
Typeset by CEPHA Imaging Pvt. Ltd., Bangalore, India Printed in India at Replika Press Pvt Ltd Printed on paper from sustainable resources
Contents Preface
vii
1
Very brief introduction to R
2
The basic regression
16
3
ANOVA as regression
29
4
ANCOVA: Lord’s paradox and mediation analysis
48
5
Model selection and shrinkage
65
6
Generalized linear models (GLMs)
92
7
Regression splines and generalized additive models (GAMs)
112
8
Multilevel models
138
9
Robust regression
162
Conclusion – make your data cool
185
10
1
Glossary of R functions used in this book
190
References
196
Index
202
Preface In this book we introduce several useful extensions to the basic regression model, without too much mathematics, but with several pictures and some of the basic references. Not all possible extensions are covered, but we chose a set that we think is particularly useful for psychology. We will use the freeware package R so a secondary purpose of this book is to introduce some of the facilities in R (R Development Core Team, 2008). It works like syntax in many of the other statistics packages like SPSS (which seems the most popular package in psychology, so we refer to it occasionally for comparison purposes), but it is more flexible and has more procedures. Once you get used to it, we hope that you will find it is easier than its competitors. It is free so we know you will like the price! While we provide a brief introduction to R, we also provide links to useful books and websites. This book is divided into ten chapters. First, we explain the most basic basics of R, but point readers to where they can find more details. Next, we give an overview of what we call the basic regression and then briefly describe each of the extensions. Then, we go through the seven extensions, and finish with a conclusion. Each of these chapters includes a description and then goes through the analysis of some data. This document grew out of a regression workshop to the Legal Psychology group at Florida International University in 2006, when Dan was on sabbatical there (and where he is now permanently), and was the basis for a talk and poster at the SARMAC 2007 conference at Bates College, Maine. It was also the basis of Modern Statistical Methods, a graduate course at University of Sussex. Many thanks to all those who provided comments! All of the royalties from this book go to the American Partnership for Eosinophilic Disorders (www.apfed.com). See the website for more information. Happy regressing, Dan Wright and Kami London
GETTING THE MOST OUT OF THIS BOOK An important part of using this book is conducting analyses in R. R is freely available on the web (http://cran.r-project.org/). You should, at least to begin with, be on a computer with internet access. R is updated frequently and some minor aspects are changed each time. This book was prepared with R2.4 thru R2.7. R is part of the Free Software Foundation (http://www.fsf.org/), which promotes, well, the name makes it pretty obvious what they promote.
viii
Preface In this book R commands are written in dark bold Courier and R output is in gray Courier. There is a glossary at the back of the book which provides brief descriptions of all the commands/functions used in this book, so if something is unfamiliar look there first. If you want to know more about the function use the online help facility within R. To do this you should use the help function. For example, for the function mean, type either help(mean) or ?mean. We have adopted an example-based approach. Most of the data come from real research papers. The examples were chosen because we hope that they will be of interest to most working in the social and behavioral sciences, and also because we were able to access the data. By providing examples, we hope you can match your own research needs onto these examples. The data for all these examples and the corresponding code are on http://www.sagepub.co.uk/wrightandlondon. There are many books that cover conducting statistics in R. A list of some can be found at: http://www.r-project.org/doc/bib/R-books.html
One of our favorites is: Crawley, M. J. (2005). Statistics: An introduction using R. Chichester, UK: Wiley.
This is not written specifically for social scientists, but his clarity is excellent. He also has written The R book (2007) which is excellent for a much more detailed treatment at 950 pages. More detailed readings are given at the end of each chapter. We assume that everybody has studied some statistics, perhaps one semester of psychology-graduate-statistics, and so understands the basics of the standard linear regression (covered briefly in Chapter 2). There are several good background books for statistics, but one stands out above all others for having the most modest authors: Wright, D. B. & London, K. (2009). First (and second) steps in statistics (2nd). London: Sage Publications.
NOTE Microsoft Word and many other ‘high level’ word processing packages change some characters (including " and ') to other characters (like “ and ‘), which are not read by R. Therefore, if using one of these word processing packages we recommend turning off several of the facilities that automatically change characters from those you type. If copying code from websites, sometimes line breaks are lost, so you need to be careful with this. If you are copying and pasting commands, it may be easier to save them in Notepad or some other ‘low level’ word processing package. The text editor Tinn-R is designed for R and can be downloaded from http://www.sciviews.org/Tinn-R/ and http://sourceforge.net/projects/tinn-r.
1
Very brief introduction to R Learning objectives 1.
Learning some of the basic R concepts: functions, objects, assigning, packages, mirrors, CRAN, and how to read data and access packages. 2. Statistical concepts reinforced are looking at data, transforming data, and there is detailed discussion of skewness. 3. We introduce you to the bootstrap, which will be used for several examples in this book.
R was developed as a free alternative to the powerful statistics language/program S/S-Plus. R and S-Plus are similar, and many of the procedures written in one will run in the other, but R is free and S-Plus is a commercial product. R is rapidly increasing in popularity. When statisticians develop procedures they often write R functions so that others can use them. Several books for learning to use R are listed at the end of this chapter.
R AND THE INTERNET When using R it is useful to have an internet connection. Figure 1.1 shows a schematic of how the R system can be considered. From your computer you download R from the internet onto your computer so that you can later use the software without being on the internet. The program is available both from the R home page and from one of the CRAN (Comprehensive R Archive Network) mirror sites. Mirror sites, or mirrors, are sites that are supposed to be exactly the same as the main CRAN site. This means that when people download files they do not all have to do it from the same server. This makes downloading faster. To begin using R you have to download it from one of the many R mirror sites: http://cran.r-project.org/mirrors.html. If an entire class is downloading information you should use different mirrors. Press the Windows, the Mac, or the Linux button in the ‘precompiled binary distribution’ box. Press ‘base’ and then run the setup program. Follow the wizard’s instructions. This gives you both a very powerful statistics language and statistics package. This will allow you to do most of the statistics that you
2
Modern Regression Techniques Using R R
statistician statistician
You & your computer
CRAN
statistician
mirrors
statistician
mirrors
statistician
mirrors
statistician
(e.g., Ripley)
:
statistician
(e.g., Efron)
mirrors
statistician
mirrors
statistician statistician
(e.g., Wilcox)
statistician
Figure 1.1 A schematic of the R-system would want, but not all. Statisticians write their own packages for specialist purposes. Some submit these to CRAN so that others can use them. When a package is sent to CRAN it gets copied onto all the mirror sites. You can then download packages from there. For example, there is a package called foreign (R core members et al., 2008) that allows you to read data from other statistics programs directly into R. If you type: install.packages("foreign") a window like Figure 1.2 opens. We chose the server in Michigan (USA (MI)) since that is close to where we are preparing this chapter (in Ohio). This should install the package onto your computer. The package is now on your computer so you may access it in the future from this computer even if you are not connected to the internet, assuming it is not erased. However, because authors update their packages frequently, it is worth reinstalling packages relatively regularly. If the mirror is not perfectly up-to-date or if you are not connected to the internet it may not install. You may also get some warning messages. Although the package is now on your computer (in a folder in the R directory) it is not active. To make it active type: library("foreign") Now you have access to a large number of functions that are used to import and export data between R and other statistical packages. Some packages will have been installed when you downloaded R and you will just need to load these. You will need to download others from CRAN. Some statisticians, like Rand Wilcox, keep their packages on their own web page. In the case of Wilcox, he has written a book that effectively acts as both a manual and teaching resource for his functions. We will use some of his code in one of the examples and his code can be accessed from the web using the source function. We have only written ‘statistician’ on the right of Figure 1.1, but there are people from other disciplines (like computer science and psychology) who write packages for R. We did not include them because it is primarily statisticians doing the writing.
Very Brief Introduction to R
Figure 1.2 The window to choose a mirror site Most of the introductory R books (a long list of them is on http://www.r-project.org/ doc/bib/R-books.html; our favorite introduction is Crawley, 2005) go through how to use R like a calculator. Type 6+4, or something like that, and see what happens. These books also go through the different data formats. A good (short and also free) book is available on http://cran.r-project.org/doc/manuals/R-intro.pdf.
If you have not already opened R, it would be good to open up it now because we will be telling you to type things in throughout the rest of this chapter.
3
4
Modern Regression Techniques Using R
FUNCTIONS AND OBJECTS R works by applying functions to objects. To illustrate functions and objects we will show how to calculate the mean of four numbers. We will use the function mean to do this, but first we have to create a variable. A variable is a set of several similar objects. We can create a variable by assigning a list of values to a variable name. To make assignments you use the characters. So the following (and type this into R yourself) creates a variable scores that has four numbers (5, 6, 7 and 8). scores