Praise for Larry Hatcher The writing is exceptionally clear and easy to follow, and precise definitions are provided to...
250 downloads
2060 Views
51MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Praise for Larry Hatcher The writing is exceptionally clear and easy to follow, and precise definitions are provided to avoid confusion. Examples are used to illustrate each concept, and those examples are, like everything in this book, clear and logically presented. Sample SAS output is provided for every analysis, with each part labeled and thoroughly explained so the reader understands the results. Sheri Bauman, Ph.D. Assistant Professor Department of Educational Psychology University of Arizona, Tucson
[Larry Hatcher] once again manages to provide clear, concise, and detailed explanations of the SAS program and procedures, including appropriate examples and sample write-ups. Frank Pajares Winship Distinguished Research Professor Emory University
The Student Guide and the Exercises books are excellent choices for use in quantitative courses in psychology and education.
Bert W. Westbrook, Ph.D. Professor of Psychology Alumni Distinguished Undergraduate Professor North Carolina State University
Step-by-Step
S T U D E N T
G U I D E
BASIC STATISTICS Using
SAS
®
L A R R Y H ATC H E R , P H . D .
The correct bibliographic citation for this manual is as follows: Hatcher, Larry. 2003. Step-by-Step Basic Statistics Using SAS®: Student Guide. Cary, NC: SAS Institute Inc.
Step-by-Step Basic Statistics Using SAS®: Student Guide Copyright © 2003 by SAS Institute Inc., Cary, NC, USA ISBN 1-59047-148-2 All rights reserved. Printed in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc. U.S. Government Restricted Rights Notice: Use, duplication, or disclosure of this software and related documentation by the U.S. government is subject to the Agreement with SAS Institute and the restrictions set forth in FAR 52.227-19, Commercial Computer Software-Restricted Rights (June 1987). SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513. 1st printing, April 2003 SAS Publishing provides a complete selection of books and electronic products to help customers use SAS software to its fullest potential. For more information about our e-books, e-learning products, CDs, and hardcopy books, visit the SAS Publishing Web site at support.sas.com/pubs or call 1-800-727-3228. ®
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Dedication
To my friends at Saginaw Valley State University.
ii Step-by-Step Basic Statistics Using SAS: Student Guide
Contents Acknowledgments .............................................................................................ix Chapter 1: Using This Student Guide ..............................................................1 Introduction ........................................................................................................................... 3 Introduction to the SAS System ............................................................................................ 4 Contents of This Student Guide ............................................................................................ 6 Conclusion .......................................................................................................................... 11
Chapter 2: Terms and Concepts Used in This Guide ..................................13 Introduction ......................................................................................................................... 15 Research Hypotheses and Statistical Hypotheses ............................................................. 16 Data, Variables, Values, and Observations ........................................................................ 21 Classifying Variables According to Their Scales of Measurement...................................... 24 Classifying Variables According to the Number of Values They Display ............................ 27 Basic Approaches to Research........................................................................................... 29 Using Type-of-Variable Figures to Represent Dependent and Independent Variables ..................................................................................................... 32 The Three Types of SAS Files ............................................................................................ 37 Conclusion .......................................................................................................................... 45
Chapter 3: Tutorial: Writing and Submitting SAS Programs .......................47 Introduction ......................................................................................................................... 48 Tutorial Part I: Basics of Using the SAS Windowing Environment..................................... 50 Tutorial Part II: Opening and Editing an Existing SAS Program ......................................... 75 Tutorial Part III: Submitting a Program with an Error ......................................................... 94 Tutorial Part IV: Practicing What You Have Learned ....................................................... 102 Summary of Steps for Frequently Performed Activities .................................................... 105 Controlling the Size of the Output Page with the OPTIONS Statement............................ 109 For More Information......................................................................................................... 110 Conclusion ........................................................................................................................ 110
Chapter 4: Data Input .....................................................................................111 Introduction ....................................................................................................................... 113 Example 4.1: Creating a Simple SAS Data Set ............................................................... 117 Example 4.2: A More Complex Data Set ......................................................................... 122 Using PROC MEANS and PROC FREQ to Identify Obvious Problems with the Data Set........................................................................................................... 131 Using PROC PRINT to Create a Printout of Raw Data..................................................... 139 The Complete SAS Program............................................................................................. 142 Conclusion ........................................................................................................................ 144
Chapter 5: Creating Frequency Tables ........................................................145 Introduction ....................................................................................................................... 146 Example 5.1: A Political Donation Study.......................................................................... 147 Using PROC FREQ to Create a Frequency Table............................................................ 152
iv Contents Examples of Questions That Can Be Answered by Interpreting a Frequency Table ........................................................................................................ 155 Conclusion ........................................................................................................................ 157
Chapter 6: Creating Graphs ..........................................................................159 Introduction ....................................................................................................................... 160 Reprise of Example 5.1: the Political Donation Study....................................................... 161 Using PROC CHART to Create a Frequency Bar Chart ................................................... 162 Using PROC CHART to Plot Means for Subgroups.......................................................... 174 Conclusion ........................................................................................................................ 177
Chapter 7: Measures of Central Tendency and Variability ........................179 Introduction ....................................................................................................................... 181 Reprise of Example 5.1: The Political Donation Study...................................................... 181 Measures of Central Tendency: The Mode, Median, and Mean ...................................... 183 Interpreting a Stem-and-Leaf Plot Created by PROC UNIVARIATE ................................ 187 Using PROC UNIVARIATE to Determine the Shape of Distributions ............................... 190 Simple Measures of Variability: The Range, the Interquartile Range, and the Semi-Interquartile Range ................................................................................. 200 More Complex Measures of Central Tendency: The Variance and Standard Deviation........................................................................................................ 204 Variance and Standard Deviation: Three Formulas ......................................................... 207 Using PROC MEANS to Compute the Variance and Standard Deviation ........................ 210 Conclusion ........................................................................................................................ 214
Chapter 8: Creating and Modifying Variables and Data Sets ....................215 Introduction ....................................................................................................................... 217 Example 8.1: An Achievement Motivation Study ............................................................. 218 Using PROC PRINT to Create a Printout of Raw Data..................................................... 222 Where to Place Data Manipulation and Data Subsetting Statements............................... 225 Basic Data Manipulation ................................................................................................... 228 Recoding a Reversed Item and Creating a New Variable for the Achievement Motivation Study...................................................................................... 235 Using IF-THEN Control Statements .................................................................................. 239 Data Subsetting................................................................................................................. 248 Combining a Large Number of Data Manipulation and Data Subsetting Statements in a Single Program......................................................... 256 Conclusion ........................................................................................................................ 260
Chapter 9: z Scores........................................................................................261 Introduction ....................................................................................................................... 262 Example 9.1: Comparing Mid-Term Test Scores for Two Courses................................. 266 Converting a Single Raw-Score Variable into a z-Score Variable .................................... 268 Converting Two Raw-Score Variables into z-Score Variables .......................................... 278 Standardizing Variables with PROC STANDARD............................................................. 285 Conclusion ........................................................................................................................ 286
Contents v
Chapter 10: Bivariate Correlation .................................................................287 Introduction ....................................................................................................................... 290 Situations Appropriate for the Pearson Correlation Coefficient......................................... 290 Interpreting the Sign and Size of a Correlation Coefficient ............................................... 293 Interpreting the Statistical Significance of a Correlation Coefficient ................................. 297 Problems with Using Correlations to Investigate Causal Relationships............................ 299 Example 10.1: Correlating Weight Loss with a Variety of Predictor Variables................. 303 Using PROC PLOT to Create a Scattergram.................................................................... 307 Using PROC CORR to Compute the Pearson Correlation between Two Variables................................................................................................. 313 Using PROC CORR to Compute All Possible Correlations for a Group of Variables ................................................................................................ 320 Summarizing Results Involving a Nonsignificant Correlation............................................ 324 Using the VAR and WITH Statements to Suppress the Printing of Some Correlations ........................................................................................................ 329 Computing the Spearman Rank-Order Correlation Coefficient for Ordinal-Level Variables................................................................................................. 332 Some Options Available with PROC CORR ..................................................................... 333 Problems with Seeking Significant Results ....................................................................... 335 Conclusion ........................................................................................................................ 338
Chapter 11: Bivariate Regression.................................................................339 Introduction ....................................................................................................................... 341 Choosing between the Terms Predictor Variable, Criterion Variable, Independent Variable, and Dependent Variable ............................................................... 341 Situations Appropriate for Bivariate Linear Regression .................................................... 344 Example 11.1: Predicting Weight Loss from a Variety of Predictor Variables.................. 346 Using PROC REG: Example with a Significant Positive Regression Coefficient .................................................................................................. 350 Using PROC REG: Example with a Significant Negative Regression Coefficient ........... 371 Using PROC REG: Example with a Nonsignificant Regression Coefficient..................... 379 Conclusion ........................................................................................................................ 383
Chapter 12: Single-Sample t Test .................................................................385 Introduction ....................................................................................................................... 387 Situations Appropriate for the Single-Sample t Test ......................................................... 387 Results Produced in a Single-Sample t Test..................................................................... 388 Example 12.1: Assessing Spatial Recall in a Reading Comprehension Task (Significant Results) ............................................................................................. 393 One-Tailed Tests versus Two-Tailed Tests ...................................................................... 406 Example 12.2: An Illustration of Nonsignificant Results................................................... 407 Conclusion ........................................................................................................................ 412
Chapter 13: Independent-Samples t Test ....................................................413 Introduction ....................................................................................................................... 415 Situations Appropriate for the Independent-Samples t Test ............................................. 417 Results Produced in an Independent-Samples t Test....................................................... 420
vi Contents Example 13.1: Observed Consequences for Modeled Aggression: Effects on Subsequent Subject Aggression (Significant Differences)........................... 428 Example 13.2: An Illustration of Results Showing Nonsignificant Differences................. 446 Conclusion ........................................................................................................................ 450
Chapter 14: Paired-Samples t Test...............................................................451 Introduction ....................................................................................................................... 453 Situations Appropriate for the Paired-Samples t Test ....................................................... 453 Similarities between the Paired-Samples t Test and the Single-Sample t Test ................ 457 Results Produced in a Paired-Samples t Test .................................................................. 461 Example 14.1: Women’s Responses to Emotional versus Sexual Infidelity .................... 463 Example 14.2: An Illustration of Results Showing Nonsignificant Differences................. 483 Conclusion ........................................................................................................................ 487
Chapter 15: One-Way ANOVA with One Between-Subjects Factor ..........489 Introduction ....................................................................................................................... 491 Situations Appropriate for One-Way ANOVA with One Between-Subjects Factor ........... 491 A Study Investigating Aggression ..................................................................................... 494 Treatment Effects, Multiple Comparison Procedures, and a New Index of Effect Size ... .497 Some Possible Results from a One-Way ANOVA ............................................................ 500 Example 15.1: One-Way ANOVA Revealing a Significant Treatment Effect ................... 505 Example 15.2: One-Way ANOVA Revealing a Nonsignificant Treatment Effect ............. 529 Conclusion ........................................................................................................................ 537
Chapter 16: Factorial ANOVA with Two Between-Subjects Factors.........539 Introduction ....................................................................................................................... 542 Situations Appropriate for Factorial ANOVA with Two Between-Subjects Factors ........... 542 Using Factorial Designs in Research ................................................................................ 546 A Different Study Investigating Aggression....................................................................... 546 Understanding Figures That Illustrate the Results of a Factorial ANOVA......................... 550 Some Possible Results from a Factorial ANOVA.............................................................. 553 Example of a Factorial ANOVA Revealing Two Significant Main Effects and a Nonsignificant Interaction.................................................................................... 565 Example of a Factorial ANOVA Revealing Nonsignificant Main Effects and a Nonsignificant Interaction.................................................................................... 607 Example of a Factorial ANOVA Revealing a Significant Interaction ................................. 617 Using the LSMEANS Statement to Analyze Data from Unbalanced Designs................... 625 Learning More about Using SAS for Factorial ANOVA ..................................................... 627 Conclusion ........................................................................................................................ 628
Chapter 17: Chi-Square Test of Independence ............................................629 Introduction ....................................................................................................................... 631 Situations That Are Appropriate for the Chi-Square Test of Independence...................... 631 Using Two-Way Classification Tables............................................................................... 634 Results Produced in a Chi-Square Test of Independence ................................................ 637 A Study Investigating Computer Preferences ................................................................... 640 Computing Chi-Square from Raw Data versus Tabular Data ........................................... 642
Contents vii Example of a Chi-Square Test That Reveals a Significant Relationship .......................... 643 Example of a Chi-Square Test That Reveals a Nonsignificant Relationship .................... 661 Computing Chi-Square from Raw Data............................................................................. 668 Conclusion ........................................................................................................................ 671
References .......................................................................................................673 Index..................................................................................................................675
viii Contents
Acknowledgments
During the development of these books, Caroline Brickley, Gretchen Rorie Harwood, Stephenie Joyner, Sue Kocher, Patsy Poole, and Hanna Schoenrock served as editors. All were positive, supportive, and helpful. They made the books stronger, and I thank them for their guidance. A number of other people at SAS made valuable contributions in a variety of areas. My sincere thanks go to those who reviewed the books for technical accuracy and readability: Jim Ashton, Jim Ford, Marty Hultgren, Catherine Lumsden, Elizabeth Maldonado, Paul Marovich, Ted Meleky, Annette Sanders, Kevin Scott, Ron Statt, and Morris Vaughan. I also thank Candy Farrell and Karen Perkins for production and design; Joan Stout for indexing; Cindy Puryear and Patricia Spain for marketing; and Cate Parrish for the cover designs. Special thanks to my wife Ellen, who was loving and supportive throughout.
x Step-by-Step Basic Statistics Using SAS: Student Guide
Using This Student Guide Introduction............................................................................................ 3 Overview...................................................................................................................3 Intended Audience and Level of Proficiency .............................................................3 Platform and Version ................................................................................................3 Materials Needed......................................................................................................4 Introduction to the SAS System ............................................................ 4 Why Do You Need This Student Guide?...................................................................4 What Is the SAS System?.........................................................................................5 Who Uses SAS? .......................................................................................................5 Using the SAS System for Statistical Analyses.........................................................5 Contents of This Student Guide............................................................. 6 Overview...................................................................................................................6 Chapter 2: Terms and Concepts Used in This Guide...............................................7 Chapter 3: Tutorial: Using the SAS Windowing Environment to Write and Submit SAS Programs ..........................................................................................7 Chapter 4: Data Input...............................................................................................7 Chapter 5: Creating Frequency Tables ....................................................................7 Chapter 6: Creating Graphs.....................................................................................8 Chapter 7: Measures of Central Tendency and Variability.......................................8 Chapter 8: Creating and Modifying Variables and Data Sets...................................8 Chapter 9: Standardized Scores (z Scores).............................................................8 Chapter 10: Bivariate Correlation.............................................................................9 Chapter 11: Bivariate Regression ............................................................................9 Chapter 12: Single-Sample t Test ............................................................................9
2 Step-by-Step Basic Statistics Using SAS: Student Guide
Chapter 13: Independent-Samples t Test ................................................................9 Chapter 14: Paired-Samples t Test..........................................................................9 Chapter 15: One-Way ANOVA with One Between-Subjects Factor.......................10 Chapter 16: Factorial ANOVA with Two Between-Subjects Factors ......................10 Chapter 17: Chi-Square Test of Independence .....................................................10 References .............................................................................................................10 Conclusion.............................................................................................11
Chapter 1: Using This Student Guide 3
Introduction Overview This chapter introduces you to the SAS System, a computer application that can be used to perform statistical analyses. It explains just what SAS is, where it is installed, and describes some of the advantages associated with using SAS for data analysis. Finally, it briefly summarizes what you will learn in each of the chapters that comprise this Student Guide. Intended Audience and Level of Proficiency This guide is intended for those who want to learn how to use SAS to perform elementary statistical analyses. The guide assumes that many students using it have not already taken a course on elementary statistics. To assist these students, this guide briefly reviews basic terms and concepts in statistics at an elementary level. It was designed to be easily understood by first and second year college students. This book was also designed to be user-friendly to those who may have little or no experience with personal computers. The beginning of Chapter 3, “Tutorial: Using the SAS Windowing Environment to Write and Submit SAS Programs,” reviews basic concepts in using Microsoft Windows, such as selecting menus, double-clicking icons, and so forth. Those who already have experience in using Windows will be able to quickly skim through this elementary material. Platform and Version This guide shows how to use the SAS System for Windows, as opposed to other operating environments. This is most apparent in Chapter 3, “Using the SAS Windowing Environment to Write and Submit SAS Programs.” However, the remaining chapters show how to write SAS code to perform statistical analyses, and most of this material will be useful to all SAS users, regardless of the operating environment. This is because, for the most part, the same SAS code can be used on a wide variety of operating environments to obtain the same results. This book was designed for those using the SAS System Version 8 and later versions. It may also be helpful to those using earlier versions of SAS (such as V6 or V7). However, if you are using one of these earlier versions, it is likely that some of the SAS system options described here are not available with your version. It is also likely that some of the SAS output that you obtain will be arranged differently than the output that is presented here.
4 Step-by-Step Basic Statistics Using SAS: Student Guide
Materials Needed To complete the activities described in this book, you will need •
access to a personal computer on which the SAS System for Windows has been installed,
•
one (and preferably two) 3.5-inch disks, formatted for IBM PCs (or some other type of storage media).
Some students using this book will also use its companion volume, Step-by-Step Basic Statistics Using SAS: Exercises. The chapters in the Exercises book parallel most of the chapters contained in this Student Guide. Each chapter in the Exercises book contains two assignments for students to complete. Complete solutions are provided for the oddnumbered exercises, but not for the even-numbered ones. The Exercises book can give you useful practice in learning how to use SAS, but it is not absolutely required.
Introduction to the SAS System Why Do You Need This Student Guide? This Student Guide shows you how to use a computer application called the SAS System to perform elementary statistical analyses. Until recently, students in elementary statistics courses typically performed statistical computations by hand or with a pocket calculator. In recent years, however, the increased availability of computers has made it possible for students to also use statistical software packages such as SPSS and the SAS System to perform these analyses. This latter approach allows students to focus more on conceptual issues in statistics, and spend less time on the mechanics of performing mathematical operations by hand. Step by step, this Student Guide will introduce you to the SAS System, and will show you how to use it to perform a variety of statistical analyses that are commonly used in the social and behavioral sciences and in education.
Chapter 1: Using This Student Guide 5
What Is the SAS System? The SAS System is a modular, integrated, and hardware-independent application. It is used as an information delivery system by business organizations, governments, and universities worldwide. SAS is used for virtually every aspect of information management in organizations, including decision support, project management, financial analysis, quality improvement, data warehousing, report writing, and presentations. However, this guide will focus on just one aspect of SAS: its ability to perform the types of statistical analyses that are appropriate for research in the social sciences and education. By the time you have completed this text, you will have accomplished two objectives: you will have learned how to perform elementary statistical analyses using SAS, and you will have become familiar with a widely used information delivery system. Who Uses SAS? The SAS System is widely used in business organizations and universities. Consider the following statistics from July 2002: •
SAS supports over 40 operating environments, including Windows, OS/2, and UNIX.
•
SAS Institute’s computer software products are installed at over 38,400 sites in 115 countries.
•
Approximately 71% of SAS installations are in business locations, 18% are education sites, and 11% are government sites. It is used for teaching and research at about 3,000 university locations.
•
It is estimated that SAS software products are used by more than 3.5 million people worldwide.
•
90% of all Fortune 500 companies are SAS clients.
Using the SAS System for Statistical Analyses SAS is a particularly powerful tool for social scientists and educators because it allows them to easily perform virtually any type of statistical analysis that may be required in their research. SAS is comprehensive enough to perform the most sophisticated multivariate analyses, but is so easy to use that undergraduates can perform simple analyses after only a short period of instruction. In a sense, the SAS System may be viewed as a library of prewritten statistical algorithms. By submitting a brief SAS program, you can access a procedure from the library
6 Step-by-Step Basic Statistics Using SAS: Student Guide
and use it to analyze a set of data. For example, below are the SAS statements used to call up the algorithm that calculates Pearson correlation coefficients: PROC CORR RUN;
DATA=D1;
The preceding statements will cause SAS to compute the Pearson correlation between every possible pair of numeric variables in your data set. Being able to call up complex procedures with such a simple statement is what makes SAS so powerful. By contrast, if you had to prepare your own programs to compute Pearson correlations by using a programming language such as FORTRAN or BASIC, it would require many statements, and there would be many opportunities for error. By using SAS instead, most of the work has already been completed, and you are able to focus on the results of the analysis rather than on the mechanics of obtaining those results.
Contents of This Student Guide Overview This guide has two objectives: to teach the basics of using SAS in general and, more specifically, to show how to use SAS procedures to perform elementary statistical analyses. Chapters 1–4 provide an overview to the basics of using SAS. The remaining chapters cover statistical concepts in a sequence that is representative of the sequence followed in most elementary statistics textbooks. Chapters 10–17 introduce you to inferential statistical procedures (the type of procedures that are most often used to analyze data from research). Each chapter shows you how to conduct the analysis from beginning to end. Each chapter also provides an example of how the analysis might be summarized for publication in an academic journal in the social sciences or education. For the most part, these summaries are written according to the guidelines provided in the Publication Manual of the American Psychological Association (1994). Many students using this book will also use its companion volume, Step-by-Step Basic Statistics Using SAS: Exercises. For Chapters 3–17 in this student guide, the corresponding chapter in the exercise book provides you with a hands-on exercise that enables you to practice the data analysis skills that you are learning. The following sections provide a summary of the contents of the remaining chapters in this guide.
Chapter 1: Using This Student Guide 7
Chapter 2: Terms and Concepts Used in This Guide Chapter 2 defines some important terms related to research and statistics that will be used throughout this guide. It also introduces you to the three types of files that you will work with during a typical session with SAS: the SAS program, the SAS log, and the SAS output file. Chapter 3: Tutorial: Using the SAS Windowing Environment to Write and Submit SAS Programs The SAS windowing environment is a powerful application that you will use to create, edit, and submit SAS programs. You will also use it to review your SAS logs and output. Chapter 3 provides a tutorial that teaches you how to use this application. Step by step, it shows you how to write simple SAS programs and interpret their results. By the end of this chapter, you should be ready to use the SAS windowing environment to write and submit SAS programs on your own. Chapter 4: Data Input Chapter 4 shows you how to use the DATA and INPUT statements to create SAS data sets. You will learn how to read both numeric and character variables by using a simple, list style for data input. By the end of the chapter, you will be prepared to input the data sets that will be presented throughout the remainder of this guide. Chapter 5: Creating Frequency Tables Chapter 5 shows you how to create frequency tables that are useful for understanding your data and answering some types of research questions. For example, imagine that you ask a sample of 150 people to tell you their age. If you then used SAS to create a frequency table for this age variable, you would be able to easily answer questions such as •
How many people are age 30?
•
How many people are age 30 or younger?
•
What percent of people are age 45?
•
What percent of people are age 45 or younger?
8 Step-by-Step Basic Statistics Using SAS: Student Guide
Chapter 6: Creating Graphs Chapter 6 shows you how to use SAS to create frequency bar charts––bar charts that indicate the number of people who displayed a given value on a variable. For example, imagine that you asked 150 people to indicate their political party. If you used SAS to create a frequency bar chart, the resulting chart would indicate the number of people who are democrats, the number who are republicans, and the number who are independents. Chapter 6 also shows how to create bar charts that plot subgroup means. For example, assume that, in the “political party” study described above, you asked the 150 subjects to indicate both their political party and their age. You could then use SAS to create a bar chart that plots the mean age for people in each party. For instance, the resulting bar chart might show that the average age for democrats was 32.12, the average age for republicans was 41.56, and the average age for independents was 37.33. Chapter 7: Measures of Central Tendency and Variability Chapter 7 shows you how to compute measures of variability (e.g., the interquartile range, standard deviation, and variance) as well as measures of central tendency (e.g., the mean, median, and mode) for numeric variables. It also shows how to use stem-and-leaf plots to determine whether a distribution is skewed or approximately normal in shape. Chapter 8: Creating and Modifying Variables and Data Sets Chapter 8 shows how to use subsetting IF statements to create new data sets that contain a specified subgroup from the original sample. It also shows how to use mathematical operators and IF-THEN statements to recode variables and to create new variables from existing variables. Chapter 9: Standardized Scores (z Scores) Chapter 9 shows how to transform raw scores into standardized variables (z score variables) with a mean of 0 and a standard deviation of 1. You will learn how to do this by using the data manipulation statements that you learned about in Chapter 8. Chapter 9 also illustrates how you can review the sign and absolute magnitude of a z score to understand where a particular observation stands on the variable in question.
Chapter 1: Using This Student Guide 9
Chapter 10: Bivariate Correlation Bivariate correlation coefficients allow you to determine the nature of the relationship between two numeric variables. Chapter 10 shows you how to use the CORR procedure to compute Pearson correlation coefficients for interval- and ratio-level variables. You will also learn to interpret the p values (probability values) that are produced by PROC CORR to determine whether a given correlation coefficient is significantly different from zero. Chapter 10 also shows how to use PROC PLOT to create a two-dimensional scattergram that illustrates the relationship between two variables. Chapter 11: Bivariate Regression Bivariate regression is used when you want to predict scores on an interval- or ratio-level criterion variable from an interval- or ratio-level predictor variable. Chapter 11 shows you how to use the REG procedure to compute the slope and intercept for the regression equation, along with predicted values and residuals of prediction. Chapter 12: Single-Sample t Test Chapter 12 shows how to use the TTEST procedure to perform a single-sample t test. This is an inferential procedure that is useful for determining whether a sample mean is significantly different from a specified population mean. You will learn how to interpret the t statistic, and the p value associated with that t statistic. Chapter 13: Independent-Samples t Test You use an independent-samples t test to determine whether there is a significant difference between two groups of subjects with respect to their mean scores on the dependent variable. Chapter 13 explains when to use the equal-variance t statistic versus the unequal-variance t statistic, and shows how to use the TTEST procedure to conduct this analysis. Chapter 14: Paired-Samples t Test The paired-samples t test is also appropriate when you want to determine whether there is a significant difference between two sample means. The paired-samples approach is indicated when each score in one sample is dependent upon a corresponding score in the second sample. This will be the case in studies in which the same subjects provide repeated measures on the same dependent variable under different conditions, or when matching procedures are used. Chapter 14 shows how to perform this analysis using the TTEST procedure.
10 Step-by-Step Basic Statistics Using SAS: Student Guide
Chapter 15: One-Way ANOVA with One Between-Subjects Factor One-way analysis of variance (ANOVA) is an inferential procedure similar to the independent-samples t test, with one important difference: while the t test allows you to test the significance of the difference between two sample means, a one-way ANOVA allows you to test the significance of the difference between more than two sample means. Chapter 15 shows how to use the GLM procedure to perform a one-way ANOVA, and then to follow with multiple comparison (post hoc) tests. Chapter 16: Factorial ANOVA with Two Between-Subjects Factors A one-way ANOVA, as described in Chapter 15, may be appropriate for analyzing data from an experiment in which the researcher manipulates only one independent variable. In contrast, a factorial ANOVA with two between-subjects factors may be appropriate for analyzing data from an experiment in which the researcher manipulates two independent variables simultaneously. Chapter 16 shows how to perform this type of analysis. It provides examples of results in which the main effects are significant, as well as results in which the interaction is significant. Chapter 17: Chi-Square Test of Independence Nonparametric statistical procedures are procedures that do not require stringent assumptions about the nature of the populations under study. Chapter 17 illustrates one of the most common nonparametric procedures: the chi-square test of independence. This test is appropriate when you want to study the relationship between two variables that assume a limited number of values. Chapter 17 shows how to conduct the test of significance and interpret the results presented in the two-way classification table created by the FREQ procedure. References Many statistical procedures are illustrated in this guide by showing you how to analyze fictitious data from an empirical study. Many of these “studies” are loosely based on actual investigations reported in the research literature. These studies were chosen to help introduce you to the types of empirical investigations that are often conducted in the social and behavioral sciences and in education. The “References” section at the end of this guide provides complete references for the actual studies that inspired the fictitious studies reported here.
Chapter 1: Using This Student Guide 11
Conclusion This guide assumes that some of the students using it have not yet completed a course on elementary statistics. This means that some readers will be unfamiliar with terms used in data analysis, such as “observations,” “null hypothesis,” “dichotomous variables,” and so on. To remedy this, the following chapter, "Terms and Concepts Used in This Guide," provides a brief primer on basic terms and concepts in statistics. This chapter should lay a foundation that will make it easier to understand the chapters to follow.
12 Step-by-Step Basic Statistics Using SAS: Student Guide
Terms and Concepts Used in This Guide Introduction...........................................................................................15 Overview.................................................................................................................15 A Common Language for Researchers...................................................................15 Why This Chapter Is Important ...............................................................................15 Research Hypotheses and Statistical Hypotheses ..............................16 Example: A Goal-Setting Study..............................................................................16 The Research Question ..........................................................................................16 The Research Hypothesis.......................................................................................16 The Statistical Null Hypothesis................................................................................18 The Statistical Alternative Hypothesis.....................................................................19 Directional versus Nondirectional Alternative Hypotheses ......................................19 Summary ................................................................................................................21 Data, Variables, Values, and Observations ..........................................21 Defining the Instrument, Gathering Data, Analyzing Data, and Drawing Conclusions...........................................................................................21 Variables, Values, and Observations ......................................................................22 Classifying Variables According to Their Scales of Measurement......24 Introduction .............................................................................................................24 Nominal Scales .......................................................................................................25 Ordinal Scales.........................................................................................................25 Interval Scales ........................................................................................................26 Ratio Scales............................................................................................................27
14 Step-by-Step Basic Statistics Using SAS: Student Guide
Classifying Variables According to the Number of Values They Display .....................................................................................27 Overview.................................................................................................................27 Dichotomous Variables ...........................................................................................27 Limited-Value Variables ..........................................................................................28 Multi-Value Variables ..............................................................................................28 Basic Approaches to Research ............................................................29 Nonexperimental Research ....................................................................................29 Experimental Research...........................................................................................31 Using Type-of-Variable Figures to Represent Dependent and Independent Variables .....................................................................32 Overview.................................................................................................................32 Figures to Represent Types of Variables................................................................33 Using Figures to Represent the Types of Variables Assessed in a Specific Study...............................................................................................34 The Three Types of SAS Files...............................................................37 Overview.................................................................................................................37 The SAS Program...................................................................................................37 The SAS Log...........................................................................................................42 The SAS Output File ...............................................................................................44 Conclusion.............................................................................................45
Chapter 2: Terms and Concepts Used in This Guide 15
Introduction Overview This chapter has two objectives. This first is to introduce you to basic terms and concepts related to research design and data analysis. This chapter describes the different types of variables that might be analyzed when conducting research, the classification of these variables according to their scale of measurement or other characteristics, and the differences between nonexperimental versus experimental research. The chapter’s second objective is to introduce you to the three types of files that you will work with when you perform statistical analyses with SAS. These include the SAS program file, the SAS log file, and the SAS output file. After completing this chapter, you should be familiar with the fundamental terms and concepts that are relevant to data analysis, and you will have a foundation to begin learning about the SAS System in detail in subsequent chapters. A Common Language for Researchers Research in the behavioral sciences and in education is extremely diverse. In part, this is because the behavioral sciences represent a wide variety of disciplines, including psychology, sociology, anthropology, political science, management, and other fields. Further complicating matters is the fact that, within each discipline, a wide variety of methods are used to conduct research. These methods can include unobtrusive observation, participant observation, case studies, interviews, focus groups, surveys, ex post facto studies, laboratory experiments, and field experiments. Despite this diversity in methods used and topics investigated, most scientific investigations still share a number of characteristics. Regardless of field, most research involves an investigator who gathers data and performs analyses to determine what the data mean. In addition, most researchers in the behavioral sciences and education use a common language in reporting their research; researchers from all fields typically speak of “testing null hypotheses” and “obtaining significant p values.” Why This Chapter Is Important The purpose of this chapter is to review some fundamental concepts and terms that are shared in the behavioral sciences and in education. You should familiarize (or refamiliarize) yourself with this material before proceeding to the subsequent chapters, as most of the terms introduced here will be referred to again and again throughout the text. If you have not yet taken a course in statistics, this chapter will provide an elementary introduction; if you have already completed a course in statistics, it will provide a quick review.
16 Step-by-Step Basic Statistics Using SAS: Student Guide
Research Hypotheses and Statistical Hypotheses Example: A Goal-Setting Study Imagine that you have been hired by a large insurance company to find ways of improving the productivity of its insurance agents. Specifically, the company would like you to find ways to increase the number of insurance policies sold by the average agent. You will therefore begin a program of research to identify the determinants of agent productivity. In the course of this program, you will work with research questions, research hypotheses, and statistical hypotheses. The Research Question The process of research often begins by developing a clear statement of the research question (or questions). The research question is a statement of what you hope to have learned by the time the research has been completed. It is good practice to revise and refine the research question several times to ensure that you are very clear about what it is you really want to know. For example, in the current example, you might begin with the question “What is the difference between agents who sell much insurance versus agents who sell little insurance?” A more specific question might be “What variables have a causal effect on the amount of insurance sold by agents?” Upon reflection, you might realize that the insurance company really only wants to know what things management can do to cause the agents to sell more insurance. This might eliminate from consideration those variables that are not under management’s control, and can substantially narrow the focus of the research program. This narrowing, in turn, leads to a more specific statement of the research question such as “What variables under the control of management have a causal effect on the amount of insurance sold by agents?” Once the research question has been more clearly defined in this way, you are in a better position to develop a good hypothesis that provides a possible answer to the question. The Research Hypothesis An hypothesis is a statement about the predicted relationships among events or variables. A good hypothesis in the present case might identify a specific variable that is expected to have a causal effect on the amount of insurance sold by agents. For example, a research hypothesis might predict that the agents’ level of training will have a positive effect on the amount of insurance sold. Or it might predict that the agents’ level of achievement motivation will positively affect sales.
Chapter 2: Terms and Concepts Used in This Guide 17
In developing the hypothesis, you might be influenced by any of a number of sources: an existing theory, some related research, or even personal experience. Let's assume that in the present situation, for example, you have been influenced by goal-setting theory. This theory states, among other things, that higher levels of work performance are achieved when difficult goals are set for employees. Drawing on goal-setting theory, you now state the following hypothesis: “The difficulty of the goals that agents set for themselves is positively related to the amount of insurance they sell.” Notice how this statement satisfies our definition for a research hypothesis, as it is a statement about the predicted relationship between two variables. The first variable can be labeled “goal difficulty,” and the second can be labeled “amount of insurance sold.” The predicted relationship between goal difficulty and amount of insurance sold is illustrated in Figure 2.1. Notice that there is an arrow extending from goal difficulty to amount of insurance sold. This arrow reflects the prediction that goal difficulty is the causal variable, and amount of insurance sold is the variable being affected.
Figure 2.1. Causal relationship between goal difficulty and amount of insurance sold, as predicted by the research hypothesis.
In Figure 2.1, you can see that the variable being affected (insurance sold) appears on the left side of the figure, and that the causal variable (goal difficulty) appears on the right. This arrangement might seem a bit unusual to you, since most figures that portray causal relationships have the order reversed (with the causal variable on the left and the variable being affected on the right). However, this guide will always use the arrangement that appears in Figure 2.1, for reasons that will become clear later. You can see that the research hypothesis stated above is quite broad in nature. In many research situations, however, it is helpful to state hypotheses that are more specific in the predictions they make. For example, assume that there is an instrument called the “Smith Goal Difficulty Scale.” Scores on this fictitious instrument can range from zero to 100, with higher scores indicating more difficult goals. If you administered this scale to a sample of agents, you could develop a more specific research hypothesis along the following lines: “Agents who score 60 or above on the Smith Goal Difficulty Scale will sell greater amounts of insurance than agents who score below 60.”
18 Step-by-Step Basic Statistics Using SAS: Student Guide
The Statistical Null Hypothesis Beginning in Chapter 10, “Bivariate Correlation,” this guide will show you how to use the SAS System to perform tests of null hypotheses. The way that you state a specific null hypothesis will vary depending on the nature of your research question and the type of data analysis that you are performing. Generally speaking, however, a statistical null hypothesis is typically a prediction that there is no difference between groups in the population, or that there is no relationship between variables in the population. For example, consider the research hypothesis stated in the preceding section: “Agents who score 60 or above on the Smith Goal Difficulty Scale will sell greater amounts of insurance than agents who score below 60.” Assume that you conduct a study to investigate this research hypothesis. You identify two groups of subjects: •
50 Agents who score 60 or above on the Smith Goal Difficulty Scale (the “high goaldifficulty group”).
•
50 Agents who score below 60 on the Smith Goal Difficulty Scale (the “low goaldifficulty group”).
You observe these agents over a 12-month period, and record the amount of insurance that they sell. You want to investigate the following (fairly specific) research hypothesis: Research hypothesis: The average amount of insurance sold by the high goal-difficulty group will be greater than the average amount sold by the low goal-difficulty group. You plan to analyze the data using a statistical procedure such as a t test (which will be discussed in Chapter 13, “Independent-Samples t Test”). One way to structure this analysis is to begin with the following statistical null hypothesis: Statistical null hypothesis: In the population, there is no difference between the high goal-difficulty group and the low goal-difficulty group with respect to their mean scores on the amount of insurance sold. Notice that this is a prediction of no difference between the groups. You will analyze the data from your sample, and if the observed difference is large enough, you will reject this null hypothesis of no difference. Rejecting this statistical null hypothesis means that you have obtained some support for your original research hypothesis (the hypothesis that there is a difference between the groups). Statistical null hypotheses are often represented symbolically. For example, this is how you could have symbolically represented the preceding statistical null hypothesis: H0:
µ1 = µ2
Chapter 2: Terms and Concepts Used in This Guide 19
where H0
is the symbol used to represent the null hypothesis
µ1
is the symbol used to represent the mean amount of insurance sold by Group 1 (the high goal-difficulty group) in the population
µ2
is the symbol used to represent the mean amount of insurance sold by Group 2 (the low goal-difficulty group) in the population.
The Statistical Alternative Hypothesis A statistical alternative hypothesis is typically a prediction that there is a difference between groups in the population, or that there is relationship between variables in the population. The alternative hypothesis is the counterpart to the null hypothesis; if you reject the null hypothesis, you tentatively accept the alternative hypothesis. There are different ways that you can state alternative hypotheses. One way is simply to predict that there is a difference between the population means, without predicting which population mean is higher. Here is one way of stating that type of alternative hypothesis for the current study: Statistical alternative hypothesis: In the population, there is a difference between the high goal-difficulty group and the low goal-difficulty group with respect to their mean scores on the amount of insurance sold. The alternative hypothesis also can be stated symbolically H1:
µ1 ≠ µ2
The H1 symbol above is the symbol for an alternative hypothesis. Notice that the “not equal” symbol (≠) is used to represent the prediction that the means will not be equal. Directional versus Nondirectional Alternative Hypotheses Nondirectional hypotheses. The preceding section illustrated a nondirectional alternative hypothesis, also known as a two-sided or two-tailed alternative hypothesis. With the type of study described here (a study in which group means are being compared), a nondirectional alternative hypothesis simply predicts that one population mean differs from the other population mean––it does not predict which population mean will be higher. You would obtain support for this nondirectional alternative hypothesis if the high goal-difficulty group sold significantly more insurance, on the average, than the low goal-difficulty group. You would also obtain support for this nondirectional alternative hypothesis if the low goaldifficulty group sold significantly more insurance than the high goal-difficulty group. With a nondirectional alternative hypothesis, you are predicting some type of difference, but you are not predicting the specific nature, or direction, of the difference.
20 Step-by-Step Basic Statistics Using SAS: Student Guide
Directional hypotheses. In some situations it might be appropriate to use a directional alternative hypothesis. With the type of study described above, a directional alternative hypothesis (also known as a one-sided or one-tailed alternative hypothesis) not only predicts that there will be a difference, but also makes a specific prediction about which population will display the higher mean. For example, in the present study, previous research might lead you to predict that the population of high goal-difficulty employees will sell more insurance, on the average, than the population of low goal-difficulty employees. If this were the case, you might state the following directional alternative hypothesis: Statistical alternative hypothesis: In the population, mean amount of insurance sold by the high goal-difficulty group is greater than the mean amount of insurance sold by the low goal-difficulty group. This alternative hypothesis can also be stated symbolically H1:
µ1 > µ2
where µ1
represents the mean amount of insurance sold by Group 1 (the high goal-difficulty group) in the population
µ2
represents the mean amount of insurance sold by Group 2 (the low goal-difficulty group) in the population.
Notice that the “greater than” symbol (>) is used to represent the prediction that the mean for the high goal-difficulty population is greater than the mean for the low goal-difficulty population. Choosing directional versus nondirectional tests. Which type of alternative hypothesis should you use in your research? Most statistics textbooks recommend using a nondirectional, or two-sided, alternative hypothesis, in most cases. The problem with the directional hypothesis is that if your obtained sample means are in the opposite direction of the direction that you predict, it can cause you to fail to reject the null hypothesis even when there are very large differences between the sample means. For example, assume that you state the directional alternative hypothesis presented above (i.e., “In the population, mean amount of insurance sold by the high goal-difficulty group is greater than the mean amount of insurance sold by the low goal-difficulty group”). Because your alternative hypothesis is a directional hypothesis, the null hypothesis you are testing is as follows: H0:
µ1 ≤ µ2
which means, “In the population, the mean amount of insurance sold by the high goaldifficulty group (Group 1) is less than or equal to the mean amount of insurance sold by the low goal-difficulty group (Group 2).”
Chapter 2: Terms and Concepts Used in This Guide 21
Clearly, to reject the null hypothesis, the high goal-difficulty group (Group 1) must display a mean that is greater than the low goal-difficulty group (Group 2). If Group 2 displays the higher mean, then you might not reject the null hypothesis, no matter how great that difference might be. This presents a problem because the finding that Group 2 scored higher than Group 1 may be of great interest to other researchers (particularly because it is not what many would have expected). This is why, in many situations, nondirectional tests are preferred over directional tests. Summary In summary, research projects often begin with a statement of a research hypothesis. This allows you to develop a specific, testable statistical null hypothesis and an alternative hypothesis. The analysis of your data will lead you to one of two results: •
If the results are significant, you can reject the null hypothesis and tentatively accept the alternative hypothesis. Assuming the means are in the predicted direction, this type of result provides some support for your initial research hypothesis.
•
If the results are nonsignificant, you fail to reject the null hypothesis. This type of result fails to provide support for your initial research hypothesis.
Data, Variables, Values, and Observations Defining the Instrument, Gathering Data, Analyzing Data, and Drawing Conclusions With the null hypothesis stated, you can now test it by conducting a study in which you gather and analyze relevant data. Data is defined as a collection of scores that are obtained when subject characteristics and/or performance are observed and recorded. For example, you can choose to test your hypothesis by conducting a simple correlational study: You identify a group of 100 agents and determine •
the difficulty of the goals that have been set for each agent
•
the amount of insurance sold by each.
Different types of instruments can be used to obtain different types of data. For example, you might use a questionnaire to assess goal difficulty, but rely on company records for measures of insurance sold. Once the data are gathered, each agent will have one score indicating the difficulty of his or her goals, and a second score indicating the amount of insurance he or she has sold. You would then analyze the data to see if the agents with the more difficult goals did, in fact, sell more insurance. If so, the study results would lend some support to your research hypothesis; if not, the results would fail to provide support. In either case, you would be
22 Step-by-Step Basic Statistics Using SAS: Student Guide
able to draw conclusions regarding the tenability of your hypotheses, and would have made some progress toward answering your research question. The information learned in the current study might stimulate new questions or new hypotheses for subsequent studies, and the cycle would repeat. For example, if you obtained support for your hypothesis with a correlational study, you might choose to follow it up with a study using a different research method, perhaps an experimental study (the difference between these methods will be described below). Over time, a body of research evidence would accumulate, and researchers would be able to review this body to draw general conclusions about the determinants of insurance sales. Variables, Values, and Observations Definitions. When discussing data, one often speaks in terms of variables, values, and observations. Further complicating matters is the fact that researchers make distinctions between different types of variables (such as quantitative variables versus classification variables). This section discusses the distinctions between these terms. •
Variables. For the type of research discussed in this book, a variable refers to some specific characteristic of a subject that can assume one or more different values. For the subjects in the study described above, “amount of insurance sold” is an example of a variable: Some subjects had sold a large amount of insurance, and others had sold less. A different variable was “goal difficulty:” Some subjects had more difficult goals, while others had less difficult goals. Subject age was a third variable, while subject sex (male versus female) was yet another.
•
Values. A value, on the other hand, refers to either a particular subject's relative standing on a quantitative variable, or a subject's classification within a classification variable. For example, the “amount of insurance sold” is a quantitative variable that can assume a large number of values: One agent might sell $2,000,000 worth of insurance in one year, one might sell $100,000 worth, and another might sell $0 worth. Subject age is another quantitative variable that can assume a wide variety of values. In the sample studied, these values ranged from a low of 22 years to a high of 64 years.
•
Quantitative variables. You can see that, in both of these examples, a particular value is a type of score that indicates where the subject stands on the variable. The word “score” is an appropriate substitute for the word “value” in these cases because both “amount of insurance sold” and “age” are quantitative variables: variables that represent the quantity, or amount, of the construct that is being assessed. With quantitative variables, numbers typically serve as values.
•
Classification variables. A different type of variable is a classification variable or, alternatively, qualitative variable or categorical variable. With classification variables, different values represent different groups to which the subject might belong. “Sex” is a good example of a classification variable, as it might assume only one of two values: A particular subject is classified as being either a male or a female. “Political Party” is an example of a classification variable that can assume a larger number of
Chapter 2: Terms and Concepts Used in This Guide 23
values: A subject might be classified as being a republican, a democrat, or an independent. These variables are classification variables and not quantitative variables because the values only represent membership in a singular, specific group–– membership that cannot be represented meaningfully with a numeric value. •
Observational units. In discussing data, researchers often make references to observational units, that can be defined as the individual subjects (or other objects) that serve as the source of the data. Within the behavioral sciences and education, an individual person usually serves as the observational unit under study (although it is also possible to use some other entity, such as an individual school or organization, as the observational unit). In this text, the individual person is used as the observational unit in most examples. Researchers will often refer to the “number of observations” or “number of cases” included in their data set, and this typically refers to the number of subjects who were studied.
An example. For a more concrete illustration of the concepts discussed so far, consider the data set displayed in Table 2.1: Table 2.1 Insurance Sales Data ________________________________________________________________________ Goal difficulty Overall Observation Name Sex Age scores ranking Sales ________________________________________________________________________ 1 Bob M 34 97 2 $598,243 2 Walt M 56 80 1 $367,342 3 Jane F 36 67 4 $254,998 4 Susan F 24 40 3 $80,344 5 Jim M 22 37 5 $40,172 6 Mack M 44 24 6 $0 ________________________________________________________________________
The preceding table reports information regarding six research subjects: Bob, Walt, Jane, Susan, Jim, and Mack; therefore, we would say that the data set includes six observations. Information about a particular observation (subject) is displayed as a row running horizontally from left to right across the table. The first column of the data set (running vertically from top to bottom) is headed “Observation,” and it simply provides an observation number for each subject. The second column (headed “Name”) provides a name for each subject. The remaining five columns report information about the five research variables that are being studied. The column headed “Sex” reports subject sex, which might assume one of two values: “M” for male and “F” for female.
24 Step-by-Step Basic Statistics Using SAS: Student Guide
The column headed “Age” reports the subject's age in years. The “Goal Difficulty Scores” column reports the subject's score on a fictitious goal difficulty scale. In this example, each participant has a score on a 20-item questionnaire about the difficulty of his or her work goals. Depending on how they respond to the questionnaire, subjects receive a score ranging from a low of zero (meaning that the subject views the work goals as extremely easy) to a high of 100 (meaning that the goals are viewed as extremely difficult). The column headed “Overall Ranking,” shows how the subjects were ranked by their supervisor according to their overall effectiveness as agents. A rank of 1 represents the most effective agent, and a rank of 6 represents the least effective. The column headed “Sales” reveals the amount of insurance sold by each agent (in dollars) during the most recent year. Table 2.1 provides a very small data set with six observations and five research variables (sex, age, goal difficulty, overall ranking, and sales). One of the variables was a classification variable (sex), while the remainder were quantitative variables. The numbers or letters that appear within a particular column represent some of the values that could be assumed by that variable.
Classifying Variables According to Their Scales of Measurement Introduction One of the most important schemes for classifying a variable involves its scale of measurement. Researchers generally discuss four different scales of measurement: nominal, ordinal, interval, and ratio. Before analyzing a data set, it is important to determine which scales of measurement were used because certain types of statistical procedures require specific scales of measurement. For example, a one-way analysis of variance generally requires that the dependent variable be an interval-level or ratio-level variable; the chi-square test of independence allows you to analyze nominal-level variables; other statistics make other assumptions about the scale of measurement used with the variables that are being studied.
Chapter 2: Terms and Concepts Used in This Guide 25
Nominal Scales A nominal scale is a classification system that places people, objects, or other entities into mutually exclusive categories. A variable that is measured using a nominal scale is a classification variable: It simply indicates the name of the group to which each subject belongs. The examples of classification variables provided earlier (e.g., sex and political party) also serve as examples of nominal-level variables: They tell you which group a subject belongs to, but they do not provide any quantitative information about the subjects. That is, the “sex” variable might tell you that some subjects are males and other are females, but it does not tell you that some subjects possess more of a specific characteristic relative to others. With the remaining three scales of measurement, however, some quantitative information is provided. Ordinal Scales Values on an ordinal scale represent the rank order of the subjects with respect to the variable that is being assessed. For example, Table 2.1 includes one variable called “Overall Ranking,” which represents the rank-ordering of the subjects according to their overall effectiveness as agents. The values on this ordinal scale represent a hierarchy of levels with respect to the construct of “effectiveness”: We know that the agent ranked “1” was perceived as being more effective than the agent ranked “2,” that the agent ranked “2” was more effective than the one ranked “3,” and so forth. However, an ordinal scale has a serious limitation in that equal differences in scale values do not necessarily have equal quantitative meaning. For example, notice the rankings reproduced here: Overall ranking _______
Name ______
1 2 3 4 5 6
Walt Bob Susan Jane Jim Mack
Notice that Walt was ranked #1 while Bob was ranked #2. The difference between these two rankings is 1 (because 2 – 1 = 1), so we might say that there is one unit of difference between Walt and Bob. Now notice that Jim was ranked #5 while Mack was ranked #6. The difference between these two rankings is also 1 (because 6 – 5 = 1), so we might say that there is also 1 unit of difference between Jim and Mack. Putting the two together, we can see that the difference in ranking between Walt and Bob is equal to the difference in ranking between Jim and Mack. But does this mean that the difference in overall effectiveness between Walt and Bob is equal to the difference in overall effectiveness between Jim and Mack? Not necessarily. It is possible that Walt was just barely superior to
26 Step-by-Step Basic Statistics Using SAS: Student Guide
Bob in effectiveness, while Jim was substantially superior to Mack. These rankings tell us very little about the quantitative differences between the subjects with regard to the underlying construct (effectiveness, in this case). An ordinal scale simply provides a rank order of who is better than whom. Interval Scales With an interval scale, equal differences between scale values do have equal quantitative meaning. For this reason, you can see that the interval scale provides more quantitative information than the ordinal scale. A good example of an interval scale is the Fahrenheit scale used to measure temperature. With the Fahrenheit scale, the difference between 70 degrees and 75 degrees is equal to the difference between 80 degrees and 85 degrees: the units of measurement are equal throughout the full range of the scale. However, the interval scale also has an important limitation: it does not have a true zero point. A true zero point means that a value of zero on the scale represent zero quantity of the construct being assessed. It should be obvious that the Fahrenheit scale does not have a true zero point. When the thermometer reads zero degrees, that does not mean that there is absolutely no heat present in the environment––it is still possible for the temperature to go lower (into the negative numbers). Researchers in the social sciences often assume that many of their “man-made” variables are measured on an interval scale. For example, in the preceding study involving insurance agents, you would probably assume that scores from the goal difficulty questionnaire constitute an interval-level scale; that is, you would likely assume that the difference between a score of 50 and 60 is approximately equal to the difference between a score of 70 and 80. Many researchers would also assume that scores from an instrument such as an intelligence test are also measured at the interval level of measurement. On the other hand, some researchers are skeptical that instruments such as these have true equal-interval properties, and prefer to refer to them as quasi-interval scales. Disagreements concerning the level of measurement achieved with such paper-and-pencil instruments continues to be a controversial topic within many disciplines. In any case, it is clear that there is no true zero point with either of the preceding instruments: a score of zero on the goal difficulty scale does not indicate the complete absence of goal difficulty, and a score of zero on an intelligence test does not indicate the complete absence of intelligence. A true zero point can be found only with variables measured on a ratio scale.
Chapter 2: Terms and Concepts Used in This Guide 27
Ratio Scales Ratio scales are similar to interval scales in that equal differences between scale values do have equal quantitative meaning. However, ratio scales also have a true zero point, which gives them an additional property: with ratio scales, it is possible to make meaningful statements about the ratios between scale values. For example, the system of inches used with a common ruler is an example of a ratio scale. There is a true zero point with this system, in that “zero inches” does in fact indicate a complete absence of length. With this scale, it is possible to make meaningful statements about ratios. It is appropriate to say that an object four inches long is twice as long as an object two inches long. Age, as measured in years, is also on a ratio scale: a 10-year-old house is twice as old as a 5-year-old house. Notice that it is not possible to make these statements about ratios with the interval-level variables discussed above. One would not say that a person with an IQ of 160 is twice as intelligent as a person with an IQ of 80, as there is no true zero point with that scale. Although ratio-level scales are most commonly used for reporting the physical properties of objects (e.g., height, weight), they are also common in the type of research that is discussed in this manual. For example, the study discussed above included the variables “age” and “amount of insurance sold (in dollars).” Both of these have true zero points, and are measured as ratio scales.
Classifying Variables According to the Number of Values They Display Overview The preceding section showed that variables can be classified according to their scale of measurement. Sometimes is also useful to classify variables according to the number of values they display. There might be any number of approaches for doing this, but this guide uses a simple division of variables into three groups according to the number of possible values: dichotomous variables, limited-value variables, and multi-value variables. Dichotomous Variables A dichotomous variable is a variable that assumes just two values. These variables are sometimes called binary variables. Here are some examples of dichotomous variables: •
Suppose that you obtain Smith Anxiety Test scores from 50 male subjects and 50 female subjects. In this study, “subject sex” is a dichotomous variable, because it can assume just two values, “male” versus “female.”
28 Step-by-Step Basic Statistics Using SAS: Student Guide •
Suppose that you conduct an experiment to determine whether the herbal supplement ginkgo biloba causes improvement in a rat’s ability to learn. You begin with 20 rats, and randomly assign them to two groups. Ten rats are assigned to the 100 mg group (they receive 100 mg of ginkgo), and the other ten rats are assigned to the 0 mg group (they receive no ginkgo). In this study, the independent variable that you are manipulating is “amount of ginkgo administered.” This is a dichotomous variable because it assumes just two values “0 mg” versus “100 mg.”
Limited-Value Variables A limited-value variable is a variable that assumes just two to six values in your sample. Here are some examples of limited-value variables: •
Suppose that you obtain Smith Anxiety Test scores from 50 Caucasian subjects, 50 African-American subjects, and 50 Asian-American subjects. In this study, “subject race” is a limited-value variable because it assumes just three values: “Caucasian” versus “African-American” versus “Asian-American.”
•
Suppose that you again conduct an experiment to determine whether ginkgo biloba causes improvements in a rat’s ability to learn. You begin with 100 rats, and randomly assign them to four groups: Twenty-five rats are assigned to the 150 mg group, 25 rats are assigned to the 100 mg group, 25 rats are assigned to the 50 mg group, and 25 rats are assigned to the 0 mg group. In this study, the independent variable that you are manipulating is still “amount of ginkgo administered.” You know that this is a limitedvalue variable because it assumes just four values “0 mg” versus “50 mg” versus “100 mg” versus “150 mg.”
Multi-Value Variables Finally, this book defines a multi-value variable as a variable that assumes more than six values in your sample. Here are some examples of multi-value variables: •
Assume that you obtain Smith Anxiety Test scores from 100 subjects. With the Smith Anxiety Test, scores (values) may range from 0–99, with higher scores indicating greater anxiety. In analyzing the data, you see that your subjects displayed a wide variety of scores, for example: • • • • • • •
One subject received a score of 2. One subject received a score of 5. Two subjects received a score of 10. Five subjects received a score of 21. Seven subjects received a score of 33. Eight subjects received a score of 45. Nine subjects received a score of 53.
Chapter 2: Terms and Concepts Used in This Guide 29
• • • • •
Seven subjects received a score of 68. Six subjects received a score of 72. Six subjects received a score of 81. One subject received a score of 89. One subject received a score of 91.
Other subjects received yet other scores. Clearly, scores on the Smith Anxiety Test constitute a multi-value variable in your sample because your subjects displayed more than six values on this variable. •
Assume that, in the ginkgo biloba study just described, you assess your dependent variable (learning) in the rats by having them work at a maze-solving task. First, you teach each rat that, if it can correctly find its way through a maze, it will be rewarded with food at the end. You then allow each rat to try to find its way through a series of mazes. Each rat is allowed 30 trials––30 opportunities to get through a maze. Your measure of learning, therefore, is the number of mazes that each rat correctly negotiates. This score can range from zero (if the rat is not successful on any of the trials), to 30 (if the rat is successful on all of the trials). A rat also can score anywhere in between these extremes. In analyzing the data, you find that the rats displayed a wide variety of scores on this “successful trials” dependent variable, for example: • • • • • • • • • •
One rat displayed zero successful trials. Two rats displayed three successful trials. Three rats displayed eight successful trials. Four rats displayed 10 successful trials. Five rats displayed 14 successful trials. Six rats displayed 15 successful trials. Six rats displayed 19 successful trials. Two rats displayed 21 successful trials. One rat displayed 27 successful trials. One rat displayed 28 successful trials.
Other rats displayed yet other scores. Clearly, scores on the “successful trials” variable constitute a multi-value variable in your sample because the rats displayed more than six values on this variable.
Basic Approaches to Research Nonexperimental Research Naturally-occurring variables. Much research can be described as being either nonexperimental or experimental in nature. In nonexperimental research (also called
30 Step-by-Step Basic Statistics Using SAS: Student Guide
correlational, nonmanipulative, or observational research), the researcher simply studies the naturally-occurring relationship between two or more naturally-occurring variables. A naturally-occurring variable is a variable that is not manipulated or controlled by the researcher; it is simply measured as it normally exists. The insurance study described previously is a good example of nonexperimental research, in that you simply measured two naturally-occurring variables (goal difficulty and amount of insurance sold) to determine whether they were related. If, in a different study, you investigated the relationship between IQ and college grade point average (GPA), this would also be an example of nonexperimental research. Criterion versus predictor variables. With nonexperimental designs, researchers often refer to criterion variables and predictor variables. A criterion variable is an outcome variable that can be predicted from one or more predictor variables. The criterion variable is often the main focus of the study in that it is the outcome variable mentioned in the statement of the research problem. With our insurance example, the criterion variable is the amount of insurance sold. The predictor variable, on the other hand, is the variable that is used to predict values on the criterion. In some studies, you might even believe that the predictor variable has a causal effect on the criterion. In the insurance study, for example, the predictor variable was “goal difficulty.” Because you believed that goal difficulty can positively affect insurance sales, you conducted a study in which goal difficulty was the predictor and insurance sales was the criterion. You do not necessarily have to believe that there is a causal relationship between two variables to conduct a study such as this, however; you might simply be interested in determining whether it is possible to predict one variable from the other. Cause-and-effect relationships. It should be noted here that nonexperimental research that investigates the relationship between just two variables generally provides very weak evidence concerning cause-and-effect relationships. The reasons for this can be seen by reviewing our study on insurance sales. If the psychologist conducts this study and finds that the agents with the more difficult goals also tend to sell more insurance, does that mean that having difficult goals caused them to sell more insurance? Not necessarily. It can also be argued that selling a lot of insurance increases the agents' self-confidence, and that this causes them to set higher work goals for themselves. Under this second scenario, it was actually the insurance sales that had a causal effect on goal difficulty. As this example shows, with nonexperimental research it is often possible to obtain a single finding that is consistent with a number of different, contradictory causal explanations. Hence, a strong inference that “variable A had a causal effect on variable B” is generally not possible when one conducts simple correlational research with just two variables. To obtain stronger evidence of cause and effect, researchers generally either analyze the relationships among a larger number of variables using sophisticated statistical procedures that are beyond the scope of this text (such as structural equation modeling), or drop the nonexperimental approach entirely and instead use experimental research methods. The nature of experimental research is discussed in the following section.
Chapter 2: Terms and Concepts Used in This Guide 31
Experimental Research General characteristics. Most experimental research can be identified by three important characteristics: •
subjects are randomly assigned to experimental conditions
•
the researcher manipulates an independent variable
•
subjects in different experimental conditions are treated similarly with regard to all variables except the independent variable.
To illustrate these concepts, let's describe a possible experiment in which you test the hypothesis that goal difficulty positively affects insurance sales. First you identify a group of 100 agents who will serve as subjects. Then you randomly assign 50 agents to a “difficult goal” condition. Subjects in this group are told by their superiors to make at least 25 “cold calls” (sales calls) to potential policyholders per week. Assume that this is a relatively difficult goal. The other 50 agents have been randomly assigned to the “easy goal” condition. They have been told to make just 5 cold calls to potential policy holders per week. To the extent possible, you see to it that agents in both groups are treated similarly with respect to everything except for the difficulty of the goals that are set for them. After one year, you determine how much new insurance each agent has sold that year. You find that the average agent in the difficult goal condition sold new policies totaling $156,000, while the average agent in the easy goal condition sold policies totaling only $121,000. Independent versus dependent variables. It is possible to use some of the terminology associated with nonexperimental research when discussing this experiment. For example, it would be appropriate to continue to refer to the amount of insurance sold as being a criterion variable because this is the outcome variable of central interest. You also could continue to refer to goal difficulty as the predictor variable because you believe that this variable will predict sales to some extent. Notice that goal difficulty is now a somewhat different variable, however. In the nonexperimental study, goal difficulty was a naturally-occurring variable that could take on a wide variety of values (whatever score the subject received on the goal difficulty questionnaire). In the present experiment, however, goal difficulty is a manipulated variable, which means that you (as the researcher) determined what value of the variable would be assigned to each subject. In the experiment, the goal difficulty variable could assume only one of two values: Subjects were either in the difficult goal group or the easy goal group. Therefore, goal difficulty is now a classification variable that codes group membership. Although it is acceptable to speak of predictor and criterion variables within the context of experimental research, it is more common to speak in terms of independent variables and dependent variables. The independent variable is that variable whose values (or levels) are
32 Step-by-Step Basic Statistics Using SAS: Student Guide
selected by the experimenter to determine what effect the independent variable has on the dependent variable. The independent variable is the experimental counterpart to a predictor variable. A dependent variable, on the other hand, is some aspect of the subject's behavior that is assessed to determine whether it has been affected by the independent variable. The dependent variable is the experimental counterpart to a criterion variable. In the present example experiment, goal difficulty is the independent variable, and the amount of insurance sold is the dependent variable. Remember that the terms “predictor variable” and “criterion variable” can be used with almost any type of research––experimental or nonexperimental. However, the terms “independent variable” and “dependent variable” should be used only with experimental research––research conducted under controlled conditions with a true manipulated independent variable. Levels of the independent variable. Researchers often speak in terms of the different levels of the independent variable. These levels are also referred to as experimental conditions or treatment conditions, and correspond to the different groups to which a subject might be assigned. The present example included two experimental conditions: a difficult goal condition, and an easy goal condition. With respect to the independent variable, it is common to speak of the experimental group versus the control group. Generally speaking, the experimental group is the group that receives the experimental treatment of interest, while the control group is an equivalent group of subjects that does not receive this treatment. The simplest type of experiment consists of one experimental group and one control group. For example, the present study could have been redesigned so that it simply consisted of an experimental group that was assigned the goal of making 25 cold calls (the difficult goal condition), as well as a control group in which no goals were assigned (the no-goal condition). Obviously, it is possible to expand the study by creating more than one experimental group. This could be accomplished in the present case by assigning one experimental group the difficult goal of 25 cold calls and the second experimental group the easy goal of 5 cold calls. The control group could still be assigned zero goals.
Using Type-of-Variable Figures to Represent Dependent and Independent Variables Overview Many studies in the social sciences and education are designed to investigate the relationship between just two variables. In an experiment, researchers generally refer to these as the independent and dependent variables; in a nonexperimental study, researchers often call them the predictor and criterion variables.
Chapter 2: Terms and Concepts Used in This Guide 33
Some chapters in this guide will describe studies in which a researcher investigates the relationship between predictor and criterion variables. To help you better visualize the nature of the variables being analyzed, most of these chapters will provide a type-ofvariable figure: a figure that graphically illustrates the number of values that are assumed by the two variables in the study. This section begins by presenting the symbols that will represent three types of variables: dichotomous variables, limited-value variables, and multi-value variables. It then provides a few examples of the type-of-variable figures that you will see in subsequent chapters of this book. Figures to Represent Types of Variables Dichotomous variables. A dichotomous variable is one that assumes just two values. For example, the variable “sex” is a dichotomous variable that can assume just the values of “male” versus “female”). Below is the type-of-variable symbol that will represent a dichotomous variable:
D i The “Di” that appears inside the boxes is an abbreviation for “Dichotomous.” The figure includes two boxes to help you remember that a dichotomous variable is one that assumes only two values. Limited-value variables. A limited-value variable is one that assumes only two to six values. For example, the variable “political party” would be a limited-value variable if it assumed only the values of “democrat” versus “republican” versus “independent.” Below is the type-of-variable symbol that will represent a limited-value variable:
The “Lmt” that appears inside the boxes is an abbreviation for “Limited.” The figure includes three boxes to remind you that a limited-value variable is one that can have only two to six values. Multi-value variables. A multi-value variable is one that assumes more than six values. For example, if you administered an IQ test to a sample of 300 subjects, then “IQ scores” would be a multi-value variable if more than six different IQ scores appeared in your sample. Below is the type-of-variable symbol that will represent a multi-value variable:
This figure consists of seven boxes to help you remember that a multi-value variable is one that assumes more than six values in your sample.
34 Step-by-Step Basic Statistics Using SAS: Student Guide
Using Figures to Represent the Types of Variables Assessed in a Specific Study As was stated earlier, when a study is a true experiment, the two variables that are being investigated are typically referred to as a dependent variable and an independent variable. It is possible to construct a type-of-variable figure that illustrates the nature of the dependent variable, as well as the nature of the independent variable, in a single figure. The research hypothesis. For example, earlier this chapter developed the research hypothesis that goal difficulty will have a positive causal effect on the amount of insurance sold by insurance agents. This hypothesis was illustrated by the causal figure presented in Figure 2.1. That figure is again reproduced here as Figure 2.2. Notice that, in this figure, the dependent variable (amount of insurance sold) appears on the left, and the independent variable (goal difficulty) appears on the right.
Figure 2.2. Predicted causal relationship between goal difficulty (the independent variable) and amount of insurance sold (the dependent variable).
An experiment with two conditions. In this example, you conduct a simple experiment to investigate this research hypothesis. You begin with 100 insurance agents, and randomly assign each agent to either an experimental group or a control group. The 50 agents in the experimental group (the “difficult-goal condition”) are told to make 25 cold calls each week. The 50 agents in the control group (the “easy-goal condition”) are told to make 5 cold calls each week. After one year, you measure your dependent variable: The amount of insurance (in dollars) sold by each agent. When you review the data, you find that the agents displayed a wide variety of scores on this dependent variable: some agents sold $0 worth of insurance, some agents sold $5,000,000 worth of insurance, and most sold somewhere in between these two extremes. As a group, they displayed far more than six values on this dependent variable. The type-of-variable figure for the preceding study is shown below:
=
D i
When illustrating an experiment with a type-of-variable figure, this guide will use the convention of placing the symbol for the dependent variable on the left side of the equals sign (=), and placing the symbol for the independent variable on the right side of the equals sign. You can see that this convention was followed in the preceding figure: the word “Multi” on the left of the equals sign represents the fact that the dependent variable in your
Chapter 2: Terms and Concepts Used in This Guide 35
study (amount of insurance sold) was a multi-value variable. You knew this, because the agents displayed more than six values on this variable. In addition, the letters “Di” on the right side of the equals sign represents the fact that the independent variable (goal difficulty) was a dichotomous variable. You knew this, because this independent variable consisted of just two values (conditions): a difficult-goal condition and an easy-goal condition. Because the dependent variable is on the left and the independent variable is on the right, the preceding type-of-variable figure is similar to Figure 2.2, which illustrated the research hypothesis. In that figure, the dependent variable was also on the left, and the independent variable was also on the right The preceding type-of-variable figure could be used to illustrate any experiment in which the dependent variable was a multi-value variable and the independent variable was a dichotomous variable. In Chapter 13, “Independent-Samples t Test,” you will learn that data from this type of experiment is often analyzed using a statistical procedure called a t test. A warning about statistical assumptions. Please note that, when you are deciding whether it is appropriate to analyze a data set with a t test, it is not sufficient to simply verify that the dependent variable is a multi-value variable and that the independent variable is a dichotomous variable. There are many statistical assumptions that must be satisfied for a t test to be appropriate, and those assumptions will not be discussed in this chapter (they will be discussed in the chapters on t tests). The type-of-variable figure was presented above to help you visualize the type of situation in which a t test is often performed. Each chapter of this guide that discusses an inferential statistical procedure (such as a t test) also will describe the assumptions that must be met in order for the test to be valid. An experiment with three conditions. Now let’s modify the experiment somewhat, and observe how it changes the type-of-variable figure. Assume that you now have 150 subjects, and your independent variable now consists of three conditions, rather than just two: •
The 50 agents in experimental group #1 (the “difficult-goal condition”) are told to make 25 cold calls each week.
•
The 50 agents in experimental group #2 (the “easy-goal condition”) are told to make 5 cold calls each week.
•
The 50 agents in the control group (the “no-goal condition”) are not given any specific goals about the number of cold calls to make each week.
Assume that everything else about the study remains the same. That is, you use the same dependent variable, the number of values observed on the dependent variable still exceed six, and so forth. If this were the case, you would illustrate this revised study with the following figure:
=
Notice that “Multi” still appears to the left of the equals sign because your dependent variable has not changed. However, “Lmt” now appears to the right of the equals sign
36 Step-by-Step Basic Statistics Using SAS: Student Guide
because the independent variable now has three values rather than two. This means that the independent variable is now a limited-value variable, not a dichotomous variable. The preceding figure could be used to illustrate any experiment in which the dependent variable was a multi-value variable, and the independent variable was a limited-value variable. In Chapter 15, “One-Way ANOVA with One Between-Subjects Factor,” you will learn that data from this type of experiment can be analyzed using a statistical procedure called a one-way ANOVA (assuming that other assumptions are met; those assumptions will be discussed in Chapter 15). A correlational study. Finally, let’s modify the experiment one more time, and observe how it changes the type-of-variable figure. This time you are interested in the same research hypothesis, but you are doing a nonexperimental study rather than an experiment. In this study, you will not manipulate an independent variable. Instead, you will simply measure two naturally occurring variables and will determine whether they are correlated in a sample of 200 insurance agents. The two variables are •
Goal difficulty. Each agent completes a scale that assesses the difficulty of the goals that the agent sets for himself/herself. With this scale, scores can range from 0 to 99, with higher scores representing more difficult goals. When you analyze the data, you find that this variable displays more than six values in this sample (i.e., you find that the agents get a wide variety of different scores).
•
Amount of insurance sold. For each agent, you review records to determine how much insurance the agent has sold during the previous year. Assume that this variable also displays more than six observed values in your sample.
By analyzing your data, you want to determine whether there is a significant correlation between goal difficulty and the amount of insurance sold. You hope to find that agents who had high scores on goal difficulty also tended to have high scores on insurance sold. Because this is nonexperimental research, it is not appropriate to speak in terms of an independent variable and a dependent variable. Instead, you will refer to “goal difficulty” as a predictor variable, and “insurance sold” as a criterion variable. When preparing a type-ofvariable figures for this type of study, the criterion variable should appear to the left of the equals sign, and the predictor variable should appear to the right.
Chapter 2: Terms and Concepts Used in This Guide 37
The correlational study that was described above can be represented with the following type-of-variable figure:
=
The “Multi” appearing to the left of the equals sign represents the criterion variable in your study: amount of insurance sold. You knew that it was a multi-value variable, because it displayed more than six values in your sample. The “Multi” appearing on the right of the equals sign represents the predictor variable in your study: scores on the goal difficulty scale. The preceding figure could be used to illustrate any correlational study in which the criterion variable and predictor variable were both multi-value variables. In Chapter 10 of this guide (“Bivariate Correlation”), you will learn that data from this type of study are often analyzed by computing a Pearson correlation coefficient (assuming that other assumptions are met).
The Three Types of SAS Files Overview The purpose of this section is to provide a very general overview of the procedure that you will follow when you submit a SAS program and then interpret the results. To do this, the current section will present a short SAS program and briefly describe the output that it creates. Generally speaking, you will work with three types of files when you use the SAS System: one file will contain the SAS program, one will contain the SAS log, and one will contain the SAS output. The differences between these three types of files are discussed next. The SAS Program A SAS program consists of a set of statements written by the user. These statements provide the SAS System with the data to be analyzed, tell SAS about the nature of the data, and indicate which statistical analyses should be performed on the data. These statements are usually typed as data lines in a file in the computer’s memory. Some fictitious data. This section will illustrate a simple SAS program that analyzes some fictitious data. Suppose that you have administered two tests (Test 1 and Test 2) to a group of eight people. Scores on a particular test can range from 0 to 9. Table 2.2 presents the scores that the eight subjects earned on Test 1 and Test 2.
38 Step-by-Step Basic Statistics Using SAS: Student Guide Table 2.2 Scores Earned on Test 1 and Test 2 ____________________________ Subject Test 1 Test 2 ____________________________ Marsha
2
3
Charles
2
2
Jack
3
3
Cathy
3
4
Emmett
4
3
Marie
4
4
Cindy
5
3
Susan 5 4 ____________________________
The way that the information is arranged in Table 2.2 is representative of the way that information is arranged in most SAS data sets. Each vertical column (running from the top to the bottom) provides information about a different variable. The headings in Table 2.2 tell us that: •
The first column provides information about the “Subject” variable: It provides the first name for each subject.
•
The second column provides information about the “Test 1” variable: It provides each subject’s score on Test 1.
•
The third column provides information about the “Test 2” variable: It provides each subject’s score on Test 2.
In contrast, each horizontal row in the table (running from left to right) provides information about a different subject. For example, •
The first row provides information about the subject named Marsha. Where the row for Marsha intersects with the column headed Test 1, you can see that she obtained a score of “2” on Test 1. Where the row for Marsha intersects with the column headed Test 2, you can see that she obtained a score of “3” on Test 2.
•
The second row provides information about the subject named Charles. Where the row for Charles intersects with the column headed Test 1, you can see that he obtained a score of “2” on Test 1. Where the row for Charles intersects with the column headed “Test 2,” you can see that he also obtained a score of “2” on Test 2.
The rows for the remaining subjects can be interpreted in the same way.
Chapter 2: Terms and Concepts Used in This Guide 39
The SAS program. Suppose that you now want to analyze subject scores on the two tests. Specifically, you want to compute the means and some other descriptive statistics for Test 1 and Test 2. Following is a complete SAS program that enters the data presented in Table 2.2. It also computes means and some other descriptive statistics for Test 1 and Test 2. OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 2 3 2 2 3 3 3 4 4 3 4 4 5 3 5 4 ; PROC MEANS DATA=D1; TITLE1 'JANE DOE'; RUN; It will be easier to refer to the different components of the preceding program if we assign line numbers to each line. We will then be able to use these line numbers to refer to specific statements. Therefore, the program is reproduced again below, this time with line numbers added (remember that you would not actually type these line numbers if you were writing a program to be analyzed by the SAS System––the line numbers should already appear on your computer screen if you use the SAS windowing environment and follow the directions to be provided in the Chapter 3 of this guide): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 2 3 2 2 3 3 3 4 4 3 4 4 5 3 5 4 ; PROC MEANS DATA=D1; TITLE1 'JANE DOE'; RUN;
40 Step-by-Step Basic Statistics Using SAS: Student Guide
This chapter does not discuss SAS programming statements in detail. However, the preceding program will make more sense to you if the functions of its various parts are briefly explained: •
Line 1 of the preceding program contains the OPTIONS statement. This is a global statement that can be used to modify how the SAS System operates. In this example, the OPTIONS statement is used to specify how large each page of SAS output should be when it is printed.
•
Line 2 contains the DATA statement. You use this statement to start the DATA step (explained below) and assign a name to the data set that you are creating.
•
Line 3 contains the INPUT statement. You use this statement to assign names to the variables that SAS will work with.
•
Line 4 contains the DATALINES statement. This statement tells SAS that the data lines will begin with the next line of the program.
•
Lines 5–12 are the data lines that will be read by SAS. You can see that these data lines were taken directly from Table 2.2: Line 5 contains scores on Test 1 and Test 2 from Marsha; line 6 contains scores on Test 1 and Test 2 from Charles, and so on. There are eight data lines because there were eight subjects. Obviously, the subjects’ names have not been included as part of the data set (although they can be included, if you choose).
•
Line 13 is the “null statement.” It is very short, consisting of a single semicolon. This null statement tells SAS that the data lines have ended.
•
Line 14 contains the PROC MEANS statement. It tells SAS to compute means and other descriptive statistics for all numeric variables in the data set.
•
Line 15 contains the TITLE1 statement. You use this statement to assign a title, or heading, that will appear on each page of output. Here, the title will be “JANE DOE”.
•
Finally, Line 16 contains the RUN statement that signals the end of the program.
Subsequent chapters will discuss the use of the preceding statements in much more detail. What is the single most common programming error? For new SAS users, the most common programming error usually involves omitting a required semicolon (;). Remember that every SAS statement must end with a semicolon (in the preceding program, notice that the DATA statement ends with a semicolon, as does the INPUT statement and the PROC MEANS statement). When you obtain an error in running a SAS program, one of the first things that you should do is inspect the program for missing semicolons.
Chapter 2: Terms and Concepts Used in This Guide 41
The DATA step versus the PROC step. There is another, more fundamental way, to divide a SAS program into its constituent components. It is possible to think of each SAS program as consisting of a DATA step and a PROC step. Below, we show how the preceding program can be divided in this way:
DATA step
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 2 3 2 2 3 3 3 4 4 3 4 4 5 3 5 4 ;
PROC step
PROC MEANS DATA=D1; TITLE1 'JANE DOE'; RUN;
The differences between these steps are described below. In the DATA step, programming statements create and/or modify a SAS data set. Among other things, statements in the DATA step may •
assign a name to the data set
•
assign names to the variables to be included in the data set
•
provide the actual data to be analyzed
•
recode existing variables
•
create new variables from existing variables.
In contrast to the DATA step, the PROC step includes statements that request specific statistical analyses of the data. For example, the PROC step might request that correlations be computed for all pairs of numeric variables, or might request that a t test be performed. In the preceding example, the PROC step requested that means be computed. What text editor will I use to write my SAS program? An editor is a computer application that allows you to create lines of text, such as the lines that constitute a SAS program. If you are working on a mainframe or mid range computer system, you might have a variety of editors that can be used to write your SAS programs; just ask the staff at your computer facility.
42 Step-by-Step Basic Statistics Using SAS: Student Guide
For many users, it is best to use the SAS windowing environment to write SAS programs. The SAS windowing environment is an integrated application that allows users to create and edit SAS programs, submit them for interactive analysis, view the results on their screens, manage files, and perform other activities. This application is available at most locations where the SAS System is installed (including personal computers). Chapter 3 of this guide provides a tutorial that shows you how to use the SAS windowing environment. After submitting the SAS program. Once the preceding program has been submitted for analysis, SAS will create two types of files reporting the results of the analysis. One file is called the SAS log or log file, and the other file is the SAS output file. The following sections explain the purpose of these files. The SAS Log The SAS log is generated by SAS after you submit your program. It is a summary of notes and messages generated by SAS as your program executes. These notes and messages will help you verify that your SAS program ran correctly. Specifically, the SAS log provides •
a reprinting of the SAS program that was submitted (minus the data lines)
•
a listing of notes indicating how many variables and observations are contained in the data set
•
a listing of any notes, warnings, or error messages generated during the execution of the SAS program.
Chapter 2: Terms and Concepts Used in This Guide 43
Log 2.1 provides a reproduction of the SAS log generated for the preceding program: NOTE: SAS initialization used: real time 14.54 seconds 1 2 3 4
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES;
NOTE: The data set WORK.D1 has 8 observations and 2 variables. NOTE: DATA statement used: real time 1.59 seconds
13 14 15 16
; PROC MEANS DATA=D1; TITLE1 'JANE DOE'; RUN;
NOTE: There were 8 observations read from the dataset WORK.D1. NOTE: PROCEDURE MEANS used: real time 1.64 seconds Log 2.1. Log file created by the current SAS program.
Notice that the statements constituting the SAS program have assigned line numbers, which are reproduced in the SAS log. The data lines are not normally reproduced as part of the SAS log unless they are specifically requested. About halfway down the log, the following note appears: NOTE: The data set WORK.D1 has 8 observations and 2 variables.
This note indicates that the data set that you created (named D1) contains 8 observations and 2 variables. You would normally check this note to verify that the data set contains all of the variables that you intended to input (in this case 2), and that it contains data from all of your subjects (in this case 8). So far, everything appears to be correct. If you had made any errors in writing the SAS program, there also would have been ERROR messages in the SAS log. Often, these error messages provide you with some help in determining what was wrong with the program. For example, a message can indicate that SAS was expecting a program statement that was not included. Chapter 3, “Tutorial: Using the SAS Windowing Environment to Write and Submit SAS Programs” will discuss error messages in more detail, and will provide you with some practice in debugging a program with an error. Once the error or errors have been identified, you must revise the original SAS program and resubmit it for analysis. After processing is complete, again review the new SAS log to see
44 Step-by-Step Basic Statistics Using SAS: Student Guide
if the errors have been eliminated. If the log indicates that the program ran correctly, you are free to review the results of the analyses in the SAS output file. Very often you will submit a SAS program and, after a few seconds, the SAS output window will appear on your computer screen. Some users mistakenly assume that this means that their program ran without errors. But this is not necessarily the case. Very often some parts of your program will run correctly, but other parts will have errors. The only way to be sure is to carefully review all of the SAS log before reviewing the SAS output. Chapter 3 will lead you through these steps. The SAS Output File The SAS output file contains the results of the statistical analyses requested in the SAS program. An output file is sometimes called a “listing” file, because it contains a listing of the results of the analyses that were requested. Because the program above requested the MEANS procedure, the output file that was produced by this program will contain means, standard deviations, and some other descriptive statistics for the two variables. Output 2.1 presents the SAS output that would be produced by the preceding SAS program. Numbers (such as and ) have been added to the output to more easily identify specific sections. JANE DOE
1
The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ----------------------------------------------------------------------------TEST1 8 3.5000000 1.1952286 2.0000000 5.0000000 TEST2 8 3.2500000 0.7071068 2.0000000 4.0000000 ----------------------------------------------------------------------------Output 2.1. SAS output produced by PROC MEANS.
At the top of the output page is the name “JANE DOE.” This name appears here because “JANE DOE” was included in the TITLE1 statement of the program. Later, this guide will show you how to insert your name in the TITLE1 statement, so that your name will appear at the top of each of your output pages. Below the heading “Variable,” SAS prints the names of each of the variables being analyzed. In this case, the variables are called TEST1 and TEST2. To the right of the heading “TEST1,” descriptive statistics for Test 1 can be found. Statistics for Test 2 appear to the right of “TEST2.” Below the heading “N,” the number of valid observations being analyzed is reported. You can see that the SAS System analyzed eight observations for TEST1, and eight observations for TEST2.
Chapter 2: Terms and Concepts Used in This Guide 45
The average score on each variable is reproduced under “Mean.” Standard deviations appear in the column labeled “Std Dev.” You can see that, for Test 1, the mean was 3.5 and the standard deviation was 1.1952. For Test 2, the corresponding figures were 3.25 and 0.7071. Below the headings “Minimum” and “Maximum,” you will find the lowest and highest scores observed for the two variables, respectively. Once you have obtained this output file from your analysis, you can review it on your computer monitor, or print it out at a printer. Chapter 3 will show you how to interpret your output.
Conclusion This chapter has introduced you (or reintroduced you) to the terminology that is used by researchers in the behavioral sciences and education. With this foundation, you are now ready to learn about performing data analyses with SAS. The preceding section indicated that you must use some type of text editor to write SAS programs. For most users, it is advantageous to use the SAS windowing environment for this purpose. With the SAS windowing environment, you can write and submit SAS programs, view the results on your monitor, print the results, and save your SAS programs on a diskette––all from within one application. Chapter 3 provides a hands-on tutorial that shows you how to perform these activities within the SAS windowing environment.
46 Step-by-Step Basic Statistics Using SAS: Student Guide
Tutorial: Writing and Submitting SAS Programs Introduction...........................................................................................48 Overview.................................................................................................................48 Materials You Will Need for This Tutorial................................................................48 Conventions and Definitions ...................................................................................48 Tutorial Part I: Basics of Using the SAS Windowing Environment ......50 Tutorial Part II: Opening and Editing an Existing SAS Program ..........75 Tutorial Part III: Submitting a Program with an Error ..........................94 Tutorial Part IV: Practicing What You Have Learned.........................102 Summary of Steps for Frequently Performed Activities.....................105 Overview...............................................................................................................105 Starting the SAS Windowing Environment ............................................................105 Opening an Existing SAS Program from a Floppy Disk ........................................106 Finding and Correcting an Error in a SAS Program ..............................................107 Controlling the Size of the Output Page with the OPTIONS Statement .......................................................................109 For More Information ..........................................................................110 Conclusion...........................................................................................110
48 Step-by-Step Basic Statistics Using SAS: Student Guide
Introduction Overview This chapter shows you how to use the SAS windowing environment—an application that enables you to create and edit SAS programs in a text editor, submit programs for execution, review and print the results of the analysis, and perform related activities. This chapter assumes that you are using the SAS System for Windows on an IBM-compatible computer. The tutorial in this chapter is based on Version 8 of SAS. If you are using Version 7 of SAS, you can still use the tutorial presented here (with some minor adjustments), because the interfaces for Version 7 and Version 8 are very similar. However, if you are using Version 6 of SAS, the interface that you are using is substantially different from the Version 8 interface. The majority of this chapter consists of a tutorial that is divided into four parts. Part I shows you how to start the SAS windowing environment, create a short SAS program, save it on a 3.5-inch floppy disk, submit it for execution, and print the resulting SAS log and SAS output files. Part II shows you how to open an existing SAS file and edit it. Part III takes you through the steps involved in debugging a program with an error. Finally, Part IV gives you the opportunity to practice what you have learned. In addition, two short sections at the end of the chapter summarize the steps that are involved in frequently performed activities, and show you how to use the OPTIONS statement to control the size of your output page. Materials You Will Need for This Tutorial To complete this tutorial, you will need access to a computer on which the SAS System for Windows has been installed. You will also need at least one (and preferably two) 3.5” diskettes formatted for IBM-compatible computers (as opposed to Macintosh computers). Conventions and Definitions Here is a brief explanation of the computer-related terms that are used in this chapter: •
The ENTER key. Depending on the computer you are using, this key is identified by “Enter,” “Return,” “CR,” “New Line,” the symbol of an arrow looping backward , or some other identifier. This key is equivalent to the return key on a typewriter. Enter
•
The backspace key. This is the key that allows you to delete text one letter at a time. The key is identified by the word “Backspace” or “Delete,” or possibly by an arrow pointing backward: . Backspace
•
Menus. This book uses the abbreviation “menu” for “pull-down menu.” A menu is a list of commands that you can access by clicking a word on the menu bar at the top of a window. For example, if you click the word File on the menu bar of the Editor window,
Chapter 3: Tutorial: Writing and Submitting SAS Programs 49
the File pull-down menu appears (this menu contains commands for working with files, as you will see later in this chapter). •
The mouse pointer. The mouse pointer is a small icon that you move around the screen by moving your mouse around on its mouse pad. Different icons serve as the mouse pointer in different contexts, depending on where you are in the SAS windowing environment. Sometimes the mouse pointer is an arrow ( ), sometimes it is an I-beam (I), and sometimes it is a small hand ( ).
•
The I-beam. The I-beam is a special type of mouse pointer. It is the small icon that looks like the letter “I” and appears on the screen when you are working in the Editor window. You move the I-beam around the screen by moving the mouse around on its mouse pad. If the I-beam appears at a particular location in a SAS program and you click the left button on the mouse, that point becomes the insertion point in the program; whatever you type will be inserted at that point.
•
The cursor. The cursor is the flashing bar ( ) that appears in your SAS program when you are working in the Editor window. Anything you type will appear at the location of the cursor.
•
Insert versus overtype mode. The Insert key toggles between insert mode and overtype mode. When you are in insert mode, the text to the right of the cursor will be pushed over to the right as you type. If you are in overtype mode, the text to the right of the cursor will disappear as you type over it.
•
Pointing. When this tutorial tells you to point at an icon on your screen, it means to position the mouse pointer over that icon.
•
Clicking. When this tutorial tells you to click something on your screen, it means to put the mouse pointer on that word or icon and click the button on your mouse one time. If your mouse has more than one button, click the button on the left.
•
Double-clicking. When this tutorial tells you to double-click something on your screen, it means to put the mouse pointer on that word or icon and click the left button on your mouse twice in rapid succession. Make sure that your mouse does not move when you are clicking.
50 Step-by-Step Basic Statistics Using SAS: Student Guide
Tutorial Part I: Basics of Using the SAS Windowing Environment Overview This section introduces you to the basic features of the SAS windowing environment. You will learn how to start the SAS System and how to cycle between the three windows that you use in a typical session: the Editor window, the Log window, and the Output window. You will type a simple SAS program, save the program on a 3.5-inch floppy disk, and submit it for execution. Finally, you will learn how to review the SAS log and SAS output created by your program and how to print these files on a printer. Starting the SAS System Turn on your computer and monitor if they are not already on. If you are working in a computer lab at a university and your computer screen is blank, your computer might be in sleep mode. To activate it, press any key. After the computer has finished booting up (or waking up), the monitor displays its normal start-up screen. Figure 3.1 shows the start-up screen for computers at Saginaw Valley State University (where this book was written). Your screen will not look exactly like Figure 3.1, although it should have a gray bar at the bottom, similar to the gray bar at the bottom of Figure 3.1. On the left side of this bar is a button labeled Start.
Photograph copyright © 2000 Saginaw Valley State University
Chapter 3: Tutorial: Writing and Submitting SAS Programs 51
Click Start button to display a list of options. One of the options should be Programs. One of the programs should be The SAS System for Windows V8. Figure 3.1. The initial computer screen.
This is how you start the SAS System (try this now): Î Use your mouse to move the mouse pointer on your screen to the Start button. Click this Start button once (click it with the left button on your mouse, if your mouse has more than one button). A menu of options appears. Î In this list of options is the word “Programs.” Put the mouse pointer on Programs. This reveals a list of programs on the computer. One of these programs is “The SAS System for Windows V8.” Î Put the mouse pointer on The SAS System for Windows V8 and select it (click it and release). This starts the SAS System. This process takes several seconds. At your facility, the actual sequence for starting SAS might be different from the sequence described here. For example, it is possible that there is no item on your Programs menu labeled “The SAS System for Windows V8.” If this is the case, you should ask the lab assistant or your professor for guidance regarding the correct way to start SAS at your location.
52 Step-by-Step Basic Statistics Using SAS: Student Guide
When SAS is running, three windows appear on your screen: the Explorer window, the Log window, and the Editor window. Your screen should look something like Figure 3.2. Click this close button to close the Explorer window.
Log window Explorer window
Editor window
Figure 3.2. The initial SAS screen (closing the SAS Explorer window).
The Five Basic SAS System Windows After you start SAS, you have access to five SAS windows: the Editor, Log, Output, Explorer, and Results windows. Not all of these windows are visible when you first start SAS. Of these five windows, you will use only three of them to perform the types of analyses described in this book. The three windows that you will use are briefly described here: •
The Editor window. The Editor is a SAS program editor. It enables you to create, edit, submit, save, open, and print SAS programs. In a typical session, you will spend most of your time working within this window. When you first start SAS, the words “Editor Untitled1” appear in the title bar for this window (the title bar is the bar that appears at the top of a window). After you save your SAS program and give it a name, that name appears in the title bar for the Editor window.
•
The Log window displays your SAS log after you submit a SAS program. The SAS log is a file generated by SAS that contains your SAS program (minus the data lines), along with a listing of notes, warnings, error messages, and other information pertaining to the execution of your program. In Figure 3.2, you can see that the Log window appears in the top half of the initial screen. The word “Log” appears in the title bar for this window.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 53 •
The Output window. The Output window contains the results of the analyses requested by your SAS program. Although the Output window does not appear in Figure 3.2, a later section shows you how to make it appear.
This book does not show you how to use the two remaining windows (the Explorer window and the Results window). In fact, for this tutorial, the first thing you should do each time you start SAS is to close these windows. This is not because these windows are not useful; it is because this book is designed to be an elementary introduction to SAS, and these two windows enable you to perform more advanced activities that are beyond the scope of this book. For guidance in using these more advanced features of the SAS windowing environment, see Delwiche and Slaughter (1998) and Gilmore (1997). The two windows that you will close each time you start SAS are briefly described here: •
The Explorer window appears on the left side of your computer screen when you first start SAS (the word “Explorer” appears in its title bar; see Figure 3.2). It enables you to open files, move files, copy files, and perform other file management tasks. You can use it to create libraries of SAS files and to create shortcuts to files other than SAS files. The Explorer window is helpful when you are managing a large number of files or libraries.
•
The Results window also appears on the left side of your screen when you start SAS. It is hidden beneath the Explorer window, but you can see it after you close that window. The Results window lists each section of your SAS output in outline form. When you request many different statistical procedures, it provides a concise, easy-to-navigate listing of results. You can use the Results window to view, print, and save individual sections of output. The Results window is useful when you write a SAS program that contains a large number of procedures.
What If My Computer Screen Does Not Look Like Figure 3.2? Your computer screen might not look exactly like the computer screen illustrated in Figure 3.2. For example, your computer screen might not contain one of the windows (such as the Editor window) that appears in Figure 3.2. There are a number of possible reasons for this, and it is not necessarily a cause for alarm. This chapter was prepared using Version 8 of SAS, so if you are using a later version of SAS, your screen might differ from the one shown in Figure 3.2. Also, the computer services staff at your university might have customized the SAS System, which would make it look different at startup. The only important consideration is this: Your SAS interface must be set up so that you can use the Editor window, Log window, and Output window. There is more than one way to achieve this. After you read the following sections, you will have a good idea of how to accomplish this, even if your screen does not look exactly like Figure 3.2.
54 Step-by-Step Basic Statistics Using SAS: Student Guide
The following sections show you how to close the two windows that you will not use, how to maximize the Editor window, and how to perform other activities that will help prepare the SAS windowing environment for writing and submitting simple SAS programs. Closing the SAS Explorer Window The SAS Explorer window appears on the left side of your computer screen (see Figure 3.2). At the top of this window is a title bar that contains the word “Explorer.” Your first task is to close this Explorer window to create more room for the SAS Editor, Log, and Output windows. In the upper-right corner of the Explorer window (on the title . This is bar), there is a small box with an “X” in it, a box that looks something like this: the close button for the Explorer window. At this time, complete the following step: Î Put your mouse pointer on the close button for the Explorer window and click once (see Figure 3.2 for guidance; make sure that you click the close button for the Explorer window, and not for any other window). The Explorer window will close. Closing the SAS Results Window When the Explorer window closes, it reveals another window beneath it––the Results window. Your screen should look like Figure 3.3. Click this close button to close the Results window.
Figure 3.3. Closing the SAS Results window.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 55
Your next task is to close this Results window to create more room for the Editor, Log, and Output windows. In the upper-right corner of the Results window (on its title bar), there is a small box with an “x” in it ( ). This is the close button for the Results window. Î Put your mouse pointer on the close button for the Results window and click once (see Figure 3.3). The Results window will close. Maximizing the Editor Window After you close the Results window, the Log window and the Editor window expand to the left to fill the SAS screen. The Log window appears on the upper part of the screen, and the Editor window appears on the lower part. Your screen should look like Figure 3.4.
Click this maximize button to expand the Editor window. Figure 3.4. Maximizing the Editor window.
As said earlier, the title bar for the Editor window is the bar that appears at the top of the window. In Figure 3.4, the title “Editor - Untitled1” appears on the left side of the title bar. On the right side of this title bar are three buttons (don’t click them yet): •
is the minimize window button; if you click this button, the window will shrink and become hidden.
•
is the maximize window button; if you click this button, the window will become larger and fill the screen.
56 Step-by-Step Basic Statistics Using SAS: Student Guide •
is the close window button; if you click this button, the window will close.
At this point, the Editor window and the Log window are both visible on your screen. A possible drawback to this arrangement is that both windows are so small that it is difficult to see very much in either window. For some SAS users it is easier to view one window at a time, allowing the “active” window to expand so that it fills the screen. With this arrangement, it is as if you have stacked the windows on top of one another, but you can see only the window that is in the foreground, on the “top” of the stack. This book shows you how to set up this arrangement. In order to rearrange your windows this way, complete the following step: Î Using your mouse pointer, click the maximize window button for the Editor window. This is the middle button––the one that contains the square (see Figure 3.4). Be sure that you do this for the Editor window, not for the Log window or any other window. When clicking the maximize button in the Editor window, take care that you do not click the close button (the button on the far right that looks like this: ). If you close this window by accident, you can reopen it by completing the following steps (do not do this now unless you have closed your Editor window by mistake): Î On the menu bar, put your mouse pointer on the word View and click. The View menu appears. Î Select Enhanced Editor. Your Editor window should return to the SAS windowing environment. You can select it by using the Window menu (a later section shows you how). You can use this same procedure to bring back your Log and Output windows if you close them by accident. Requesting Line Numbers and Other Options There are a number of options that you can select to make SAS easier to use. One of the most important options is the “line numbers” option. If you request this option, SAS will automatically generate line numbers for the lines of the SAS programs that you write in the Editor. Having line numbers is useful because it helps you to know where you are in the program, and it can make it easier to copy lines of a program, move lines, and so forth. To request line numbers, you must first open the Enhanced Editor Options dialog box. Figure 3.5 shows you how to do this. Complete the following steps: Î On the menu bar, select Tools. The Tools menu appears. Î Select Options (and continue to hold the mouse button down). A pop-up menu appears. Î Select Enhanced Editor and release the mouse button.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 57 First, select the Tools menu. Menu bar
Second, select Options. Third, select Enhanced Editor.
Figure 3.5. Requesting the Enhanced Editor Options dialog box.
The Enhanced Editor Options dialog box appears. This dialog box should be similar to the one in Figure 3.6. Verify that Show line numbers is selected (click inside this box if it is not checked).
In the Indentation section, verify that None is selected (with a dot).
Verify that Clear text on submit is not selected (that is, verify that it is not checked).
Figure 3.6. Selecting appropriate options in the Enhanced Editor Options dialog box.
58 Step-by-Step Basic Statistics Using SAS: Student Guide
There are two tabs for Enhanced Editor options: General options and Appearance options. In Figure 3.6, you can see that the General options tab has been selected. If General options has not been selected for the dialog box on your screen, click the General tab now to bring General options to the front. A number of options are listed in this Enhanced Editor Options dialog box. For example, in the upper-left corner of the dialog box in Figure 3.6, you can see that two of the possible options are Allow cursor movement past end of line and Drag and drop text editing. If a check mark appears in the small box to the left of an option, it means that the option has been selected. If no check mark appears, the option has not been selected. If a box for an option is empty, you can click inside the box to select that option. If a box already has a check mark, you can click inside the box to deselect it and make the check mark disappear. There are three settings that you should always review at the beginning of a SAS session. If you do not set these options as described here, SAS will still work, but your screen might not look like the screens displayed in this chapter. If your screen does not look correct or if you are having other problems with SAS, you should go to this dialog box and verify that your options are set correctly. Set your options as follows, at the beginning of your SAS session: •
Verify that Show line numbers is selected (that is, make sure that a check mark appears in the box for this option).
•
In the box labeled Indentation, verify that None is selected.
Here is the one option that should not be selected: •
Verify that Clear text on submit is not selected (that is, make sure that a check mark does not appear in the box for this option).
Figure 3.6 shows the proper settings for the Enhanced Editor Options dialog box. If necessary, click inside the appropriate boxes so that these three options are set properly (you can disregard the other options). When all options are correct, complete this step: Î Click the OK button at the bottom of the dialog box. This returns you to the Editor window. A single line number (the number “1”) appears in the upper-left corner of the Editor window. As you begin typing the lines of a SAS program, SAS will automatically generate new line numbers. A later section of this tutorial provides you with a specific SAS program to type. First, however, you will learn more about the menu bar and the Window menu.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 59
The Menu Bar Figure 3.7 illustrates what your screen should look like at this time. The Editor window should now be enlarged and fill your screen. Toward the top of this window is the menu bar. The menu bar lists all of the menus that you can access while in this window: the File menu, the Edit menu, the View menu, the Tools menu, the Run menu, the Solutions menu, the Window menu, and the Help menu. You will use these menus to access commands that enable you to edit SAS programs, save them on diskettes, submit them for execution, and perform other activities. Use the Window menu to change windows.
Menu bar
Line number
Figure 3.7. The Editor window with line numbers.
Using the Window Menu At this point, the Editor window should be enlarged and in the foreground. During a typical session, you will jump back and forth frequently between the Editor window, the Log window, and the Output window. To do this, you will use the Window menu. In order to bring the Log window to the front of your stack, perform the following steps: Î Go to the menu bar (at the top of your screen). Î Put your mouse pointer on the word Window and click; this pulls down the Window menu and lists the different windows that you can select.
60 Step-by-Step Basic Statistics Using SAS: Student Guide
Î In this menu, put your mouse pointer on the word Log and then release the button on the mouse. When you release the button, the (empty) Log window comes to the foreground. Notice that the words “SAS – [Log-(Untitled)]” now appear in the title bar at the top of your screen. To bring the Output window to the foreground, complete the following steps: Î Go to the menu bar at the top of your screen. Î Pull down the Window menu. Î Select Output. The (empty) Output window comes to the front of your stack. Notice that the words “SAS [Output - (Untitled)]” now appear on the title bar at the top of your screen. To go back to the Editor window, complete these steps: Î Go to the menu bar at the top of your screen. Î Pull down the Window menu. Î Select Editor. The Editor window comes to the foreground. If your Editor window is not as large as you would like, you can enlarge it by clicking the bottom right corner of the window and dragging it down and to the right. When you put your mouse pointer in this corner, make sure that the double-headed arrow appears before you click and drag. A More Concise Way of Illustrating Menu Paths The preceding section showed you how to follow a number of different menu paths. A menu path is a sequence in which you pull down a menu and select one or more commands from that menu. For example, you are following a menu path when you go to the menu bar, pull down the Window menu, and select Editor.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 61
In the preceding section, menu paths were illustrated by listing each step on a separate line. This was done for clarity. However, to conserve space, the remainder of this book will often list an entire menu path on a single line. Here is an example: Î Window Î Editor The preceding menu path instructs you to go to the menu bar at the top of the screen, pull down the Window menu, and select Editor. Obviously, this is the same sequence that was described earlier, but it is now being presented in a more concise way. When possible, the remainder of this book will use this abbreviated form for specifying menu paths. Typing a Simple SAS Program In this section, you will prepare and submit a short SAS program. Before doing this, make sure that the Editor window is the active window (that is, it is in front of other windows). You know that the Editor is the active window if the title bar at the top of your screen includes the words “SAS - [Editor - Untitled1].” If it is not the active window, use the Window menu to bring it to the front, as described earlier. Your cursor should now be in position to begin typing your SAS program (if it is not, use your mouse or the arrow keys on your keyboard to move the cursor down to the now-empty lines where your program will be typed). Keep these points in mind as you type your program: •
Do not type the line numbers that appear to the left of the program (that is, the numbers 1, 2, 3, and so forth, that appear on the left side of the SAS program). These line numbers are automatically generated by the SAS System as you type your SAS program.
•
The lines of your SAS program should be left-justified (that is, begin at the left side of the window). If your cursor is in the wrong location, use your arrows keys to move it to the correct location.
•
You can type SAS statements in uppercase letters or in lowercase letters––either is acceptable.
•
If you make an error, use the backspace key to correct it. This key is identified by the word “Backspace” or “Delete,” or possibly by an arrow pointing backward: . Backspace
•
Be sure to press ENTER at the end of each line in the program. This moves you down to the next line.
62 Step-by-Step Basic Statistics Using SAS: Student Guide
Type this program: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 2 3 2 2 3 3 3 4 4 3 4 4 5 3 5 4 ; PROC MEANS DATA=D1; TITLE1 'type your name here'; RUN;
Some notes about the preceding program: •
The line numbers on the far left side of the program (for example, 1, 2, and 3) are generated by the SAS Editor. Do not type these line numbers. When you get to the bottom of the screen, the Editor continues to generate new lines for you each time you press ENTER.
•
In this book some lines in the programs are indented (such as line 3 in the preceding program). You are encouraged to use indention in the same way. This is not required by the SAS System, but many programmers use indention in this way to keep sections of the program organized meaningfully.
•
Line 15 contains the TITLE1 statement. You should type your first and last names between the single quotation marks in this statement. By doing this, your name will appear at the top of your printout when you print the results of the analysis. Be sure that your single quotation marks are balanced (that is, you have one to the left of your name and one to the right of your name before the semicolon). If you leave off one of the single quotation marks, it will cause an error.
Scrolling through Your Program The program that you have just typed is so short that you can probably see all of it on your screen at one time. However, very often you will work on longer programs that extend beyond the bottom of your screen. In these situations, it is necessary to scroll (move) through your program so that you can see the hidden parts. There are a variety of approaches that you can use to scroll through a file, and Figure 3.8 illustrates some of these approaches.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 63 Click this up-arrow to move up one line at a time.
Click and drag this scroll bar to move quickly through large files.
Click this down-arrow to move down one line at a time.
Figure 3.8. How to scroll up and down.
Here is a brief description of ways to scroll through a file: •
You can press the Page Up and Page Down keys on your keyboard.
•
You can click and drag the scroll bar that appears on the right side of your Editor window. Drag it down to see the lower sections of your program, drag it up to see the earlier sections (see Figure 3.8).
•
You can click the area that appears just above or below the scroll bar to move up or down one screen at a time.
•
You can click the up-arrow and down-arrow in the scroll bar area to move up or down one line at a time.
You can use these same techniques to scroll through any type of SAS file, whether it is a SAS program, a SAS log, or a SAS output file. SAS File Types After your program has been typed correctly, you should save it. This is done with the File menu and the Save As command (see the next section). The current section discusses some conventions that are followed when naming SAS files. Most files, including SAS files, have two-part names consisting of a root and a suffix. The root is the basic file name that you can often make up. For example, if you are analyzing
64 Step-by-Step Basic Statistics Using SAS: Student Guide
data from a sample of subjects who are republicans, you might want to give the file a root name such as REPUB. If you are using a version of Windows prior to Windows 95 (such as Windows 3.1), the root part of the file name must begin with a letter, must be no more than eight characters in length, and must not contain any spaces or special characters (for example, #@“:?>;*). If you are using Windows 95 or later, the root part of the file name can be up to 255 characters in length, and it can contain spaces and some special characters. It cannot contain the following characters: * | \ / ; : ? “ < >. The file name extension indicates what type of file you are working with. The extension immediately follows the root, begins with a period, and is three letters long, such as .LST, .SAS, .LOG, or .DAT. Therefore, a complete file name might appear as REPUB.LST or REPUB.SAS. The extensions are described here: root.SAS
This file contains the SAS program that you write. Remember that a SAS program is a set of statements that causes the SAS System to read data and perform analyses on the data. If you want to include the data as part of the program in this file, you can.
root.LOG
This file contains your SAS log: a file generated by SAS that includes notes, warnings, error messages, and other information pertaining to the execution of your program.
root.LST
This file contains your SAS output: the results of your analyses as generated by SAS. The “LST” is an abbreviation for “Listing.”
root.DAT
This file is a raw data file, a file containing raw data that are to be read and analyzed by SAS. You use a file like this only if the data are not already included in the .SAS file that contains the SAS program. This book does not illustrate the use of the .DAT file, because with each of the SAS programs illustrated here, the data are always included as part of the .SAS file.
Saving Your SAS Program on Floppy Disks versus Other Media This book shows you how to save your SAS programs on 3.5-inch floppy disks. This is because it is assumed that most readers of this book are university students and that floppy disks are the media that are most readily available to students. It is also possible to save your SAS programs in a variety of other ways. For example, in some courses, students are instructed to save their programs on their computers’ hard drives. In other courses, students are told to save their programs on Zip disks or some other removable media. If you decide to save your programs on a storage medium other than a 3.5-inch floppy disk, ask your lab assistant or professor for guidance.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 65
Saving Your SAS Program for the First Time on a Floppy Disk To save your SAS program on a floppy disk, make sure that a 3.5-inch high-density IBM PC-formatted disk has been inserted in drive “A” on your CPU (this book assumes that drive “A” is the floppy drive). Also make sure that the Editor window containing your program is the active window (the one currently in the foreground). Then make the following selections: Î File Î Save As You will see a Save As dialog box on your screen. This dialog box contains smaller boxes with labels such as Save in, File name, and so forth. The dialog box should resemble Figure 3.9. Click this down arrow to get a list of other locations where you can save your file.
Names of files and folders might appear here.
Click inside this File name box, and then type the name that you want to give to your file. Figure 3.9. The initial Save As dialog box.
The first time you save a program, you must tell the computer which drive your diskette is in and provide a name for your file. In this example, suppose that your default destination is a folder named “V8” and that you do not want to save your file in this folder (on your computer, the default folder might have a different name). Suppose you want to save it on a floppy disk instead. With most computers, this is done in the computer’s “A” drive. It is therefore necessary to change your computer’s drive, as illustrated here.
66 Step-by-Step Basic Statistics Using SAS: Student Guide
To change the location where your file will be saved, complete these steps: Î Click the down arrow on the right side of the Save in box; from there you can navigate to the location where you want to save your file. Î Scroll up and down until you see 3 1/2 Floppy (A:). Î Select 3 1/2 Floppy (A:). 3 1/2 Floppy (A:) appears in the Save in box. Now you must name your file. Complete the following steps: Î Click inside the box labeled File name. Your cursor appears inside this box (see Figure 3.9). Î Type the following name inside the File name box: DEMO.SAS. With this done, your Save As dialog box should resemble the completed Save As dialog box that appears in Figure 3.10. When ready, click this Save button.
Figure 3.10. The completed Save As dialog box.
After you have completed these tasks, you are ready to save the SAS program on the floppy disk. To do this, complete the following step: Î Click the Save button (see Figure 3.10). After clicking Save, a small light next to the 3.5-inch drive will light up for a few seconds, indicating that the computer is saving your program on the diskette. When the light goes off, your program has been saved under the name DEMO.SAS.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 67
Where the Name of the SAS Program Will Appear After you use the Save As command to name a SAS program within the Editor, the name of that file will appear on the left side of the title bar for the Editor window (remember that the title bar is the bar that appears at the top of a window). For example, if you look on the left side of the title bar for the current Editor window, you will see that it no longer contains the words “SAS - [Editor - Untitled1]” as it did before. Instead, this location now contains the words “SAS - [DEMO].” Similarly, whenever you pull down the Window menu during this session, you will see that it still contains items labeled “Log” and “Output,” but that it no longer contains “Editor - Untitled1.” Instead, this label has been replaced with “Demo,” the name that you gave to your SAS file. Saving a File Each Subsequent Time on a Floppy Disk The first time you save a new file, you should use the Save As command and give it a specific name as you did earlier. Each subsequent time you save that file (during the same session) you can use the Save command, rather than the Save As command. To use the Save command, verify that your Editor window is active and make the following selections. Î File Î Save Notice that when you save the file a second time using the Save command in this way, you do not get a dialog box. Instead, the file is saved again under the same name and in the same location that you selected earlier. Save Your Work Often! Sometimes you will work on a single SAS program for a long period of time. On these occasions, you should save the file once every 10 minutes or so. If you do not do this, you might lose all of the work you have done if the computer loses power or becomes inoperative during your session. However, if you have saved your work frequently, you will be able to reopen the file, and it will appear the way that it did the last time you saved it. Submitting the SAS Program for Execution So far you have created your SAS program and saved it as a file on your diskette. It is now time to submit it for execution.
68 Step-by-Step Basic Statistics Using SAS: Student Guide
There are at least two ways to submit SAS programs. The first way is to use the Run menu. The Run menu is identified in Figure 3.11. One way to submit a SAS program is to click the Run menu, and then select Submit.
Another way is to click the Submit button on the toolbar (that is, the “running person” icon).
Figure 3.11. How to submit a SAS program for execution.
To submit a SAS program by using the Run menu, make sure that the Editor is in the foreground, and then make the following selections (go ahead and try this now): Î Run Î Submit The second way to submit a SAS program is to click the Submit button on the toolbar. This is a row of buttons below the Editor window. These buttons provide shortcuts for performing a number of activities. One of these buttons is identified with the icon of a running person (see Figure 3.11). This is the Submit button. To submit your program using the toolbar, you would do the following (do not do this now): Put your mouse pointer on the running person icon, and click it once. This submits your program for execution (note that this button is identified with a running person because after you click it, your program will be running).
Chapter 3: Tutorial: Writing and Submitting SAS Programs 69
What Happens after Submitting a Program In the preceding section, you submitted your SAS program for execution. When you submit a SAS program, it disappears from the Editor window. While the program is executing, a message appears above the menu bar for the Editor, and this message indicates which PROC (SAS procedure) is running. It usually takes only a few seconds for a typical SAS program to execute. Some programs might take longer, however, depending on the size of the data set being analyzed, the number of procedures being requested, the speed of the computer’s processor, and other factors. After you submit a SAS program and it finishes executing, typically you will experience one of three possible outcomes. •
Outcome 1: Your program runs perfectly and without any errors. If this happens, the Editor window will disappear and SAS will automatically bring the Output window to the foreground. The results of your analysis will appear in this Output window.
•
Outcome 2: Part of your program runs correctly, and part of it has errors. If this happens, it is still possible that your Editor window will disappear and the Output window will come to the foreground. In the Output window, you will see the results from those sections of the program that ran correctly. This outcome can be misleading, however, because if you are not careful, you might never realize that part of your program had errors and did not run.
•
Outcome 3: Your program has errors and no results are produced. If this happens, the Output window will never appear; you will see only the Editor window.
Outcome 2 can mislead you into believing that there were no problems with your SAS program when there may, in fact, have been problems. The point is this: after you submit a SAS program, even if SAS brings up the Output window, you should always review all of the log file prior to reviewing the output file. This is the only way to be sure you have no errors or other problems in your program. Reviewing the Contents of Your Log File The SAS log is a file generated by SAS that contains the program you submitted (minus the data lines), along with notes, warnings, error messages, and other information pertaining to the execution of your program. It is important to always review your log file prior to reviewing your output file to verify that the program ran as planned. If your program ran, the Output window is probably in the foreground now. If your program did not run, the Editor window is probably in the foreground. In any case, make sure that the Log window is in the foreground, and then make the following selections: Î Window Î Log
70 Step-by-Step Basic Statistics Using SAS: Student Guide
Your Log window is now your active window. In many cases, your SAS log will be fairly long and only the last part of it will be visible in the Log window. This is a problem, because it is best to begin at the beginning of your log and review it from beginning to end. If you are currently at the end of your log file, click the scroll bar on the right side of the Log window, drag it up, and release it. The beginning of your log file is now at the top of your screen. Scroll through the log and verify that you have no error messages. If your program executed correctly, your log file should look something like Log 3.1 (notice that the SAS log contains your SAS program minus the data lines). NOTE: SAS initialization used: real time 14.54 seconds 1 2 3 4
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES;
NOTE: The data set WORK.D1 has 8 observations and 2 variables. NOTE: DATA statement used: real time 1.59 seconds 13 14 15 16
; PROC MEANS DATA=D1; TITLE1 'JANE DOE'; RUN;
NOTE: There were 8 observations read from the dataset WORK.D1. NOTE: PROCEDURE MEANS used: real time 1.64 seconds
Log 3.1. Log file created by the current SAS program.
What Do I Do If My Program Did Not Run Correctly? If your program ran correctly and there are no error messages in your SAS log, you should skim this section and then skip to the following section (titled “Printing the Log File on a Printer”). If your program did not run correctly, you should follow the instructions provided here. If there is an error message in your log file, or if for any other reason your program did not run correctly, you have two options: •
If there is a professor or a lab assistant available, ask this person to help you debug the program and resubmit it. After the program runs correctly, continue with the next section.
•
If there is no professor or lab assistant available, go to the section titled “Tutorial Part III: Submitting a Program with an Error,” which is in the second half of this chapter. That section shows you how to correct and resubmit a program. Use the guidelines
Chapter 3: Tutorial: Writing and Submitting SAS Programs 71
provided there to debug your program and resubmit it. When your program is running correctly, continue with the next section below. Printing the Log File on a Printer To get a hardcopy (paper copy) of your log file, the Log window must be in the foreground. If necessary, make the following selections to bring your Log window to the front: Î Window Î Log Then select Î File Î Print This gives you the Print dialog box, which should look something like Figure 3.12.
Click OK to print your file. Figure 3.12. The Print dialog box.
Assuming that all of the settings in this dialog box are acceptable, you can print your file by completing the following step (do this now): Î Click the OK button at the bottom of this dialog box. Your log file should print. Go to the printer and pick it up. If other people are using the SAS System at this time, make sure that you get your own personal log file and not a log file
72 Step-by-Step Basic Statistics Using SAS: Student Guide
created by someone else (your log file should have your name in the TITLE1 statement, toward the bottom). Reviewing the Contents of Your Output File If your program ran correctly, the results of the analysis can be viewed in the Output window. To review your output, bring the Output window to the foreground by making the following selections: Î Window Î Output Your Output window is in the foreground. If you cannot see all of your output, it is probably because you are at the bottom of the output file. If this is the case, scroll up to see all of the output page. If your program ran correctly (and you keyed your data correctly), the output should look something like Output 3.1. JANE DOE The MEANS Procedure
1
Variable N Mean Std Dev Minimum Maximum -------------------------------------------------------------------------TEST1 8 3.5000000 1.1952286 2.0000000 5.0000000 TEST2 8 3.2500000 0.7071068 2.0000000 4.0000000 ---------------------------------------------------------------------------
Output 3.1. SAS output produced by PROC MEANS.
Printing Your SAS Output Now you should print your output file on a printer. You do this by following the same procedure used to print your log file. Make sure that your Output window is in the foreground and make the following selections: Î File Î Print This gives you the Print dialog box. If all the settings are appropriate, complete this step: Î Click the OK button at the bottom of the Print dialog box. Your output file should print. When you pick it up, verify that you have your own output and not the output file created by someone else (your name should appear at the top of the output if you typed your name in the TITLE1 statement, as you were directed).
Chapter 3: Tutorial: Writing and Submitting SAS Programs 73
Clearing the Log and Output Windows After you finish an analysis, it is a good idea to clear the Log and Output windows prior to doing any subsequent analyses. If you perform subsequent analyses, the new log and output files will be appended to the bottom of the log and output files created by earlier analyses. To clear the contents of your Output window, make sure that the Output window is in the foreground and make the following selections: Î Edit Î Clear All The contents of the Output window should disappear from the screen. Now bring the Log window to the front: Î Window Î Log Clear its contents by clicking Î Edit Î Clear All Returning to the Editor Window Suppose that you now want to modify your SAS program by adding new data to the data set. Before doing this, you must bring the Editor to the foreground. An earlier section warned you that after you save a SAS program, the word “Editor” will no longer appear on the Window menu. In its place, you will see the name that you gave to your SAS program. In this session, you gave the name “DEMO.SAS” to your SAS program. This means that, in the Window menu, you will now find the word “Demo” where “Editor” used to be. To bring the Editor to the foreground, you should select “Demo.” Î Window Î Demo The Editor window containing your SAS program now appears on your screen. What If the Editor Window Is Empty? When you bring the Editor window to the foreground, it is possible that your SAS program has disappeared. If this is the case, it might be because you did not set the Enhanced Editor Options in the way described earlier in this chapter. Figure 3.6 showed how these options should be set. Toward the bottom of the Enhanced Editor Options dialog box, one of the options is “Clear text on submit.” The directions provided earlier indicated that this option should not be selected (that is, the box for this option should not be checked). If your SAS program disappeared after you submitted it, it might be because this box was checked. If this is the case, go to the Enhanced Editor Options dialog box now and verify that this
74 Step-by-Step Basic Statistics Using SAS: Student Guide
option is not selected. You should deselect it if it is selected (see the previous section titled “Requesting Line Numbers and Other Options” for directions on how to do this). If your SAS program has disappeared from the Editor window, you can retrieve it easily by using the Recall Last Submit command. If this is necessary, verify that your Editor window is in the foreground and make the following selections (do this only if your SAS program has disappeared): Î Run Î Recall Last Submit Your SAS program reappears in the Editor window. Saving Your SAS Program on a Diskette (Again) At the end of a SAS session, you will save the most recent version of your SAS program on a diskette. If you do this, you will be able to open up this most recent version of the program the next time you want to do some additional analyses. You will now save the program on the diskette in drive “A.” Because this is not the first time you have saved it this session, you can use the Save command rather than the Save As command. Verify that your Editor window is the active window (and that your program is actually in this window), and make the following selections: Î File Î Save Ending Your SAS Session You can now end your SAS session by selecting Î File Î Exit A dialog box appears with the message, “Are you sure you want to end the SAS session?” Click OK to end the SAS session.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 75
Tutorial Part II: Opening and Editing an Existing SAS Program Overview This section shows you how to open the SAS program that you have saved on a floppy disk. It also shows you how to edit an existing program: how to insert new lines, delete lines, copy lines, and perform other activities that are necessary to modify a SAS program. Restarting SAS Often you will want to open a file (on a diskette) that contains a SAS program that you created earlier. This section and the three to follow show you how to do this. Verify that your computer and monitor are turned on and are not in sleep mode. Then complete the following steps: Î Click the Start button that appears at the bottom of your initial screen. This displays a list of options, including the word Programs. Î Select Programs. This reveals a list of programs on the computer. One of them is The SAS System for Windows V8. Î Select (click) The SAS System for Windows V8. This starts the SAS System. After a few seconds, you will see the initial SAS screen, which contains the Explorer window, the Log window, and the Editor window. Your screen should look something like Figure 3.13.
76 Step-by-Step Basic Statistics Using SAS: Student Guide First, click the close button for the Explorer window, as well as for the Results window, which lies beneath it.
Next, click the maximize button for the Editor window.
Figure 3.13. Modifying the initial SAS screen.
Modifying the Initial SAS System Screen Before opening an existing SAS file, you need to modify the initial SAS System screen so that it is easier to work with. Complete the following steps. Close the Explorer window: Î Click the close window button for the Explorer window (the button that looks like in the upper-right corner of the Explorer window; see Figure 3.13). This reveals the Results window, which was hidden beneath the Explorer window. Now close the Results window: Î Click the close window button for the Results window (the button that looks like in the upper-right corner of the Results window). The remaining visible windows now expand to fill the screen. Your screen should contain only the Log window (at the top of the screen) and the Editor window (at the bottom). Remember that the Editor window is identified by the words “Editor - Untitled1” in its title bar. To maximize the Editor window, complete the following step: Î Click the maximize window button for the Editor window (the middle button that contains a square: ). The Editor window expands and fills your screen.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 77
Setting Line Numbers and Other Options To change the settings for line numbers and other options, use the Enhanced Editor Options dialog box. From the Editor’s menu bar, make the following selections: Î Tools Î Options Î Enhanced Editor This opens the Enhanced Editor Options dialog box (see Figure 3.14). Verify that Show line numbers is selected.
In the Indentation section, verify that None is selected.
Verify that Clear text on submit is not selected.
Figure 3.14. Verifying that appropriate options are selected in the Enhanced Editor Options dialog box.
If you began Part II of this tutorial immediately after completing Part I (and if you are working at the same computer), the options that you selected in Part I should still be selected. However, if you have changed computers, or if someone else has used SAS on your computer since you used it, your options might have been changed. For that reason, it is always a good idea to check at the beginning of each SAS session to ensure that your Editor options are correct. As explained in Part I, the Enhanced Editor Options dialog box consists of two components: General options and Appearance options. The upper-left corner of the Enhanced Editor Options dialog box contains one tab labeled General and one tab labeled Appearance. You should verify that General options is at the front of the stack (that is, General options is visible), and you should click the tab labeled General if it is not visible.
78 Step-by-Step Basic Statistics Using SAS: Student Guide
The Enhanced Editor Options dialog box contains a variety of different options, but we will focus on three of them. Here are the two options that should be selected at the beginning of a SAS session: •
Verify that Show line numbers is selected (that is, verify that a check mark appears in the box for this option).
•
In the box labeled Indentation, verify that None is selected.
This option should not be selected: •
Verify that Clear text on submit is not selected (that is, verify that a check mark does not appear in the box for this option).
Figure 3.14 shows the proper settings for the Enhanced Editor Options dialog box. If necessary, click inside the appropriate boxes so that the three options described earlier are set properly (you can disregard the other options). When all are correct, Î Click the OK button at the bottom of the dialog box. This returns you to the Editor window. A single line number (the number 1) appears in the upper left corner of the Editor window. Reviewing the Names of Files on a Floppy Disk and Opening an Existing SAS Program Earlier, you saved your SAS program on a 3.5-inch floppy disk in drive A. This section shows you how to open this program in the Editor. To begin this process, verify that your floppy disk is in drive A and that the Editor window is the active window. Then make the following selections: Î File Î Open The Open dialog box appears on your screen. The Open dialog box contains a number of smaller boxes with labels such as Look in, File name, and so forth. It should look something like Figure 3.15.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 79 Click this down arrow to display a list of other possible locations where SAS can look for your file.
After you have selected the correct Look in location, the names of your files should appear in this window. Figure 3.15. The Open dialog box.
Toward the top of the Open dialog box is a box labeled Look in (see Figure 3.15). This box tells SAS where it should look to find a file. The default is to look in a folder named “V8” on your hard drive. You know that this is the case if the Look in box contains the icon of a folder and the words “V8” (although the default location might be different on your computer). You have to change this default so that SAS will look on your 3.5-inch floppy disk to find your program file. To accomplish this, complete the following steps: Î On the right side of the Look in box is a down arrow. Click this down arrow to get other possible locations where SAS can look (see Figure 3.15). Î Scroll up and down this list of possible locations (if necessary) until you see an entry that reads 3 1/2 Floppy (A:). Î Click the entry that reads 3 1/2 Floppy (A:). 3 1/2 Floppy (A:) appears in the Look in box. When SAS searches for files, it will look on your floppy disk.
80 Step-by-Step Basic Statistics Using SAS: Student Guide
The contents of your disk now appear below the Look in box. One of these files is “DEMO,” the file that you need to open. Remember that when you first saved your file, you gave it the full name “DEMO.SAS.” However, this “.SAS” extension does not appear on the SAS program files that appear in the Open dialog box. To open this file, complete the following steps: Î Click the file named DEMO. The name DEMO appears in the File name box below the large box. Î Click the Open button. The SAS program that you saved under the name DEMO.SAS appears in the Editor window. You can now modify it and submit it for execution. What If I Don’t See the Name of My File on My Disk? In the Open dialog box, if you don’t see the name of a file that you know is on your diskette, there are several possible reasons. The first thing you should do is verify that you are looking in drive A, and that your floppy disk has been inserted in the drive. If everything appears to be correct with the Look in box, the second thing you should do is review the box labeled Files of type, which also appears in the Open dialog box. Verify that the entry inside this box is SAS Files (*.sas) (see Figure 3.15). If this entry does not appear in this box, it means that SAS is looking for the wrong types of files on your disk. Click the down arrow on the right side of the Files of type box to reveal other options. Select SAS Files (*.sas), and then check the window to see if DEMO now appears there. Another possible solution is to instruct SAS to list the names of all files on your disk, regardless of type. To do this, go to the box labeled Files of type. On the right side of this Files of type box is a down arrow. Click the down arrow to reveal different options. Select the entry that says All Files (*.*). This reveals the names of all the files on your disk, regardless of the format in which they were saved.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 81
General Comments about Editing SAS Programs The following sections show you how to edit an existing SAS program. Editing a SAS program involves modifying it in some way: inserting new lines, copying lines, moving lines, and so forth. Keep the following points in mind as you edit files: •
The Undo command. The Undo command allows you to undo (reverse) your most recent editing action. For example, assume that you select (highlight) a large section of your SAS program, and then you accidentally delete it. It is possible to use the Undo command and return your program to its prior state.
When you make a mistake, you can undo your most recent editing action (do not select this now; this is for illustration only): Î Edit Î Undo This returns your program to the state it was in prior to the most recent editing action, whether that action involved deleting, copying, cutting, or some other activity. You can select the Undo command multiple times in a row. This allows you to undo a sequence of changes that you have made since the last time the Save command was used. •
Using the arrow keys. Somewhere on your keyboard are keys marked with directional arrows such as ↑←↓→. They enable you to move your cursor around the SAS program.
When you want to move your cursor to a lower line on a program that you have already written, you generally use the down arrow key (↓) rather than the ENTER key. This is because, when you press the ENTER key, it creates a new blank line in the SAS program as it moves your cursor down. Thus, you should use the ENTER key only when you want to create new lines; otherwise, rely on the arrow keys. The following sections show you how to perform a number of editing activities. It is important that you perform these editing functions as you move through the tutorial. As you read each section, you should modify your SAS program in the same way that the SAS program in the book is being modified. Inserting a Single Line in an Existing SAS Program When editing a SAS program in the Editor, you might want to insert a single line between two existing lines as follows: •
Place the cursor at the end of the line that is to precede the new line
•
Press the ENTER key once.
82 Step-by-Step Basic Statistics Using SAS: Student Guide
For example, suppose that you want to insert a new line after line 4 in the following program. To do this, you must first place the cursor at the end of that line (that is, at the end of the DATALINES statement that appears on line 4). Complete the following steps: Î Use the mouse to place the I-beam at the end of the DATALINES statement on line 4 Î Click once. The flashing cursor appears at the point where you clicked (if you missed and the cursor is not in the correct location, use your arrow keys to move it to the correct location). 1 2 3 4 5 6 7
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; ▌ 2 3 2 2 3 3 Î Press ENTER. A new blank line is inserted between existing lines 4 and 5, as shown here:
1 2 3 4 5 6 7 8
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; ▌ 2 3 2 2 3 3
Your cursor (the insertion point) is now in column 1 of the new line you have created. You can now type a new data line in the blank line. Complete the following step: Î Type the numbers 6 and 7 on line 5, as shown here: 1 2 3 4 5 6 7 8
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 6 7 2 3 2 2 3 3
Chapter 3: Tutorial: Writing and Submitting SAS Programs 83
Let’s do this one more time: Again, insert a new blank line after the DATALINES statement: Î Use the mouse to place the I-beam at the end of the DATALINES statement on line 4. Î Click once to place the insertion point there. Î Press the ENTER key once. This gives you a new blank line. Î Now type the numbers 8 and 9 in the new line that you have created, as shown here: 1 2 3 4 5 6 7 8 9
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 8 9 6 7 2 3 2 2 3 3
Inserting Multiple Lines You follow essentially the same procedure to insert multiple lines, with one exception: After you have positioned your insertion point, you press ENTER multiple times rather than one time. For example, complete these steps to insert three new lines between existing lines 4 and 5: Î Use the mouse to place the I-beam at the end of the DATALINES statement on line 4. Î Click once to place the insertion point there. Î Press ENTER three times. This gives you three new blank lines, as shown here: 1 2 3 4 5 6 7 8 9 10 11 12
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; ▌ 8 6 2 2 3
9 7 3 2 3
84 Step-by-Step Basic Statistics Using SAS: Student Guide
Next, you will type some new data on the three new lines you have created, starting with line 5. Î Use your arrow keys to move the cursor up to line 5. Î Now type the following values on lines 5–7, so that your data set looks like the following program (after you have typed the values on a particular line, use the arrow keys to move the cursor down to the next line): 1 2 3 4 5 6 7 8 9 10 11 12
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 2 2 3 3 4 4 8 9 6 7 2 3 2 2 3 3
Deleting a Single Line There are at least two ways to delete a single line in a SAS program. One way is obvious: place the cursor at the end of the line and press the backspace key to delete the line, one character at a time. A second way (and the way that you will use here) is to click and drag to highlight the line, and then delete the entire line at once. This introduces you to the concept of “clicking and dragging,” which is a very important technique to use when editing a SAS program. For example, suppose that you want to delete line 5 of the following program (the line with “2 2” on it): 1 2 3 4 5 6 7 8 9 10 11 12
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 2 2 3 3 4 4 8 9 6 7 2 3 2 2 3 3
Chapter 3: Tutorial: Writing and Submitting SAS Programs 85
Complete the following steps: Î Place your I-beam cursor at the beginning of the data on line 5. (This means place the I-beam to the immediate left of the first “2” on line 5; do not go too far to the left of the “2” or your I-beam will turn into an arrow––if this happens you have gone too far. This might take a little practice.) Î Click once and hold the button down (do not release it yet). Î While holding the button down, drag your mouse to the right so that the data on line 5 are highlighted in black. (This means that you should drag to the right until the “2 2” is highlighted in black. Do not drag your mouse up or down, or you might accidentally highlight additional lines of data). Î After the data are highlighted in black, release the button. The data should remain highlighted. Your program should look something like this: 1 2 3 4 5 6 7 8 9 10 11 12
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 2 2 3 3 4 4 8 9 6 7 2 3 2 2 3 3
To delete the line you have highlighted, complete the following step: Î Press the Backspace (DELETE) key. The highlighted data disappear, leaving only a blank line, as shown here: 1 2 3 4 5 6 7 8 9 10 11 12
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; ▌ 3 3 4 4 8 9 6 7 2 3 2 2 3 3
86 Step-by-Step Basic Statistics Using SAS: Student Guide
Your cursor is in column 1 of the newly blank line. To make the blank line disappear, complete the following step: Î Press the Backspace (DELETE) key again. Your program now appears as shown here: 1 2 3 4 5 6 7 8 9 10 11
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 3 3 4 4 8 9 6 7 2 3 2 2 3 3
Deleting a Range of Lines You follow a similar procedure to delete a range of lines, with one exception: When you click and drag, you will drag down (as well as to the right) so that you highlight more than one line. When you press the backspace key, all of the highlighted lines will be deleted. For example, suppose that you want to delete lines 5, 6, and 7 in your program: 1 2 3 4 5 6 7 8 9 10 11
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 3 3 4 4 8 9 6 7 2 3 2 2 3 3
Complete the following steps: Î Place your I-beam at the beginning of the data on line 5. (Again, this means place the I-beam to the immediate left of the first “3” on line 5; do not go too far to the left of the “3” or your I-beam will turn into an arrow––if this happens you have gone too far.) Î Click once and hold the button down (do not release it yet).
Chapter 3: Tutorial: Writing and Submitting SAS Programs 87
Î While holding the button down, drag your mouse down and to the right so that the data on lines 5, 6, and 7 are highlighted in black. Î After the data are highlighted in black, release the button. The lines remain highlighted. Your program should look something like this: 1 2 3 4 5 6 7 8 9 10 11
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 3 3 4 4 8 9 6 7 2 3 2 2 3 3
To delete the lines you have highlighted, complete this step: Î Press the Backspace (DELETE) key once. The highlighted lines disappear. After deleting these lines, it is possible that one blank line will be left, as shown on line 5 here: 1 2 3 4 5 6 7 8 9
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; ❙ 6 7 2 3 2 2 3 3
To delete the blank line, Î Press the Backspace (DELETE) key again. Your program now appears as shown here: 1 2 3 4 5 6 7 8
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 6 7 2 3 2 2 3 3
88 Step-by-Step Basic Statistics Using SAS: Student Guide
Copying a Single Line into Your Program To copy a single line involves these steps (do not do this yet): 1. Create a new blank line where the copied line is to be inserted. 2. Click and drag to highlight the line to be copied. 3. Pull down the Edit menu and select Copy. 4. Place the cursor at the point where the line is to be pasted. 5. Pull down the Edit menu and select Paste. For example, suppose that you want to make a copy of line 5 and place the copy before line 6 in the following program: 1 2 3 4 5 6 7 8
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 6 7 2 3 2 2 3 3
First, you must create a new blank line where the copied line is to be inserted. Complete the following steps (try this now): Î Place the I-beam at the end of the data on line 5 (that is, after the “6 7” on line 5) and click once. This places your cursor to the right of the numbers “6 7.” Î Press ENTER once. This creates a new blank line after line 5, as shown here: 1 2 3 4 5 6 7 8 9
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 6 7 2 3 2 2 3 3
Next you highlight the data to be copied. Complete the following steps: Î Place your I-beam at the beginning of the data on line 5. (This means place the Ibeam to the immediate left of the “6” on the “6 7” line.) Î Click once and hold the button down (do not release it yet).
Chapter 3: Tutorial: Writing and Submitting SAS Programs 89
Î While holding the button down, drag your mouse to the right so that the data on line 5 are highlighted in black. Î After the data are highlighted in black, release the button. The data remain highlighted. With this done, your program should look something like this: 1 2 3 4 5 6 7 8 9
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 6 7 2 3 2 2 3 3
Now you must select the Copy command. Î Edit Î Copy Nothing appears to happen, but don’t worry––the highlighted text has been copied to an invisible clipboard. Now place your cursor (the insertion point) at the beginning of line 6. Complete this step: Î Place the I-beam in column 1 of line 6 and click. Your program should look like this: 1 2 3 4 5 6 7 8 9
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 6 7 ▌ 2 3 2 2 3 3
Finally, you can take the material that you copied to the clipboard and paste it at the insertion point in your program. Make the following selections: Î Edit Î Paste
90 Step-by-Step Basic Statistics Using SAS: Student Guide
The copied data now appear on line 6. Your program should look something like this: 1 2 3 4 5 6 7 8 9
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 6 7 6 7 2 3 2 2 3 3
Copying a Range of Lines To copy a range of lines, you follow the same procedure that you use to copy a single line, with one exception: When you click and drag, you drag down (as well as to the right) so that you highlight more than one line. When you select Paste, all of the highlighted lines will be copied. For example, assume that you want to copy lines 5-7 and place the copied lines before line 8 in the following program: 1 2 3 4 5 6 7 8 9
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 6 7 6 7 2 3 2 2 3 3
First, you must create a new blank line where the copied lines are to be inserted. Complete the following steps: Î Place the I-beam at the end of the data on line 7 (that is, after the “2 3” on line 7) and click once. This places your cursor to the right of the “2 3.”) Î Press ENTER once.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 91
This creates a new blank line after line 7, as shown here: 2 3 4 5 6 7 8 9 10
DATA D1; INPUT TEST1 DATALINES; 6 7 6 7 2 3
TEST2;
2 2 3 3
Next you highlight the data to be copied. Complete these steps: Î Place your I-beam at the beginning of the data on line 5. (This means place the Ibeam to the immediate left of the “6” on the “6 7” line.) Î Click once and hold the button down. Î While holding the button down, drag your mouse down and to the right so that the data on lines 5–7 are highlighted in black. Î After the lines are highlighted in black, release the button. The lines remain highlighted. With this done, your program should look something like this: 2 3 4 5 6 7 8 9 10
DATA D1; INPUT TEST1 DATALINES; 6 7 6 7 2 3
TEST2;
2 2 3 3
Now you must select the Copy command. Î Edit Î Copy Nothing appears to happen, but don’t worry––the highlighted text has been copied to an invisible clipboard. Now you need to place your cursor (the insertion point) at the beginning of line 8: Î Place the I-beam in column 1 of line 8 and click.
92 Step-by-Step Basic Statistics Using SAS: Student Guide
Your program should look like this: 2 3 4 5 6 7 8 9 10
DATA D1; INPUT TEST1 DATALINES; 6 7 6 7 2 3 ▌ 2 2 3 3
TEST2;
Finally, take the material that you copied to the clipboard and paste your selection at the insertion point in your program. Î Edit Î Paste The copied data now appear on (and following) line 8. Your program should look something like this: 2 3 4 5 6 7 8 9 10 11 12
DATA D1; INPUT TEST1 DATALINES; 6 7 6 7 2 3 6 7 6 7 2 3 2 2 3 3
TEST2;
Moving Lines To move lines, you follow the same procedure that you use to copy lines, with one exception: When you initially pull down the Edit menu, select Cut rather than Copy. For example, you follow these steps to move a range of lines (do not actually do this now): 1. Create a new blank line where the moved lines are to be inserted. 2. Click and drag to highlight the lines to be moved. 3. Pull down the Edit menu and select Cut. 4. Place the cursor at the point where the lines are to be pasted. 5. Pull down the Edit menu and select Paste.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 93
When you are finished, there will be one blank line at the location where the moved lines used to be. You can delete this line in the usual way: Use your mouse to place the cursor in column 1 of that blank line, and press the backspace (delete) key. Saving Your SAS Program and Ending the SAS Session Now it is time to save the program on your floppy disk in drive A. Because you opened DEMO.SAS from drive A, drive A is now the default drive—it is not necessary to assign it as the default drive. You can save your file using the Save command, rather than the Save As command. Verify that your Editor window is the active window, and select: Î File Î Save Now end your SAS session by selecting: Î File Î Exit This produces a dialog box that asks if you are sure you want to end the SAS session. Click OK in this dialog box, and the SAS session ends.
94 Step-by-Step Basic Statistics Using SAS: Student Guide
Tutorial Part III: Submitting a Program with an Error Overview In this section, you modify an existing SAS program so that it will produce an error when you submit it. This gives you the opportunity to learn the procedure that you will follow when debugging SAS programs with errors. Restarting SAS Complete the following steps: Î Click the Start button that appears at the bottom of your initial screen. This displays a list of options, including the word Programs. Î Select Programs. This reveals a list of programs on the computer. One of them is The SAS System for Windows V8. Î Select The SAS System for Windows V8 and release the button. Modifying the Initial SAS System Screen Before opening an existing SAS file, modify the initial SAS screen so that it is easier to work with. Complete the following steps. Close the Explorer window: Î Click the close window button for the Explorer window (the button that looks like in the upper-right corner of the Explorer window; see Figure 3.13). This reveals the Results window, which was hidden beneath the Explorer window. Now close the Results window: Î Click the close window button for the Results window (the button that looks like in the upper-right corner of the Results window). Your screen now contains the Log window and the Editor window. To maximize the Editor window, complete this step:
Chapter 3: Tutorial: Writing and Submitting SAS Programs 95
Î Click the maximize window button for the Editor window (the middle button that contains a square : ; see Figure 3.4). The Editor expands and fills your screen. Finally, you need to review the Enhanced Editor Options dialog box to verify that the appropriate options have been selected. To request this dialog box, make the following selections: Î Tools Î Options Î Enhanced Editor The upper-left corner of the Enhanced Editor Options dialog box contains one tab labeled General and one tab labeled Appearance. Verify that the General options tab is in the foreground (that is, General options is visible), and click the tab labeled General if it is not visible. The Enhanced Editor Options dialog box contains a variety of different options, but we will focus on three of them. Here are the two options that should be selected at the beginning of a SAS session: •
Verify that Show line numbers is selected (that is, verify that a check mark appears in the box for this option).
•
In the box labeled Indentation, verify that None is selected.
Here is the one option that should not be selected: •
Verify that Clear text on submit is not selected (that is, verify that a check mark does not appear in the box for this option).
Figure 3.14 shows the proper settings for the Enhanced Editor Options dialog box. If necessary, click inside the appropriate boxes so that the three options described earlier are set properly (you can disregard the other options). When all are correct, complete the following step: Î Click the OK button at the bottom of the dialog box. This returns you to the Editor window.
96 Step-by-Step Basic Statistics Using SAS: Student Guide
Opening an Existing SAS Program from Your Floppy Disk Verify that your 3.5-inch floppy disk is in drive A and that the Editor window is the active window. Then make the following selections: Î File Î Open The Open dialog box appears on your screen. Toward the top of the Open dialog box is a box labeled Look in. This box tells the SAS System where it should look to find a file. If this box does not contain 3 1/2 Floppy (A:), you will have to change it. If this is necessary, complete these steps: Î On the right side of the Look in box is a down arrow. Click this down arrow to get other possible locations where the SAS System can look (see Figure 3.15). Î Scroll up and down this list of possible locations (if necessary) until you see an entry that reads 3 1/2 Floppy (A:). Î Click the entry that reads 3 1/2 Floppy (A:). 3 1/2 Floppy (A:) appears in the Look in box. The contents of your diskette now appear in the larger box below the Look in box. One of these files should be DEMO, the file that you need to open. To open this file, complete the following steps: Î Click DEMO. The name DEMO appears in the File name box. Î Click the Open button. The SAS program that you saved in the preceding section appears in the Editor window. You are now free to modify it and submit it for execution. Submitting a Program with an Error You will now submit a program with an error in order to see how errors are identified and corrected. With the file DEMO.SAS opened in the Editor window, change the third line from the bottom so that it requests “PROC MEENS” instead of “PROC MEANS.” This will produce an error message when SAS attempts to execute the program, because there is no procedure named PROC MEENS.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 97
Here is the modified program; notice that the third line from the bottom now requests “PROC MEENS”: OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 6 7 6 7 2 3 6 7 6 7 2 3 2 2 3 3 3 4 4 3 4 4 5 3 5 4 ; PROC MEENS DATA=D1; TITLE1 'JANE DOE'; RUN; After you make this change, submit the program for execution in the usual way: Î Run Î Submit This submits your program for execution. Reviewing the SAS Log and Correcting the Error After you submit the SAS program, it takes SAS a few seconds to finish processing it. However, after this processing is complete the Output window will not become the active window. The fact that your Output window does not appear indicates that your program did not run correctly. To determine what was wrong with the program, you need to review the Log window. Make the following selections: Î Window Î Log
98 Step-by-Step Basic Statistics Using SAS: Student Guide
The log file that is created by your program should look similar to Log 3.2. NOTE: SAS initialization used: real time 15.44 seconds 1 2 3 4
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES;
NOTE: The data set WORK.D1 has 13 observations and 2 variables. NOTE: DATA statement used: real time 2.86 seconds 18 ; 19 PROC MEENS DATA=D1; ERROR: Procedure MEENS not found. 20 TITLE1 'JANE DOE'; 21 RUN; NOTE: The SAS System stopped processing this step because of errors. NOTE: PROCEDURE MEENS used: real time 0.22 seconds Log 3.2. Log file created by a SAS program with an error.
When you go to the SAS Log window, you will usually see the last part of the log file. You should begin reviewing at the beginning of a log file when looking for errors or other problems. Scroll to the top of the log file by completing this step (try this now): Î Scroll up to the top of the log file either by clicking and dragging the scroll bar or by clicking the up arrow in the scroll bar area of the Log window. Starting at the top of the log file, begin looking for warning messages, error messages, or other signs of problems. Begin at the top of the log file and work your way down. It is important to always begin at the top, because a single error message early in a program can sometimes cause dozens of additional error messages later in the program; if you correct the first error, the remaining error messages will often disappear. Toward the bottom half of Log 3.2, you can see an error message that reads “Error: Procedure MEENS not found.” The SAS program statement causing this error is the statement that immediately precedes the error message in the SAS log. That SAS program statement, along with the resulting error message, is reproduced here: 19 PROC MEENS DATA=D1; ERROR: Procedure MEENS not found. This error message indicates that SAS does not have a procedure named MEENS. A subsequent statement in the log indicates that SAS stopped processing this step because of the error.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 99
Obviously, the error is that PROC MEANS was incorrectly spelled as PROC MEENS. When you see an error like this in a log file, your first impulse might be to correct the error in the log file itself. But this will not work––you must correct the error in the SAS program, not in the log file. Before doing this, however, clear the text of the current log file before you resubmit the corrected program. If you do not clear the text of this log file, the next time you submit the SAS program, SAS will append a new log file to the bottom of the existing log file, and it will be difficult to review your current log. To correct your error, first clear your Log window by selecting Î Edit Î Clear All Now return to your SAS program by making the Editor the active window: Î Window Î Demo The SAS program that you submitted reappears in the Editor: OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 6 7 6 7 2 3 6 7 6 7 2 3 2 2 3 3 3 4 4 3 4 4 5 3 5 4 ; PROC MEENS DATA=D1; TITLE1 'JANE DOE'; RUN; Now correct the error in the program: Î If necessary, scroll down to the bottom of the program. Move your cursor down to the line that contains PROC MEENS and change this statement so that it reads PROC MEANS. Now submit the program again: Î Run Î Submit
100 Step-by-Step Basic Statistics Using SAS: Student Guide
If the error has been corrected (and if the program contains no other errors), the Output window appears after processing is completed. The results of PROC MEANS appear in this Output window, and these results should look similar to Output 3.2. JANE DOE
1
The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ------------------------------------------------------------------------TEST1 13 4.1538462 1.6251233 2.0000000 6.0000000 TEST2 13 4.3846154 1.8946619 2.0000000 7.0000000 ------------------------------------------------------------------------Output 3.2. SAS output produced by PROC MEANS after correcting the error.
If SAS does not take you to the Output window, it means that your program still contains an error. If this is the case, repeat the process described earlier: 1. Go to the Log window and identify the error. 2. Clear the Log window of text. 3. Go to the Editor window, which contains the SAS program. 4. Correct the error. 5. Resubmit the program. Saving Your SAS Program and Ending This Session If the SAS program ran correctly, the Output window should be in the foreground. Now you must make the Editor the active window before you can save your program. Make the following selections: Î Window Î Demo Your SAS program is visible in the Editor window. Now save the program on the disk in drive A: Î File Î Save You can end your session with the SAS System by selecting Î File Î Exit The computer asks if you are sure you want to end the SAS session. Click OK, and the SAS session ends.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 101
To Learn More about Debugging SAS Programs This section summarizes the steps that you follow when debugging a SAS program with an error. A more concise summary of these steps appears in “Finding and Correcting an Error in a SAS Program” later in this chapter. To learn more about debugging SAS programs, see “Common SAS Progamming Errors That Beginners Make” on this book’s companion Web site (support.sas.com/companionsites). Delwiche and Slaughter (1998) also provide guidance on how to find and correct errors in SAS programs.
102 Step-by-Step Basic Statistics Using SAS: Student Guide
Tutorial Part IV: Practicing What You Have Learned Overview In this section, you practice the skills that you have developed in the preceding sections. You open your existing SAS program, edit it, submit it, print the resulting log and output files, and perform other activities. Restarting the SAS System and Modifying the Initial SAS System Screen Complete the following steps to restart the SAS System: Î Click the Start button that appears at the bottom of your initial screen. This displays a list of options, including the word Programs. Î Select Programs. This reveals a list of programs on the computer. One of them is The SAS System for Windows V8. Î Select The SAS System for Windows V8 and release your button. This produces the initial SAS System screen. Next, close the Explorer window and the Results window and maximize the Editor window: Î Click the close window button for the Explorer window (the button that looks like in the upper-right corner of the Explorer window; see Figure 3.13). Î Click the close window button for the Results window (the button that looks like in the upper-right corner of the Results window). Î Click the maximize window button for the Editor window (the middle button that contains a square: ; see Figure 3.13). Finally, if there is any possibility that someone has changed the options in the Enhanced Editor Options dialog box, you should review this dialog box. If this is necessary, make the following selections: Î Tools Î Options Î Enhanced Editor The upper-left corner of the Enhanced Editor Options dialog box contains one tab labeled General and one tab labeled Appearance. Verify that the General options tab is in the
Chapter 3: Tutorial: Writing and Submitting SAS Programs 103
foreground (that is, General options is visible), and click the tab labeled General if it is not visible. The Enhanced Editor Options dialog box contains a variety of different options, but we will focus on three of them. Here are the two options that should be selected at the beginning of a SAS session: •
Verify that Show line numbers is selected (that is, verify that a check mark appears in the box for this option).
•
In the box labeled Indentation, verify that None is selected.
This option should not be selected: •
Verify that Clear text on submit is not selected (that is, verify that a check mark does not appear in the box for this option).
Figure 3.14 shows the proper settings for the Enhanced Editor Options dialog box. If necessary, click inside the appropriate boxes so that the three options described earlier are set properly (you can disregard the other options). When all are correct, Î Click the OK button at the bottom of the dialog box. This returns you to the Editor window. Reviewing the Names of Files on a Floppy Disk and Opening an Existing SAS Program Verify that your 3.5-inch floppy disk is in drive A and that the Editor window is the active window. Then make the following selections: Î File Î Open The Open dialog box appears on your screen. Toward the top of the Open dialog box is a box labeled Look in. This box tells the SAS System where it should look to find a file. If this box does not contain 3 1/2 Floppy (A:), you will have to change it. If this is necessary, complete the following steps: Î On the right side of the Look in box is a down arrow. Click this down arrow to get other possible locations where the SAS System can look (see Figure 3.15). Î Scroll up and down this list of possible locations (if necessary) until you see an entry that reads 3 1/2 Floppy (A:). Î Click the entry that reads 3 1/2 Floppy (A:).
104 Step-by-Step Basic Statistics Using SAS: Student Guide
3 1/2 Floppy (A:) appears in the Look in box. The contents of your diskette should now appear in the larger box below the Look in box. One of these files should be DEMO, the file that you need to open. Complete the following steps: Î Click the file named DEMO. The name DEMO appears in the File name box. Î Click the Open button. The SAS program that you saved in the preceding section appears in the Editor window. You are now free to modify it and submit it for execution. Practicing What You Have Learned Now that the file DEMO.SAS is open in the Editor, you can practice what you have learned in this tutorial. When you are not sure about how to complete certain tasks, refer to earlier sections. Within the Editor window, complete the following steps to modify your file named DEMO: 1) Insert three new lines of data into the middle of your data set (somewhere after the DATALINES statement). Make up the numbers. 2) Delete two existing lines of data (choose any two lines of data). 3) Copy four lines of data. 4) Move three lines of data. 5) Save your program on the 3.5” diskette using the File Î Save As command. Give it a new name: DEMO2.SAS. 6) Submit the program for execution. 7) Review the contents of your log file on the screen. 8) Print your log file. 9) Clear your log file from the screen by using the Edit Î Clear All command. 10) Review the contents of your output file on the screen. 11) Print your output. 12) Clear your output file from the screen by using the Edit Î Clear All command. 13) Go back to the Editor window. 14) Add one more line of data to your program. 15) Save your program again, this time using the File Î Save command (not the Save As command).
Chapter 3: Tutorial: Writing and Submitting SAS Programs 105
Ending the Tutorial You can end your SAS session by making the following selections: Î File Î Exit The computer asks if you are sure you want to end the SAS session. Click OK, and the SAS session ends. This completes the tutorial sections of this chapter.
Summary of Steps for Frequently Performed Activities Overview This section summarizes the steps that you follow when performing three common activities: • starting the SAS windowing environment • opening an existing SAS program from a floppy disk • finding and correcting an error in a SAS program. Starting the SAS Windowing Environment Verify that your computer and monitor are turned on and not in sleep mode. Make sure that your initial Windows screen appears (with the Start button at the bottom). Î Click the Start button that appears at the bottom left of your initial screen. This displays a list of options. Î Select Programs. This reveals a list of programs on the computer. Î Select The SAS System for Windows V8. This produces the initial SAS System screen. Next, close the Explorer window and the Results window, and maximize the Editor window: Î Click
in the upper-right corner of the Explorer window (see Figure 3.13).
Î Click
in the upper-right corner of the Results window.
Î Click
to maximize the window.
106 Step-by-Step Basic Statistics Using SAS: Student Guide
Finally, if there is any possibility that someone has changed the options in the Enhanced Editor Options dialog box, you should review this dialog box. If this is necessary, make the following selections: Î Tools Î Options Î Enhanced Editor The upper-left corner of the Enhanced Editor Options dialog box contains one tab labeled General and one tab labeled Appearance. Verify that General options is in the foreground (that is, General options is visible), and you should click the tab labeled General if it is not visible. The Enhanced Editor Options dialog box contains a variety of different options, but we will focus on three of them. Here are the two options that should be selected at the beginning of a SAS session: •
Verify that Show line numbers is selected (that is, verify that a check mark appears in the box for this option).
•
In the box labeled Indentation, verify that None is selected.
Here is the one option that should not be selected: •
Verify that Clear text on submit is not selected (that is, verify that a check mark does not appear in the box for this option).
Figure 3.14 shows the proper settings for the Enhanced Editor Options dialog box. If necessary, click the appropriate boxes so that the three options described earlier are set properly (you can disregard the other options). When all are correct, complete the following step: Î Click the OK button at the bottom of the dialog box. This returns you to the Editor window. You are now ready to begin typing a SAS program or open an existing SAS program from a diskette. Opening an Existing SAS Program from a Floppy Disk Verify that your floppy disk is in drive A and that the Editor window is the active window. Then make the following selections: Î File Î Open The Open dialog box appears on your screen. Toward the top of the Open dialog box is a box labeled Look in. This box tells SAS where it should look to find a file. If this box does not contain 3 1/2 Floppy (A:), you will have to change it by completing the following steps:
Chapter 3: Tutorial: Writing and Submitting SAS Programs 107
Î On the right side of the Look in box is a down arrow. Click this down arrow to get other possible locations where SAS can look (see Figure 3.15). Î Scroll up and down this list of possible locations (if necessary) until you see an entry that reads 3 1/2 Floppy (A:). Î Click the entry that reads 3 1/2 Floppy (A:). 3 1/2 Floppy (A:) appears in the Look in box. The contents of your disk should now appear in the larger box below the Look in box. One of these files is the file that you need to open. Remember that the file names in this box might not contain the “.SAS” suffix, even if you included this suffix in the file name when you first saved the file. This does not mean that there is a problem. To open your file, complete the following steps: Î Click the name of the file that you want to open. This file name appears in the File name box. Î Click the Open button. The SAS program saved under that file name appears in the Editor window. You are now free to modify it and submit it for execution. Finding and Correcting an Error in a SAS Program After you submit a SAS program, one of two things will happen: •
If SAS does not take you to the Output window (that is, if you remain at the Editor window), it means that your SAS program did not run correctly. You need to go to the Log window and locate the error (or errors) in your program.
•
If SAS does take you to the Output window, it still does not mean that the entire SAS program ran correctly. It is always a good idea to review your SAS log for warnings, errors, and other messages prior to reviewing the SAS output.
The point is this: After submitting a SAS program, you should always review the log file prior to reviewing the SAS output, either to verify that there are no errors or to identify the nature of those errors. This section summarizes the steps in this process. After you have submitted the SAS program and processing has stopped, make the following selections to go to the Log window: Î Window Î Log When you go to the SAS Log window, what you will usually see is the last part of the log file. You must scroll to the top of the log file to see the entire log:
108 Step-by-Step Basic Statistics Using SAS: Student Guide
Î Scroll up to the top of the log file by either clicking and dragging the scroll bar or by clicking the up arrow in the scroll bar area of the Log window. Now that you are at the top of the log file, begin reviewing it for warning messages, error messages, or other signs of problems. Begin at the top of the log file and work your way down. It is important to begin at the top of the log because a single error message early in a program can sometimes cause dozens of additional error messages later in the program; if you correct the first error, the dozens of remaining error messages often disappear. If there are no warnings, errors, or other signs of problems, go to your output file: Î Window Î Output If your SAS log does contain error messages, try to find the cause of these errors. Begin with the first (earliest) error message in the SAS log. Remember that your SAS log always contains the statements that make up the SAS program (minus the data lines). Review the line of your SAS program that immediately precedes the first error message––are there any problems with this line (for example, a missing semicolon, a misspelled word)? If that line appears to be correct, review the line above it. Are there any problems with that line? Continue working backward, one line at a time, until you find the error. After you have found the error, remember that you cannot correct the error in the SAS log. Instead, you must correct the error in the SAS program. First, clear all text in the existing SAS Log and SAS Output windows. This ensures that when you resubmit your SAS program, the new log and output will not be appended to the old. Complete the following steps to delete the existing log and output files: Î Window Î Log Î Edit Î Clear All Î Window Î Output Î Edit Î Clear All You will now go to the Editor window (remember that the Window menu might contain the name that you gave to your SAS program, rather than the word “Editor,” which appears here): Î Window Î Editor You should now edit your SAS program to correct the error that you identified earlier. After you have corrected the error, save the modified program: Î File Î Save Now submit the revised SAS program: Î Run Î Submit At this point, the process repeats itself. If the program runs, you should still go to the log file to verify that there are no errors or other signs of problems. If the program did not run, you
Chapter 3: Tutorial: Writing and Submitting SAS Programs 109
will go to the log file to look for the error (or errors) in your program. Continue this process until your program runs without errors.
Controlling the Size of the Output Page with the OPTIONS Statement In completing this tutorial, all of the programs that you submitted contained the following OPTIONS statement: OPTIONS
LS=80
PS=60;
The OPTIONS statement is a global statement that can be used to change the value of system options and change how the SAS System operates. For example, you can use the OPTIONS statement to suppress the printing of page numbers, to suppress the printing of dates, and to perform other tasks. In this tutorial, the OPTIONS statement was used for one purpose: to specify the size of the printed page. The OPTIONS statement presented earlier requests a small-page format for output. The LS=80 section of this statement requests that your output have a line size of 80 characters per line (the “LS” stands for “line size”). This setting makes your output easy to view on a narrow computer screen. The PS=60 section of this statement requests that your output have a page size of 60 lines per page (the “PS” stands for “page size”). These specifications are fairly standard. Specifying a line size of 80 and a page size of 60 is fine for most programs, but it is not optimal for SAS programs that provide a great deal of information on each page. For example, when performing a factor analysis or principal component analysis, it is better to use a larger format so that each page can contain more information. The printed output from these more sophisticated analyses are easier to read if the line size is 120 characters per line rather than 80 (of course, this assumes that you have access to a large-format printer that can print 120 characters per line). To print your output in the larger format, change the OPTIONS statement on the first line of the program. Specifically, set LS=120 (you can leave the PS=60). This revised OPTIONS statement is illustrated here: OPTIONS
LS=120
PS=60;
In the OPTIONS statements presented in this section, “LS” is used as an abbreviation for the keyword “LINESIZE,” and PS is used as an abbreviation for the keyword “PAGESIZE.” If you prefer, you can also write your OPTIONS statement with the full-length keywords, as shown here: OPTIONS
LINESIZE=80
PAGESIZE=60;
110 Step-by-Step Basic Statistics Using SAS: Student Guide
For More Information The reference section at the end of this book lists a number of books that provide additional information about using the SAS windowing environment. Delwiche and Slaughter (1998) provide a concise but comprehensive introduction to using SAS Version 7. Two books by Jodie Gilmore (Gilmore, 1997; Gilmore, 1999) provide detailed instructions for using SAS in the Windows environment. These books can be ordered from SAS at (800) 727-3228 or (919) 677-8000.
Conclusion For students who are using SAS for the first time, learning to use the SAS windowing environment is often the most challenging task. The tutorial in this chapter has introduced you to this application. When you are performing analyses with the SAS System, you should continue to refer to this chapter to refresh your memory on how to perform specific activities. For most students, using the SAS windowing environment becomes second nature within a matter of weeks. The next chapter in this book, Chapter 4, “Data Input,” describes how to get your data into a format that can be analyzed by SAS. Before you can perform statistical analyses on your data, you must first provide SAS with information about how many variables it has to read, what names should be given to those variables, whether the variables are numeric or character, along with other information. Chapter 4 shows you the basics for creating the types of data sets that are most frequently encountered when conducting research.
Data Input Introduction.........................................................................................113 Overview...............................................................................................................113 The Rows and Columns that Constitute a Data Set..............................................113 Overview of Three Options for Writing the INPUT Statement ...............................115 Example 4.1: Creating a Simple SAS Data Set ..................................117 Overview...............................................................................................................117 The OPTIONS Statement .....................................................................................117 The DATA Statement............................................................................................118 The INPUT Statement...........................................................................................119 The DATALINES Statement .................................................................................120 The Data Lines......................................................................................................120 The Null Statement ...............................................................................................121 The PROC Statement ...........................................................................................121 Example 4.2: A More Complex Data Set............................................122 Overview...............................................................................................................122 The Study .............................................................................................................122 Data Set to Be Analyzed.......................................................................................123 The SAS DATA Step.............................................................................................125 Some Rules for List Input......................................................................................126 Using PROC MEANS and PROC FREQ to Identify Obvious Problems with the Data Set .............................................131 Overview...............................................................................................................131 Adding PROC MEANS and PROC FREQ to the SAS Program............................131
112 Step-by-Step Basic Statistics Using SAS: Student Guide
The SAS Log.........................................................................................................134 Interpreting the Results Produced by PROC MEANS...........................................135 Interpreting the Results Produced by PROC FREQ..............................................137 Summary ..............................................................................................................138 Using PROC PRINT to Create a Printout of Raw Data .......................139 Overview...............................................................................................................139 Using PROC PRINT to Print Raw Data for All of the Variables In the Data Set ....139 Using PROC PRINT to Print Raw Data for a Subset of Variables In the Data Set ..................................................................................................141 A Common Misunderstanding Regarding PROC PRINT ......................................142 The Complete SAS Program................................................................142 Conclusion...........................................................................................144
Chapter 4: Data Input 113
Introduction Overview Raw data must be converted into a SAS data set before you can analyze it with SAS statistical procedures. In this chapter you learn how to create simple SAS data sets. Most of the chapter uses the format-free approach to data input because this is the simplest approach, and will be adequate for the types of data sets that you will encounter in this guide. You will learn how to write an INPUT statement to read both numeric variables as well as character variables. You will also learn how to add missing data into the data set that you want to analyze. After you have typed your data, you should always analyze it with a few simple procedures to verify that SAS read your data set as you intended. This chapter shows you how to use PROC MEANS, PROC FREQ, and PROC PRINT to verify that your data set has been created correctly. The Rows and Columns that Constitute a Data Set Suppose that you administer a short questionnaire to nine subjects. The questionnaire asks the subjects to indicate their height (in inches), weight (in pounds), and age (in years). The results that you obtain for the nine subjects are summarized in Table 4.1. Table 4.1 Data from the Height and Weight Study ________________________________ Subject Height Weight Age ________________________________ 1. Marsha 64 140 20 2. Charles 68 170 28 3. Jack 74 210 20 4. Cathy 60 110 32 5. Emmett 64 130 22 6. Marie 68 170 23 7. Cindy 65 140 22 8. Susan 65 140 22 9. Fred 68 160 22 ________________________________
Table 4.1 is a data set: a collection of variables and observations that could be analyzed using a statistical package such as SAS. The table is organized pretty much the way a SAS data set is organized: each row in the data set is (running horizontally from left to right) represents a different observation. Because you are doing research in which the individual person is the unit of analysis, each observation in your data set is a different person. You can
114 Step-by-Step Basic Statistics Using SAS: Student Guide
see that the first row of the data set presents data from a subject named Marsha, the next row presents data from a subject named Charles, and so on. In contrast, each column in the data set (running vertically from top to bottom) represents a different variable. The first column (“Subject”) provides each subject’s name and number; the second column (“Height”) provides each subject's height in inches; the third column (“Weight”) provides each subject’s weight in pounds, and so on. By reading across a given row, you can see where each subject scored on each variable. For example, by reading across the row for subject #1 (Marsha), you can see that she stands 64 inches in height, weighs 140 pounds, and is 20 years old. After you have entered the data in Table 4.1 into a SAS program, you could analyze it with any number of statistical procedures. For example, you could find the mean score for the three quantitative variables (height, weight, and age). Below is an example of a SAS program that will do this (remember that you would not type the line numbers appearing on the left; these numbers are used here simply to identify the lines in the program): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM HEIGHT WEIGHT AGE ; DATALINES; 1 64 140 20 2 68 170 28 3 74 210 20 4 60 110 32 5 64 130 22 6 68 170 23 7 65 140 22 8 65 140 22 9 68 160 22 ; PROC MEANS DATA=D1; VAR HEIGHT WEIGHT TITLE1 'JANE DOE'; RUN;
AGE;
Later sections of this chapter will discuss various parts of the preceding SAS program: the DATA statement on line 2, the INPUT statement on lines 3–6, and so on. For now, just focus on the data set itself, which appears on lines 8–16. Notice that this is identical to the data set of Table 4.1 except that the subjects’ first names have been removed, and the columns have been moved closer to one another so that there is less space between the variables. You can see that line 8 still presents data for subject #1 (Marsha), line 9 still represents data for subject #2, and so on.
Chapter 4: Data Input 115
The point is this: in this guide, all data sets will be arranged in the same fashion. The rows (running horizontally from left to right) will represent different observations (typically different people) and the columns (running vertically from top to bottom) will represent different variables. Overview of Three Options for Writing the INPUT Statement The first step in analyzing variables with SAS involves reading them as part of a DATA step, and the heart of the DATA step is the INPUT statement. The INPUT statement is the statement in which you assign names to the variables that you will analyze. There are many ways to write an INPUT statement, and some ways are much more complex than others. A later section of this chapter will provide detailed guidelines for using one specific approach. However, following is a quick overview of all three most commonly used options. List input. List input (also called “free-formatted input”) is probably the simplest way to write an INPUT statement. This is the approach that will be taught in this guide. With list input, you simply give a name to each variable, and tell SAS the order in which they appear on a data line (i.e., which variable comes first, which comes second, and so on). This is a good approach to use when you are first learning SAS, and when you have data sets with a small number of variables. List input is also called free-formatted input because you do not have to put a variable into any particular column on the data line. You simply have to be sure that you leave at least one blank space between each variable, so that SAS can tell one variable from another. Additional guidelines on the use of free-formatted input will be presented in the section “Example 4.2: A More Complex Data Set.” Here is an example of how you could write an INPUT statement that will read the preceding data set using the free-formatted approach: INPUT
SUB_NUM HEIGHT WEIGHT AGE;
The preceding INPUT statement tells SAS that it will read four variables for each subject. On each data line, it will first read the subject’s score on SUB_NUM (the participant’s subject number). On the same data line, it will then read the subject’s score on HEIGHT, the subject’s score on WEIGHT, and finally the subject’s score on AGE. Column input. With column input, you assign a name to each variable, and tell SAS the exact columns in which the variable will appear. For example, you might indicate that the variable SUB_NUM will appear in column 1, the variable HEIGHT will appear in columns 3 through 4, the variable WEIGHT will appear in columns 6 through 8, and the variable AGE will appear in columns 10 through 11.
116 Step-by-Step Basic Statistics Using SAS: Student Guide
Column input is a useful approach when you are working with larger data sets that contain a larger number of variables. Although column input will not be covered in detail here, you can learn more about in Schlotzhauer and Littell (1997, pp. 41–44). Here is an example of column input: INPUT
SUB_NUM HEIGHT WEIGHT AGE
1 3-4 6-8 10-11 ;
Formatted input. Formatted input is a more complex type of column input in which you again assign names to your variables and indicate the exact columns in which they will appear. Formatted input has the advantage of making it easy to input string variables, variables whose names begin with the same root and end with a series of numbers. For example, imagine that you administered a 50-item questionnaire to a large sample of people, and wanted to use the SAS variable name V1 to represent responses to the first question, the variable name V2 for responses to the second question, and so on. It would be very time consuming if you listed each of these variable names individually in the INPUT statement. However, if you used formatted input, you could create all 50 variables very easily with the following statement: INPUT
@1
(V1-V50)
(1.);
To learn about formatted input, see Cody and Smith (1997), and Hatcher and Stepanski (1994). Here is an example of how the current data set could be input using the formatted input approach: INPUT
@1 @3 @6 @10
(SUB_NUM) (HEIGHT) (WEIGHT) (AGE)
(1.) (2.) (3.) (2.) ;
Chapter 4: Data Input 117
Example 4.1: Creating a Simple SAS Data Set Overview This section shows you how to create a simple data set that contains just three quantitative variables (i.e., the data set from Table 4.1). You will learn to use the various components that constitute the DATA step: •
OPTIONS statement
•
DATA statement
•
INPUT statement
•
DATALINES statement
•
data lines
•
null statement.
For reference, here is the DATA step that was presented earlier in this chapter: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM HEIGHT WEIGHT AGE ; DATALINES; 1 64 140 20 2 68 170 28 3 74 210 20 4 60 110 32 5 64 130 22 6 68 170 23 7 65 140 22 8 65 140 22 9 68 160 22 ;
The OPTIONS Statement The OPTIONS statement is not really a formal part of the DATA step; it is actually a global command that can be used to set a variety of system options. In this section you will learn how to use the OPTIONS statement to control the size of your output page when it is printed. The syntax for the OPTIONS statement is as follows: OPTIONS
LS=n1
PS=n2 ;
118 Step-by-Step Basic Statistics Using SAS: Student Guide
LS in the preceding OPTIONS statement is an abbreviation for “LINESIZE.” This option enables you to control the maximum number of characters that will appear on each line of output. In this OPTIONS statement, n1 = the maximum number of characters that you want to appear on each printed line. PS in the OPTIONS statement is an abbreviation for “PAGESIZE.” This option enables you to control the maximum number of lines that will appear on each page of output. In this OPTIONS statement, the n2 = the maximum number of lines that you want to appear on each page. For example, suppose that you want your SAS output to have a maximum of 80 characters (letters and numbers) per line, and you want to have a maximum of 60 lines per page. The following OPTIONS statement would request this: OPTIONS
LS=80
PS=60 ;
The preceding is a good format to request when your output will be printed on standard letter-size paper. However, if your output will be printed on a large-format printer with a long carriage, you may want to have 120 characters per line. The following statement would request this: OPTIONS
LS=120
PS=60 ;
The DATA Statement You use the DATA statement to begin the DATA step and assign a name to the data set that you are creating. The syntax for the DATA statement is as follows: DATA
data-set-name ;
For example, if you want to assign the name D1 to the data set that you are creating, you would use the following statement: DATA
D1;
You can assign just about any name you like to a data set, as long as the name conforms to the following rules for a SAS data set name: •
The name must begin with a letter or an underscore (_).
•
The remainder of the name can include either letters or numbers.
•
If you are using SAS System Version 6 (or earlier), the name can be a maximum of eight characters; if you are using Version 7 (or later), the name can be a maximum of 32 characters.
•
The name cannot include any embedded blanks. For example, “POL PRTY” is not an acceptable name, as it includes a blank space. However, “POL_PRTY” is acceptable,
Chapter 4: Data Input 119
because an underscore (“_”) connects the first part of the name (“POL”) to the second part of the name (“PRTY”). •
The name cannot contain any special characters (e.g., “*,” “#”) or hyphens (-).
This guide typically uses the name D1 for a SAS data set simply because it is short and easy to remember. The INPUT Statement You use the INPUT statement to assign names to the variables that you will analyze, and to indicate the order in which the variables will appear on the data lines. Using free-formatted input, the syntax for the INPUT statement is as follows: INPUT
first-variable second-variable third-variable . . . last-variable ;
In the INPUT statement, the first variable that you name (first-variable above) should be the first variable that SAS will encounter when reading a data line from left to right. The second variable you name (second-variable above) should be the second variable that SAS will encounter when reading a data line, and so on. The INPUT statement from the preceding height and weight study is reproduced here: INPUT
SUB_NUM HEIGHT WEIGHT AGE ;
You can assign almost any name to a SAS variable, provided that you adhere to the rules for creating a SAS variable name. The rules for creating a SAS variable name are identical to the rules for creating a SAS data set name, and these rules were discussed in the section, “The DATA Statement.” That is, a SAS variable name must begin with a letter, it must not contain any imbedded blanks, and so on.
120 Step-by-Step Basic Statistics Using SAS: Student Guide
The DATALINES Statement The DATALINES statement tells SAS that the data set will begin on the next line. Here is the DATALINES statement from the preceding program, along with the first two lines of data: DATALINES; 1 64 140 20 2 68 170 28 You use the DATALINES statement when you want to include the data as a part of the SAS program. However, this is not the only way to input data with SAS. For example, it is also possible to keep your data in a separate file, and refer to that file within your SAS program by using the INFILE statement. This approach has the advantage of allowing your SAS program to remain relatively short. The data sets used in this guide are fairly short, however. Therefore, to keep things simple, this guide includes the data set as part of the example SAS program, using the DATALINES statement. To learn how to use the INFILE statement, see Cody and Smith (1997), and Hatcher and Stepanski (1994, pp. 56–58). The Data Lines The data lines should be placed between the DATALINES statement (described above) and the null statement (to be described in the following section). Below are the data lines from the height and weight study, preceded by the DATALINES statement, and followed by the null statement (the semicolon at the end): DATALINES; 1 64 140 20 2 68 170 28 3 74 210 20 4 60 110 32 5 64 130 22 6 68 170 23 7 65 140 22 8 65 140 22 9 68 160 22 ; The data sets in this guide are short and simple, with only one line of data for each subject. This should be adequate when you have collected data on only a few variables. When you collect data on a large number of variables, however, it will be necessary to use more than one line of data for each subject. This will require a more sophisticated approach to data input than the format-free approach used here. To learn about these more advanced approaches, see Hatcher and Stepanski (1994, pp 31–51).
Chapter 4: Data Input 121
The Null Statement The null statement is the shortest statement in SAS programming: It consists simply of a line with a semicolon, as shown here: ; The null statement appears on the line following the end of the data set. It tells SAS that the data lines have ended. Here are the last two lines of the preceding data set, followed by the null statement: 8 65 140 22 9 68 160 22 ; Make sure that you place this semicolon by itself on the first line following the end of the data lines. A mistake that is often made by new SAS users is to instead place it at the end of the last line of data, as shown here: 8 65 140 22 9 68 160 22; Do not do this; placing the semicolon at the end of the last line of data will usually result in an error statement. Make sure that you always place it alone on the first line following the last line of data. The PROC Statement The null statement, described in the previous section, tells SAS that the DATA step has ended. When the DATA step is complete, you can then request statistical procedures that will analyze your data set. You request these statistical procedures using PROC statements. For example, below is a reproduction of (a) the last two data lines for the height and weight study, (b) the null statement, and (c) the PROC MEANS statement that tells SAS to compute the means and other descriptive statistics for three variables in the data set: 8 65 140 22 9 68 160 22 ; PROC MEANS DATA=D1; VAR HEIGHT WEIGHT TITLE1 'JANE DOE'; RUN;
AGE;
This guide shows you how to use a variety of PROC statements to request descriptive statistics, correlations, t tests, analysis of variance, and other statistical procedures.
122 Step-by-Step Basic Statistics Using SAS: Student Guide
Example 4.2: A More Complex Data Set Overview The preceding section was designed to provide the “big picture” regarding the SAS DATA step. Now that you understand the fundamentals, you are ready to learn some of the details. This section describes a fictitious study from the discipline of political science, and shows you how to input the data that might be obtained from such a study. It shows you how to write a simple SAS program that will handle numeric variables, character variables, and missing data. The Study Suppose that you are interested in identifying variables that predict the size of financial donations that people make to political parties. You develop the following questionnaire: 1. Would you describe yourself as being generally conservative, generally liberal, or somewhere between these two extremes? (Please circle the number that represents your orientation) Generally Conservative
1
2
3
4
5
6
7
Generally Liberal
2. Would you like to see the size of the federal government increased or decreased? Greatly Decreased
1
2
3
4
5
6
7
Greatly Increased
3. Would you like to see the federal government assume an increased role or a decreased role in providing health care to our citizens? Decreased Role
1
2
3
4
5
6
7
Increased Role
4. What is your political party? (Check one) ____ Democrat
____ Republican
____ Other
5. What is your sex? ____ Female
_____ Male
6. What is your age? ___________ years old 7. During the past year, how much money have you donated to your political party? $ ________________
Chapter 4: Data Input 123
Data Set to Be Analyzed The table of data. You administer this questionnaire to 11 people. Their responses are reproduced in Table 4.2. Table 4.2 Data from the Political Donation Study ______________________________________________________________ Responses to questions: ______________ Political Subject Q1 Q2 Q3 party Sex Age Donation _______________________________________________________________ 01. Marsha 7 6 5 D F 32 1000 02. Charles 2 2 3 R M . 0 03. Jack 3 4 3 . M 45 100 04. Cindy 6 6 5 . F 20 . 05. Cathy 5 4 5 D F 31 0 06. Emmett 2 3 1 R M 54 0 07. Edward 2 1 3 . M 21 250 08. Eric 3 3 3 R M 43 . 09. Susan 5 4 5 D F 32 100 10. Freda 3 2 . R F 18 0 11. Richard 3 6 4 R M 21 50 _______________________________________________________________
As was the case with the height and weight data set presented earlier, the horizontal rows of Table 4.2 represent individual subjects, and the vertical columns represent different variables. The first column is headed “Subject,” and below this heading you will find a subject number (e.g., “01”) and a first name (e.g., “Marsha”) for each subject. Notice that the subject numbers are now two-digit numbers ranging from “01” to “11.” Numbers that normally would be single-digit numbers (such as “1”) have been converted to two-digit numbers such as “01”. This will make it easier to keep columns of numbers lined up properly when you are typing these subject numbers as part of a SAS data set. The first row presents questionnaire responses from subject #1, Marsha. Reading from left to right, you can see that Marsha •
circled a “7” in response to Question 1
•
circled a “6” in response to Question 2
•
circled a “5” in response to Question 3
•
indicated that she is a democrat (this is reflected by the “D” in the column “Political party”)
•
indicated that she is a female (reflected by the “F” in the column headed “Sex”)
•
is 32 years old
•
donated $1000 dollars to her party (reflected by the “1000” in the “Donation” column).
124 Step-by-Step Basic Statistics Using SAS: Student Guide
The remaining rows of the table can be interpreted in the same way. Using periods to represent missing data. The next-to-last column in Table 4.2 is “Age,” and this column indicates each subject’s age in years. For example, you can see that subject #1 (Marsha) is 32 years old. Where the row for subject #2 (Charles) intersects with the column headed “Age,” you can see a period (“.”). In this book, periods will be used to represent missing data. In this case, the period for subject #2 means that you do not have data on the Age variable for this subject. In conducting questionnaire research, you will often obtain missing data when subjects fail to complete certain questionnaire items. When you review the Table 4.2, you can see that there are missing data on other variables in addition to Age. For example, •
Subject #3 (Jack) and subject #4 (Cindy) each have missing data for the Political party variable.
•
Subject #4 (Cindy) also has missing data on Donation.
•
Subject #10 (Freda) has missing data on Q3.
There are a few other periods in Table 4.2, and each of these periods similarly represent missing data. Later, when you write your SAS program, you will again use periods to represent missing data within the DATA step. When SAS reads a data line and encounters a period as a value, it interprets it as missing data.
Chapter 4: Data Input 125
The SAS DATA Step Following is the DATA step of a SAS program that contains information from Table 4.2. You can see that all of the variables in Table 4.2 are included in this SAS data set, except for subjects’ first names (such as “Marsha”). However, subject numbers such as “01” and “02” have been included as the first variable in the data set. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM Q1 Q2 Q3 POL_PRTY $ SEX $ AGE DONATION ; DATALINES; 01 7 6 5 D F 32 1000 02 2 2 3 R M . 0 03 3 4 3 . M 45 100 04 6 6 5 . F 20 . 05 5 4 5 D F 31 0 06 2 3 1 R M 54 0 07 2 1 3 . M 21 250 08 3 3 3 R M 43 . 09 5 4 5 D F 32 100 10 3 2 . R F 18 0 11 3 6 4 R M 21 50 ;
The INPUT statement appears on lines 3–10 of the preceding program. It assigns the following SAS variable names: •
The SAS variable name SUB_NUM will be used to represent each participant’s subject number.
•
The SAS variable name Q1 will be used to represent subjects’ responses to Question 1.
•
The SAS variable name Q2 will be used to represent subjects’ responses to Question 2.
•
The SAS variable name Q3 will be used to represent subjects’ responses to Question 3.
•
The SAS variable name POL_PRTY will be used to represent subjects’ political party.
•
The SAS variable name SEX will be used to represent subjects’ sex.
•
The SAS variable name AGE will be used to represent subjects’ age.
•
The SAS variable name DONATION will be used to represent the size of subjects’ donations to political parties.
126 Step-by-Step Basic Statistics Using SAS: Student Guide
The data set appears on lines 12–22 of the preceding program. You can see that it is identical to the data presented in Table 4.2, except that subject names have been omitted, and the columns of data have been moved together so that only one blank space separates each variable. Some Rules for List Input The list approach to data input is probably the easiest way to write an INPUT statement. However, there are a number of rules that you must observe to ensure that your data are read correctly by SAS. The most important of these rules are presented here. The variables must appear on the data lines in the same sequence that they are listed in the INPUT statement. In the INPUT statement of the preceding program, the SAS variables were listed in this order: SUB_NUM Q1 Q2 Q3 POL_PRTY SEX AGE DONATION. This means that the variables must appear in exactly the same order on the data lines following the DATALINES statement. Each variable on the data lines must be separated by at least one blank space. Below are the first three data lines from the preceding program: DATALINES; 01 7 6 5 D F 32 1000 02 2 2 3 R M . 0 03 3 4 3 . M 45 100 The first data line was for subject #1 (Marsha), and so the first value on her line is “01” (her subject number). Marsha circled a “7” for Question 1, a “6” for Question 2, and so forth. It was necessary to leave one blank space between the “7” and the “6” so that SAS would read them as two separate values, rather than a single value of “76.” You will notice that the variables in the data lines of the preceding program were lined up so that each variable formed a neat, orderly column. Technically, this is not necessary with list input, but it is recommended as it increases the likelihood that you will have at least one blank space between each variable and you will not make any other errors in typing your data. When you have a large number of variables, it becomes awkward to leave one blank space between each variable. In these cases, it is better to enter each variable without the blank spaces and to use either the column input approach or the formatted input approach instead of the list approach to entering data. See Cody and Smith (1997) or Hatcher and Stepanski (1994) for details.
Chapter 4: Data Input 127
Each missing value must be represented by a single period. Data from the first three subjects in Table 4.2 are again reproduced here: _______________________________________________________________ Responses to questions: ______________ Political Subject Q1 Q2 Q3 party Sex Age Donation _______________________________________________________________ 01. Marsha 7 6 5 D F 32 1000 02. Charles 2 2 3 R M . 0 03. Jack 3 4 3 . M 45 100 _______________________________________________________________
You can see that there are some missing data in this table. For example, consider the second line of the table, which presents questionnaire responses for subject #2, Charles: in the column for “Age,” there is a single period (.) where you would expect to find Charles’ age. Similarly, the third line of the table presents responses for subject #3, Jack. You can see that there is a period for Jack in the “Political party” column. If you are using the list approach to input, it is very important that you use a single period (.) to represent each instance of missing data when typing your data lines. As was mentioned earlier, SAS recognizes a single period as the symbol for missing data. For example, here again are the first three lines of data from the preceding SAS program: DATALINES; 01 7 6 5 D F 32 1000 02 2 2 3 R M . 0 03 3 4 3 . M 45 100 The seventh variable (from the left) in this data set is the AGE variable. You can see that, on the second line of data, a period appears in the column for the AGE variable. The second line of data is for subject #2, Charles, and this period tells SAS that you have missing data on the AGE variable for Charles. Other periods in the data set may be interpreted in the same way. It is important that you use only one period for each instance of missing data; do not, for example, use two periods simply because the relevant variable occupies two columns. As an illustration, the following lines show an incorrect way to indicate missing data for subject #2 on the age variable (the next-to-last variable): DATALINES; 01 7 6 5 D F 32 1000 02 2 2 3 R M .. 0 03 3 4 3 . M 45 100 In the above incorrect example, the programmer keyed two periods in the place where the second subject’s age would normally be typed. But this will cause problems––because there are two periods, SAS will assume that there is missing data on two variables: AGE, as well
128 Step-by-Step Basic Statistics Using SAS: Student Guide
as DONATION (the variable next to AGE). The point is simple: use a single period to represent a single instance of missing data, regardless of how many columns the variable occupies. In the INPUT statement, use the $ symbol to identify character variables. All of the variables discussed in this guide are either numeric variables or character variables. Numeric variables consist exclusively of numbers––they do not contain any letters of the alphabet or any special characters (symbols such as *, %, #). In the preceding data set, AGE was an example of a numeric variable because it could assume only numeric values such as 32, 45, and 20. In contrast, character variables may consist of letters of the alphabet, special characters, or numbers. In the preceding data set, POL_PRTY was an example of a character variable because it could assume the values “D” (for democrats) or “R” (for republicans). SEX was also a character variable because it could assume the values “F” and “M.” By default, SAS assumes that all of your variables will be numeric variables. If a particular variable is a character variable, you must indicate this in your INPUT statement. You do this by placing the dollar symbol ($) after the name of the variable in the INPUT statement. Leave at least one blank space between the name of the variable and the $ symbol. For example, the INPUT statement from the preceding program is again reproduced here: INPUT
SUB_NUM Q1 Q2 Q3 POL_PRTY SEX $ AGE DONATION
$ ;
In this program, the SAS variables Q1, Q2, and Q3 are numeric variables, so the $ symbol is not placed next to them. However, POL_PRTY is a character variable, and so the $ appears next to it. The same is true for the SEX variable. If you are using the column input approach, you should type the $ symbol before indicating the columns in which the variable will appear. For example, here is the way the preceding INPUT statement would be typed if you were using the column input approach: INPUT
SUB_NUM Q1 Q2 Q3 POL_PRTY SEX AGE DONATION
$ $
1 4 6 8 10 12 14-15 17-20 ;
Chapter 4: Data Input 129
The preceding statement tells SAS that SUB_NUM appears in column 1, Q1 appears in column 4, Q2 appears in column 6, Q3 appears in column 8, POL_PRTY appears in column 10, and so on. The $ symbols next to POL_PRTY and SEX inform SAS that these variables are character variables. Limit the values of character variables to eight characters. When using the format-free approach to inputting data, a value of a character variable can be no more than eight characters in length. Remember that the values are the actual entries that appear in the data lines. With a numeric variable, a value is usually the “score” that the subject displays on the variable. For example, the numeric variable AGE could assume values such as “32,” “45,” “20,” and so on. With a character variable, a value is usually a name or an abbreviation consisting of letters or symbols. For example, the character variable POL_PRTY could assume the values “D” or “R.” The character variable SEX could assume the values “F” or “M.” Suppose that you wanted to create a new character variable called NAME to include your subjects’ first names. The values of this variable would be the subjects’ first names (such as “Marsha”), and you would have to ensure that no name was over eight letters in length. Now, suppose that you drop the SUB_NUM variable, which assigns numeric subject numbers to each subject (such as “01,” “02,” and so on). Then you decide to replace SUB_NUM with your new character variable called NAME, which will consist of your subjects’ first names. This NAME variable would be the first variable on each data line. Here is the INPUT statement for this revised program, along with the first few data lines: INPUT
NAME $ Q1 Q2 Q3 POL_PRTY SEX $ AGE DONATION
$
; DATALINES; Marsha 7 6 5 D F 32 1000 Charles 2 2 3 R M . 0 Jack 3 4 3 . M 45 100 Notice that each value of NAME in the preceding program (such as “Marsha”) is eight characters in length or shorter. This is acceptable.
130 Step-by-Step Basic Statistics Using SAS: Student Guide
However, the following data lines would not be acceptable, because the values of NAME are over eight characters in length: DATALINES; Elizabeth 7 6 5 D F 32 1000 Christopher 2 2 3 R M . 0 Francisco 3 4 3 . M 45 100 Remember also that the value of a character variable must not contain any embedded blanks. This means, for example, that you cannot have a blank space in the middle of a name, as is done with the following unacceptable data line: Betty Lou
7 6 5 D F 32 1000
Avoid using hyphens in variable names. When listing SAS variable names in the INPUT statement, you should avoid creating any SAS variable names that include a hyphen, such as “AGE-YRS.” This is because SAS usually reads a variable name containing a hyphen as a string variable (string variables were discussed in the section “Overview of Three Options for Writing the INPUT Statement”). Students learning SAS programming for the first time will sometimes write a SAS variable name that includes a hyphen, not realizing that this will cause SAS to search for a string variable. The result is often an error message and confusion. Instead of using hyphens, it is good practice to use an underscore (“_”) in SAS variable names. If you use an underscore, SAS will assume that the variable is a regular SAS variable, and not a string variable. For example, suppose that one of your variables is “age in years.” You should not use the following SAS variable name to represent this variable, because SAS will interpret it as a string variable: AGE-YRS Instead, you can use an underscore in the variable name, like this: AGE_YRS
Chapter 4: Data Input 131
Using PROC MEANS and PROC FREQ to Identify Obvious Problems with the Data Set Overview The DATA step is now complete, and you are finally ready to analyze the data set you have entered. This section shows you how to use two SAS procedures to analyze the data set: •
PROC MEANS, which requests that the means and other descriptive statistics be computed for the numeric variables
•
PROC FREQ, which creates frequency tables for either numeric or character variables.
This section will show you the basic information that you need to know in order to use these two procedures. PROC MEANS and PROC FREQ are illustrated here so that you can perform some simple analyses to help verify that you created your data set correctly. A more detailed treatment of PROC MEANS and PROC FREQ will appear in the chapters to follow. Adding PROC MEANS and PROC FREQ to the SAS Program The syntax. Here is the syntax for requesting PROC MEANS and PROC FREQ: PROC MEANS DATA=data-set-name ; VAR variable-list ; TITLE1 ' your-name ' ; PROC FREQ DATA=data-set-name ; TABLES variable-list ; RUN; In this guide, syntax is a template for a section of a SAS program. When you use syntax for guidance in writing a SAS program, you should adhere to the following guidelines: •
If certain words are presented in uppercase type (capital letters) in the syntax, you should type those same words in your SAS program.
•
If certain words are presented in lowercase type in the syntax, you should not type those words in your SAS program. Instead, you should substitute the data set names, variable names, or key words that are appropriate for your specific analysis.
132 Step-by-Step Basic Statistics Using SAS: Student Guide
For example: PROC MEANS
DATA=data-set-name ;
In the preceding line, PROC MEANS and DATA= are printed in uppercase type. Therefore, you should type these words in your program just as they appear in the syntax. However, the words “data-set-name” appear in lower case italics. Therefore, you will not type the words “data-set-name.” Instead, you will type the name of the data set that you wish to analyze in your specific analysis. For example, if you wish to analyze a data set that is named D1, you would write the PROC MEANS statement this way in your SAS program: PROC MEANS
DATA=D1;
Most of the chapters in this guide will include syntax for performing different tasks with SAS. In each instance, you should follow the guidelines presented above for using the syntax. Variables that you will analyze. With the preceding syntax, the entry “variable-list” that appears with the VAR and TABLES statements refers to the list of variables that you want to analyze. In analyzing data from the political donation study, suppose that you will use PROC MEANS to analyze your numeric variables (such as Q1 and AGE), and PROC FREQ to analyze your character variables (POL_PRTY and SEX).
Chapter 4: Data Input 133
The SAS program. Following is the entire program for analyzing data from the political donation study. This time, statements have been appended to the end of the program to request PROC MEANS and PROC FREQ. Notice how the names of actual variables have been inserted in the locations where “variable-list” had appeared in the syntax that was presented above. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM Q1 Q2 Q3 POL_PRTY $ SEX $ AGE DONATION ; DATALINES; 01 7 6 5 D F 32 1000 02 2 2 3 R M . 0 03 3 4 3 . M 45 100 04 6 6 5 . F 20 . 05 5 4 5 D F 31 0 06 2 3 1 R M 54 0 07 2 1 3 . M 21 250 08 3 3 3 R M 43 . 09 5 4 5 D F 32 100 10 3 2 . R F 18 0 11 3 6 4 R M 21 50 ; PROC MEANS DATA=D1; VAR Q1 Q2 Q3 AGE DONATION; TITLE1 'JOHN DOE'; RUN; PROC FREQ DATA=D1; TABLES POL_PRTY SEX; RUN;
Lines 24–27 of this program contain the statements that request the MEANS procedures. Line 25 contains the VAR statement for PROC MEANS. You use the VAR statement to list the variables to be analyzed. You can see that this statement requests that PROC MEANS be performed on Q1, Q2, Q3, AGE, and DONATION. Remember that you may list only numeric variables in the VAR statement for PROC MEANS––you may not list character variables (such as POL_PRTY or SEX).
134 Step-by-Step Basic Statistics Using SAS: Student Guide
Lines 28–30 of this program contain the statements that request the FREQ procedure. Line 29 contains the TABLES statement for PROC FREQ. You use this statement to list the variables for which frequency tables will be produced. You can see that PROC FREQ will be performed on POL_PRTY and SEX. In the TABLES statement for PROC FREQ, you may list either character variables or numeric variables. The SAS Log After the preceding program has been submitted and executed, you should first review the SAS log file to verify that it ran without error. The log file for the preceding program is reproduced as Log 4.1. NOTE: SAS initialization used: real time 18.56 seconds 1 2 3 4 5 6 7 8 9 10 11
OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM Q1 Q2 Q3 POL_PRTY $ SEX $ AGE DONATION ; DATALINES;
NOTE: The data set WORK.D1 has 11 observations and 8 variables. NOTE: DATA statement used: real time 1.43 seconds 23 24 25 26 27
; PROC MEANS DATA=D1; VAR Q1 Q2 Q3 AGE TITLE1 'JOHN DOE'; RUN;
DONATION;
NOTE: There were 11 observations read from the dataset WORK.D1. NOTE: PROCEDURE MEANS used: real time 1.63 seconds 28 29 30
PROC FREQ DATA=D1; TABLES POL_PRTY RUN;
SEX;
NOTE: There were 11 observations read from the dataset WORK.D1. NOTE: PROCEDURE FREQ used: real time 0.61 seconds Log 4.1. Log file from the political donation study.
Remember that the SAS log consists of your SAS program (minus the data), along with notes, warnings, and error messages generated by SAS as it executes your program. Lines
Chapter 4: Data Input 135
1–11 in Log 4.1 reproduce the DATA step of your SAS program. Immediately after this, the following note appeared in the log window: NOTE: The data set WORK.D1 has 11 observations and 8 variables.
This note indicates that your SAS data set (named D1) has 11 observations and 8 variables. This is a good sign, because you intended to input data from 11 subjects on 8 variables. The remainder of the SAS log reveals no evidence of any problems in the SAS program, and so you can proceed to the SAS output file. Interpreting the Results Produced by PROC MEANS The SAS Output. The output file for the current analysis consists of two pages. Page 1 contains the results of PROC MEANS, and page 2 contains the results of PROC FREQ. The results of PROC MEANS are reproduced in Output 4.1. JOHN DOE
1
The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ---------------------------------------------------------------------------Q1 11 3.7272727 1.7372915 2.0000000 7.0000000 Q2 11 3.7272727 1.7372915 1.0000000 6.0000000 Q3 10 3.7000000 1.3374935 1.0000000 5.0000000 AGE 10 31.7000000 12.2750877 18.0000000 54.0000000 DONATION 9 166.6666667 323.0711996 0 1000.00 ----------------------------------------------------------------------------Output 4.1. Results of PROC MEANS, political donation study.
Once you have created a data set, it is a good idea to perform PROC MEANS on all numeric variables, and review the results for evidence of possible errors. It is especially important to review the information in the columns headed “N,” “Minimum,” and “Maximum.” Reviewing the number of valid observations. The first column in Output 4.1 is headed “Variable.” In this column you will find the names of the variables that were analyzed. You can see that, as expected, PROC MEANS was performed on Q1, Q2, Q3, AGE, and DONATION. The second column in Output 4.1 is headed “N.” This column indicates the number of valid observations that were found for each variable. Where the row for Q1 intersects with the column headed “N,” you will find the number “11.” This indicates that PROC MEANS analyzed 11 valid cases for the variable Q1, as expected. Where the row for Q3 intersects with the column headed “N,” you find the number “10,” meaning that there were only 10 usable observations for the Q3 variable. However, this does not necessarily mean that there was an error: if you review the actual data set (reproduced earlier), you will note that there is
136 Step-by-Step Basic Statistics Using SAS: Student Guide
one instance of missing data for Q3 (indicated by the single period in the column for Q3 for the next-to-last subject). Similarly, although Output 4.1 indicates only 9 valid observations for DONATION, this is no cause for concern because the data set itself shows that you had missing data for two people on this variable. Reviewing the Minimum and Maximum columns. The fifth column in Output 4.1 is headed “Minimum.” This column indicates the smallest value that was observed for each variable. The last column in the output is headed “Maximum,” and this column indicates the largest value that was observed for each variable. The “Minimum” and “Maximum” columns are useful for determining whether any values are out of bounds. Out-of-bounds values are scores that are either too large or too small to be possible, given the type of variable that you are analyzing. If you find any out-of-bounds values, it probably means that you made an error, either in writing your INPUT statement or in typing your data. For example, consider the variable Q1 in Output 4.1. Where the row for Q1 intersects with the column headed “Minimum,” you see the value of 2. This means that the smallest value that the SAS System observed for Q1 was 2. Where the row for Q1 intersect with the column headed “Maximum,” you see the value of 7. This means that the largest value that SAS read for Q1 was 7. Remember that the variable Q1 represents responses to a questionnaire item where the possible responses ranged from a low of 1 (Generally Conservative) to a maximum of 7 (Generally Liberal). If Output 4.1 showed a “Minimum” score of 0 for Q1, this would be an invalid, out-of-bounds score (because Q1 is not supposed to go any lower than 1). Such a result might mean that you made an error in keying your data. Similarly, if the output showed a Maximum score of 9 for Q1, this would also be an invalid score (because Q1 is not supposed to go any higher than 7). A review of minimum and maximum values in Output 4.1 does not reveal any out-ofbounds scores for any of the variables.
Chapter 4: Data Input 137
Interpreting the Results Produced by PROC FREQ The SAS Output. Because the results of PROC MEANS did not reveal any obvious problems with the data set, you can proceed to the results of PROC FREQ. These results are reproduced in Output 4.2. JOHN DOE
2
The FREQ Procedure Cumulative Cumulative POL_PRTY Frequency Percent Frequency Percent ------------------------------------------------------------D 3 37.50 3 37.50 R 5 62.50 8 100.00 Frequency Missing = 3
Cumulative Cumulative SEX Frequency Percent Frequency Percent -------------------------------------------------------F 5 45.45 5 45.45 M 6 54.55 11 100.00 Output 4.2. Results of PROC FREQ, political donation study.
Reviewing the frequency tables. Two tables appear in Output 4.2: a frequency table for the variable POL_PRTY, and a frequency table for the variable SEX. In the first table, the variable name POL_PRTY appears in the upper left corner, meaning that this is the frequency table for POL_PRTY (political party). Beneath this variable name are the two possible values that the variable could assume: “D” (for democrats) and “R” (for republicans). You should always review this list of values to verify that no invalid values appear there. For example, if the values for this table had included “T” along with “D” and “R,” it probably would indicate that you made an error in keying your data because “T” doesn’t stand for anything meaningful in this study. When typing character variables in a data set, case is important, so you must be consistent in using uppercase and lowercase letters. For example, when keying POL_PRTY, if you initially use an uppercase “D” to represent democrats, you should never switch to using a lowercase “d” within that data set. If you do, SAS will treat the uppercase “D” and the lowercase “d” as two completely different values. When you perform a PROC FREQ on the POL_PRTY variable, you will obtain one row of frequency information for subjects identified with the uppercase “D,” and a different row of frequency information for subjects identified with the lowercase “d.” In most cases, this will not be desirable.
138 Step-by-Step Basic Statistics Using SAS: Student Guide
The second column in the frequency table in Output 4.2 is headed “Frequency.” This column indicates the number of subjects who were observed in each of the categories of the variable being analyzed. For example, where the row for the value “D” intersects with the column headed “Frequency,” you can see the number “3.” This means that 3 subjects were coded with a “D” in the data set. In other words, it means that 3 subjects were democrats. Where the row for the value “R” intersects with the column headed “Frequency,” you see the number “5.” This means that 5 subjects were coded with a “R” in the data set (i.e., 5 subjects were republicans). Below the frequency table for the POL_PRTY variable, you can see the entry “Frequency Missing = 3”. This section of the results produced by PROC FREQ indicates the number of observations with missing data for the variable being analyzed. This frequency missing entry for POL_PRTY indicates that there were three subjects with missing data for the political party variable. Whenever you create a new data set, you should always perform PROC FREQ on all character variables in this manner, to verify that the results seem reasonable. A warning sign, for example, would be a very large value for “Frequency Missing.” For POL_PRTY, all of the results from PROC FREQ seem reasonable, indicating no obvious problems. The second frequency table in Output 4.2 provides results for the SEX variable. It shows that 5 subjects were coded with an “F” (5 subjects were female), and 6 subjects were coded with an “M” (6 subjects were male). There is no “Frequency Missing” entry for the SEX table, which indicates that there were no missing data for this variable. These results, too, seem reasonable, and do not indicate any obvious problems with the DATA step so far. Summary In summary, whenever you create a new data set, you should perform a few simple descriptive analyses to verify that there were no obvious errors in writing the INPUT statement or in typing the data. At a minimum, this should include performing PROC MEANS on your numeric variables, and performing PROC FREQ on your character variables. PROC UNIVARIATE is also useful for performing descriptive analysis on numeric variables. The results produced by PROC UNIVARIATE are somewhat more complex than those produced by PROC MEANS; for this reason, it will be covered in Chapter 7, “Measures of Central Tendency and Variability.” If the results produced by PROC MEANS and PROC FREQ do not reveal any obvious problems, it does not necessarily mean that your data set is free of typos or other errors. An even more thorough approach to checking your data set involves using PROC PRINT to print out the raw data, so that you can proof every subject’s value on every variable. The following section shows how to do this.
Chapter 4: Data Input 139
Using PROC PRINT to Create a Printout of Raw Data Overview The PRINT procedure (PROC PRINT) is useful for generating a printout of your raw data (i.e., a printout of your data as they appear in a SAS data set). You can use PROC PRINT to review each subject’s score on each variable in your data set. Whenever you create a new data set, you should always use PROC PRINT to print out the raw data before doing any other, more sophisticated analyses. You should check the output created by PROC PRINT against your original data records to verify that SAS has read your data in the way that you intended. The first part of this section shows you how to use PROC PRINT to print raw data for all variables in a data set. Later, this section shows how you can use the VAR statement to print raw data for a subset of variables. Using PROC PRINT to Print Raw Data for All of the Variables In the Data Set The Syntax. Here is the syntax for the PROC step that will cause the PRINT procedure to print the raw data for all variables in your data set: PROC PRINT TITLE1 RUN;
DATA=data-set-name ; ' your-name ' ;
Here are the actual statements that you use with the PRINT procedure to print the raw data for the political donation study described above (a later section will show where these statements should go in your program): PROC PRINT TITLE1 RUN;
DATA=D1; 'JOHN DOE';
Output 4.3 shows the results that are generated by the preceding statements.
140 Step-by-Step Basic Statistics Using SAS: Student Guide
JOHN DOE
1
Obs
SUB_NUM
Q1
Q2
Q3
POL_PRTY
SEX
AGE
DONATION
1 2 3 4 5 6 7 8 9 10 11
1 2 3 4 5 6 7 8 9 10 11
7 2 3 6 5 2 2 3 5 3 3
6 2 4 6 4 3 1 3 4 2 6
5 3 3 5 5 1 3 3 5 . 4
D R
F M M F F M M M F F M
32 . 45 20 31 54 21 43 32 18 21
1000 0 100 . 0 0 250 . 100 0 50
D R R D R R
Output 4.3. Results of PROC PRINT performed on data from the political donation study (see Table 4.2).
Output created by PROC PRINT. For the most part, Output 4.3 presents a duplication of the data that appeared in Table 4.2, which was presented earlier in this chapter. The most obvious difference is the fact that subject names that appeared in Table 4.2 do not appear in Output 4.3. The first column, Obs (Observation number) lists a unique observation number for each subject in the study. When the observations in a data set are individual subjects (as is the case with the current political donation study), the observation numbers are essentially subject numbers. This means that, in the row for observation #1, you will find data for your first subject (Marsha from Table 4.2); in the row for observation #2, you will find data for your second subject (Charles from Table 4.2), and so on. You probably remember that you did not include this Obs variable in the data set that you created. Instead, this Obs variable is automatically generated by SAS whenever you create a SAS data set. This column shows the “subject number” variable that was input as part of your SAS data set. The column headed Q1 contains subject responses to question #1 from the political donation questionnaire that was presented earlier. Question #1 asked, “Would you describe yourself as being generally conservative, generally liberal, or somewhere between these two extremes?” Subjects could circle any number from 1 to 7 to indicate their response, where 1 = Generally Conservative and 7 = Generally Liberal. Under the heading of Q1 in Output 4.3, you can see that subject #1 circled a 7, subject #2 circled a 2, subject #3 circled a 3, and so on. In the columns headed Q2 and Q3, you will find subject responses to question #2 and question #3 from the political donation questionnaire. These questions also used a 7-point response format. The output shows that subject #10 has a period (.) listed under the heading Q3. This means that this subject has missing data for question #3.
Chapter 4: Data Input 141
Under POL_PRTY, you will find subject values for the political party variable. You will remember that this was a character variable in which the value “D” represents democrats and “R” represents republicans. You can see subject #3, subject #4, and subject #7 do not have any values for POL_PRTY. This is because they had missing data on the political party variable. The column headed SEX indicates subject sex. This was a character variable in which “F” represented females and “M” represented males. The column headed AGE indicates subject age. The column headed DONATION indicates the amount of the financial donation made to a political party by each subject. Using PROC PRINT to Print Raw Data for a Subset of Variables In the Data Set Statements for the SAS Program. In some cases, you may wish to print raw data for only a few variables in a data set. When this is the case, you should use the VAR statement in conjunction with PROC PRINT statement. In the VAR statement, list only the names of the variables that you want to print. Below is the syntax: PROC PRINT DATA=data-set-name ; VAR variable-list ; TITLE1 ' your-name ' ; RUN; For example, the following will cause PROC PRINT to print raw values only for the SEX and AGE variables: PROC PRINT DATA=D1; VAR SEX AGE; TITLE1 'JOHN DOE'; RUN;
142 Step-by-Step Basic Statistics Using SAS: Student Guide
Output created by PROC PRINT. Output 4.4 shows the results that are generated by the preceding statements. JOHN DOE Obs
SEX
AGE
1 2 3 4 5 6 7 8 9 10 11
F M M F F M M M F F M
32 . 45 20 31 54 21 43 32 18 21
Output 4.4. Results of PROC PRINT in which only the SEX and AGE variables were listed in the VAR statement.
You can see that Output 4.4 is similar to Output 4.3, with the exception that Output 4.4 includes only three variables: Obs, SEX, and AGE. As was stated earlier, Obs is not entered by the SAS user as a part of the data set; instead, it is automatically generated by SAS. A Common Misunderstanding Regarding PROC PRINT Students learning the SAS System for the first time often misunderstand PROC PRINT: they sometimes assume that a SAS program must contain PROC PRINT in order to generate a paper printout of their results. This is not the case. PROC PRINT simply generates a printout of your raw data (i.e., subjects’ individual scores for the variables in your data set). If you have performed some other SAS procedure such as PROC MEANS or PROC FREQ, you do not have to include PROC PRINT in your program to create a paper printout of the results generated by these procedures. Use PROC PRINT only when you want to generate a listing of each subject’s values on the variables in your SAS data set.
The Complete SAS Program To review: when you first create a SAS data set, it is very important to perform a few simple SAS procedures to verify that SAS read your data set as you intended. In most cases, this means that you should •
perform PROC MEANS on all numeric variables in the data set (if any).
•
perform PROC FREQ on all character variables in the data set (if any).
Chapter 4: Data Input 143 •
perform PROC PRINT to print out the complete raw data set, including numeric and character variables.
These three procedures have been discussed separately in previous sections. However, it is often best to request all three procedures in the same SAS program when you have created a new data set. An example of such a program appears below. The program does the following: 1. inputs the political donation data set described earlier in this chapter 2. requests that PROC MEANS be performed on one subset of variables 3. requests that PROC FREQ be performed on a different subset of variables 4. includes a PROC PRINT statement that will cause the entire raw data set to be printed out (notice that the VAR statement has been omitted from the PROC PRINT section of the program): OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM Q1 Q2 Q3 POL_PRTY $ SEX $ AGE DONATION ; DATALINES; 01 7 6 5 D F 32 1000 02 2 2 3 R M . 0 03 3 4 3 . M 45 100 04 6 6 5 . F 20 . 05 5 4 5 D F 31 0 06 2 3 1 R M 54 0 07 2 1 3 . M 21 250 08 3 3 3 R M 43 . 09 5 4 5 D F 32 100 10 3 2 . R F 18 0 11 3 6 4 R M 21 50 ; PROC MEANS DATA=D1; VAR Q1 Q2 Q3 AGE DONATION; TITLE1 'JOHN DOE'; RUN; PROC FREQ DATA=D1; TABLES POL_PRTY SEX; RUN; PROC PRINT DATA=D1; RUN;
144 Step-by-Step Basic Statistics Using SAS: Student Guide
Conclusion This chapter focused on the list input approach to writing the INPUT statement. This is a relatively simple approach, and will be adequate for the types of data sets that you will encounter in this Student Guide. For more complex data sets (e.g., data sets that include more than one line of data for each observation), you might want to learn more about formatted input. This approach is described and illustrated in Cody and Smith (1997), Hatcher (2001), and Hatcher and Stepanski (1994). After you have prepared the DATA step of your SAS program, it is good practice to analyze it with PROC FREQ (along with other procedures) to verify that there were no obvious errors in the INPUT statement. This chapter provided a quick introduction to PROC FREQ; next, Chapter 5, “Creating Frequency Tables,” discusses the FREQ procedure in greater detail.
Creating Frequency Tables Introduction.........................................................................................146 Overview...............................................................................................................146 Why It Is Important to Use PROC FREQ ..............................................................146 Example 5.1: A Political Donation Study ...........................................147 The Study .............................................................................................................147 Data Set to Be Analyzed.......................................................................................148 The DATA Step of the SAS Program ....................................................................150 Using PROC FREQ to Create a Frequency Table................................152 Writing the PROC FREQ Statement .....................................................................152 Output Produced by the SAS Program .................................................................152 Examples of Questions That Can Be Answered by Interpreting a Frequency Table......................................................155 The Frequency Table............................................................................................155 The Questions.......................................................................................................156 Conclusion...........................................................................................157
146 Step-by-Step Basic Statistics Using SAS: Student Guide
Introduction Overview In this chapter you learn how to use the FREQ procedure to create simple, one-way frequency tables. When you use PROC FREQ to analyze a specific variable, the resulting frequency table displays •
values for that variable that were observed in the sample that you analyzed
•
frequency (number) of observations appearing at each value
•
percent of observations appearing at each value
•
cumulative frequency of observations appearing at each value
•
cumulative percent of observations appearing at each value.
Some of the preceding statistics terms (e.g., cumulative frequency) may be new to you. Later sections of this chapter will explain these terms, and will show you how to interpret a frequency table created by the FREQ procedure. Why It Is Important to Use PROC FREQ After you have created a SAS data set, it is often a good idea to analyze it with PROC FREQ before going on to perform more sophisticated statistical analyses (such as analysis of variance). At a minimum, this will help you find errors in your data or program. In addition, with some types of investigations it is necessary to create a frequency table in order to answer research questions. For example, performing PROC FREQ on the correct data set can help you answer the research question “What percentage of the adult U.S. population favors the death penalty?”
Chapter 5: Creating Frequency Tables 147
Example 5.1: A Political Donation Study The Study Suppose that you are a political scientist conducting research on campaign finance. With your current study, you wish to identify the variables that predict the size of the financial donations that people make to political causes. You develop the following questionnaire: What is your political party (please check one): ____ Democrat
____ Republican
____ Independent
What is your sex? ____ Female
____ Male
What is your age? ______ years old
During the past year, how much money have you donated to political causes? $ ____________ Below are a number of statements with which you may agree or disagree. For each, please circle the number that indicates the extent to which you either agree or disagree with the statement. Please use the following format in making your responses: 7 6 5 4 3 2 1
= = = = = = =
Agree Very Strongly Agree Strongly Agree Neither Agree nor Disagree Disagree Disagree Strongly Disagree Very Strongly
Circle your response ––––––––– 1 2 3 4 5
6
7
1. I believe that our federal government is generally doing a good job.
1
2
3
4
5
6
7
2. The federal government should raise taxes.
1
2
3
4
5
6
7
3. The federal government should do a better job of maintaining our interstate highway system.
1
2
3
4
5
6
7
4. The federal government should increase social security benefits to the elderly.
148 Step-by-Step Basic Statistics Using SAS: Student Guide
Data Set to Be Analyzed Responses to the questionnaire. You administer this questionnaire to 22 individuals between the ages of 33 and 59. Table 5.1 contains subject responses to the questionnaire. Table 5.1 Subject Responses to the Political Donation Questionnaire _______________________________________________________________ Responses to statementsb Subject
Political partya
Sex
Age
Donation
__________________ Q1 Q2 Q3 Q4
_______________________________________________________________ 01 D M 47 400 4 3 6 2 02 R M 36 800 4 6 6 6 03 I F 52 200 1 3 7 2 04 R M 47 300 3 2 5 3 05 D F 42 300 4 4 5 6 06 R F 44 1200 2 2 5 5 07 D M 44 200 6 2 3 6 08 D M 50 400 4 3 6 2 09 R F 49 2000 3 1 6 2 10 D F 33 500 3 4 7 1 11 R M 49 700 7 2 6 7 12 D F 59 600 4 2 5 6 13 D M 38 300 4 1 2 6 14 I M 55 100 5 5 6 5 15 I F 52 0 5 2 6 5 16 D F 48 100 6 3 4 6 17 R F 47 1500 2 1 6 2 18 D M 49 500 4 1 6 2 19 D F 43 1000 5 2 7 3 20 D F 44 300 4 3 5 7 21 I F 38 100 5 2 4 1 22 D F 47 200 3 7 1 4 _______________________________________________________________ a
For the political party variable, “D” represents democrats, “R” represents republicans, and “I” represents independents. b Responses to the four “agree-disagree” statements at the end of the questionnaire.
Chapter 5: Creating Frequency Tables 149
Understanding Table 5.1. In Table 5.1, the rows (running horizontally) represent different subjects, and the columns (running vertically) represent different variables. The first column is headed “Subject.” This variable simply assigns a unique subject number to each person who responded to the questionnaire. These subject numbers run from “01” to “22.” The second column is headed “Political party.” With this variable, the value “D” is used to represent democrats, “R” is used to represent republicans, and “I” is used to represent independents. Third column is headed “Sex.” With this variable, the value “M” is used to represent male subjects, and the value “F” is used to represent female subjects. The fourth and fifth columns are headed “Age” and “Donation.” These columns provide each subject’s age and the size of the political donations they have made, respectively. For example, you can see that Subject 01 was 47 years old and made a donation of $400, Subject 02 was 36 years old and made a donation of $800, and so forth. The last four columns of Table 5.1 appear under the major heading “Responses to statements.” These columns contain subject responses to the four “agree-disagree” statements that appear in the previously mentioned questionnaire: •
Column Q1 indicates the number that each subject circled in response to the statement “I believe that our federal government is generally doing a good job.” You can see that Subject 01 circled “4” (which stands for “Neither agree nor Disagree”), Subject 02 also circled “4,” Subject 03 circled “1” (which stands for “Disagree Very Strongly”), and so on.
•
Column Q2 contains responses to the statement “The federal government should raise taxes.”
•
Column Q3 contains responses to the statement “The federal government should do a better job of maintaining our interstate highway system.”
•
Column Q4 contains responses to the statement “The federal government should increase social security benefits to the elderly.”
150 Step-by-Step Basic Statistics Using SAS: Student Guide
The DATA Step of the SAS Program Keying the DATA step. Now you include the data that appear in Table 5.1 as part of the DATA step of a SAS program. In doing this, you arrange the data in a way that is similar to the preceding table (i.e., the first column contains a unique subject number for each participant, the second column indicates the political party to which each subject belongs, and so on). Below is the DATA step for the SAS program that contains these data: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM POL_PRTY $ SEX $ AGE DONATION Q1 Q2 Q3 Q4 ; DATALINES; 01 D M 47 400 4 3 6 2 02 R M 36 800 4 6 6 6 03 I F 52 200 1 3 7 2 04 R M 47 300 3 2 5 3 05 D F 42 300 4 4 5 6 06 R F 44 1200 2 2 5 5 07 D M 44 200 6 2 3 6 08 D M 50 400 4 3 6 2 09 R F 49 2000 3 1 6 2 10 D F 33 500 3 4 7 1 11 R M 49 700 7 2 6 7 12 D F 59 600 4 2 5 6 13 D M 38 300 4 1 2 6 14 I M 55 100 5 5 6 5 15 I F 52 0 5 2 6 5 16 D F 48 100 6 3 4 6 17 R F 47 1500 2 1 6 2 18 D M 49 500 4 1 6 2 19 D F 43 1000 5 2 7 3 20 D F 44 300 4 3 5 7 21 I F 38 100 5 2 4 1 22 D F 47 200 3 7 1 4 ;
Chapter 5: Creating Frequency Tables 151
Understanding the DATA step. Remember that if you were typing the preceding program, you would not actually type the line numbers (in italic) that appear on the left; they are provided here for reference. With the preceding program, the INPUT statement appears on lines 3–11. This INPUT statement assigns the following SAS variable names to your variables: •
The SAS variable name SUB_NUM is used to represent each subject’s unique subject number (i.e., “01,” “02,” “03,” and so on).
•
The SAS variable name POL_PRTY represents the political party to which the subject belongs. In typing your data, you used the value “D” to represent democrats, “R” to represent republicans, and “I” to represent independents. The dollar sign ($) to the right of this variable name indicates that it is a character variable.
•
The SAS variable name SEX represents each subject’s sex, with the value “F” representing females and “M” representing males. Again, the dollar sign ($) to the right of this variable name indicates that it is also a character variable.
•
The SAS variable name AGE indicates the subject's age in years.
•
The SAS variable name DONATION indicates the size of the political donation (in dollars) that each subject made in the past year.
•
The SAS variable name Q1 indicates subject responses to the first question using the “agree-disagree” format. You keyed a “1” if the subject circled a “1” (for “Disagree Very Strongly”), you keyed a “2” if the subject circled a “2” (for “Disagree Strongly”), and so on.
•
In the same way, the SAS variable names Q2, and Q3, and Q4 represent subject responses to the second, third, and fourth questions using the “Agree-Disagree” format.
Notice that at least one blank space was left between each variable in the data set. This is required when using the list, or free-formatted, approach to data input. With the data set typed in, you can now append PROC statements below the null statement (the lone semicolon that appears on line 35). You can use these PROC statements to create frequency tables that will help you understand more clearly the variables in the data set. The following section shows you how to do this.
152 Step-by-Step Basic Statistics Using SAS: Student Guide
Using PROC FREQ to Create a Frequency Table Writing the PROC FREQ Statement The syntax. Following is the syntax for the PROC FREQ statement (and related statements) that will create a simple frequency table: PROC FREQ TABLES TITLE1 RUN;
DATA=data-set-name ; variable-list ; ' your-name ' ;
The second line of the preceding code is the TABLES statement. In this statement you list the names of the variables for which frequency tables should be created. If you list more than one variable, you should separate each variable name by at least one space. Listing more than one variable in the TABLES statement will cause SAS to create a separate frequency table for each variable. For example, the following statements create a frequency table for the variable AGE that appears in the preceding data set: PROC FREQ DATA=D1; TABLES AGE; TITLE1 'JANE DOE'; RUN;
Output Produced by the SAS Program The frequency table. The frequency table created by the preceding statements is reproduced as Output 5.1. The remainder of this section shows how to interpret the various parts of the table.
Chapter 5: Creating Frequency Tables 153
JANE DOE
1
The FREQ Procedure Cumulative Cumulative AGE Frequency Percent Frequency Percent -------------------------------------------------------33 1 4.55 1 4.55 36 1 4.55 2 9.09 38 2 9.09 4 18.18 42 1 4.55 5 22.73 43 1 4.55 6 27.27 44 3 13.64 9 40.91 47 4 18.18 13 59.09 48 1 4.55 14 63.64 49 3 13.64 17 77.27 50 1 4.55 18 81.82 52 2 9.09 20 90.91 55 1 4.55 21 95.45 59 1 4.55 22 100.00 Output 5.1. Results from the FREQ procedure performed on the variable AGE.
You can see that the frequency table consists of five vertical columns headed AGE ( ), Frequency ( ), Percent ( ), Cumulative Frequency ( ), and Cumulative Percent ( ). The following sections describe the meaning of the information contained in each column. The column headed with the variable name. The first column in a SAS frequency table is headed with the name of the variable that is being analyzed. You can see that the first column in Output 5.1 is labeled “AGE,” the variable being analyzed here. The various values assumed by AGE appear in the first column under the heading “AGE.” Reading from the top down in this column, you can see that, in this data set, the observed values of AGE were 33, 36, 38, and so on, through 59. This means that the youngest person in your data set was 33, and the oldest was 59. The second column in the data set is headed “Frequency.” This column reports the number of observations that appear at each value of the variable being analyzed. In the present case, it will tell us how many subjects were at age 33, how many were at age 36, and so on. For example, the first value in the AGE column is 33. If you read to the right of this value, you will find information about how many people were at age 33. Where the row for “33” intersects with the column for “Frequency,” you see the number “1.” This means that just one person was at age 33. Now skip down two rows to the row for the age “38.” Where the row for “38” intersects with the column headed “Frequency,” you see the number “2.” This means that two people were at age 38.
154 Step-by-Step Basic Statistics Using SAS: Student Guide
Reviewing various parts of the “Frequency” column reveals the following: •
There were 3 people at age 44.
•
There were 4 people at age 47.
•
There was 1 person at age 59.
The next column is the “Percent” column. This column indicates the percent of observations appearing at each value. In the present case, it will reveal the percent of people at age 33, the percent at age 36, and so on. A particular entry in the “Percent” column is equal to the corresponding value in the “Frequency” column, divided by the total number of usable observations in the data set. For example, where the row for the age of 33 intersects with the column headed “Percent,” you see the entry 4.55. This means that 4.55% of the subjects were at age 33. This was computed by dividing the frequency of people at age 33 (which was “1”) by the total number of usable observations (which was “22”). 1 divided by 22 is equal to .0455, or 4.55%. Now go down to the row for the age of 44. Where the row for 44 intersects with the column headed “Percent,” you see the entry 13.64, meaning that 13.64% of the subjects were at age 44. This was computed by dividing the frequency of people at age 44 (which was “3”) by the total number of usable observations (which was “22”). 3 divided by 22 is equal to .1364, or 13.64%. The next column is the “Cumulative Frequency” column. A particular entry in the “Cumulative Frequency” column indicates the sum of •
the number of observations scoring at the current value in the "Frequency" column plus ...
•
the number of observations scoring at each of the preceding (lower) values in the "Frequency" column.
For example, look at the point where the row for AGE = 44 intersects with the column headed “Cumulative Frequency.” At that intersection, you see the number “9.” This means that a total of 9 people were at age 44 or younger. Next, look at the point where the row for AGE = 55 intersects with the column headed “Cumulative Frequency.” At that intersection, you see the number “21.” This means that a total of 21 people were at age 55 or younger. Finally, the last entry in the “Cumulative Frequency” column is “22,” meaning that 22 people were at age 59 or younger. It also means that a total of 22 people provided valid data on this AGE variable (the last entry in the “Cumulative Frequency” column always indicates the total number of usable observations for the variable being analyzed).
Chapter 5: Creating Frequency Tables 155
The last column is the “Cumulative Percent” column. A particular entry in the “Cumulative Percent” column indicates the sum of •
the percent of observations scoring at the current value in the “Percent” column plus...
•
the percent of observations scoring at each of the preceding (lower) values in the “Percent” column.
For example, look at the point where the row for AGE = 44 intersects with the column headed “Cumulative Percent.” At that intersection, you see the number “40.91.” This means that 40.91% of the subjects were at age 44 or younger. Next, look at the point where the row for AGE = 55 intersects with the column headed “Cumulative Percent.” At that intersection, you see the number “95.45.” This means that 95.45% of the subjects were at age 55 or younger. Finally, the last entry in the “Cumulative Percent” column is “100.00,” meaning that 100% of the people were at age 59 or younger. The last figure in the “Cumulative Percent” column will always be 100%. Frequency missing. In some instances, you will see “Frequency missing = n” below the frequency table that was created by PROC FREQ (this entry does not appear in Output 5.1). This entry appears when you perform PROC FREQ on a variable that has at least some missing data for the variable being analyzed. The entry is followed by a number that indicates the number of observations containing missing data that were encountered by SAS. For example, the following entry, after the frequency table for the variable AGE, would indicate that SAS encountered five subjects with missing data on the AGE variable: Frequency missing = 5
Examples of Questions That Can Be Answered by Interpreting a Frequency Table The Frequency Table The frequency table for the AGE variable is reproduced again as Output 5.2. This output is identical to Output 5.1, with the exception that different parts of the output are now identified with numbers (e.g., , ).
156 Step-by-Step Basic Statistics Using SAS: Student Guide
JANE DOE
1
The FREQ Procedure Cumulative Cumulative AGE Frequency Percent Frequency Percent -------------------------------------------------------33 1 4.55 1 4.55 36 1 4.55 2 9.09 38 2 9.09 4 18.18 42 1 4.55 5 22.73 43 1 4.55 6 27.27 44 3 13.64 9 40.91 47 4 18.18 13 59.09 48 1 4.55 14 63.64 49 3 13.64 17 77.27 50 1 4.55 18 81.82 52 2 9.09 20 90.91 55 1 4.55 21 95.45 59 1 4.55 22 100.00 Output 5.2. Results from the FREQ procedure, with various items identified for purposes of answering questions.
The Questions The companion volume to this book, Step-by-Step Basic Statistics Using SAS: Exercises, provides exercises that enable you to review what you learned in this chapter by •
entering a new data set
•
performing PROC FREQ on one of the variables in that data set
•
answering a series of questions about the frequency table created by PROC FREQ.
Here are examples of the types of questions that you will be asked to answer. Read each of the questions presented below, review the answer provided, and verify that you understand where the answer is found in Output 5.2. Also verify that you understand why that answer is correct. If you are confused by any of the following questions and answers, go back to the relevant section of this chapter and reread that section. •
Question: What is the lowest observed value for the AGE variable? Answer: 33. This number is identified with number
•
in Output 5.2.
Question: What is the highest observed value for the AGE variable? Answer: 59. This number is identified with number
in Output 5.2.
Chapter 5: Creating Frequency Tables 157 •
Question: How many people are 49 years old? (i.e., What is the frequency for people who displayed a value of 49 on the AGE variable?) Answer: Three. This number is identified with number
•
in Output 5.2.
Question: What percent of people are 38 years old? Answer: 9.09% This number is identified with number
•
Question: How many people are 50 years old or younger? Answer: 18. This number is identified with number
•
in Output 5.2.
Question: What percent of people are 52 years old or younger? Answer: 90.91%. This number is identified with number
•
in Output 5.2.
in Output 5.2.
Question: What is the total number of valid observations for the AGE variable in this data set? Answer: 22. This number is identified with number
in Output 5.2.
Conclusion This chapter has shown you how to use PROC FREQ to create simple frequency tables. These tables provide the numbers that enable you to verbally describe the nature of your data; they allow you to make statements such as “Nine percent of the sample were 38 years of age” or “96% of the sample were age 55 or younger.” In some cases, it is more effective to use a graph to illustrate the nature of your data. For example, you might use a bar graph to indicate the frequency of subjects at various ages. Or you might use a bar graph to illustrate the mean age for male subjects versus female subjects. SAS provides an number of procedures that enable you to create bar graphs of this sort, as well as other types of graphs and charts. The following chapter introduces you to some of these procedures.
158 Step-by-Step Basic Statistics Using SAS: Student Guide
Creating Graphs Introduction.........................................................................................160 Overview...............................................................................................................160 High-Resolution versus Low-Resolution Graphics ................................................160 What to Do If Your Graphics Do Not Fit on the Page............................................161 Reprise of Example 5.1: the Political Donation Study .......................161 The Study .............................................................................................................161 SAS Variable Names ............................................................................................161 Using PROC CHART to Create a Frequency Bar Chart.......................162 What Is a Frequency Bar Chart? ..........................................................................162 Syntax for the PROC Step ....................................................................................163 Creating a Frequency Bar Chart for a Character Variable ....................................163 Creating a Frequency Bar Chart for a Numeric Variable.......................................165 Creating a Frequency Bar Chart Using the LEVELS Option .................................168 Creating a Frequency Bar Chart Using the MIDPOINTS Option...........................170 Creating a Frequency Bar Chart Using the DISCRETE Option.............................172 Using PROC CHART to Plot Means for Subgroups .............................174 Plotting Means versus Frequencies ......................................................................174 The PROC Step ....................................................................................................174 Output Produced by the SAS Program .................................................................176 Conclusion...........................................................................................177
160 Step-by-Step Basic Statistics Using SAS: Student Guide
Introduction Overview In this chapter you learn how to use the SAS System’s CHART procedure to create bar charts. Most of the chapter focuses on creating frequency bar charts: figures in which the horizontal axis plots values for a variable, and the vertical axis plots frequencies. A bar for a particular value in a frequency bar chart indicates the number of observations that display a particular value in the data set. Frequency bar charts are useful for quickly determining which values are relatively common in a data set, and which values are less common. You'll learn how to modify your bar charts by using the LEVELS, MIDPOINTS, and DISCRETE options. The final section of this chapter shows you how to use PROC CHART to create subgroupmean bar charts. These are figures in which the points on the horizontal axis represent different subgroups of subjects, and the vertical axis plots values on a selected quantitative variable. A bar for a particular group illustrates the mean score displayed by that group for the quantitative variable. Subgroup-mean bar charts are useful for quickly determining which groups scored relatively high on a quantitative variable, and which groups scored relatively low. High-Resolution versus Low-Resolution Graphics This chapter shows you how to create low-resolution graphics, as opposed to high-resolution graphics. The difference between low-resolution and high-resolution graphics is one of appearance: high-resolution graphics have a higher-quality, more professional look, and therefore are more appropriate for publication in a research journal. Low-resolution graphics are fine for helping you review and understand your data, but the quality of their appearance is generally not good enough for publication. This chapter presents only low-resolution graphics because the SAS programs requesting them are simpler, and they require only base SAS and SAS/STAT software, which most SAS users have access to. If you need to prepare high-resolution graphics, you need SAS/GRAPH software. For more information on producing high-quality figures with SAS/GRAPH, see SAS Institute Inc. (2000), and Carpenter and Shipp (1995).
Chapter 6: Creating Graphs 161
What to Do If Your Graphics Do Not Fit on the Page Most of chapters in this book advise that you begin each SAS program with the following OPTIONS statement to control the size of your output page: OPTIONS
LS=80
PS=60;
The option PS=60 is an abbreviation for PAGESIZE=60. This requests that output be printed with 60 lines per page. With some computers and printers, however, these specifications will cause some figures (e.g., bar charts) to be too large to be printed on a single page. If you have charts that are broken across two pages, try reducing the page size to 50 lines by using the following OPTIONS statement: OPTIONS
LS=80
PS=50;
Reprise of Example 5.1: the Political Donation Study The Study This chapter will demonstrate how to use PROC CHART to analyze data from the fictitious political donation study that was presented in Chapter 5, “Creating Frequency Tables.” In that chapter, the example involved research on campaign finance, with a questionnaire that was administered to 33 subjects. The results of the questionnaire provided demographic information about the subjects (e.g., sex, age), the size of political donations they had made recently, and some information regarding their political beliefs (sample item: “I believe that our federal government is generally doing a good job”). Subjects responded to these items using a seven-point response format in which 1 = “Disagree Very Strongly” and 7 = “Agree Very Strongly.” SAS Variable Names When you typed your data, you used the following SAS variable names: •
POL_PRTY represents the political party to which the subject belongs. In keying your data, you used the value “D” to represent democrats, “R” to represent republicans, and “I” to represent independents.
•
SEX represents the subject’s sex, with the value “F” representing females and “M” representing males.
•
AGE represents the subject's age in years.
•
DONATION represents the size (in dollars) of the political donation that each subject made in the past year.
162 Step-by-Step Basic Statistics Using SAS: Student Guide •
Q1 represents subject responses to the first question using the “Agree-Disagree” format. You typed a “1” if the subject circled a “1” for “Disagree Very Strongly," you typed a “2” if the subject circled a “2” for “Disagree Strongly,” and so on.
•
Q2, Q3, and Q4 represent subject responses to the second, third, and fourth questions using the agree-disagree format.
Chapter 5, “Creating Frequency Tables” includes a copy of the questionnaire that was used to obtain the data. It also presented the complete SAS DATA step that input the data as a SAS data set. If you need to refamiliarize yourself with the questionnaire or the data set, review Example 5.1 in Chapter 5.
Using PROC CHART to Create a Frequency Bar Chart What Is a Frequency Bar Chart? Chapter 5 showed you how to use PROC FREQ to create a simple frequency table. These tables are useful for determining the number of people whose scores lie at each value of a given variable. In some cases, however, it is easier to get a sense of these frequencies by plotting them in a bar chart, rather than summarizing them in numerical form in a table. The SAS System’s PROC CHART makes it easy to create this type of bar chart. A frequency bar chart is a figure in which the horizontal axis plots values for a variable, and the vertical axis plots frequencies. A bar for a particular value in a frequency bar chart indicates the number of observations displaying that value in the data set. Frequency bar charts are useful for quickly determining which values are relatively common in a data set, and which values are less common. The following sections illustrate a variety of approaches for creating these charts.
Chapter 6: Creating Graphs 163
Syntax for the PROC Step Following is the syntax for the PROC step of a SAS program that requests a frequency bar chart with vertical bars.: PROC CHART DATA=data-set-name; VBAR variable-list / options ; TITLE1 ' your-name '; RUN; The second line of the preceding syntax presents the VBAR statement, which requests a vertical bar chart (use the HBAR statement for a horizontal bar chart). It is in the variablelist from the VBAR statement that you list the variables for which you want frequency bar charts. The VBAR statement ends with /options, the section in which you list particular options you want for the charts. Some of these options will be discussed in later sections. Creating a Frequency Bar Chart for a Character Variable The PROC step. In this example, you will create a frequency bar chart for the variable POL_PRTY. You will recall that this is the variable for the subject's political party: democrat, republican, or independent. Following are the statements that will request a vertical bar chart plotting frequencies for POL_PRTY: PROC CHART DATA=D1; VBAR POL_PRTY; TITLE1 'JANE DOE'; RUN; Where these statements should appear in the SAS program. Remember that the PROC step of a SAS program should generally come after the DATA step. Chapter 5 provided the DATA step for the political donation data that will be analyzed here. To give you a sense of where the PROC CHART statement should go, the following code shows the last few data lines from the data set, followed by the statements in the PROC step: [Lines 1–30 of the DATA step presented in Chapter 5 would appear here] 19 D F 43 1000 5 2 7 3 20 D F 44 300 4 3 5 7 21 I F 38 100 5 2 4 1 22 D F 47 200 3 7 1 4 ; PROC CHART DATA=D1; VBAR POL_PRTY; TITLE1 'JANE DOE'; RUN;
164 Step-by-Step Basic Statistics Using SAS: Student Guide
Output produced by the SAS program. Output 6.1 shows the bar chart that is created by the preceding statements. JANE DOE
1
Frequency 12 + ***** | ***** 11 + ***** | ***** 10 + ***** | ***** 9 + ***** | ***** 8 + ***** | ***** 7 + ***** | ***** 6 + ***** ***** | ***** ***** 5 + ***** ***** | ***** ***** 4 + ***** ***** ***** | ***** ***** ***** 3 + ***** ***** ***** | ***** ***** ***** 2 + ***** ***** ***** | ***** ***** ***** 1 + ***** ***** ***** | ***** ***** ***** -------------------------------------------D I R POL_PRTY Output 6.1. Results of PROC CHART performed on POL_PRTY.
The name of the variable that is being analyzed appears below the horizontal axis of the bar chart. You can see that POL_PRTY is the variable being analyzed in this case. The values that this variable assumed are used as labels for the bars in the bar chart. In Output 6.1, the value D (democrats) labels the first bar, I (independents) labels the second bar, and R (republicans) labels the last bar. Frequencies are plotted on the vertical axis of the bar chart. The height of a bar on the frequency axis indicates the frequencies associated with each value of the variable being analyzed. For example, Output 6.1 shows a frequency of 12 for subjects coded with a D, a frequency of 4 for subjects coded with an I, and a frequency of 6 for subjects coded with an R. In other words, this bar chart shows that there were 12 democrats, 4 independents, and 6 republicans in your data set.
Chapter 6: Creating Graphs 165
Creating a Frequency Bar Chart for a Numeric Variable When you create a bar chart for a character variable, SAS will create a separate bar for each value that your character variable includes. You can see this in Output 6.1, where separate bars were created for the values D, I, and R. However, if you create a bar chart for a numeric variable (and the numeric variable assumes a relatively large number of values), SAS will typically “group” your data, and create a bar chart in which the various bars are labeled with the midpoint for each group. The PROC step. Here are the statements necessary to create a frequency bar chart for the numeric variable AGE: PROC CHART DATA=D1; VBAR AGE; TITLE1 'JANE DOE'; RUN;
166 Step-by-Step Basic Statistics Using SAS: Student Guide
Output produced by the SAS program. Output 6.2 shows the bar chart that is created by the preceding statements. JANE DOE
1
Frequency 9 + ***** | ***** | ***** | ***** 8 + ***** | ***** | ***** | ***** 7 + ***** ***** | ***** ***** | ***** ***** | ***** ***** 6 + ***** ***** | ***** ***** | ***** ***** | ***** ***** 5 + ***** ***** | ***** ***** | ***** ***** | ***** ***** 4 + ***** ***** | ***** ***** | ***** ***** | ***** ***** 3 + ***** ***** ***** | ***** ***** ***** | ***** ***** ***** | ***** ***** ***** 2 + ***** ***** ***** ***** | ***** ***** ***** ***** | ***** ***** ***** ***** | ***** ***** ***** ***** 1 + ***** ***** ***** ***** ***** | ***** ***** ***** ***** ***** | ***** ***** ***** ***** ***** | ***** ***** ***** ***** ***** -------------------------------------------------------------------33 39 45 51 57 AGE Midpoint Output 6.2. Results of PROC CHART performed on AGE.
Notice that the horizontal axis is now labeled “AGE Midpoint.” The various bars in the chart are labeled with the midpoints for the groups that they represent. Table 6.1 summarizes the way that the AGE values were grouped:
Chapter 6: Creating Graphs 167 Table 6.1 Criteria Used for Grouping Values of the AGE Variable _____________________________________________ An observation is Interval placed in this interval if midpoint AGE scores fell in this range _____________________________________________ 33
30 ≤ AGE Score < 36
39
36 ≤ AGE Score < 42
45
42 ≤ AGE Score < 48
51
48 ≤ AGE Score < 54
57 54 ≤ AGE Score < 60 _____________________________________________
In Table 6.1, the first value under “Interval midpoint” is 33. To the right of this midpoint is the entry “30 ≤ AGE Score < 36.” This means that if a given value of AGE is greater than or equal to 30 and also is less than 36 it is placed into the interval that is identified with a midpoint of 33. The remainder of Table 6.1 can be interpreted in the same way. The first bar in Output 6.2 shows that there is a frequency of 1 for the group identified with the midpoint of 33. This means that there was only one person whose age was in the interval from 30 to 35. The remaining bars in Output 6.2 show that •
there was a frequency of “3” for the group identified with the midpoint of 39
•
there was a frequency of “9” for the group identified with the midpoint of 45
•
there was a frequency of “7” for the group identified with the midpoint of 51
•
there was a frequency of “2” for the group identified with the midpoint of 57.
168 Step-by-Step Basic Statistics Using SAS: Student Guide
Creating a Frequency Bar Chart Using the LEVELS Option The preceding section showed that when you analyze a numeric variable with PROC CHART, SAS may group the values on that variable and identify each bar in the bar chart with the interval midpoint. But what if SAS does not group these values into the number of bars that you want? Is there a way to override the SAS System’s default approach to grouping these values? Fortunately, there is. PROC CHART provides a number of options that enable you to control the number and nature of the bars that appear in the bar chart. For example, the LEVELS option provides one of the easiest approaches for controlling the number of bars that will appear. The syntax. Here is the syntax for the PROC step in which the LEVELS option is used to control the number of bars: PROC CHART DATA=data-set-name; VBAR variable-list / LEVELS=desired-number-of-bars ; TITLE1 ' your-name '; RUN; For example, suppose that you want to have exactly six bars in your chart. The following statements would request this: PROC CHART DATA=D1; VBAR AGE / LEVELS=6; TITLE1 'JANE DOE'; RUN;
Chapter 6: Creating Graphs 169
Output produced by the SAS program. Output 6.3 shows the chart that is created by the preceding statements. JANE DOE
1
Frequency 8 + ***** | ***** | ***** | ***** | ***** 7 + ***** | ***** | ***** | ***** | ***** 6 + ***** | ***** | ***** | ***** | ***** 5 + ***** ***** | ***** ***** | ***** ***** | ***** ***** | ***** ***** 4 + ***** ***** | ***** ***** | ***** ***** | ***** ***** | ***** ***** 3 + ***** ***** ***** ***** | ***** ***** ***** ***** | ***** ***** ***** ***** | ***** ***** ***** ***** | ***** ***** ***** ***** 2 + ***** ***** ***** ***** ***** | ***** ***** ***** ***** ***** | ***** ***** ***** ***** ***** | ***** ***** ***** ***** ***** | ***** ***** ***** ***** ***** 1 + ***** ***** ***** ***** ***** ***** | ***** ***** ***** ***** ***** ***** | ***** ***** ***** ***** ***** ***** | ***** ***** ***** ***** ***** ***** | ***** ***** ***** ***** ***** ***** ------------------------------------------------------------------------32.5 37.5 42.5 47.5 52.5 57.5 AGE Midpoint
Output 6.3. Results of PROC CHART using the LEVELS option.
Notice that there are now six bars in the chart, as requested in the PROC step. Notice also that the midpoints in the chart have been changed to accommodate the fact that there are now bars for six groups, rather than five.
170 Step-by-Step Basic Statistics Using SAS: Student Guide
Creating a Frequency Bar Chart Using the MIDPOINTS Option PROC CHART also enables you to specify exactly what you want the midpoints to be for the various bars in the figure. You can do this by using the MIDPOINTS option in the VBAR statement. With this approach, you can either list the exact values that each midpoint should assume, or provide a range and an interval. Both approaches are illustrated below. Listing the exact midpoints. If your bar chart will have a small number of bars, you might want to specify the exact value for each midpoint. Here is the syntax for the PROC step that specifies midpoint values: PROC CHART DATA=data-set-name; VBAR variable-list / MIDPOINTS=desired-midpoints ; TITLE1 ' your-name '; RUN; For example, suppose that you want to use the midpoints 20, 30, 40, 50, 60, 70, and 80 for the bars in your chart. The following statements would request this: PROC CHART DATA=D1; VBAR AGE / MIDPOINTS=20 30 40 50 60 70 80; TITLE1 'JANE DOE'; RUN; Providing a range and an interval. Writing the MIDPOINTS statement in the manner illustrated can be tedious if you want to have a large number of bars on your chart. In these situations, it may be easier to use the key words TO and BY with the MIDPOINTS option. This allows you to specify the lowest midpoint, the highest midpoint, and the interval that separates each midpoint. For example, the following MIDPOINTS option asks SAS to create midpoints that range from 20 to 80, with each midpoint separated by an interval of 10 units: PROC CHART DATA=D1; VBAR AGE / MIDPOINTS=20 TO 80 BY 10; TITLE1 'JANE DOE'; RUN;
Chapter 6: Creating Graphs 171
Output produced by the SAS program. Output 6.4 shows the bar chart that is created by the preceding statements. JANE DOE
1
Frequency 11 + ***** | ***** 10 + ***** | ***** 9 + ***** | ***** 8 + ***** ***** | ***** ***** 7 + ***** ***** | ***** ***** 6 + ***** ***** | ***** ***** 5 + ***** ***** | ***** ***** 4 + ***** ***** | ***** ***** 3 + ***** ***** | ***** ***** 2 + ***** ***** ***** | ***** ***** ***** 1 + ***** ***** ***** ***** | ***** ***** ***** ***** --------------------------------------------------------------------------20 30 40 50 60 70 80 AGE Midpoint Output 6.4. Results of PROC CHART using the MIDPOINTS option.
You can see that all of the requested midpoints appear on the horizontal axis of Output 6.4. This is the case despite the fact that some of the midpoints have no bar at all, indicating a frequency of zero. For example, the last midpoint on the horizontal axis is 80, but there is no bar for this group. This is because, as you may remember, the oldest subject in your data set was 59 years old; thus there were no subjects in the group with a midpoint of 80.
172 Step-by-Step Basic Statistics Using SAS: Student Guide
Creating a Frequency Bar Chart Using the DISCRETE Option Earlier sections have shown that when you use PROC CHART to create a frequency table for a character variable (such as POL_PRTY or SEX), it will automatically create a separate bar for each value that the variable includes. However, it will not typically do this with a numeric variable; when numeric variables assume many values, PROC CHART will normally group the data, and label the axis with the midpoints for each group. But what if you want to create a separate bar for each observed value of your numeric variable? In this case, you simply specify the DISCRETE option in the VBAR statement. The syntax. Here is the syntax for the PROC step that will cause a separate bar to be printed for each observed value of a numeric variable: PROC CHART DATA=data-set-name; VBAR variable-list / DISCRETE ; TITLE1 ' your-name '; RUN; The following statements again create a frequency bar chart for AGE. This time, however, the DISCRETE option is used to create a separate bar for each observed value of AGE. PROC CHART DATA=D1; VBAR AGE / DISCRETE; TITLE1 'JANE DOE'; RUN;
Chapter 6: Creating Graphs 173
Output produced by the SAS program. Output 6.5 shows the bar chart that is created by the preceding statements. JANE DOE
1
Frequency 4 + *** | *** | *** | *** | *** | *** | *** | *** | *** | *** 3 + *** *** *** | *** *** *** | *** *** *** | *** *** *** | *** *** *** | *** *** *** | *** *** *** | *** *** *** | *** *** *** | *** *** *** 2 + *** *** *** *** *** | *** *** *** *** *** | *** *** *** *** *** | *** *** *** *** *** | *** *** *** *** *** | *** *** *** *** *** | *** *** *** *** *** | *** *** *** *** *** | *** *** *** *** *** | *** *** *** *** *** 1 + *** *** *** *** *** *** *** *** *** *** *** *** *** | *** *** *** *** *** *** *** *** *** *** *** *** *** | *** *** *** *** *** *** *** *** *** *** *** *** *** | *** *** *** *** *** *** *** *** *** *** *** *** *** | *** *** *** *** *** *** *** *** *** *** *** *** *** | *** *** *** *** *** *** *** *** *** *** *** *** *** | *** *** *** *** *** *** *** *** *** *** *** *** *** | *** *** *** *** *** *** *** *** *** *** *** *** *** | *** *** *** *** *** *** *** *** *** *** *** *** *** | *** *** *** *** *** *** *** *** *** *** *** *** *** -------------------------------------------------------------------33 36 38 42 43 44 47 48 49 50 52 55 59 AGE
Output 6.5. Results of PROC CHART using the DISCRETE option.
In Output 6.5, the bars are narrower than in the previous examples because there are more of them. There is now one bar for each value that AGE actually assumed in the data set. You can see that the first bar indicates the number of people at age 33, the second bar indicates the number of people at age 36, and so on. Notice that there are no bars labeled 34 or 35, as there were no people at these ages in your data set.
174 Step-by-Step Basic Statistics Using SAS: Student Guide
Using PROC CHART to Plot Means for Subgroups Plotting Means versus Frequencies Earlier sections of this chapter have shown how to use PROC CHART to create frequency bar charts. However, it is also possible to use PROC CHART to create subgroup-mean bar charts. These are figures in which the bars represent the means for various subgroups according to a particular, quantifiable criterion. For example, consider the political donation questionnaire that was presented in Chapter 5. One of the items on the questionnaire was “I believe that our federal government is generally doing a good job.” Subjects were asked to indicate the extent to which they agreed or disagreed with this statement by circling a number from 1 (“Disagree Very Strongly”) to 7 (“Agree Very Strongly”). Responses to this question were given the SAS variable name Q1 in the SAS program. A separate item on the questionnaire asked subjects to indicate whether they were democrats, republicans, or independents. Responses to this question were given the SAS variable name POL_PRTY. It would be interesting to see if there are any differences between the ways that democrats, republicans, and independents responded to the question about whether the federal government is doing a good job (variable Q1). In fact, it is possible to compute the mean score on Q1 for each of these subgroups, and to plot these subgroup means on a chart. The following section shows how do this with PROC CHART. The PROC Step Following is the syntax for the PROC CHART statements that create a subgroup-mean bar chart: PROC CHART DATA=data-set-name; VBAR group-variable / SUMVAR=criterion-variable TITLE1 ' your-name '; RUN;
TYPE=MEAN;
The second line of the preceding syntax includes the following: VBAR
group-variable
Here, group-variable refers to the SAS variable that codes group membership. In your program, you list POL_PRTY as the group variable because POL_PRTY indicates whether a given subject is a democrat, a republican, or an independent (the three groups that you want to compare).
Chapter 6: Creating Graphs 175
The second line of the syntax also includes the following: / SUMVAR=criterion-variable The slash (/) indicates that options will follow. You use the SUMVAR= option to identify the criterion variable in your analysis. This criterion variable is the variable on which means will be computed. In the present example, you want to compute mean scores on Q1, the item asking whether the federal government is doing a good job. Therefore, you will include SUMVAR=Q1 in your final program. Finally, the second line of the syntax ends with TYPE=MEAN; This option specifies that PROC CHART should compute group means on the criterion variable, as opposed to computing group sums on the criterion variable. If you wanted to compute group sums on the criterion variable, this option would be typed as: TYPE=SUM; Following are the statements that request that PROC CHART create a bar chart to plot means for the three groups on the variable Q1: PROC CHART DATA=D1; VBAR POL_PRTY / SUMVAR=Q1 TITLE1 'JANE DOE'; RUN;
TYPE=MEAN;
176 Step-by-Step Basic Statistics Using SAS: Student Guide
Output Produced by the SAS Program Output 6.6 shows the bar chart that is created by the preceding PROC step. JANE DOE
1
Q1 Mean | ***** 4 + ***** ***** | ***** ***** | ***** ***** ***** | ***** ***** ***** | ***** ***** ***** 3 + ***** ***** ***** | ***** ***** ***** | ***** ***** ***** | ***** ***** ***** | ***** ***** ***** 2 + ***** ***** ***** | ***** ***** ***** | ***** ***** ***** | ***** ***** ***** | ***** ***** ***** 1 + ***** ***** ***** | ***** ***** ***** | ***** ***** ***** | ***** ***** ***** | ***** ***** ***** -------------------------------------------D I R POL_PRTY Output 6.6. Results of PROC CHART in which the criterion variable is Q1 and the grouping variable is POL_PRTY.
When this type of analysis is performed, the name of the grouping variable appears as the label for the horizontal axis. Output 6.6 shows that POL_PRTY is the grouping variable for the current analysis. The values that POL_PRTY can assume appear as labels for the individual bars in the chart. The labels for the three bars are D (democrats), I (independents), and R (republicans). In the frequency bar charts that were presented earlier in this chapter, the vertical axis plotted frequencies. However, with the type of analysis reported here, the vertical axis reports mean scores for the various groups on the criterion variable. The heading for the vertical axis is now Q1 Mean, meaning that you use this axis to determine each group’s mean score on the criterion variable, Q1.
Chapter 6: Creating Graphs 177
The height of a particular bar on the vertical axis indicates the mean score for that group on Q1 (the question about whether the federal government is doing a good job). Output 6.6 shows that •
The democrats (the bar labeled D) had a mean score that was just above 4.0 on the criterion variable, Q1.
•
The independents (the bar labeled I) had a mean score that was about 4.0.
•
The republicans (the bar labeled R) had a mean score that was just below 4.0.
Remember that 1 represented “Disagree Very Strongly” and 7 represented “Agree Very Strongly.” A score of 4 represented “Neither Agree nor Disagree.” The mean scores presented in Output 6.6 show that all three groups had a mean score close to 4.0, meaning that their mean scores were all close to the response “Neither Agree nor Disagree.” The mean score for the democrats was a bit higher (a bit closer to “Agree”), and the mean score for the republicans was a bit lower (a bit closer to “Disagree”), although we have no way of knowing whether these differences are statistically significant at this point.
Conclusion This chapter has shown you how to use PROC CHART to create frequency bar charts and subgroup-mean bar charts. Summarizing results graphically in charts such as these can make it easier for you to identify trends in your data at a glance. These figures can also make it easier to communicate your findings to others. When presenting your findings, it is also common to report measures of central tendency (such as the mean or the median) and measures of variability (such as the standard deviation or the interquartile range). Chapter 7, "Measures of Central Tendency and Variability," shows you how to use PROC MEANS and PROC UNIVARIATE to compute these measures, along with other measures of central tendency and variability. Chapter 7 also shows you how to create stem-and-leaf plots that can be reviewed to determine the general shape of a particular distribution of scores.
178 Step-by-Step Basic Statistics Using SAS: Student Guide
Measures of Central Tendency and Variability Introduction.........................................................................................181 Overview...............................................................................................................181 Why It Is Important to Compute These Measures.................................................181 Reprise of Example 5.1: The Political Donation Study.......................181 The Study .............................................................................................................181 SAS Variable Names ............................................................................................182 Measures of Central Tendency: The Mode, Median, and Mean.........183 Overview...............................................................................................................183 Writing the SAS Program......................................................................................183 Output Produced by the SAS Program .................................................................184 Interpreting the Mode Computed by PROC UNIVARIATE....................................185 Interpreting the Median Computed by PROC UNIVARIATE .................................186 Interpreting the Mean Computed by PROC UNIVARIATE....................................186 Interpreting a Stem-and-Leaf Plot Created by PROC UNIVARIATE ...187 Overview...............................................................................................................187 Output Produced by the SAS Program .................................................................187 Using PROC UNIVARIATE to Determine the Shape of Distributions ...................................................................................190 Overview...............................................................................................................190 Variables Analyzed ...............................................................................................190 An Approximately Normal Distribution ..................................................................191 A Positively Skewed Distribution...........................................................................194
180 Step-by-Step Basic Statistics Using SAS: Student Guide
A Negatively Skewed Distribution .........................................................................196 A Bimodal Distribution...........................................................................................198 Simple Measures of Variability: The Range, the Interquartile Range, and the Semi-Interquartile Range ......................................200 Overview...............................................................................................................200 The Range ............................................................................................................200 The Interquartile Range ........................................................................................202 The Semi-Interquartile Range ...............................................................................203 More Complex Measures of Central Tendency: The Variance and Standard Deviation .........................................................................204 Overview...............................................................................................................204 Relevant Terms and Concepts..............................................................................204 Conceptual Formula for the Population Variance..................................................205 Conceptual Formula for the Population Standard Deviation .................................206 Variance and Standard Deviation: Three Formulas ..........................207 Overview...............................................................................................................207 The Population Variance and Standard Deviation ................................................207 The Sample Variance and Standard Deviation .....................................................208 The Estimated Population Variance and Standard Deviation................................209 Using PROC MEANS to Compute the Variance and Standard Deviation .........................................................................210 Overview...............................................................................................................210 Computing the Sample Variance and Standard Deviation ....................................210 Computing the Population Variance and Standard Deviation ...............................212 Computing the Estimated Population Variance and Standard Deviation ..............212 Conclusion...........................................................................................214
Chapter 7: Measures of Central Tendency and Variability 181
Introduction Overview This chapter shows you how to perform simple procedures that help describe and summarize data. You will learn how to use PROC UNIVARIATE to compute the measures of central tendency that are most frequently used in research: the mode, the median, and the mean. You will also learn how to create and interpret a stem-and-leaf plot, which can be helpful in understanding the shape of a variable’s distribution. Finally, you will learn how to use PROC MEANS to compute the variance and standard deviation of quantitative variables. You will learn how to compute the sample standard deviation and variance, as well as the estimated population standard deviation and variance. Why It Is Important to Compute These Measures There are a number of reasons why you need to be able to perform these procedures. At a minimum, the output produced by PROC MEANS and PROC UNIVARIATE will help you verify that you made no obvious errors in entering data or writing your program. It is always important to verify that your data are correct before going on to analyze them with more sophisticated procedures. In addition, most research journals require that you report simple descriptive statistics (e.g., means and standard deviations) for the variables you analyze. Finally, many of the later chapters in this guide build on the more basic concepts presented here. For example, in Chapter 9 you will learn how to create a type of standardized variable called a “z score” by using standard deviations, a concept taught in this chapter.
Reprise of Example 5.1: The Political Donation Study The Study This chapter illustrates PROC UNIVARIATE and PROC MEANS by analyzing data from the fictitious political donation study presented in Chapter 5, “Creating Frequency Tables.” In that chapter, you were asked to suppose that you are a political scientist conducting research on campaign finance. You developed a questionnaire and administered it to 22 individuals. With the questionnaire, subjects provided demographic information about themselves (e.g., sex, age), indicated the size of political donations they had made recently, and responded to four items designed to assess some of their political beliefs (sample item: “I believe that our federal government is generally doing a good job”). Subjects responded to these items using a 7-point response format in which “1” = “Disagree Very Strongly” and “7” = “Agree Very Strongly.”
182 Step-by-Step Basic Statistics Using SAS: Student Guide
SAS Variable Names In entering your data, you used the following SAS variable names to represent the variables measured: •
The SAS variable SUB_NUM contains unique subject numbers assigned to each subject.
•
The SAS variable POL_PRTY represents the political party to which the subject belongs. In entering your data, you used the value “D” to represent democrats, “R” to represent republicans, and “I” to represent independents.
•
The SAS variable SEX represents the subject’s sex, with the value “F” representing females and “M” representing males.
•
The SAS variable AGE represents the subject's age in years.
•
The SAS variable DONATION represents the size of the political donation (in dollars) that each subject made in the past year.
•
The SAS variable Q1 represents subject responses to the first question using the “AgreeDisagree” format. You typed a “1” if the subject circled a “1” (for “Disagree Very Strongly”), you typed a “2” if the subject circled a “2” (for “Disagree Strongly”), and so forth.
•
In the same way, the SAS variables Q2, Q3, and Q4 represent subject responses to the second, third, and fourth questions using the “Agree-Disagree” format.
Chapter 5, “Creating Frequency Tables,” provides a copy of the questionnaire that was used to obtain the preceding data. It also presented the complete SAS DATA step that read in the data as a SAS data set.
Chapter 7: Measures of Central Tendency and Variability 183
Measures of Central Tendency: The Mode, Median, and Mean Overview When you assess some numeric variable (such as subject age) in the course of conducting a research study, you will typically obtain a variety of different scores––a distribution of scores. If you write an article about your study for a research journal, you will need some mechanism to describe your obtained distribution of scores. Readers will want to know what was the most “typical” or “representative” score on your variable; in the present case, they would want to know what was the most typical or representative age in your sample. To convey this, you will probably report one or more measures of central tendency. A measure of central tendency is a value or number that represents the location of a sample of data on a continuum by revealing where the center of the distribution is located in that sample. There are a variety of different measures of central tendency, and each uses a somewhat different approach for determining just what the “center” of the distribution is. The measures of central tendency that are used most frequently in the behavioral sciences and education are the mode, the median, and the mean. This section discusses the difference between each, and shows how to compute them using PROC UNIVARIATE. Writing the SAS Program The PROC step. The UNIVARIATE procedure in SAS can produce a wide variety of indices for describing a distribution. These include the sample size, the standard deviation, the skewness, the kurtosis, percentiles, and other indices. This section, however, focuses on just three: the mode, the median, and the mean. Below is the syntax for the PROC step that requests PROC UNIVARIATE: PROC UNIVARIATE DATA=data-set-name VAR variable-list ; TITLE1 ' your-name '; RUN;
options
;
This chapter illustrates how to use a PROC UNIVARIATE statement that requests the usual default statistics, along with two options: The PLOT option (which requests a stem-and-leaf plot), and the NORMAL option (which requests statistics that test the null hypothesis that the sample data were drawn from a normally distributed population). The current section discusses only the default output (which includes the mode, median, and mean). Later sections cover the stem-and-leaf plot and the tests for normality.
184 Step-by-Step Basic Statistics Using SAS: Student Guide
Here are the actual statements requesting that PROC UNIVARIATE be performed on the variable AGE: PROC UNIVARIATE DATA=D1 VAR AGE; TITLE1 'JANE DOE'; RUN;
PLOT
NORMAL;
Where these statements should appear in the SAS program. Remember that the PROC step of a SAS program should generally come after the DATA step. Chapter 5, “Creating Frequency Tables,” provided the DATA step for the political donation data that will be analyzed here. To give you a sense of where the PROC CHART statement should go, here is a reproduction of the last few data lines from the data set, followed by the statements in the current PROC step: [Lines 1-30 of the DATA step presented in Chapter 5 would appear here] 20 D F 44 300 4 3 5 7 21 I F 38 100 5 2 4 1 22 D F 47 200 3 7 1 4 ; PROC UNIVARIATE DATA=D1 PLOT NORMAL; VAR AGE; TITLE1 'JANE DOE'; RUN;
Output Produced by the SAS Program The preceding statements produced two pages of output that provide plenty of information about the variable being analyzed. The output provides: •
a moments table that includes the sample size, mean, standard deviation, variance, skewness, and kurtosis, along with other statistics
•
a basic statistical measures table that includes measures of location (the mean, median, and mode), as well as measures of variability (the standard deviation, variance, range, and interquartile range)
•
a tests for normality table that includes statistical tests of the null hypothesis that the sample was drawn from a normally distributed population
•
a quantiles table that provides the median, 25th percentile, 75th percentile, and related information
•
an extremes table that provides the five highest values and five lowest values on the variable being analyzed
•
a stem-and-leaf plot, box plot, and normal probability plot.
Chapter 7: Measures of Central Tendency and Variability 185
Interpreting the Mode Computed by PROC UNIVARIATE The mode (also called the modal score) is the most frequently occurring value or score in a sample. Technically, the mode can be assessed with either quantitative or nonquantitative (nominal-level) variables, although this chapter focuses on quantitative variables because the UNIVARIATE procedure is designed for quantitative variables only. When you are working with numeric variables, the mode is a useful measure of central tendency to report when the distribution has more than one mode, is skewed, or is markedly nonnormal in some other way. PROC UNIVARIATE prints the mode as part of the “Basic Statistical Measures” table in its output. The Basic Statistical Measures table from the current analysis of the variable AGE is reproduced here as Output 7.1. Basic Statistical Measures Location Mean Median Mode
46.04545 47.00000 47.00000
Variability Std Deviation Variance Range Interquartile Range
6.19890 38.42641 26.00000 6.00000
Output 7.1. The mode as it appears in the Basic Statistical Measures table from PROC UNIVARIATE; the variable analyzed is AGE.
The mode appears on the last line of the “Location” section of the table. You can see that, for this data set, the mode is 47. This means that the most frequently occurring score on AGE was 47. You can verify that this is the case by reviewing output reproduced earlier in this guide, in Chapter 6, “Creating Graphs.” Output 6.5 contains the results of running PROC CHART on AGE; it shows that the value “47” has the highest frequency. A word of warning about the mode as computed by PROC UNIVARIATE: when there is more than one mode for a given variable, PROC UNIVARIATE prints only the mode with the lowest numerical value. For example, suppose that there were two modes for the current data set: imagine that 10 people were at age 25, and 10 additional people were at age 35. This means that the two most common scores on AGE would be 25 and 35. PROC UNIVARIATE would report only one mode for this variable: it would report 25 (because 25 was the mode with the lowest numerical value). When you have more than one mode, a note at the bottom of the Basic Statistical Measures table indicates the number of modes that were observed. This situation is discussed in the section, "A Bimodal Distribution," that appears later in this chapter. See Output 7.12.
186 Step-by-Step Basic Statistics Using SAS: Student Guide
Interpreting the Median Computed by PROC UNIVARIATE The median (also called the median score) is the score located at the 50th percentile. This means that the median is the score below which 50% of all data appear. For example, suppose that you administer a test worth 100 points to a very large sample of students. If 50% of the students obtain a score below 71 points, then the median is 71. The median can be computed only from some type of quantitative data. It is a particularly useful measure of central tendency when you are working with an ordinal (ranked) variable. It is also useful when you are working with interval- or ratio-level variables that display a skewed distribution. Along with the mode, PROC UNIVARIATE also prints the median as part of the Basic Statistical Measures table in its output. The table from the present analysis of the variable AGE is reproduced as Output 7.2. Basic Statistical Measures Location Mean Median Mode
46.04545 47.00000 47.00000
Variability Std Deviation Variance Range Interquartile Range
6.19890 38.42641 26.00000 6.00000
Output 7.2. The median as it appears in the Basic Statistical Measures table from PROC UNIVARIATE; the variable analyzed is AGE.
Output 7.2 shows that the median for the current data set is 47. You can see that, in this data set, the median and the mode happen to be the same number: 47. This result is not unusual, especially when the data set has a symmetrical distribution. Interpreting the Mean Computed by PROC UNIVARIATE The mean is the score that is located at the mathematical center of a distribution. It is computed by (a) summing the scores and (b) dividing by the number of observations. The mean is useful as a measure of central tendency for numeric variables that are assessed at the interval- or ratio-level, particularly when they display fairly symmetrical, unimodal distributions (later, you will see that the mean can be dramatically affected when a distribution is skewed). You may have noticed that the mean was printed as part of the Basic Statistical Measures table in Output 7.1 and 7.2. The same mean is also printed as part of the “Moments” table produced by PROC UNIVARIATE. The moments table from the current PROC UNIVARIATE analysis of AGE appears here in Output 7.3.
Chapter 7: Measures of Central Tendency and Variability 187
Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation
22 46.0454545 6.19890369 -0.2047376 47451 13.4625746
Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean
22 1013 38.4264069 0.19131126 806.954545 1.32161071
Output 7.3. The N (sample size) and the mean as they appear in the Moments table from PROC UNIVARIATE; the variable analyzed is AGE.
Output 7.3 provides many statistics from the analysis of AGE, but this section focuses on just two. First, to the right of the heading “N” you will find the number of valid (usable) observations on which these analyses were based. Here, you can see that N = 22, which means that scores on AGE were analyzed for 22 subjects. Second, to the right of the heading “Mean” you will find the mean score for AGE. You can see that the mean score on AGE is 46.045 for the current sample. Again, this is fairly close to the mode and median of 47, which is fairly common for distributions that are largely symmetrical.
Interpreting a Stem-and-Leaf Plot Created by PROC UNIVARIATE Overview A stem-and-leaf plot is a special type of chart for plotting frequencies. It is particularly useful for understanding the shape of the distribution; i.e., for determining whether the distribution is approximately normal, as opposed to being skewed, bimodal, or in some other way nonnormal. The current section shows you how to interpret the stem-and-leaf plot generated by PROC UNIVARIATE. The section that follows provides examples of variables with normal and nonnormal distributions. Output Produced by the SAS Program Earlier this chapter indicated that, when you include the PLOT option in the PROC UNIVARIATE statement, SAS produces three figures. These figures are a stem-and-leaf plot, a box plot, and a normal probability plot. This section focuses only on the stem-andleaf plot, which is reproduced here as Output 7.4.
188 Step-by-Step Basic Statistics Using SAS: Student Guide
Stem 5 5 4 4 3 3
Leaf 59 022 77778999 23444 688 3 ----+----+----+----+ Multiply Stem.Leaf by 10**+1
# 2 3 8 5 3 1
Boxplot 0 | +--+--+ +-----+ | 0
Output 7.4. Stem-and-leaf plot for the variable AGE produced by PROC UNIVARIATE.
Remember that the variable being analyzed in this case is AGE. In essence, the stem-andleaf plot indicates what values appear in the data set for the variable AGE, and how many occurrences of each value appear. Interpreting the “stems” and “leaves.” Each potential value of AGE is separated into a “stem” and a “leaf.” The “stem” for a given value appears under the heading “Stem.” The “leaf” for each value appears under the heading “Leaf.” This concept is easier to understand one score at a time. For example, immediately under the heading “Stem,” you can see that the first stem is “5.” Immediately under the heading “Leaf” you see a “5” and a “9.” This excerpt from Output 7.4 is reproduced here: Stem Leaf 5 59 Connecting the stem (“5”) to the first leaf (also a “5”) gives you the first potential value that AGE took on: “55.” This means that the data set included one subject at age 55. Similarly, connecting the stem (“5”) to the second leaf (the “9”) gives you the second potential value that AGE took on: “59.” In short, the plot tells you that one subject had a score on AGE of “55,” and one had a score of “59.” Now move down one line. The stem on the second line is again a “5,” but you now have different leaves: a “0,” a “2,” and another “2” (see below). Connecting the stem to these leaves tells you that, in your data set, one subject had a score on AGE of “50,” one had a score of “52,” and another had a score of “52.” Stem Leaf 5 59 5 022
Chapter 7: Measures of Central Tendency and Variability 189
One last example: Move down to the third line. The stem on the third line is now a “4.” Further, you now have eight leaves: four leaves are a “7,” one leaf is an “8,” and three leaves are a “9,” as follows: Stem 5 5 4
Leaf 59 022 77778999
If you connect the stem (“4”) to these individual leaves, you learn that, in your data set, there were four subjects who had a score on AGE of “47,” one subject who had a score of “48,” and three subjects who had a score of “49.” The remainder of the stem-and-leaf plot can be interpreted in the same way. In summary, you can see that the stem-and-leaf plot is similar to a frequency bar chart, except that it is set on its side: the values that the variable took on appear on the vertical axis, and the frequencies are plotted along the horizontal axis. This is the reverse of what you saw with the vertical bar charts created by PROC CHART in the previous chapter. Interpreting the note at the bottom of the plot. There is one more feature in the stem-andleaf plot that requires explanation. Output 7.4 shows that the following note appears at the bottom of the plot: Multiply Stem.Leaf by 10**+1 To understand the meaning of this note, you need to mentally insert a decimal point into the stem-leaf values you have just reviewed. Those values are again reproduced here: Stem 5 5 4 4 3 3
Leaf 59 022 77778999 23444 688 3
Notice the blank space that separates each stem from its leaves. For example, in the first line, there is a blank space that separates the stem “5” from the leaves “5” and “9.” Technically, you are supposed to read this blank space as a decimal point (.). This means that the values in the first line are actually 5.5 (for the subject whose age was 55) and 5.9 (for the subject whose age was 59). The note at the bottom of the page tells you how to move this decimal point so that the values will return to their original metric. For this plot, the note says “Multiply Stem.Leaf by 10**+1.” This means “multiply the stem-leaf by 10 raised to the first power.” The number 10 raised to the first power is, of course, 10. So what happens to the stem-leaf 5.5 when it is multiplied by 10? It becomes 55, the subject’s actual score on AGE. And what happens to the stem-leaf 5.9 when it is multiplied by 10? It becomes 59, another subject’s actual score on AGE.
190 Step-by-Step Basic Statistics Using SAS: Student Guide
And that is how you interpret a stem-and-leaf plot. Whenever you type in a new data set, you should routinely create stem-and-leaf plots for each of your numeric variables. This will help you identify obvious errors in data entry, and will help you visualize the shape of your distribution (i.e., help you determine whether it is skewed, bimodal, or in any other way nonnormal). The next section of this chapter shows you how to do this.
Using PROC UNIVARIATE to Determine the Shape of Distributions Overview This guide often describes a sample of scores as displaying an “approximately normal” distribution. This means that the distribution of scores more or less follows the bell-shaped, symmetrical pattern of the normal curve. It is generally wise to review the shape of a sample of data prior to analyzing it with more sophisticated inferential statistics. This is because some inferential statistics require that your sample be drawn from a normally distributed population. When the values that you have obtained in a sample display a marked departure from normality (such as a strong skew), then it becomes doubtful that your sample was drawn from a population with a normal distribution. In some cases, this will mean that you should not analyze the data with certain types of inferential statistics. This section illustrates several different shapes that a distribution may display. Using stemand-leaf plots, it shows how a sample may appear (a) when it is approximately normal, (b) when it is positively skewed, (c) when it is negatively skewed, and (d) when it may have multiple modes. It also shows how each type of distribution affects the mode, median, and mean computed by PROC UNIVARIATE. Variables Analyzed As was discussed earlier, Chapter 5, “Creating Frequency Tables,” provided a fictitious political donation questionnaire. The last four items on the questionnaire presented subjects with statements, and asked the subjects to indicate the extent to which they agreed or disagreed with each statement. They responded by using a 7-point scale in which “1” represented “Disagree Very Strongly” and “7” represented “Agree Very Strongly.” Responses to these four items were given the SAS variable names Q1, Q2, Q3, and Q4, respectively. This section shows some of the results produced when PROC UNIVARIATE was used to analyze responses to these items. Here are the SAS statements requesting that PROC UNIVARIATE be performed on Q1, Q2, Q3, and Q4. Notice that the PROC UNIVARIATE statement itself contains the PLOT
Chapter 7: Measures of Central Tendency and Variability 191
option (which will cause stem-and-leaf plots to be created), as well as the NORMAL option (which requests tests of the null hypothesis that the data were sampled from a normally distributed population). PROC UNIVARIATE DATA=D1 VAR Q1 Q2 Q3 Q4; TITLE1 'JANE DOE'; RUN;
PLOT
NORMAL;
These statements resulted in four sets of output: one set of PROC UNIVARIATE output for each of the four variables. An Approximately Normal Distribution The stem-and-leaf plot. The SAS variable Q1 represents responses to the statement, “I believe that our federal government is generally doing a good job.” Output 7.5 presents the stem-and-leaf plot of fictitious responses to this question. Stem 7 6 5 4 3 2 1
Leaf 0 00 0000 00000000 0000 00 0 ----+----+----+----+
# 1 2 4 8 4 2 1
Boxplot | | +-----+ *--+--* +-----+ | |
Output 7.5. Stem-and-leaf plot of an approximately normal distribution produced by the PROC UNIVARIATE analysis of Q1.
The stem-and-leaf plot appears on the left side of Output 7.5. The box plot for the same variable appears on the right (for guidelines on how to interpret a box plot, see Schlotzhauer and Littell [1997], or Tukey [1977] ). The first line of the stem-and-leaf plot shows a stem-leaf combination of “7 0,” which means that the variable included one value of 7.0. The second line shows the stem “6,” along with two leaves of “0” and “0.” This means that the sample included two subjects with a score of 6.0. The remainder of the plot can be interpreted in the same fashion. Notice that there is no note at the bottom of the plot saying anything such as “Multiply Stem.Leaf by 10**+1.” This means that these stem-leaf values already display the appropriate unit of measurement; it is not necessary to multiply them by 10 (or any other value). The stem-and-leaf plot shows that Q1 has a symmetrical distribution centered around the score of 4.0 (in this context, the word symmetrical means that the tail extending above a
192 Step-by-Step Basic Statistics Using SAS: Student Guide
score of 4.0 is the same length as the tail extending below a score of 4.0). A response of “4” on the questionnaire corresponds to the answer “Neither Agree nor Disagree,” so it appears that the most common response was for subjects to neither agree nor disagree with the statement, “I believe that our federal government is generally doing a good job.” Based on the physical appearance of the stem-and-leaf plot, Q1 appears to have an approximately normal shape. The mean, median, and mode. Output 7.6 provides the Basic Statistical Measures table and the Tests for Normality table produced by PROC UNIVARIATE in its analysis of Q1. Basic Statistical Measures Location Mean Median Mode
Variability
4.000000 4.000000 4.000000
Std Deviation Variance Range Interquartile Range
1.41421 2.00000 6.00000 2.00000
Tests for Location: Mu0=0 Test
-Statistic-
-----p Value------
Student's t Sign Signed Rank
t M S
Pr > |t| Pr >= |M| Pr >= |S|
13.2665 11 126.5
W D W-Sq A-Sq
0.4477 0.0573 0.0686 0.1388
Output 7.6. Basic Statistical Measures table and Tests for Normality table produced by the PROC UNIVARIATE analysis of Q1.
Output 7.6 shows that, for the variable Q1, the mean is 4 ( ), the median is 4 ( ), and the mode is also 4 ( ). You may remember that the mean, median, and mode are expected to be the same number when the distribution being analyzed is symmetrical (i.e., when neither tail of the distribution is longer than the other). The stem-and-leaf plot of Output 7.5 has already shown that the distribution is symmetrical, so it makes sense that the mean, median, and mode of Q1 would all be the same value.
Chapter 7: Measures of Central Tendency and Variability 193
The test for normality. In Output 7.6, the section headed “Tests for Normality” provides the results from four statistics. Each of these statistics tests the null hypothesis that the sample was drawn from a normally distributed population. This section focuses on just one of these tests: The results for the Shapiro-Wilk W statistic appear in the row headed “Shapiro-Wilk.” In the Tests for Normality section, one of the columns is headed “Statistic.” This section provides the obtained value for the statistic. In the present analysis, you can see that the obtained value for the Shapiro-Wilk W statistic is 0.957881, which rounds to .96. The next column in the Tests for Normality section is headed “p Value,” which stands for “probability value.” Where the row headed “Shapiro-Wilk” intersects with the column headed “p Value,” you can see that the obtained p value for the current statistic is 0.4477 (this appears to the immediate right of the heading “Pr < W”). In general, a p value represents the probability that you would obtain the present statistic if the null hypothesis were true. The smaller the p value is, the less likely it is that the null hypothesis is true. A standard rule of thumb is that, if a p value is less than .05, you should assume that it is very unlikely that the null hypothesis is true, and you should reject the null hypothesis. In the present analysis, the p value of the Shapiro-Wilk statistic represents the probability that you would obtain a W statistic of this size if your sample were drawn from a normally distributed population. In general, when this p value is less than .05, you may reject the null hypothesis, and conclude that your sample was probably not drawn from a normally distributed population. When the p value is greater than .05, you should fail to reject the null hypothesis, and tentatively conclude that your sample probably was drawn from a normally distributed population. In the current analysis, you can see that this p value is 0.4477 (look below the heading “p Value”). Because this p value is greater than the criterion of .05, you fail to reject the null hypothesis of normality. In other words, you tentatively conclude that your sample probably did come from a normally distributed population. In most cases, this is good news because many statistical procedures require that your data be drawn from normally distributed populations. As is the case with many statistics, this statistic is sensitive to sample size in that it tends be very powerful (sensitive) with large samples. This means that the W statistic may imply that your sample did not come from a normal distribution even when the sample shows a very minor departure from normality. So keep the sample size in mind, and use caution in interpreting the results of this test, particularly when your sample size is large.
194 Step-by-Step Basic Statistics Using SAS: Student Guide
A Positively Skewed Distribution What is a skewed distribution? A distribution is skewed if one tail is longer than the other. It shows positive skewness if the longer tail of the distribution points in the direction of higher values. In contrast, negative skewness means that the longer tail points in the direction of lower values. The present section illustrates a positively skewed distribution, and the following section covers negative skewness. The stem-and-leaf plot. In the political donation study, the second agree-disagree item stated, “The federal government should raise taxes.” Responses to this question were represented by the SAS variable Q2. The stem-and-leaf plot created when PROC UNIVARIATE analyzed Q2 is reproduced here as Output 7.7. Stem 7 6 5 4 3 2 1
Leaf 0 0 0 00 00000 00000000 0000 ----+----+----+----+
# 1 1 1 2 5 8 4
Boxplot * 0 0 | +-----+ *--+--* |
Output 7.7. Stem-and-leaf plot of a positively skewed distribution produced by the PROC UNIVARIATE analysis of Q2.
The stem-and-leaf plot of Output 7.7 shows that most responses to item Q2 appear around the number “2,” meaning that most subjects circled the number “2” or some nearby number. This seems reasonable, because, on the questionnaire, the response number “2” represented “Disagree Strongly,” and it makes sense that many people would disagree with the statement “The federal government should raise taxes.” However, the stem-and-leaf plot shows that a small number of people actually agreed with the statement. It shows that two people circled “4” (for “Neither Agree nor Disagree”), one person circled “5” (for “Agree”), one person circled “6” (for “Agree Strongly”), and one person circled “7” (for “Agree Very Strongly”). These responses created a long tail for the distribution––a tail that stretches out in the direction of higher numbers (such as “6” and “7”). This means that the distribution for Q2 is a positively skewed distribution. The mean, median, and mode. You can also determine whether a distribution is skewed by comparing the mean to the median. These statistics are presented in Output 7.8.
Chapter 7: Measures of Central Tendency and Variability 195
Basic Statistical Measures Location Mean Median Mode
Variability
2.772727 2.000000 2.000000
Std Deviation Variance Range Interquartile Range
1.60154 2.56494 6.00000 1.00000
Tests for Location: Mu0=0 Test
-Statistic-
-----p Value------
Student's t Sign Signed Rank
t M S
Pr > |t| Pr >= |M| Pr >= |S|
8.120454 11 126.5
W D W-Sq A-Sq
0.0048 |t| Pr >= |M| Pr >= |S|
9.219434 11 126.5
W D W-Sq A-Sq
0.0101 0.0186 0.0064 GE LT or < LE
is is is is is is
equal to not equal to greater than greater than or equal to less than less than or equal to
The Syntax of the IF-THEN Statement The syntax is as follows: expression
IF
THEN
statement ;
The “expression” usually consists of some comparison involving existing variables. The “statement” usually involves some operation performed on existing variables or new variables. To illustrate the use of the IF-THEN statement, this section again refers to the fictitious Learning Aptitude Test (abbreviated “LAT”) presented earlier in this guide. A previous chapter indicated that the LAT included a verbal subtest as well as a mathematical subtest. Suppose that you have obtained LAT scores for a sample of subjects, and now wish to create a new variable called LATVGRP, which is an abbreviation for “LAT-verbal group.” This variable will be created with the following provisions: •
If you do not know what a subject’s LAT Verbal test score is, that subject will have a score of "." (for "missing data") on LATVGRP.
•
If the subject’s score is under 500 on the LAT Verbal test, the subject will have a score of 0 on LATVGRP.
•
If the subject’s score is 500 or greater on the LAT Verbal test, the subject will have a score of 1 on LATVGRP.
Suppose that the variable LATV already exists in your data set and that it contains each subject’s actual score on the LAT Verbal test. You can now use it to create the new variable, LATVGRP, by writing the following statements: LATVGRP = .; IF LATV LT 500 THEN LATVGRP = 0; IF LATV GE 500 THEN LATVGRP = 1; The preceding statements tell SAS to create a new variable called LATVGRP, and begin by setting everyone's score as equal to "." (missing). If a subject's score on LATV is less than 500, then SAS sets his or her score on LATVGRP as equal to 0. If a subject's score on LATV is greater than or equal to 500, then SAS sets his or her score on LATVGRP as equal to 1.
Chapter 8: Creating and Modifying Variables and Data Sets 243
Using ELSE Statements You could have performed the preceding operations more efficiently by using the ELSE statement. Here is the syntax for using the ELSE statement with the IF-THEN statement: IF
expression THEN statement ; ELSE IF expression THEN statement ;
The ELSE statement provides alternative actions that SAS may take if the original IF expression is not true. For example, consider the following: 1 2 3
LATVGRP = .; IF LATV LT 500 THEN LATVGRP = 0; ELSE IF LATV GE 500 THEN LATVGRP = 1;
The preceding tells SAS to •
create a new variable called LATVGRP, and initially assign all subjects a value of “missing”
•
assign a particular subject a score of 0 on LATVGRP if that subject has an LATV score less than 500
•
assign a particular subject a score of 1 on LATVGRP if that subject has an LATV score greater than or equal to 500.
Obviously, the preceding statements were identical to the earlier statements that created LATVGRP, except that the word ELSE has now been added to the beginning of the third line. In fact, these two approaches result in assigning exactly the same values on LATVGRP to each subject. So what, then, is the advantage of including the ELSE statement? The answer has to do with efficiency: When an ELSE statement is included, the actions specified in that statement are executed only if the expression in the preceding IF statement is not true. For example, consider the situation in which subject 1 has a score on LATV that is less than 500. Line 2 in the preceding statements would assign that subject a score of 0 on LATVGRP. SAS would then ignore line 3 (because it contains the ELSE statement) thus saving computer time. If line 3 did not contain the word ELSE, SAS would execute the line, checking to see whether the LATV score for subject 1 is greater than or equal to 500 (which is actually unnecessary, given what was learned in line 2). Regarding missing data, notice that line 2 of the preceding program assigns subjects to group 0 (under LATVGRP) if their scores on LATV are less than 500. Unfortunately, a score of “missing” (.) on LATV is viewed as being less than 500 (actually, SAS views it as being less than 0). This means that subjects with missing data on LATV will be assigned to group 0 under LATVGRP by line 2 of the preceding program. This is not desirable.
244 Step-by-Step Basic Statistics Using SAS: Student Guide
To prevent this from happening, you may rewrite the program in the following way: 1 2 3
LATVGRP = .; IF LATV GE 200 AND LATV LT 500 THEN LATVGRP = 0; ELSE IF LATV GE 500 THEN LATVGRP = 1;
Line 2 of the program now tells SAS to assign subjects to group 0 only if their scores on LATV are both greater than or equal to 200, and less than 500. This modification uses the conditional AND statement, which is discussed in greater detail in the following section. Finally, remember to use the ELSE statement only in conjunction with a preceding IF statement and always to place it immediately following the relevant IF statement. Using the Conditional Statements AND and OR As the preceding section suggests, it is also possible to use the conditional statement AND within an IF-THEN statement or an ELSE statement. For example, consider the following: LATVGRP = .; IF LATV GT 400 AND LATV LT 500 THEN LATVGRP = 0; ELSE IF LATV GE 500 THEN LATVGRP = 1; The second statement tells SAS "if LATV is greater than 400 and less than 500, then give this subject a score on LATVGRP of 0." This means that subjects are given a score of 0 only if they are both over 400 and under 500. What happens to those who have a score of 400 or less on the LATV? They are given a score of "." on LATVGRP. That is, they are classified as having missing data on LATVGRP. This is because they (along with everyone else) were given a score of "." in the first statement, and neither of the later statements replaces that "." with a 0 or a 1. However, for subjects whose scores are over 400, one of the later statements will replace the "." with either a 0 or a 1. It is also possible to use the conditional statement OR within an IF-THEN statement or an ELSE statement. For example, suppose that you have a variable in your data set called ETHNIC. With this variable, subjects were assigned the value 5 if they were Caucasian, 6 if they were African-American, or 7 if they were Asian-American. Supose that you now wish to create a new variable called MAJORITY: Subjects will be assigned a value of 1 on this variable if they are in the majority group (i.e., if they are Caucasians), and they will be assigned a value of 0 on this variable if they are in a minority group (if they are either African-Americans or Asian-Americans). The following statements would create this variable: MAJORITY=.; IF ETHNIC = 5 THEN MAJORITY = 1; ELSE IF ETHNIC = 6 OR ETHNIC = 7 THEN MAJORITY = 0; In the preceding statements, all subjects are first assigned a value of “missing” on MAJORITY. If their value on ETHNIC is 5, their score on MAJORITY is changed to 1,
Chapter 8: Creating and Modifying Variables and Data Sets 245
and SAS ignores the following ELSE statement. If their value on ETHNIC is not 5, then SAS moves on to the ELSE statement. There, if the subject’s value on ETHNIC is either 6 or 7, the subject is assigned a value of 0 on MAJORITY. Working with Character Variables Using single quotation marks. When you are working with character variables (variables in which the values may consist of letters or special characters), it is important that you enclose values within single quotation marks in the IF-THEN and ELSE statements. Converting character values to numeric values. For example, suppose that you administered the achievement motivation questionnaire (from the beginning of this chapter) to a sample of subjects. This questionnaire asked subjects to identify their sex. In entering the data, you created a character variable called SEX in which the value “F” represented female subjects and the value “M” represented male subjects. Suppose that you now wish to create a new variable called SEX2. SEX2 will be a numeric variable in which the value “0” is used to represent females and the value “1” is used to represent males. The following SAS statements could be used to create and print this new variable: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
[The first part of the DATA step appears here] 7 5 5 6 1 5 F 21 8 2 3 1 5 2 F 25 9 4 5 4 2 5 M 23 ; DATA D2; SET D1;
3 3 3
SEX2 = .; IF SEX = 'F' THEN SEX2 = 0; IF SEX = 'M' THEN SEX2 = 1; PROC PRINT DATA=D2; VAR SEX SEX2; TITLE1 'JANE DOE'; RUN;
Some notes about the preceding program: •
The last data lines from the achievement motivation study appear on lines 18–20 (to save space, only the last few lines are presented).
•
A new DATA step begins on line 22. This was necessary, because you can use IF-THEN control statements only within a DATA step to create a new variable.
•
Line 25 tells SAS to create a new variable called “SEX2,” and begin by assigning missing data (.) to all subjects on SEX2.
246 Step-by-Step Basic Statistics Using SAS: Student Guide •
Line 26 tells SAS that, if a given subject’s value on SEX is equal to “F,” then her value on SEX2 should be “0.”
•
Line 27 tells SAS that, if a given subject’s value on SEX is equal to “M,” then his value on SEX2 should be “1.”
On line 26, notice that the “F” is enclosed within single quotation marks. This was necessary because SEX was a character variable. However, when “SEX2” is set to zero on line 26, the zero is not enclosed within single quotation marks. This is because SEX2 is not a character variable––it is a numeric variable. Use single quotation marks only to enclose character variable values. Output 8.5 presents the results generated by the preceding program. JANE DOE Obs
SEX
SEX2
1 2 3 4 5 6 7 8 9
F M M F M F F F M
0 1 1 0 1 0 0 0 1
Output 8.5. Results of PROC PRINT in which SEX and SEX2 were listed in the VAR statement, achievement motivation study.
Output 8.5 presents each subject’s values for the variables SEX and SEX2. Notice that, if a given subject’s value for SEX is “F,” her value on SEX2 is equal to zero; if a given subject’s value on SEX is “M,” his value on SEX2 is equal to 1. This is as expected, given the preceding IF-THEN control statements. Converting numeric values to character values. The same conventions apply when you convert numeric values to character values: Within the IF-THEN control statements, the values for character variables should be enclosed within single quotation marks, but the values for numeric variables should not be enclosed within single quotation marks. For example, you may remember that the achievement motivation questionnaire asked subjects to indicate their major. That section of the questionnaire is reproduced here: 8. What is your major?
______ Arts and Sciences (1) ______ Business (2) ______ Education (3)
Chapter 8: Creating and Modifying Variables and Data Sets 247
In entering the data, you created a numeric variable called MAJOR. You used the value “1” to represent subjects majoring in the arts and sciences, the value “2” to represent subjects majoring in business, and the value “3” to represent subjects majoring in education. Suppose that you now wish to create a new variable called MAJOR2, which will also identify the area in which the subjects majored. However, MAJOR2 will be a character variable, and the values of MAJOR2 will be three characters long. Specifically, •
the value “A&S” represents subjects majoring in the arts and sciences
•
the value “BUS” represents subjects majoring in business
•
the value “EDU” represents subjects majoring in education.
The following statements use IF-THEN control statements to create the new MAJOR2 variable: 1 2 3 4
MAJOR2 = IF MAJOR IF MAJOR IF MAJOR
' .'; = 1 THEN MAJOR2 = 'A&S'; = 2 THEN MAJOR2 = 'BUS'; = 3 THEN MAJOR2 = 'EDU';
Some notes about the preceding program: •
Line 1 creates the new variable MAJOR2 and initially assigned missing data to all subjects. It did this with the following statement: MAJOR2 = '
.';
•
Notice that there is room for three characters within the single quotation marks. Within the single quotation marks there are two blank spaces and a single period (to represent “missing data”). This was important, because the subsequent IF-THEN statements assign 3-character values to MAJOR2.
•
Line 2 indicates that, if a given subject has a value of “1” on MAJOR, then that subject’s value on MAJOR2 should be “A&S.”
•
Line 3 indicates that, if a given subject has a value of “2” on MAJOR, then that subject’s value on MAJOR2 should be “BUS.”
•
Line 4 indicates that, if a given subject has a value of “3” on MAJOR, then that subject’s value on MAJOR2 should be “EDU.”
Once again, notice that when line 2 includes the expression IF MAJOR = 1 the “1” does not appear within single quotation marks. This is because MAJOR is a numeric variable. However, when line 2 includes the statement: THEN MAJOR2 = 'A&S';
248 Step-by-Step Basic Statistics Using SAS: Student Guide
the “A&S” does appear within single quotation marks. This is because MAJOR2 is a character variable. If a later section of your program included a PROC PRINT statement to print the contents of MAJOR and MAJOR2, the results would look something like Output 8.6. JANE DOE Obs
MAJOR
MAJOR2
1 2 3 4 5 6 7 8 9
1 1 1 2 2 2 3 3 3
A&S A&S A&S BUS BUS BUS EDU EDU EDU
Output 8.6. Results of the PROC PRINT in which MAJOR and MAJOR2 were listed in the VAR statement, achievement motivation study.
Data Subsetting Overview An earlier section of this chapter indicated that data subsetting statements are SAS statements that eliminate unwanted observations from a sample, so that only a specified subgroup is included in the resulting data set. Often, it is necessary to perform an analysis on only a subset of the subjects who are included in the data set. For example, you may wish to review the mean survey responses provided by just the female subjects. Or, you may wish to review mean survey responses provided by just those subjects majoring in the arts and sciences. Subsetting IF statements may be used to obtain these results. The Syntax of Data Subsetting Statements Here is the syntax for the statements that perform data subsetting: DATA new-data-set-name ; SET existing-data-set-name ; IF comparison ; The "comparison" generally includes some existing variable and at least one comparison operator.
Chapter 8: Creating and Modifying Variables and Data Sets 249
An Example The SAS program. For example, suppose you wish to compute mean survey responses for just those subjects who are majoring in the arts and sciences. The following statements accomplish this: 18 19 20 21 22 23 24 25 26 27 28 29
[The first part of the DATA step appears here] 7 5 5 6 1 5 F 21 8 2 3 1 5 2 F 25 9 4 5 4 2 5 M 23 ; DATA D2; SET D1;
3 3 3
IF MAJOR = 1; PROC MEANS DATA=D2; TITLE1 'JANE DOE--ARTS AND SCIENCES MAJORS'; RUN;
Some notes about the preceding program: •
The last data lines from the achievement motivation study appear on lines 18–20 (to save space, only the last few lines are presented).
•
A new DATA step begins on line 22. This was necessary, because you can perform data subsetting only within a DATA step.
•
Lines 22–23 tell SAS to create a new data set, name it “D2,” and create it as a duplicate of data set D1.
•
Line 25 tells SAS to retain a given observation for D2 only if that observation’s value on MAJOR is equal to “1.” This will retain in the data set D2 only those subjects who majored in the arts and sciences (because the number “1” was used to represent this group under the MAJOR variable).
•
Lines 27–29 request that PROC MEANS be performed on the data set.
•
Line 27 includes the option “DATA=D2,” which specifies that PROC MEANS should be performed on the data set D2. This makes sense, because the D2 is the data set that contains just the arts and sciences majors.
250 Step-by-Step Basic Statistics Using SAS: Student Guide
Output 8.7 presents the results generated by the preceding program. JANE DOE--ARTS AND SCIENCES MAJORS
1
The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ----------------------------------------------------------------------------SUB_NUM 3 2.0000000 1.0000000 1.0000000 3.0000000 Q1 3 4.3333333 2.0816660 2.0000000 6.0000000 Q2 2 3.0000000 2.8284271 1.0000000 5.0000000 Q3 3 3.6666667 1.5275252 2.0000000 5.0000000 Q4 3 3.3333333 1.5275252 2.0000000 5.0000000 Q5 3 4.6666667 2.3094011 2.0000000 6.0000000 AGE 3 25.6666667 4.0414519 22.0000000 30.0000000 MAJOR 3 1.0000000 0 1.0000000 1.0000000 ----------------------------------------------------------------------------Output 8.7. Results of the PROC MEANS performed on data set consisting of arts and sciences majors only, achievement motivation study.
Some notes about this output: PROC MEANS was performed on all of the numeric variables in the data set because the VAR statement had been omitted from the SAS program. In the column headed “N,” you can see that, for most variables, the analyses were performed on just three subjects. This makes sense, because Table 8.1 showed that just three subjects were majoring in the arts and sciences. To the right of the variable name “MAJOR,” you can see that the mean score on MAJOR is 1.0, the minimum value for MAJOR is 1.0, and the maximum value for MAJOR is 1.0. This is what you would expect if your data set consisted exclusively of arts and sciences majors: Each of them should have a value on MAJOR that is equal to “1.”
Chapter 8: Creating and Modifying Variables and Data Sets 251
An Example with Multiple Subsets It is possible to write a single program that creates multiple data sets, with each data set consisting of a different subgroup of subjects. This is done with the following program: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
[The first part of the DATA step appears here] 7 5 5 6 1 5 F 21 3 8 2 3 1 5 2 F 25 3 9 4 5 4 2 5 M 23 3 ; DATA D2; SET D1; IF MAJOR = 1; PROC MEANS DATA=D2; TITLE1 'JANE DOE--ARTS AND SCIENCES MAJORS'; RUN; DATA D3; SET D1; IF MAJOR = 2; PROC MEANS DATA=D3; TITLE1 'JANE DOE--BUSINESS MAJORS'; RUN; DATA D4; SET D1; IF MAJOR = 3; PROC MEANS DATA=D4; TITLE1 'JANE DOE--EDUCATION MAJORS'; RUN;
Some notes about the preceding program: •
Lines 22–24 create a new data set named D2. The subsetting IF statement on line 24 ensures that this data set will contain only subjects with a value of “1” on the variable MAJOR. This means that the data set will contain only arts and sciences majors. Lines 25– 27 request that PROC MEANS be performed on this data set.
•
Lines 29–31 create a new data set named D3. The subsetting IF statement on line 31 ensures that this data set will contain only subjects with a value of “2” on the variable, “MAJOR.” This means that the data set will contain only business majors. Lines 32–34 request that PROC MEANS be performed on this data set.
•
Lines 36–38 create a new data set named D4. The subsetting IF statement on line 38 ensures that this data set will contain only subjects with a value of “3” on the variable, “MAJOR.” This means that the data set will contain only education majors. Lines 39–41 request that PROC MEANS be performed on this data set.
252 Step-by-Step Basic Statistics Using SAS: Student Guide
Specifying the initial data set in the SET statement. Notice that, throughout the preceding program, the SET statement always specifies “D1,” as shown here: SET D1; This is because the data set D1 was the initial data set and the only data set that contained all of the initial observations. When creating a new data set that will consist of a subset of this initial data set, you will usually want to specify the initial data set in your SET statement. Specifying the current data set in the PROC statements. PROC MEANS statements appear on lines 25, 32, and 39 of the preceding program. Notice that, in each case, the “DATA= “ option of the PROC MEANS statement always specifies the data set that has just been created. Line 25 reads “PROC MEANS DATA=D2,” line 32 reads, “PROC MEANS DATA=D3,” and line 39 reads “PROC MEANS DATA=D4.” This ensures that the first PROC MEANS is performed on the data set containing just the arts and sciences majors, the second PROC MEANS is performed on the data set containing just the business majors, and the third PROC MEANS is performed on the data set containing just the education majors. Using Comparison Operators and the Conditional Statements AND and OR When writing a subsetting IF statement, you may use all of the comparison operators described above (such as “LT” or “GE”) as well as the conditional statements AND and OR. For example, suppose that you have created an initial data set named D1 that contains the SAS variables SEX (which represents subject sex) and AGE (which represents subject age). You now wish to create a second data set named D2, and a subject will be retained in D2 only if she is a female, and she is 65 years of age or older. The following statements will accomplish this: DATA D2; SET D1; IF SEX = 'F' AND AGE GE 65;
Eliminating Observations That Have Missing Data on Some Variables Overview. One of the most common difficulties encountered by researchers in the social sciences and education is the problem of missing data. Briefly, the missing data problem involves not having scores on all variables for all subjects in a data set.
Chapter 8: Creating and Modifying Variables and Data Sets 253
Missing data in the achievement motivation study. To illustrate the concept of missing data, Table 8.1 is reproduced here as Table 8.2: Table 8.2 Data from the Achievement Motivation Study ___________________________________________ Agree-Disagree Questions __________________ Subject Q1 Q2 Q3 Q4 Q5 Sex Age Major ____________________________________________________ 1. Marsha
6
5
5
2
6
F
22
1
2. Charles
2
1
2
5
2
M
25
1
3. Jack
5
.
4
3
6
M
30
1
4. Cathy
5
6
6
.
6
F
41
2
5. Emmett
4
4
5
2
5
M
22
2
6. Marie
5
6
6
2
6
F
20
2
7. Cindy
5
5
6
1
5
F
21
3
8. Susan
2
3
1
5
2
F
25
3
9. Fred 4 5 4 2 5 M 23 3 ___________________________________________________
Table 8.2 uses a single period (.) to represent missing data. The table reveals missing data for the third subject (Jack) on the variable Q2: There is a single period in the location where you would expect to see Jack’s score for Q2. Similarly, the table also reveals missing data for the fourth subject (Cathy) on variable Q4.
254 Step-by-Step Basic Statistics Using SAS: Student Guide
In Chapter 4, “Data Input,” you learned that you should also use a single period to represent missing data when entering data in a SAS data set. This was shown in the section “SAS Program to Read the Raw Data,” earlier in this chapter. The initial DATA step from that program is reproduced below: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM Q1 Q2 Q3 Q4 Q5 SEX $ AGE MAJOR; DATALINES; 1 6 5 5 2 6 F 2 2 1 2 5 2 M 3 5 . 4 3 6 M 4 5 6 6 . 6 F 5 4 4 5 2 5 M 6 5 6 6 2 6 F 7 5 5 6 1 5 F 8 2 3 1 5 2 F 9 4 5 4 2 5 M ;
22 25 30 41 22 20 21 25 23
1 1 1 2 2 2 3 3 3
Line 15 from the preceding program contains data from the subject Jack. You can see that a single period appears in the location where you would normally expect Jack’s score on Q2. In the same way, line 16 contains data from the subject Cathy. You can see that a single period appears in the location where you would normally expect Cathy’s score on Q4. Eliminating observations with missing data from a new data set. Suppose that you now wish to create a new data set named D2. The new data set will be identical to the initial data set (D1) with one exception: D2 will contain only observations that have no missing data on the five achievement motivation questionnaire items (Q1, Q2, Q3, Q4, and Q5). In other words, you wish to include a subject in the new data set only if the subject answered all five of the achievement motivation questionnaire items. Once you have created the new data set, you will use PROC PRINT to print it out.
Chapter 8: Creating and Modifying Variables and Data Sets 255
The following statements accomplish this: 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
[The first part of the DATA step appears here] 7 5 5 6 1 5 F 21 8 2 3 1 5 2 F 25 9 4 5 4 2 5 M 23 ; DATA D2; SET D1; IF Q1 NE . AND Q2 NE . AND Q3 NE . AND Q4 NE . AND Q5 NE . ;
3 3 3
PROC PRINT DATA=D2; TITLE1 'JANE DOE'; RUN;
Some notes about the preceding program: •
The last data lines from the achievement motivation study appear on lines 19–21.
•
A new DATA step begins on line 23. This was necessary, because you can perform data subsetting only within a DATA step.
•
Lines 23–24 tell SAS to create a new data set, name it “D2,” and initially create it as a duplicate of data set D1.
•
Lines 25–29 contain a single subsetting IF statement. The comparison operator “NE” that appears in the statement stands for “is not equal to.” This subsetting IF statement tells SAS to retain a given observation for data set D2 only if all of the following are true: • Q1 is not equal to “missing data” • Q2 is not equal to “missing data” • Q3 is not equal to “missing data” • Q4 is not equal to “missing data” • Q5 is not equal to “missing data.”
•
With this subsetting IF statement, a given subject is retained in data set D2 only if he or she had no missing data on any of the five variables listed.
•
Lines 31–33 contain the PROC PRINT statement that prints the new data set. Notice that the “DATA=D2” option on line 31 specifies that D2 should be printed, rather than D1.
256 Step-by-Step Basic Statistics Using SAS: Student Guide
Output 8.8 presents the results generated by the preceding program. JANE DOE
1
Obs
SUB_NUM
Q1
Q2
Q3
Q4
Q5
SEX
AGE
MAJOR
1 2 3 4 5 6 7
1 2 5 6 7 8 9
6 2 4 5 5 2 4
5 1 4 6 5 3 5
5 2 5 6 6 1 4
2 5 2 2 1 5 2
6 2 5 6 5 2 5
F M M F F F M
22 25 22 20 21 25 23
1 1 2 2 3 3 3
Output 8.8. Results of the PROC PRINT performed on data set D2, achievement motivation study.
Notice that there are only seven observations in Output 8.8. The initial data set (D1) contained nine observations, but two of these (observations for subjects Jack and Cathy) contained missing data, and were therefore not included in data set D2. If you look at the values for variables Q1, Q2, Q3, Q4, and Q5, you will not see any single periods indicating missing data.
Combining a Large Number of Data Manipulation and Data Subsetting Statements in a Single Program Overview Most of the SAS programs presented in this chapter have been fairly simple in that only a few data manipulation or data subsetting statements have been included in each program. In practice, however, it is possible to include––within a single program––a relatively large number of statements that modify variables and data sets. This section provides an example of such a program. A Longer SAS Program For example, this section presents a fairly long SAS program. This program includes data from the achievement motivation study (first presented in Table 8.1) that has been analyzed throughout this chapter. You will see that this single program includes a wide variety of statements that perform most of the tasks discussed in this chapter.
Chapter 8: Creating and Modifying Variables and Data Sets 257
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM Q1 Q2 Q3 Q4 Q5 SEX $ AGE MAJOR; DATALINES; 1 6 5 5 2 6 F 2 2 1 2 5 2 M 3 5 . 4 3 6 M 4 5 6 6 . 6 F 5 4 4 5 2 5 M 6 5 6 6 2 6 F 7 5 5 6 1 5 F 8 2 3 1 5 2 F 9 4 5 4 2 5 M ; PROC PRINT DATA=D1; TITLE1 'JANE DOE'; RUN;
22 25 30 41 22 20 21 25 23
1 1 1 2 2 2 3 3 3
DATA D2; SET D1; Q4 = 7 - Q4; ACH_MOT = (Q1 + Q2 + Q3 + Q4 + Q5) / 5; AGE2 = .; IF AGE LT 25 THEN AGE2 = 0; IF AGE GE 25 THEN AGE2 = 1; SEX2 = .; IF SEX = 'F' THEN SEX2 = 0; IF SEX = 'M' THEN SEX2 = 1; MAJOR2 = IF MAJOR IF MAJOR IF MAJOR
' .'; = 1 THEN MAJOR2 = 'A&S'; = 2 THEN MAJOR2 = 'BUS'; = 3 THEN MAJOR2 = 'EDU';
PROC PRINT DATA=D2; TITLE1 'JANE DOE'; RUN; DATA D3;
258 Step-by-Step Basic Statistics Using SAS: Student Guide
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
SET D2; IF MAJOR2 = 'A&S'; PROC MEANS DATA=D3; TITLE1 'JANE DOE--ARTS AND SCIENCES MAJORS'; RUN; DATA D4; SET D2; IF MAJOR2 = 'BUS'; PROC MEANS DATA=D4; TITLE1 'JANE DOE--BUSINESS MAJORS'; RUN; DATA D5; SET D2; IF MAJOR2 = 'EDU'; PROC MEANS DATA=D5; TITLE1 'JANE DOE--EDUCATION MAJORS'; RUN; DATA D6; SET D2; IF Q1 NE Q2 NE Q3 NE Q4 NE Q5 NE
. . . . .
AND AND AND AND ;
PROC PRINT DATA=D6; TITLE1 'JANE DOE'; RUN;
Some notes concerning the preceding program: •
Lines 1–22 input the achievement motivation data that were first presented in Table 8.1.
•
Lines 23–25 request that PROC PRINT be performed on the initial data set, D1.
•
Lines 27–28 begin a new DATA step. The new data set is named D2, and is initially created as a duplicate of D1.
•
With line 30, the reversed variable Q4 is recoded.
•
With line 31, the new variable ACH_MOT is created as the average of variables Q1, Q2, Q3, Q4, and Q5.
•
Lines 33–35 create a new variable named AGE2, based on the existing variable named AGE.
•
Lines 37–39 create a new variable named SEX2, based on the existing variable named SEX.
Chapter 8: Creating and Modifying Variables and Data Sets 259 •
Lines 41–44 create a new variable named MAJOR2, based on the existing variable named MAJOR.
•
Lines 46–48 request that PROC PRINT be performed on the new data set, D2.
•
Lines 50–51 begin a new DATA step. The new data set is named D3, and is initially created as a duplicate of D2. This is followed by a subsetting IF statement on line 52 that retains a subject only if his or her value on MAJOR2 is “A&S.” This ensures that only arts and sciences majors be retained for the new data set. The statements on lines 53–55 request that PROC MEANS be performed on the new data set.
•
Lines 57–58 begin a new DATA step. The new data set is named D4, and is initially created as a duplicate of D2. This is followed by a subsetting IF statement on line 59 that retains a subject only if his or her value on MAJOR2 is “BUS.” This ensures that only business majors be retained for the new data set. The statements on lines 60–62 request that PROC MEANS be performed on the new data set.
•
Lines 64–65 begin a new DATA step. The new data set is named D5, and is initially created as a duplicate of D2. This is followed by a subsetting IF statement on line 66 that retains a subject only if his or her value on MAJOR2 is “EDU.” This ensures that only education majors be retained for the new data set. The statements on lines 67–69 request that PROC MEANS be performed on the new data set.
•
Lines 71–72 begin a new DATA step. The new data set is named D6, and is initially created as a duplicate of D2. This is followed by a subsetting IF statement on lines 73–77 that retains a subject only if he or she has no missing data on Q1, Q2, Q3, Q4, or Q5. The statements on lines 79–81 request that PROC PRINT be performed on the new data set.
Some General Guidelines When writing relatively long SAS programs such as this one, it is important to keep two points in mind. First, remember that you can perform data manipulation or data subsetting only within a DATA step. This means that in most cases you should begin a new DATA step (by using the “DATA” statement) before writing the statements that create new variables, modify existing variables, or create subsets of data. Second, you must keep track of the names that you give to new data sets, and must specify the correct data set name within a given PROC statement. For example, suppose that you create a data set called D1. In the course of a lengthy SAS program, you create a number of different data sets, all based on D1. Somewhere late in the program, you create a new data set named D5, and within this data set create a new variable named ACH_MOT. You now wish to perform PROC MEANS on ACH_MOT. To do this, you must specify the data set D5 in the PROC MEANS statement, as follows: PROC MEANS RUN;
DATA=D5;
260 Step-by-Step Basic Statistics Using SAS: Student Guide
If you specify any other data set (such as D1), you will not obtain the mean for ACH_MOT, as it appears only within the data set named D5. In this case, SAS will issue an error statement in your log file.
Conclusion This chapter has shown you how to use simple formulas, IF-THEN control statements, subsetting IF statements, and other tools to modify existing data sets. You should now be prepared to perform the types of data manipulation that are most commonly required in research in the social sciences and education. For example, with these tools it should now be a simple matter for you to convert raw scores into standardized scores. When analyzing data, researchers often like to standardize variables so that they have a known mean (typically zero) and a known standard deviation (typically 1). Scores that have been standardized in this way are called z scores. The following chapter shows you how to use data manipulation statements to create z scores, and illustrates some of the ways that z scores can be used to answer research questions.
z Scores Introduction.........................................................................................262 Overview...............................................................................................................262 Raw-Score Variables versus Standardized Variables...........................................262 Types of Standardized Scores ..............................................................................262 The Advantages of Working with z Scores ...........................................................263 Example 9.1: Comparing Mid-Term Test Scores for Two Courses...266 Data Set to Be Analyzed.......................................................................................266 The DATA Step.....................................................................................................267 Converting a Single Raw-Score Variable into a z-Score Variable......268 Overview...............................................................................................................268 Step 1: Computing the Mean and Sample Standard Deviation ............................269 Step 2: Creating the z-Score Variable..................................................................270 Examples of Questions That Can Be Answered with the New z-Score Variable ..276 Converting Two Raw-Score Variables into z-Score Variables ...........278 Overview...............................................................................................................278 Review: Data Set to Be Analyzed ........................................................................278 Step 1: Computing the Means and Sample Standard Deviations ........................279 Step 2: Creating the z-Score Variables................................................................280 Examples of Questions That Can Be Answered with the New z-Score Variables.284 Standardizing Variables with PROC STANDARD ................................285 Isn’t There an Easier Way to Do This? .................................................................285 Why This Guide Used a Two-Step Approach .......................................................286 Conclusion...........................................................................................286
262 Step-by-Step Basic Statistics Using SAS: Student Guide
Introduction Overview This chapter shows you the advantages of working with standardized variables: variables with specified means and standard deviations. Most of the chapter focuses on z scores: scores that have been standardized to have a mean of zero and a standard deviation of 1. It shows you how to use SAS data manipulation statements to convert raw scores into z scores, and how to interpret the characteristics of a z score (its sign and size) to understand the relative standing of that score within a sample. Raw-Score Variables versus Standardized Variables All of the variables presented in this guide so far have been raw-score variables. Raw-score variables are variables that have not been transformed to have a specified mean and standard deviation. For example, if you administer an attitude scale to a group of subjects, compute their scores on the scale, and do not transform their scores in any way, then the attitude scale is a raw-score variable. Depending on the nature of your scale, the sample of scores might have almost any mean or standard deviation. In contrast, a standardized variable is a variable that has been transformed to have a specified mean and standard deviation. For example, consider the scores on the attitude scale mentioned above. If you wanted to, you could convert these raw scores into z scores. This means that you would transform the variable so that it has a mean of zero and a standard deviation of 1. In this situation the new variable that you create (the group of z scores) is a standardized variable. Types of Standardized Scores In the social sciences and education, the z score is probably the most frequently used type of standardized score. A z score is a value that indicates the distance of a raw score from the mean when the distance is measured in standard deviations. In other words, a z score indicates how many standard deviations above (or below) the mean a given raw score is located. By definition, a sample of z scores has a mean of zero and a standard deviation of 1. Another type of standardized variable that is sometimes used in the social and behavioral sciences is the T score. A sample of T scores has a mean of 50 and a standard deviation of 10. Intelligence quotient (IQ) scores are often standardized as well. A sample of IQ scores is typically standardized so that it has a mean of 100 and a standard deviation of about 15.
Chapter 9: z Scores 263
The Advantages of Working with z Scores Overview. z scores can be easier to work with than raw scores for a variety of purposes. For example, z scores enable you to immediately determine a particular score’s relative position in a sample, and can also make it easier to compare scores on variables that initially had different means and/or standard deviations. These advantages are discussed below. Immediately determining the relative position of a particular score in a sample. When you look at a z score, you can immediately determine whether the corresponding raw score is above or below the mean, and how far the raw score is from the mean. You do this by viewing the sign and the absolute magnitude of the z score (the absolute magnitude of a number is simply the size of the number, regardless of sign). Again, assume that you have administered the attitude scale discussed earlier, and have converted the raw scores of the subjects into z scores. The preceding section stated that a sample of z scores has a mean of zero and a standard deviation of 1. You can take advantage of this fact to review a subject’s z score and immediately understand that subject's position within the sample. For example, if a subject has a z score of zero, you know that she scored at the mean on the attitude scale. If her z score has a positive sign (e.g., + 1.0, +2.0), then you know that she scored above the mean. If her z score has a negative sign (e.g., – 1.0, –2.0), then you know that she scored below the mean. The absolute magnitude of the subject’s z score tells you how far away from the mean the corresponding raw score was located, when the raw score was measured in terms of standard deviations. If a z score has a positive sign, it tell you how many standard deviations the corresponding raw score was above the mean. For example, if a subject has a z score of +1.0, this tells you that his raw score was 1 standard deviation above the mean. If the subject has a z score of +2.0, his raw score was 2 standard deviations above the mean. The same holds true for z scores with a negative sign, except that these z scores tell you how many standard deviations the corresponding raw score was below the mean. If a given subject has a z score of –1.0, this tells you that his raw score was 1 standard deviation below the mean; if the z score was –2.0, the corresponding raw score was 2 standard deviations below the mean. So far the discussion has focused on z scores that are whole numbers (such as “1” or “2”), but it is important to remember that z scores are typically carried out to one or more places to the right of the decimal point. It is common, for example, to see z scores with values such as 1.4 or –2.31. Comparing scores for variables with different means and standard deviations. When you are working with a group of variables that have different means and standard deviations, it is difficult to compare scores across variables. For example, a raw score of 50 may be above the mean on Variable 1, but below the mean on Variable 2.
264 Step-by-Step Basic Statistics Using SAS: Student Guide
If you need to make comparisons across variables, it is often a good idea to first convert all raw scores on all variables into z scores. Regardless of the variable being represented, all z scores have the same interpretation (e.g., a z score of 1.0 always means that the corresponding raw score was 1 standard deviation above the mean). To illustrate this concept in a more concrete way, imagine that you are an admissions officer at a university. Each year 5,000 people apply for admission to your school. Half of them come from states in which college applicants take the Learning Aptitude Test (LAT), an aptitude test that contains three subtests (the Verbal subtest, the Math subtest, and the Analytical subtest). Suppose that the LAT Verbal subtest has a range from 200 to 800, a mean of 500, and a standard deviation of 100. The other half come from states in which college applicants take Higher Education Aptitude Test (HEAT). This test also consists of three subtests (for verbal, math, and analytical skills), but each subtest has a range, mean, and standard deviation that is different from those found with the LAT. For example, suppose that the HEAT Verbal subtest has a range from 1 to 30, a mean of 15, and a standard deviation of 5. Suppose that you are reviewing the files of two people who have applied for admission to your university. Applicant A comes from a state that uses the LAT, and her raw score on the LAT Verbal subtest is 600. Applicant B comes from a state that uses the HEAT, and his raw score on the HEAT Verbal subtest is 19. Relatively speaking, which of these two had the higher score? It is very difficult to make this comparison as long as the variables are in raw-score form. However, the comparison becomes much easier once the two variables have been converted into z scores. The formula for computing a z score is X–X z = –––––––– SX where z = the subject’s z score X = the subject’s raw score X = the sample mean SX = the sample standard deviation (remember that N is used in the denominator for this standard deviation; not N –1).
Chapter 9: z Scores 265
First, we will convert Applicant A’s raw score into a z score (remember that Applicant A had a raw score of 600 on the LAT Verbal subtest). Below, we substitute the appropriate values into the formula: X–X z = –––––––– SX
=
600 – 500 100 ––––––––––– = ––––––– = 1.0 100 100
So Applicant A had a z score of 1.0 (she stood 1 standard deviation above the mean). Next, we convert Applicant B’s raw score into a z score (remember that he had a raw score of 19 on the HEAT Verbal subtest ). Below, we substitute the appropriate values into the formula (notice that a different mean and standard deviation are used for Applicant B’s formula, compared to Applicant A’s formula): X–X z = –––––––– SX
=
19 – 15 4 ––––––––––– = ––––––– = 0.8 5 5
So Applicant B had a z score of 0.8 (he stood 8/10ths of a standard deviation above the mean). Earlier, we asked which of the two applicants had the higher score. This question was difficult to answer when the variables were in raw-score form, but is easier to answer now that the variables are in z-score form. The z score for Applicant A (1.0) was slightly higher than the z score for Applicant B (0.8). In terms of entrance exam scores, Applicant A may be a somewhat stronger candidate. This illustrates one of the reasons that z scores are so important in the social sciences and education: very often, you will work with groups of variables that have different means and standard deviations, making it difficult to compare scores from one variable to another. By converting all scores to z scores, you create a common metric that makes it easier to make these comparisons.
266 Step-by-Step Basic Statistics Using SAS: Student Guide
Example 9.1: Comparing Mid-Term Test Scores for Two Courses Data Set to Be Analyzed Suppose that you obtain test scores for 12 college students. All of the students are enrolled in a French course (French 101) and a geology course (Geology 101). All students recently took a mid-term test in each of these two courses. With the test given in French 101, scores could range from 0 to 50. The test given in Geology 101 was longer––scores on that test could range from 0 to 200. Table 9.1 presents the scores that the 12 students obtained on these two tests. Table 9.1 Mid-Term Test Scores for Students _______________________________________ Subject French 101 Geology 101 _______________________________________ 01. Fred
50
90
02. Susan
46
165
03. Marsha
45
170
04. Charles
41
110
05. Paul
39
130
06. Cindy
38
150
07. Jack
37
140
08. Cathy
35
120
09. George
34
155
10. John
31
180
11. Marie
29
135
12. Emmett 25 200 _______________________________________
Table 9.1 use the same conventions that have been used with other tables in this guide: the horizontal rows represent individual subjects, and the vertical columns represent different variables (scores on mid-term tests, in this case). Where the row for a particular student intersects with the column for a particular course, the table provides the student’s score on the mid-term test for that course (e.g., the table shows that Fred received a score of 50 on
Chapter 9: z Scores 267
the French 101 test and a score of 90 on the Geology 101 test; Susan received a score of 46 on the French 101 test and a score of 165 on the Geology 101 test, and so on). The DATA Step As you know, the first section of most SAS programs is the DATA step: the section in which the raw data are read to create a SAS data set. The data set of the program used in this example will include all four variables represented in Table 9.1: •
subject numbers
•
subject names
•
scores on the French 101 test
•
scores on the Geology 101 test.
Technically, it is not necessary to create a SAS variable for subject numbers and subject names in order to compute z scores and perform the other tasks illustrated in this chapter. However, including the subject number and subject name variables will make the output somewhat easier to read. Below is the DATA step from the SAS program that will analyze the data from Table 9.1: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM NAME $ FREN GEOL; DATALINES; 01 Fred 50 90 02 Susan 46 165 03 Marsha 45 170 04 Charles 41 110 05 Paul 39 130 06 Cindy 38 150 07 Jack 37 140 08 Cathy 35 120 09 George 34 155 10 John 31 180 11 Marie 29 135 12 Emmett 25 200 ;
Some notes concerning the preceding DATA step: •
Line 2 assigns this data set the SAS data set name “D1.”
•
Line 3 assigns the SAS variable name “SUB_NUM” to represent subject numbers.
268 Step-by-Step Basic Statistics Using SAS: Student Guide •
Line 4 assigns the SAS variable name “NAME” to represent the students’ names. Notice that the variable name is followed by a “$” to tell SAS that this will be a character variable.
•
Line 5 assigns to the SAS variable name “FREN” to represent the students’ scores on the French 101 test.
•
Line 6 assigns to the SAS variable name “GEOL” to represent the students’ scores on the Geology 101 test.
•
The actual data appear on lines 8–19. You can see that these data were taken directly from Table 9.1
Converting a Single Raw-Score Variable into a z-Score Variable Overview This section shows you how to convert student scores on the French 101 mid-term test into z scores. We can convert scores on the Geology 101 test later. The approach recommended here involves two steps: •
Step 1: Computing the mean and sample standard deviation for the raw-score variable.
•
Step 2: Using data manipulation statements to create the z-score variable.
This approach requires you to submit your SAS program twice. At Step 1, you will use PROC MEANS to determine the mean and sample standard deviation for raw scores on the French 101 test variable, FREN. At Step 2, you will add a data manipulation statement to your SAS program. This data manipulation statement will create a new variable to be called FREN_Z, which will be the zscore version of student scores on the French 101 test. The data manipulation statement that creates FREN_Z is the formula for the z score, similar to the one presented earlier in this chapter. After you have created the new z score variable, you will use PROC PRINT to print out values on the variable, and will use PROC MEANS to obtain descriptive statistics.
Chapter 9: z Scores 269
Step 1: Computing the Mean and Sample Standard Deviation The syntax. You will use PROC MEANS to calculate the mean and sample standard deviation for the raw score variable, FREN. The syntax is presented below: PROC MEANS DATA=data-set-name VAR raw-score-variable ; TITLE1 ' your-name '; RUN;
VARDEF=N
N
MEAN
STD
MIN
MAX;
In the preceding syntax, one of the options specified was the VARDEF option (see the first line). This VARDEF option specifies the divisor to be used when calculating the standard deviation. If you request VARDEF=N then PROC MEANS will compute the sample standard deviation (the formula for the sample standard deviation uses N as the divisor). In contrast, if you request VARDEF=DF then PROC MEANS will compute the estimated population standard deviation (the formula for the estimated population standard deviation uses N – 1 as the divisor). This distinction is important because ultimately (at Step 2) you will want to insert the correct type of standard deviation into the formula that creates your z scores. When computing z scores, it is very important that you insert the sample standard deviation into the computational formula for z scores; you generally should not insert the estimated population standard deviation. This means that, when writing the PROC MEANS statement at Step 1, you should specify VARDEF=N. If you leave the VARDEF option out, then PROC MEANS will compute the estimated population standard deviation by default, and you do not want this. A number of other options are also included in the preceding PROC MEANS statement: •
N requests that the sample size be printed.
•
MEAN requests that the sample mean be printed.
•
STD requests that the standard deviation be printed.
•
MIN requests that the smallest observed value be printed.
•
MAX requests that the largest observed value be printed.
The remaining sections of the preceding syntax are self-explanatory.
270 Step-by-Step Basic Statistics Using SAS: Student Guide
The actual SAS statements. Below are the actual statements that request that the MEANS procedure be performed on the FREN variable from the current data set (note that “FREN” is specified in the VAR statement): PROC MEANS DATA=D1 VARDEF=N VAR FREN; TITLE1 'JANE DOE'; RUN;
N
MEAN
STD
MIN
MAX;
The SAS Output. Output 9.1 presents the results generated by the preceding program. JANE DOE The MEANS Procedure Analysis Variable : FREN N Mean Std Dev Minimum Maximum ---------------------------------------------------------12 37.5000000 7.0059499 25.0000000 50.0000000 ---------------------------------------------------------Output 9.1. Results of PROC MEANS performed on the raw-score variable FREN.
Some notes concerning this output: FREN is the name of the variable being analyzed. “N” is the number of subjects providing usable data. In this analysis, 12 students provided scores on FREN, as expected. “Mean” is the sample mean that (at Step 2) will be inserted in your formula for computing z scores. Here, you can see that the sample mean for FREN is 37.50. “Std Dev” is the sample standard deviation that (at Step 2) also will be inserted in your formula for computing z scores. Here, you can see that the sample standard deviation for FREN rounds to 7.01 It is usually a good idea to check the “Minimum” and “Maximum” values to verify that there were no obvious typographical errors in the data. Here, the minimum and maximum values were 25 and 50 respectively, which seems reasonable. Step 2: Creating the z-Score Variable Overview. The preceding MEANS procedure provided the sample mean and sample standard deviation for the raw-score variable, FREN. You will now include these values in a
Chapter 9: z Scores 271
SAS data manipulation statement that will use the raw scores included in FREN to create the z-score variable, FREN_Z. The formula for z scores. Remember that the formula for computing z scores is: X–X z = –––––––– SX where z = the subject’s z score X = the subject’s raw score X = the sample mean SX = the sample standard deviation The SAS data manipulation statement. It is now necessary to convert this generic formula into a SAS data manipulation statement that does the same thing (i.e., that creates z scores by transforming raw scores). Here is the syntax for the SAS data manipulation statement that will do this: z-variable = (raw-variable – mean) / standard-deviation; You can see that both the generic formula for creating z scores, as well as the SAS data manipulation statement presented above, do the same thing: •
They begin with a subject’s score on the raw variable.
•
They subtract the sample mean from that raw-variable score.
•
The resulting difference is then divided by the sample standard deviation.
•
The result is the subject’s z score.
Below is the SAS data manipulation statement that creates the z-score variable, FREN_Z. Notice how the variable names, as well as the mean and standard deviation from Step 1, have been inserted in the appropriate locations in this statement: FREN_Z = (FREN - 37.50) / 7.01; This statement tells SAS to create a new variable called FREN_Z by doing the following: •
Begin with a given subject’s score on FREN.
•
Subtract 37.50 from FREN (37.50 is the sample mean from Step 1).
•
The resulting difference should then be divided by 7.01 (7.01 is the sample standard deviation from Step 1).
•
The result is the subject’s score on FREN_Z (a z score).
272 Step-by-Step Basic Statistics Using SAS: Student Guide
Including the data manipulation statement as part of a new SAS DATA step. In Chapter 8, “Creating and Modifying Variables and Data Sets,” you learned that you can only create a new variable within a DATA step. In the present example, you are creating a new variable (FREN_Z) to compute these z scores, and this means that the data manipulation statement that creates FREN_Z must appear within a new DATA step. In your original SAS program from Step 1, you assigned the name “D1” to your initial SAS data set. After the DATA step, you added the PROC MEANS statements that computed the sample mean and standard deviation. In order to complete the tasks required for Step 2, you can now append new SAS statements to that existing SAS program. Here is one way that you can do this: •
Begin a new DATA step after the PROC MEANS statements that were used in Step 1.
•
Begin this new DATA step by creating a new data set named “D2.” Initially, D2 will be created as a duplicate of the existing data set, D1.
•
After creating this new data set, D2, you will append the data manipulation statement that creates FREN_Z. This will ensure that the new z-score variable, FREN_Z, will be included in D2.
Following is a section of the SAS program that accomplishes this: 1 2 3 4 5 6 7 8 9 10 11 12 13
[First part of the DATA step appears here] 10 John 31 180 11 Marie 29 135 12 Emmett 25 200 ; PROC MEANS DATA=D1 VARDEF=N VAR FREN; TITLE1 'JANE DOE'; RUN;
N
MEAN
STD
MIN
MAX;
DATA D2; SET D1; FREN_Z = (FREN - 37.50) / 7.01;
Some notes concerning the preceding lines: •
Lines 1–3 present the last three lines from the data set (to conserve space, only the last few lines are presented here).
•
Lines 6–9 present the PROC MEANS statements that were used in Step 1. Obviously, it was not really necessary to include these statements in the program that you submit at Step 2; if you liked, you could have simply deleted these lines. But it is a good idea to include them so that, within a single program, you will have all of the SAS statements that are used to compute the z scores.
Chapter 9: z Scores 273
•
Lines 11–13 create the new data set (named D2), initially creating it as a duplicate of D1. Line 13 presents the data manipulation statement that creates the z scores, and includes them in a variable named FREN_Z.
Using PROC PRINT to print out the new z-score variable. After creating the new z-score variable, FREN_Z, you should next use PROC PRINT to print out each subject’s value on this variable. Among other things, this will enable you to check your work, to verify that the new variable was created correctly. Chapter 8, “Creating and Modifying Variables and Data Sets,” provided the syntax for a PROC PRINT statement. Below are the statements that use PROC PRINT to create a printout listing each subject’s value for NAME, FREN, and FREN_Z: PROC PRINT DATA=D2; VAR NAME FREN FREN_Z; TITLE 'JANE DOE'; RUN; Notice that, in the PROC PRINT statement, the DATA option specifies that the analysis should be performed using the data set D2. This is important because the variable FREN_Z appears only in D2; it does not appear in D1. Using PROC MEANS to request descriptive statistics for the new variable. Finally, you use PROC MEANS to obtain simple descriptive statistics (e.g., means, standard deviations) for any new z-score variables that you create. This is useful because you already know that a sample of z scores is supposed to have a mean of zero and a standard deviation of 1. In Step 2, you will review the results of PROC MEANS to verify that your new z-score variable FREN_Z also has a mean of zero and a standard deviation of 1. Here are the statements that request descriptive statistics for the current example: PROC MEANS DATA=D2 VARDEF=N VAR FREN_Z; TITLE1 'JANE DOE'; RUN;
N
MEAN
STD
MIN
MAX;
Two points concerning these statements: •
The DATA option in the PROC MEANS statement specifies that the analysis should be performed on the new data set, D2.
•
The VARDEF option specifies “VARDEF=N,” which ensures that the MEANS procedure will compute the sample standard deviation rather than the estimated population standard deviation. This is appropriate when you want to verify that the standard deviation of the z-score variable is close to 1.
274 Step-by-Step Basic Statistics Using SAS: Student Guide
Putting it all together. So far, this chapter has presented the SAS statements needed for Step 2. So that you will have a better idea of how all of these statements fit together, below we present (a) the last part of the initial DATA step (from Step 1), and (b) the SAS statements needed to perform the various tasks of Step 2: [First part of the DATA step appears here] 10 John 31 180 11 Marie 29 135 12 Emmett 25 200 ; PROC MEANS DATA=D1 VARDEF=N VAR FREN; TITLE1 'JANE DOE'; RUN;
N
MEAN
STD
MIN
MAX;
STD
MIN
MAX;
DATA D2; SET D1; FREN_Z = (FREN - 37.50) / 7.01; PROC PRINT DATA=D2; VAR NAME FREN FREN_Z; TITLE 'JANE DOE'; RUN; PROC MEANS DATA=D2 VARDEF=N VAR FREN_Z; TITLE1 'JANE DOE'; RUN;
N
MEAN
SAS Output generated by PROC PRINT. Output 9.2 presents the results generated by the PRINT procedure in the preceding program: JANE DOE Obs 1 2 3 4 5 6 7 8 9 10 11 12
NAME Fred Susan Marsha Charles Paul Cindy Jack Cathy George John Marie Emmett
FREN
FREN_Z
50 46 45 41 39 38 37 35 34 31 29 25
1.78317 1.21255 1.06990 0.49929 0.21398 0.07133 -0.07133 -0.35663 -0.49929 -0.92725 -1.21255 -1.78317
Output 9.2. Results of PROC PRINT performed on NAME, FREN, and FREN_Z.
Chapter 9: z Scores 275
Some notes concerning the preceding output: The OBS variable is generated by SAS whenever it performs PROC PRINT. It merely assigns an observation number to each subject. The NAME column provides each student’s first name. The FREN column provides each student’s raw score for the French 101 mid-term test, as it appears in Table 9.1. The FREN_Z column provides each student’s score for the new z-score variable that was created. These z scores correspond to the raw scores for the French 101 mid-term test, as they appear in Table 9.1. Were the z scores in the FREN_Z column created correctly? You can find out by computing z scores manually for a few subjects, and verifying that your results match the results generated by SAS. For example, the first subject (Fred) had a raw score on FREN of 50. You can compute his z score by inserting this raw score in the z-score formula: X–X 50 – 37.50 12.50 z = –––––––– = ––––––––––– = –––––––––––– = 1.783 7.01 7.01 SX So Fred’s z score was 1.783, which rounds to 1.78. Output 9.2 shows that this is also the value that SAS obtained. So far, these results are consistent with the conclusion that the z scores have been computed correctly. Reviewing the mean and standard deviation for the new z-score variable. Another way to verify that the z scores were created correctly is to perform PROC MEANS on the z-score variable, then verify that the mean for the new variable is approximately zero, and that the standard deviation is approximately 1. The preceding program included PROC MEANS statements, and the results are presented in Output 9.3. JANE DOE The MEANS Procedure Analysis Variable : FREN_Z N Mean Std Dev Minimum Maximum ---------------------------------------------------------12 -5.55112E-17 0.9994222 -1.7831669 1.7831669 ---------------------------------------------------------Output 9.3. Results of PROC MEANS performed on FREN_Z.
276 Step-by-Step Basic Statistics Using SAS: Student Guide
The variable name FREN_Z tells you that this analysis was performed on the new z-score variable. The “Mean” column contains the sample mean for this z-score variable, –5.55112E–17. You might be concerned that something is wrong because this number does not appear to be approximately zero, as we had expected. But the number is, in fact, very close to zero. The number is presented in scientific notation. The actual value presented in Output 9.3 is “–5.55112,” and E–17 tells you that the decimal place must be moved 17 spaces to the left. Thus, the actual mean is –0.0000000000000000555112. Obviously, this mean is very close to zero, and should reassure us that the z-score variable was probably created correctly. Why was the mean for FREN_Z not exactly zero? The answer is that we did not use a great deal of precision in creating FREN_Z. Here again is the data manipulation statement that created it: FREN_Z = (FREN - 37.50) / 7.01; Notice that we went to only two places beyond the decimal point when we typed the sample mean (37.50) and standard deviation (7.01). If we had carried these values out a greater number of decimal places, our z-score variable would have been created with greater precision, and the mean score on FREN_Z would have been even closer to zero. The standard deviation for FREN_Z appears below the heading “Std Dev” in Output 9.3. You can see that the standard deviation for this variable is 0.9994222. Again, this is very close to the value of 1 that is expected with a sample of z scores (the fact that it is not exactly 1, again, is due to the somewhat weak precision used in our data manipulation statement). The results suggest that the z-score variable was probably created in the correct manner. Examples of Questions That Can Be Answered with the New zScore Variable Reviewing the sign and absolute magnitude of a z score. The introduction section of this chapter discussed a number of advantages of working with z scores. One of these advantages involves the fact that, by simply reviewing a z score, you can immediately determine the relative position of that score within a sample. Specifically, •
The sign of a z score tells you whether the raw score appears above or below the mean (a positive sign means above, a negative sign means below).
•
The absolute magnitude of the z score tells you how far away from the mean the corresponding raw score was located, in terms of standard deviations (e.g., a z score of 1.2 tells you that the raw score was 1.2 standard deviations from the mean).
Chapter 9: z Scores 277
Output 9.4 presents the results of the PROC PRINT that were previously presented as Output 9.2. This output provides each subject’s z score for the French 101 test, as created by the preceding program. This output is reproduced again so that you can see how the results that it contains can be used to answer questions about the location of specific scores within the sample. This section provides the answers for each of the questions. Be sure that you understand the reasoning that led to these answers, as you might be asked to answer similar questions as part of an exercise when you complete this chapter. JANE DOE Obs 1 2 3 4 5 6 7 8 9 10 11 12
NAME Fred Susan Marsha Charles Paul Cindy Jack Cathy George John Marie Emmett
FREN
FREN_Z
50 46 45 41 39 38 37 35 34 31 29 25
1.78317 1.21255 1.06990 0.49929 0.21398 0.07133 -0.07133 -0.35663 -0.49929 -0.92725 -1.21255 -1.78317
Output 9.4. Results of PROC PRINT performed on NAME, FREN, and FREN_Z (to illustrate the questions that can be answered with z scores).
Questions regarding the new z-score variable, FREN_Z, that appears in Output 9.4: 1. Question: Fred’s raw score on the French 101 test was 50 (Fred was Observation #1). What was the relative position of this score within the sample? Explain your answer. Answer: Fred’s score was 1.78 standard deviations above the mean. I know that his score was above the mean because his z score was a positive value. I know that his score was 1.78 standard deviations from the mean because the absolute value of the z score was 1.78. 2. Question: Cindy’s raw score on the French 101 test was 38 (Cindy was Observation #6). What was the relative position of this score within the sample? Explain your answer. Answer: Cindy’s score was 0.07 standard deviations above the mean. I know that her score was above the mean because her z score was a positive value. I know that her score was 0.07 standard deviations from the mean because the absolute value of the z score was 0.07. 3. Question: Marie’s raw score on the French 101 test was 29 (Marie was Observation #11). What was the relative position of this score within the sample? Explain your answer.
278 Step-by-Step Basic Statistics Using SAS: Student Guide
Answer: Marie’s score was 1.21 standard deviations below the mean. I know that her score was below the mean because her z score was a negative value. I know that her score was 1.21 standard deviations from the mean because the absolute value of the z score was 1.21.
Converting Two Raw-Score Variables into z-Score Variables Overview It many situations, it is necessary to convert one or more raw-score variables into z-score variables. In these situations, you should follow the same 2-step sequence described above: (a) compute the mean and standard deviation for each raw-score variable, and then (b) write data manipulation statements that will create the new z-score variables. You will write a separate data manipulation statement for each new z-score variable to be created. Needless to say, when you write a data manipulation statement for a given variable, it is important to insert the correct mean and standard deviation into that statement (i.e., the mean and standard deviation for the corresponding raw-score variable). This section shows you how to convert two raw-score variables into two new z-score variables. It builds on the preceding section by analyzing the same data set (the data set with test scores for French 101 and Geology 101). Because almost all of the concepts discussed here have already been discussed in earlier section of the chapter, the present material will be covered with less detail. Review: Data Set to Be Analyzed An earlier section of this chapter, “Example 9.1: Scores on Mid-Term Tests in Two Courses,” described the four variables included in your data set: •
The first variable was given the SAS variable name “SUB_NUM.” This was a numeric variable that included each student’s subject number.
•
The second variable was given the SAS variable name “NAME.” This was a character variable that included each subject’s first name.
•
The third variable was given the SAS variable name “FREN.” This was a numeric variable that included each subject’s raw score on the mid-term test given in the French 101 course.
•
The fourth variable was given the SAS variable name “GEOL.” This was a numeric variable that included each subject’s raw score on the mid-term test given in the Geology 101 course.
Chapter 9: z Scores 279
The same earlier section also provided each subject’s values on these variables, then presented the SAS DATA step to read the data into a SAS data set. Step 1: Computing the Means and Sample Standard Deviations Your first task is to compute the sample mean and sample standard deviation for the two test score variables, FREN and GEOL. This can be done by adding the following statements to a SAS program that already contains the SAS DATA step: PROC MEANS DATA=D1 VARDEF=N VAR FREN GEOL; TITLE1 'JANE DOE'; RUN;
N
MEAN
STD
MIN
MAX;
The preceding statements are identical to the PROC MEANS statements presented earlier, except that the VAR statement now lists both FREN and GEOL. This will cause SAS to compute the mean and sample standard deviation (along with some other descriptive statistics) for both of these variables. Output 9.5 presents the results that were generated by the preceding statements. JANE DOE The MEANS Procedure
Variable N Mean Std Dev Minimum Maximum -------------------------------------------------------------------FREN 12 37.5000000 7.0059499 25.0000000 50.0000000 GEOL 12 145.4166667 29.7530344 90.0000000 200.0000000 -------------------------------------------------------------------Output 9.5. Results of PROC MEANS performed on the raw-score variables FREN and GEOL.
The Mean column presents the sample mean for the two test-score variables. The Std Dev column presents the sample standard deviations. You can see that, for FREN, the mean is 37.50 and the sample standard deviation is 7.01 (obviously, these figures had to be identical to the figures presented in Output 9.1 because the same variable was analyzed). For the second variable, GEOL, the mean is 145.42, and the sample standard deviation is 29.75.
280 Step-by-Step Basic Statistics Using SAS: Student Guide
With these means and standard deviations successfully computed, you can now move on to Step 2, where they will be inserted into data manipulation statements that will create the new z-score variables. Step 2: Creating the z-Score Variables The SAS data manipulation statements. Earlier in this chapter, the following syntax for the data manipulation statement created a z-score variable: z-variable = (raw-variable – mean) / standard-deviation; In this step you will create a z-score variable for scores on the French 101 test, and give it the SAS variable name FREN_Z. In doing this, the mean and sample standard deviation for FREN from Output 9.5 will be inserted in the formula (because you are working with the same mean and standard deviation, this data manipulation statement will be identical to the data manipulation statement for FREN_Z that was presented earlier). FREN_Z = (FREN - 37.50) / 7.01; Next, you will create a z-score variable for scores on the Geology 101 test, and give it the SAS variable name GEOL_Z. In doing this, the mean and sample standard deviation for GEOL from Output 9.5 will be inserted in the formula: GEOL_Z = (GEOL - 145.42) / 29.75; Including the data manipulation statements as part of a new SAS DATA step. Remember that new SAS variables can be created only within a DATA step. Therefore, within your SAS program, you will begin a new DATA step prior to writing the two preceding statements that create FREN_Z and GEOL_Z. This is done in the following excerpt from the SAS program: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
[First part of the DATA step appears here] 10 John 31 180 11 Marie 29 135 12 Emmett 25 200 ; PROC MEANS DATA=D1 VARDEF=N VAR FREN GEOL; TITLE1 'JANE DOE'; RUN;
N
MEAN
DATA D2; SET D1; FREN_Z = (FREN - 37.50) / 7.01; GEOL_Z = (GEOL - 145.42) / 29.75;
STD
MIN
MAX;
Chapter 9: z Scores 281
Some notes about the preceding excerpt: •
Lines 1–3 present the last three data lines from the data set. To save space, only the last few lines from the data set are reproduced.
•
Lines 6–9 present the PROC MEANS statements that cause SAS to compute the mean and standard deviation for FREN and GEOL. These statements were discussed in Step 1.
•
Lines 11–12 begin a new DATA step by creating a new data set named D2. Initially, D2 is created as a duplicate of D1.
•
Lines 13–14 present the data manipulation statements that create the new z-score variables, FREN_Z and GEOL_Z.
Using PROC PRINT to print out the new z-score variables. After you create the new z-score variables, you can use PROC PRINT to print out each subject’s value on these variables. The following statements accomplish this: PROC PRINT DATA=D2; VAR NAME FREN GEOL TITLE 'JANE DOE'; RUN;
FREN_Z
GEOL_Z;
The preceding VAR statement requests that this printout include each subject’s values for the variables NAME, FREN, GEOL, FREN_Z, and GEOL_Z. Using PROC MEANS to request descriptive statistics for the new variables. Remember that it is generally a good idea to use PROC MEANS to compute simple descriptive statistics for the new z-score variables that you create. This will enable you to verify that the mean is approximately zero, and the standard deviation is approximately 1, for each new variable. This is accomplished by the following statements: PROC MEANS DATA=D2 VARDEF=N VAR FREN_Z GEOL_Z; TITLE1 'JANE DOE'; RUN;
N
MEAN
STD
MIN
MAX;
282 Step-by-Step Basic Statistics Using SAS: Student Guide
Putting it all together. The following shows the last part of the initial DATA step, along with the SAS statements needed to perform the various tasks of Step 2: (First part of the DATA step appears here) 10 John 31 180 11 Marie 29 135 12 Emmett 25 200 ; PROC MEANS DATA=D1 VARDEF=N VAR FREN GEOL; TITLE1 'JANE DOE'; RUN;
N
MEAN
STD
MIN
MAX;
MIN
MAX;
DATA D2; SET D1; FREN_Z = (FREN - 37.50) / 7.01; GEOL_Z = (GEOL - 145.42) / 29.75; PROC PRINT DATA=D2; VAR NAME FREN GEOL TITLE 'JANE DOE'; RUN;
FREN_Z
PROC MEANS DATA=D2 VARDEF=N VAR FREN_Z GEOL_Z; TITLE1 'JANE DOE'; RUN;
N
GEOL_Z;
MEAN
STD
SAS output generated by PROC MEANS. In the output that is generated by the preceding program, the results of the MEANS procedure performed on FREN_Z and GEOL_Z will be presented first. This will enable you to verify that there were no obvious errors in creating the new z-score variables. After this is done, you can view the results from the PRINT procedure. Output 9.6 presents the results of PROC MEANS performed on FREN_Z and GEOL_Z. JANE DOE The MEANS Procedure Variable N Mean Std Dev Minimum Maximum -------------------------------------------------------------------FREN_Z 12 -5.55112E-17 0.9994222 -1.7831669 1.7831669 GEOL_Z 12 -0.000112045 1.0001020 -1.8628571 1.8346218 -------------------------------------------------------------------Output 9.6. Results of the PROC MEANS performed on FREN_Z and GEOL_Z.
Chapter 9: z Scores 283
Reviewing the means and standard deviations for the new z-score variables. Output 9.3 (from earlier in this chapter) has already provided the mean and standard deviation for FREN_Z. Those results are identical to the results for FREN_Z presented in Output 9.6 ( ), and so the descriptive statistics for FREN_Z will not be reviewed again at this point. The second row of results in Output 9.6 presents descriptive statistics for GEOL_Z ( ). You can see that the mean for this variable is –0.000112045, which is very close to the mean of zero that you would normally expect for a z-score variable. This provides some assurance that GEOL_Z was created correctly. You may have noticed that the mean for GEOL_Z was not presented in scientific notation, as was the case for FREN_Z. This is because the mean for GEOL_Z was not quite as close to zero, and it was therefore not necessary to use scientific notation. Where the row GEOL_Z intersects with the column headed “Std Dev,” you can see that the standard deviation for this variable is 1.0001020. This is very close to the standard deviation of 1 that you would normally expect for a z-score variable, and again provides some evidence that GEOL_Z was probably created correctly. Because the means and standard deviations for the new z-score variables seem to be appropriate, you can now review the individual z scores in the results that were generated by PROC PRINT. SAS output generated by PROC PRINT. Output 9.7 presents results generated by the PROC PRINT statements included in the preceding program. JANE DOE Obs 1 2 3 4 5 6 7 8 9 10 11 12
NAME Fred Susan Marsha Charles Paul Cindy Jack Cathy George John Marie Emmett
FREN
GEOL
FREN_Z
GEOL_Z
50 46 45 41 39 38 37 35 34 31 29 25
90 165 170 110 130 150 140 120 155 180 135 200
1.78317 1.21255 1.06990 0.49929 0.21398 0.07133 -0.07133 -0.35663 -0.49929 -0.92725 -1.21255 -1.78317
-1.86286 0.65815 0.82622 -1.19059 -0.51832 0.15395 -0.18218 -0.85445 0.32202 1.16235 -0.35025 1.83462
Output 9.7. Results of PROC PRINT performed on NAME, FREN, GEOL, FREN_Z, and GEOL_Z.
In Output 9.7, the column headed “FREN” presents subjects’ raw scores for the French 101 test. Similarly, the column headed “GEOL” presents raw scores for the Geology 101 test.
284 Step-by-Step Basic Statistics Using SAS: Student Guide
At this point, however, you are more interested in the new z-score variables that appear in the columns headed “FREN_Z” and “GEOL_Z”. These columns provide standardized versions of scores on the French 101 and Geology 101 tests, respectively. Because these test scores have been standardized, you can use them to answer a different set of questions. These questions and their answers are discussed in the following section. Examples of Questions That Can Be Answered with the New zScore Variables The introduction section of this chapter indicated that one of the advantages of working with z scores is the fact that they enable you to compare scores on variables that otherwise would have different means and standard deviations. For example, assume that you are working with the raw-score versions of scores on the French 101 and Geology 101 tests. Scores on the French 101 test could possibly range from 1 to 50, and scores on the Geology 101 test could possibly range from 1 to 200. This resulted in the two tests having very different means and standard deviations. Comparing scores on variables with different means and standard deviations. Suppose that you wanted to know: “Compared to the other students, did the student named Susan (Observation #2 in Output 9.7) score higher on the French 101 test or on the Geology 101 test?” This question is difficult to answer if you focus on the raw scores (the columns headed FREN and GEOL in Output 9.7). Although her score on the French 101 test was 46, and her score on the Geology 101 test was higher at 165, this certainly does not mean that she did better on the Geology 101 test; her score may have been higher there simply because the test was on a 200-point scale (rather than the 50-point scale used with the French 101 test). Comparing scores on z-score variables. The data becomes much more meaningful when you review the z-score versions of the variables named FREN_Z and GEOL_Z. These columns show that Susan’s z score on the French 101 test was 1.21, while her z score on the Geology 101 test was lower at 0.66. Because you are working with z scores, you know that both of these variables have the same mean (zero) and the same standard deviation (1). This means that you can directly compare the two scores. Clearly, Susan did better on the French 101 test than on the Geology 101 test (compared to other students). The following section provides some questions that could be asked about the performance of students on the French 101 and Geology 101 tests. Following each question is the correct answer based on the z scores presented in Output 9.7. Be sure that you understand the reasoning that led to these answers, as you might be asked to answer similar questions on your own as part of an exercise when you complete this chapter.
Chapter 9: z Scores 285
Questions regarding the new z-score variables, FREN_Z and GEOL_Z, that appear in Output 9.7: 1. Question: Compared to the other students, did the student named Fred (Observation #1 in Output 9.7) score higher on the French 101 test or on the Geology 101 test? Explain your answer. Answer: Compared to the other students, Fred scored higher on the French 101 test than on the Geology 101 test. I know this because his z score on the French 101 test was a positive value (1.78), while his z score on the Geology 101 test was a negative value (–1.86). 2. Question: Compared to the other students, did the student named Cindy (Observation #6 in Output 9.7) score higher on the French 101 test or on the Geology 101 test? Explain your answer. Answer: Compared to the other students, Cindy scored higher on the Geology 101 test than on the French 101 test. I know this because both z scores were positive, and her z score on the Geology 101 test (0.15) was higher than her score on the French 101 test (0.07). 3. Question: Compared to the other students, did the student named Cathy (Observation #8 in Output 9.7) score higher on the French 101 test or on the Geology 101 test? Explain your answer. Answer: Compared to the other students, Cathy scored higher on the French 101 test than on the Geology 101 test. I know this because both z scores were negative, and her z score on the French 101 test (–0.36) was closer to zero than her score on the Geology 101 test (–0.85).
Standardizing Variables with PROC STANDARD Isn’t There an Easier Way to Do This? This chapter has presented a two-step approach that can be used to standardize variables, converting raw-score variables into z-score variables. It is worth mentioning, however, that when you work with SAS, there is often more than one way to accomplish a data management or statistical analysis task. This applies to the creation of z scores as well. The SAS System includes a procedure named STANDARD that can be used to standardize variables in a SAS data set. This procedure enables you to begin with a raw-score variable and standardize it so that it has a specified mean and standard deviation. If you specify that the new variable should have a mean of zero and a standard deviation of 1, then you have
286 Step-by-Step Basic Statistics Using SAS: Student Guide
created a z-score variable. You can use the new standardized variable in subsequent analyses. Using PROC STANDARD has a number of advantages over the approach to standardization taught in this chapter. One of these advantages is that it enables you to complete the standardization process in one step, rather than two. Why This Guide Used a Two-Step Approach If PROC STANDARD has this advantage, then why did the current chapter teach the somewhat more laborious, two-step procedure? This was done because it was expected that this guide will normally be used in a basic statistics course, and the present approach is somewhat more educational: you begin with the generic formula for creating z scores, and translate that generic formula into a SAS data manipulation statement that actually creates z scores. This approach should reinforce your understanding of what a z score is, and exactly how a z score is obtained. For more detailed information on the use of the STANDARD procedure for computing z scores and other types of standardized variables, see the SAS Procedures Guide (1999c).
Conclusion Up until this point, this guide has largely focused on basic concepts in statistics and the SAS System. You have learned the basics of how to use SAS, and have learned about SAS procedures that perform elementary types of data analysis. The next chapter will take you to a new level, however, as it presents the first inferential statistic to be covered in this text. In Chapter 10, you will learn how to use the SAS System to compute Pearson correlation coefficients. The Pearson correlation coefficient is a measure of association that is used to investigate the relationship between two numeric variables. In Chapter 10, “Bivariate Correlation,” you will learn the assumptions that underlie this statistic, will see examples in which PROC CORR is used to compute Pearson correlations, and will learn how to prepare analysis reports that summarize the results obtained from correlational research.
Bivariate Correlation Introduction..........................................................................................290 Overview................................................................................................................ 290 Situations Appropriate for the Pearson Correlation Coefficient.........290 Overview................................................................................................................ 290 Nature of the Predictor and Criterion Variables ..................................................... 291 The Type-of-Variable Figure .................................................................................. 291 Example of a Study Providing Data Appropriate for This Procedure...................... 291 Summary of Assumptions for the Pearson Correlation Coefficient ........................ 293 Interpreting the Sign and Size of a Correlation Coefficient................293 Overview................................................................................................................ 293 Interpreting the Sign of a Correlation Coefficient ................................................... 293 Interpreting the Size of a Correlation Coefficient ................................................... 295 The Coefficient of Determination............................................................................ 296 Interpreting the Statistical Significance of a Correlation Coefficient .......................................................................................297 Overview................................................................................................................ 297 The Null Hypothesis for the Test of Significance ................................................... 297 The Alternative Hypothesis for the Test of Significance......................................... 297 The p Value ........................................................................................................... 298 Problems with Using Correlations to Investigate Causal Relationships ...................................................................................299 Overview................................................................................................................ 299 Correlations and Cause-and-Effect Relationships ................................................. 300 An Initial Explanation ............................................................................................. 300
288 Step-by-Step Basic Statistics Using SAS: Student Guide
Alternative Explanations ........................................................................................ 300 Obtaining Stronger Evidence of Cause and Effect................................................. 302 Is Correlational Research Ever Appropriate?......................................................... 302 Example 10.1: Correlating Weight Loss with a Variety of Predictor Variables..........................................................................303 Overview................................................................................................................ 303 The Study .............................................................................................................. 303 The Criterion Variable and Predictor Variables in the Analysis.............................. 304 Data Set to Be Analyzed........................................................................................ 305 The DATA Step for the SAS Program.................................................................... 306 Using PROC PLOT to Create a Scattergram........................................307 Overview................................................................................................................ 307 Why You Should Create a Scattergram Prior to Computing a Correlation Coefficient .......................................................................................................... 307 Syntax for the SAS Program.................................................................................. 308 Results from the SAS Output................................................................................. 310 Using PROC CORR to Compute the Pearson Correlation between Two Variables...................................................................313 Overview................................................................................................................ 313 Syntax for the SAS Program.................................................................................. 313 Results from the SAS Output................................................................................. 315 Steps in Interpreting the Output ............................................................................. 315 Summarizing the Results of the Analysis............................................................... 318 Using PROC CORR to Compute All Possible Correlations for a Group of Variables...........................................................................320 Overview................................................................................................................ 320 Writing the SAS Program....................................................................................... 321 Results from the SAS Output................................................................................. 322 Summarizing Results Involving a Nonsignificant Correlation.............324 Overview................................................................................................................ 324 The Results from PROC CORR............................................................................. 324 The Results from PROC PLOT .............................................................................. 325 Summarizing the Results of the Analysis............................................................... 328 Using the VAR and WITH Statements to Suppress the Printing of Some Correlations.......................................................................329 Overview................................................................................................................ 329 Writing the SAS Program....................................................................................... 329 Results from the SAS Output................................................................................. 331 Computing the Spearman Rank-Order Correlation Coefficient for Ordinal-Level Variables..............................................................332 Overview................................................................................................................ 332 Situations Appropriate for This Statistic ................................................................. 332
Chapter 10: Bivariate Correlation 289
Example of When to Compute the Spearman Rank-Order Correlation Coefficient........................................................................................ 332 Writing the SAS Program....................................................................................... 333 Understanding the SAS Output.............................................................................. 333 Some Options Available with PROC CORR ..........................................333 Overview................................................................................................................ 333 Where in the Program to Request Options ............................................................ 334 Description of Some Options ................................................................................. 334 Where to Find More Options for PROC CORR ...................................................... 335 Problems with Seeking Significant Results ........................................335 Overview................................................................................................................ 335 Reprise: Null Hypothesis Testing with Just Two Variables ................................... 335 Null Hypothesis Testing with a Larger Number of Variables .................................. 336 How to Avoid This Problem.................................................................................... 337 Conclusion............................................................................................338
290 Step-by-Step Basic Statistics Using SAS: Student Guide
Introduction Overview This chapter shows you how to use SAS to compute correlation coefficients. Most of the chapter focuses on the Pearson product-moment correlation coefficient. You use this procedure when you want to determine whether there is a significant relationship between two numeric variables that are each assessed on an interval scale or ratio scale (there are a number of additional assumptions that must also be met; these will be discussed below). The chapter also illustrates the use of the Spearman rank-order correlation coefficient, which is appropriate for variables assessed on an ordinal scale of measurement. This chapter discusses a number of issues related to the conduct of correlational research. It shows how to interpret the sign and size of correlations coefficients, and how to determine whether they are statistically significant. It cautions against “fishing” for significant findings by computing large numbers of correlation coefficients in a single study. It also cautions against using correlational findings to draw conclusions about cause-and-effect relationships. This chapter shows you how to use PROC PLOT to create a scattergram so that you can verify that the relationship between two variables is linear. It then illustrates the use of PROC CORR to compute Pearson correlation coefficients. It shows (a) how to compute the correlation between just two variables, (b) how to compute all possible correlations between a number of variables, and (c) how to use the VAR and WITH statements to selectively suppress the printing of some correlations. It shows how to prepare analysis reports for correlations that are statistically significant, as well as for correlations that are nonsignificant.
Situations Appropriate for the Pearson Correlation Coefficient Overview A correlation coefficient is a number that summarizes the nature of the relationship between two variables. Most of this chapter focuses on the Pearson product-moment correlation coefficient. The Pearson correlation coefficient is appropriate when both variables being analyzed are assessed on an interval or ratio level, and the relationship between the two variables is linear. The symbol for the Pearson product-moment correlation is r. The first part of this section describes the types of situations in which this statistic is typically computed, and discusses a few of the assumptions underlying the procedure. A more complete summary of assumptions is presented at the end of this section.
Chapter 10: Bivariate Correlation 291
Nature of the Predictor and Criterion Variables Predictor variable. In computing a Pearson correlation coefficient, the predictor variable should be a numeric variable that is assessed on an interval or ratio scale of measurement. Criterion variable. The criterion variable should also be a numeric variable that is assessed on an interval or ratio scale of measurement. The Type-of-Variable Figure When researchers compute Pearson correlation coefficients, they are typically studying the relationship between (a) a criterion variable that is a multi-value numeric variable and (b) a predictor variable that is also a multi-value numeric variable. Chapter 2, “Terms and Concepts Used in This Guide,” introduced you to the concept of the “type-of-variable” figure. A type-of-variable figure indicates the types of variables that are included in an analysis when the variables are classified according to the number of values that they assume. Using this scheme, all variables can be classified as being either dichotomous variables, limited-value variables, or multi-value variables. The following figure illustrates the types of variables that are typically being analyzed when computing a Pearson correlation coefficient. Criterion
Predictor
=
Chapter 2 indicated that the symbol that appears to the left of the equal sign in this type of figure represents the criterion variable in the analysis. The “Multi” symbol that appears to the left of the equal sign in the above figure shows that the criterion variable in the computation of a Pearson correlation is typically a multi-value variable (a variable that assumes more than six values in your sample). Chapter 2 also indicated that the symbol that appears to the right of the equal sign in this type of figure represents the predictor variable in the analysis. The “Multi” symbol that appears to the right of the equal sign in the above figure shows that the predictor variable in the computation of a Pearson correlation is also typically a multi-value variable. Example of a Study Providing Data Appropriate for This Procedure Predictor and criterion variables. Suppose that you are an industrial psychologist who is studying prosocial organizational behavior. Employees score high on prosocial organizational behavior when they do helpful things for the organization or for other employees––helpful things that are beyond their normal job responsibilities. This might include volunteering for some new assignment, helping a new employee on the job, or helping to clean up the shop.
292 Step-by-Step Basic Statistics Using SAS: Student Guide
Suppose that you want to identify variables that may be correlated with prosocial organizational behavior. Based on a review of the literature, you hypothesize that perceived organizational fairness may be related to this variable. Employees score high on perceived organizational fairness when they believe that the organization’s management has treated them equitably. Research method. You conduct a study to determine whether there is a significant correlation between prosocial organizational behavior and perceived organizational fairness in a sample of 300 employees. To assess prosocial organizational behavior, you develop a checklist of prosocial behaviors, and ask supervisors to evaluate each of their subordinates with this checklist by checking off behaviors each time they are displayed by the subordinate. To assess perceived organizational fairness, you use a questionnaire scale developed by other researchers. The questionnaire contains items such as “This organization treats me fairly.” Employees circle a number from 1 to 7 to indicate the extent to which they agree or disagree with each item. You sum responses to the individual items to create a single summed score for each employee. With this variable, higher scores indicate greater agreement that the organization treats them fairly. To analyze the data, you compute the correlation between the measure of prosocial behavior and the measure of perceived fairness. You hypothesize that there will be a positive correlation between the two variables. Why this questionnaire data would be appropriate for this procedure. Earlier sections have indicated that, to compute a Pearson product-moment correlation coefficient, the predictor variable should be a numeric variable that is assessed on an interval or ratio scale of measurement. The predictor variable in this study consisted of scores on a questionnaire scale that is designed to assess perceived organizational fairness. Most researchers would agree that scores from this type of summated rating scale can be viewed as constituting an interval scale of measurement (assuming that the scale was developed properly). To compute a Pearson correlation, the criterion variable in the analysis should also be a numeric variable that is assessed on an interval or ratio scale of measurement. The criterion variable in the current study was prosocial organizational behavior. A particular employee’s score on this variable is the number of prosocial behaviors (as assessed by the employee’s supervisor) that they have displayed in a specified period of time. This “number of prosocial behaviors” variable has equal intervals and a true zero point. Therefore, this variable appears to be assessed on the ratio level. To review, when you compute a Pearson correlation, the predictor and the criterion variable are usually multi-value variables. To determine whether this is the case for the current study, you would use PROC FREQ to create simple frequency tables for the predictor and criterion variables (similar to those shown in Chapter 5, “Creating Frequency Tables”). You would know that both variables were multi-value variables if you observed more than six values for each of them in their frequency tables.
Chapter 10: Bivariate Correlation 293
Summary of Assumptions for the Pearson Correlation Coefficient •
Interval-level measurement. Both the predictor and criterion variables should be assessed on an interval or ratio level of measurement.
•
Random sampling. Each subject in the sample should contribute one score on the predictor variable, and one score on the criterion variable. These pairs of scores should represent a random sample drawn from the population of interest.
•
Linearity. The relationship between the criterion variable and the predictor variable should be linear. This means that, in the population, the mean criterion scores at each value of the predictor variable should fall on a straight line. The Pearson correlation coefficient is not appropriate for assessing the strength of the relationship between two variables involved in a curvilinear relationship.
•
Bivariate normal distribution. The pairs of scores should follow a bivariate normal distribution. This means that (a) scores on the criterion variable should form a normal distribution at each value of the predictor variable and (b) scores of the predictor variable should form a normal distribution at each value of the criterion variable. Scores that represent a bivariate normal distribution form an elliptical scattergram when plotted (i.e., their scattergram is shaped like a football: relatively fat in the middle and tapered on the ends).
Interpreting the Sign and Size of a Correlation Coefficient Overview As was stated earlier, a correlation coefficient is a number that represents the nature of the relationship between two variables. To understand the nature of this relationship, you will review the sign of the coefficient (whether it is positive or negative), as well as the size of the coefficient (whether it is relatively close to zero or close to ± 1.00). This section shows you how. Interpreting the Sign of a Correlation Coefficient Overview. A correlation coefficient may be either positive (+) or negative (–). The sign of the correlation tells you about the direction of the relationship between the two variables. Positive correlation. A positive correlation means that high values on one variable tend to be associated with high values on the other variable, and low values on one variable tend to be associated with low values on the other variable.
294 Step-by-Step Basic Statistics Using SAS: Student Guide
For example, consider the fictitious industrial psychology study described above. In that study, you assessed two variables: •
Prosocial organizational behavior. This refers to positive, helpful things that an employee might do to help his or her organization. Assume that, according to the way that you measured this variable, higher scores represent higher levels of prosocial organizational behavior.
•
Perceived organizational fairness. This refers to the extent to which the employee believes that the organization has treated him or her in a fair and equitable way. Again, assume that, according to the way that you measured this variable, higher scores represent higher levels of perceived organizational fairness.
Suppose that you have reviewed the research literature, and it suggests that employees who score high on perceived organizational fairness are likely to feel grateful to their employing organizations, and are likely to repay them by engaging in prosocial organizational behavior. Based on this idea, you conduct a correlational study. If you measured these two variables in a sample of 300 employees, you would probably find that there is a positive correlation between perceived organizational fairness and prosocial organizational behavior. Consistent with the definition provided above, you would probably find that both of the following are true: •
employees with high scores on perceived organizational fairness would also tend to have high scores on prosocial organizational behavior
•
employees with low scores on perceived organizational fairness would also tend to have low scores on prosocial organizational behavior.
In social science research, there are countless additional examples of pairs of variables that would demonstrate positive correlations. Here are just a few examples: •
In a sample of college students, there would probably be a positive correlation between scores on the Scholastic Aptitude Test (SAT) and subsequent grade point average in college.
•
In a sample of contestants in a body-building contest, there would probably be a positive correlation between the number of hours that they spend training and their overall scores as body builders.
Negative correlation. In contrast to a positive correlation, a negative correlation means that high values on one variable tend to be associated with low values on the second variable. To illustrate this concept, consider what kind of variables would probably show a negative correlation with prosocial organizational behavior. For example, imagine that you develop a multi-item scale designed to measured burnout among employees. For our purposes, burnout refers to the extent to which employees feel exhausted, stressed, and unable to cope on the job. Assume that, according to the way that you measured this variable, higher scores represent higher levels of burnout.
Chapter 10: Bivariate Correlation 295
Suppose that you have reviewed the research literature, and it suggests that employees who score high on burnout are probably too exhausted to engage in any prosocial organizational behavior. Based on this idea, you conduct a correlational study. If you measured burnout and prosocial organizational behavior in a sample of employees, you would probably find that there is a negative correlation between the two variables. Consistent with the definition provided above, you would probably find that both of the following are true: •
employees with high scores on burnout would tend to have low scores on prosocial organizational behavior
•
employees with low scores on burnout would tend to have high scores on prosocial organizational behavior.
In social science research, there are also countless examples of pairs of variables that would demonstrate negative correlations. Here are just a few examples: •
In a sample of college students, there would probably be a negative correlation between the number of hours they spent at parties each week, and their subsequent grade point average in college.
•
In a sample of contestants in a body-building contest, there would probably be a negative correlation between the amount of junk food that they eat and their subsequent overall scores as body builders.
Interpreting the Size of a Correlation Coefficient Overview. You interpret the size of a correlation coefficient to determine the strength of the relationship between the two variables. Generally speaking, the larger the size of the coefficient (in absolute value), the stronger the relationship. Absolute value refers to how large the correlation coefficient is, regardless of its sign. When there is a strong relationship between two variables, you are able to predict values on one variable from values on the second variable with a relatively high degree of accuracy. When there is a weak relationship between two variables, you are able to predict values on variable from values on the second variable with a relatively low degree of accuracy. A guide. Below is an informal guide for interpreting the approximate strength of the relationship between two variables, based on the absolute value of the coefficient: ±1.00 ±.80 ±.50 ±.20 .00
= = = = =
Perfect correlation Strong correlation Moderate correlation Weak correlation No correlation
For example, the above guide suggests that you should view a correlation as being relatively strong if the correlation coefficient were +.80 (or –.80). Similarly, it suggests that you
296 Step-by-Step Basic Statistics Using SAS: Student Guide
should view a correlation as being relatively weak if the correlation coefficient were +.20 (or –.20). Again, remember to consider the absolute value of the coefficient when you interpret the size of the correlation. This means that a correlation of –.50 is just as strong as a correlation of +.50; a correlation of –.75 is just as strong as a correlation of +.75, and so forth. The above guide shows that the possible values of correlation coefficients range from –1.00 through zero through +1.00. This means that you will never obtain a Pearson productmoment correlation below –1.00, or above +1.00. Perfect correlation. A correlation of ±1.00 is a perfect correlation. When the correlation between two variables is ±1.00, it means that you can predict values on one variable from values on the second variable with no errors. For all practical purposes, the only time you will obtain a perfect correlation is when you correlate a variable with itself. Zero correlation. A correlation of .00 means that there is no relationship between the two variables being studied. This means that, if you know how a subject is rated on one variable, it does not allow you to predict how that subject is rated on the second variable with any accuracy. The Coefficient of Determination The coefficient of determination refers to the proportion of variance in one variable that is accounted for by variability in the second variable. This issue of “proportion of variance accounted for” is an important one in statistics; in the chapters that follow, you will learn some techniques for calculating the percentage of variance in a criterion variable that is accounted for by a predictor variable. The coefficient of determination is relatively simple to compute if you have calculated a Pearson correlation coefficient. The formula is as follows: Coefficient of determination = r
2
In other words, to compute the coefficient of determination, you simply square the correlation coefficient. For example, suppose that you find that the correlation between two variables is equal to .50. In this case: Coefficient of determination = r2 Coefficient of determination = (.50)2 Coefficient of determination = .25 So, when the Pearson correlation is equal to .50, the coefficient of determination is equal to .25. This means that 25% of the variability in the criterion variable is associated with variability in the predictor variable.
Chapter 10: Bivariate Correlation 297
Interpreting the Statistical Significance of a Correlation Coefficient Overview When researchers report the results of correlational research, they typically indicate whether the correlation coefficients that they have computed are statistically significant. When researchers report that a correlation coefficient is “statistically significant,” they typically mean that the coefficient is significantly different from zero. To understand the concept of statistical significance, it is necessary to first understand the concepts of the null hypothesis, the alternative hypothesis, and the p value, as they apply to correlational research. Each of these concepts is discussed in the following sections. The Null Hypothesis for the Test of Significance A null hypothesis is a statistical hypothesis about a population or about the relationship between two or more different populations. A null hypothesis typically states either that (a) there is no relationship between the variables being studied, or that (b) there is no difference between the populations being studied. In other words, the null hypothesis is typically a hypothesis or no relationship or no difference. When you are conducting correlational research and are investigating the correlation between two variables, your null hypothesis will typically state that, in the population, there is no correlation between these two variables. For example, again suppose that you are studying the relationship between perceived organizational fairness and prosocial organizational behavior in a sample of employees. Your null hypothesis for this analysis might be stated as follows: Statistical null hypothesis (H0): ρ = 0; In the population, the correlation between the perceived organizational fairness and prosocial organizational behavior is equal to zero. In the preceding statement, the symbol “H0” is the symbol for “null hypothesis.” The symbol “ρ” is the greek letter that represents the correlation between two variables in the population. When the above null hypothesis states “ρ = 0”, it is essentially stating that the correlation between these two variables is equal to zero in the population. The Alternative Hypothesis for the Test of Significance Like the null hypothesis, the alternative hypothesis is a statistical hypothesis about a population or about the relationship between two or more different populations. In contrast to the null hypothesis, however, the alternative hypothesis typically states either that (a) there is a relationship between the variables being studied, or that (b) there is a difference between the populations.
298 Step-by-Step Basic Statistics Using SAS: Student Guide
For example, again consider the fictitious study investigating the relationship between perceived organizational fairness and prosocial organizational behavior. The alternative hypothesis for this study would state that there is a relationship between these two variables in the population. In formal terms, it could be stated this way: Statistical alternative hypothesis (H1): ρ ≠ 0; In the population, the correlation between perceived organizational fairness and prosocial organizational behavior is not equal to zero. Notice that the above alternative hypothesis was stated as a nondirectional hypothesis. It does not predict whether the actual correlation in the population is positive or negative; it simply predicts that it will not be equal to zero. The p Value Overview. When you use SAS to compute a Pearson correlation coefficient, it automatically provides a p value for that coefficient. This p value may range in size from .00 through 1.00. You will review this p value to determine whether the coefficient is statistically significant. This section shows how to interpret these p values. What a p value represents. In general terms, a probability value (or p value) is the probability that you would obtain the present results if the null hypothesis were true. The exact meaning of a p value depends upon the type of analysis that you are performing. When you compute a correlation coefficient, the p value represents the probability that you would obtain a correlation coefficient this large or larger in absolute magnitude if the null hypothesis were true. Remember that the null hypothesis that you are testing states that, in the population, the correlation between the two variables is equal to zero. Suppose that you perform your analysis, and you find that, for your sample, the obtained correlation coefficient is .15 (symbolically, r = .15). This is a relatively weak correlation––it is fairly close to zero. Assume that the p value associated with this correlation coefficient is .89 (symbolically, p =.89). Essentially, this p value is making the following statement: •
If the null hypothesis were true, the probability that you would obtain a correlation coefficient of .15 is fairly high at 89%.
Clearly, 89% is a high probability. Under these circumstances, it seems reasonable to retain your null hypothesis. You will conclude that, in the population, the correlation between these two variables probably is equal to zero. You will conclude that your obtained correlation coefficient of r =.15 is not statistically significant. In other words, you will conclude that it is not significantly different from zero. Now consider another fictitious outcome. Suppose that you perform your analysis, and you find that, for your sample, the obtained correlation coefficient is .70 (symbolically, r = .70). This is a relatively strong correlation. Suppose that the p value associated with this
Chapter 10: Bivariate Correlation 299
correlation coefficient is .01 (symbolically, p = .01). This p value is making the following statement: •
If the null hypothesis were true, the probability that you would obtain a correlation coefficient of .70 is fairly low at only 1%.
Most researchers would agree that 1% is a fairly low probability. Given this low probability, it now seems reasonable to reject your null hypothesis. You will conclude that, in the population, the correlation between these two variables is probably not equal to zero. You will conclude that your obtained correlation coefficient of r =.70 is statistically significant. In other words, you will conclude that it is significantly different from zero. Deciding whether to reject the null hypothesis. With the above examples, you saw that, when the p value is a relative large value (such as p = .89) you should not reject the null hypothesis, and that when the p value is a relative small value (such as p = .01) you should reject the null hypothesis. This naturally leads to the question, “Just how small must the p value be to reject the null hypothesis?” The answer to this question will depend on a number of factors, such as the nature of your research and the importance of not erroneously rejecting a true null hypothesis. To keep things simple, however, this book will adopt the following guidelines: •
If the p value is less than .05, you should reject the null hypothesis.
•
If the p value is .05 or larger, you should not reject the null hypothesis.
This means that if your p value is less than .05 (such as p = .0400, p = .0121, or p < .0001), you should reject the null hypothesis, and should conclude that your obtained correlation coefficient is statistically significant. Conversely, if your p value is .05 or larger (such as p = .0510, p = .5456, or p = .9674), you should not reject the null hypothesis, and should conclude that your obtained correlation is statistically nonsignificant. This book will use this guideline throughout all of the remaining chapters. Most of the chapters following this one will report some type of significance test. In each of those chapters, you will continue to use the same rule that you will reject the null hypothesis if your p value is less than .05.
Problems with Using Correlations to Investigate Causal Relationships Overview When you compute a Pearson correlation, you can review the correlation coefficient to learn about the nature of the relationship between the two variables (e.g., whether the relationship is positive or negative; whether the relationship is relatively strong or weak). However, a single Pearson correlation coefficient by itself will not tell you anything about whether there is a causal relationship between the two variables. This section discusses the concept of
300 Step-by-Step Basic Statistics Using SAS: Student Guide
cause-and-effect relationships, and cautions against using simple correlational analyses to provide evidence for such relationships. Correlations and Cause-and-Effect Relationships Chapter 2, “Terms and Concepts Used in This Guide” discussed some of the differences between experimental research versus nonexperimental (correlational) research. It indicated that correlational research generally provides relatively weak evidence concerning causeand-effect relationships. This is especially the case when investigating the correlation between just two variables (as is the case in this chapter). This means that, if you observe a strong, significant relationship between two variables, it should not be taken as evidence that one of the variables is exerting a causal effect on the other. An Initial Explanation For example, suppose that you began your investigation with a hypothesis of cause and effect. Suppose that you hypothesized that perceived organizational fairness has a causal effect on prosocial organizational behavior: If you can increase employee perceptions of fairness, this will cause the employees to display an increase in prosocial organizational behavior. This cause-and-effect relationship is illustrated in Figure 10.1.
Figure 10.1. Hypothesized relationship between prosocial organizational behavior and perceived organizational fairness.
Suppose that you conduct your study, and find that there is a relatively large, positive, and significant correlation between the fairness variable and the prosocial behavior variable. Your first impulse might be to rejoice that you have obtained “proof” that fairness has a causal effect on prosocial behavior. Unfortunately, few researchers will find your evidence convincing. Alternative Explanations The reason has to do with the concept of “alternative explanations.” You have found a significant correlation, and have offered your own explanation for what it means: It means that fairness has a causal effect on prosocial behavior. However, it will be easy for other researchers to offer very different explanations that may be equally plausible in explaining why you obtained a significant correlation. And if others can generate alternative
Chapter 10: Bivariate Correlation 301
explanations, few researchers are going to be convinced that your explanation must be correct. For example, suppose that you have obtained a strong, positive correlation, and have presented your results at a professional conference. You tell your audience that these results prove that perceived fairness has an effect on prosocial behavior, as illustrated in Figure 10.1. At the end of your presentation, a member of the audience may rise and ask if it is not possible that there is a different explanation for your correlation. She may suggest that the two variables are correlated because of the influence of an underlying third variable. One such possible underlying variable is illustrated in Figure 10.2.
Figure 10.2. An alternative explanation for the observed correlation between prosocial organizational behavior and perceived organizational fairness.
Figure 10.2 suggests that fairness and prosocial behavior may be correlated because they are both influenced by the same underlying third variable: the personality trait of “optimism.” The researcher in your audience may argue that it is reasonable to assume that optimism has a causal effect on prosocial behavior. She could argue that, if a person is generally optimistic, they will be more likely to believe that showing prosocial behaviors will result in rewards from the organization (such as promotions). This argument supports the causal arrow that goes from optimism to prosocial behavior in Figure 10.2. The researcher in your audience could go on to argue that it is reasonable to assume that optimism also has a causal effect on perceived fairness. She could argue that optimistic people tend to look for the good in all situations, including their work situations. This could cause optimistic people to describe their organizations as treating them more fairly (compared to pessimistic people). This argument supports the causal arrow that goes from optimism to perceived fairness in Figure 10.2. In short, the researcher in your audience could argue that there is no causal relationship between prosocial behavior and fairness at all––the only reason they are correlated is because they are both influenced by the same underlying third variable: optimism.
302 Step-by-Step Basic Statistics Using SAS: Student Guide
This is why correlational research provides relatively weak evidence of cause-and-effect relationships: It is often possible to generate more than one explanation for an observed correlation between two variables. Obtaining Stronger Evidence of Cause and Effect Researchers who wish to obtain stronger evidence of cause-and-effect relationships typically rely on one (or both) of two approaches. The first approach is to conduct experimental research. Chapter 2 of this guide argued that, when you conduct a true experiment, it is sometimes possible to control all important extraneous variables. This means that, when you conduct a true experiment and obtain a significant effect for your independent variable, it provides more convincing evidence that it was truly your independent variable that had an effect on the dependent variable, and not an “underlying third variable.” In other words, with a well-designed experiment, it is more difficult for other researchers to generate plausible alternative explanations for your results. Another alternative for researchers who wish to obtain stronger evidence of cause-and-effect relationships is to use correlational data, but analyze them with statistical procedures that are much more sophisticated than the procedures discussed in this text. These sophisticated correlational procedures go by the names such as “path analysis,” “causal modeling,” and “structural equation modeling.” Hatcher (1994) provides an introduction to some of these procedures. Is Correlational Research Ever Appropriate? None of this is meant to discourage you from conducting correlational research. There are many situations in which correlational research is perfectly appropriate. These include: •
Situations in which it would be unethical or impossible to conduct an experiment. There are many situations in which it would be unethical or impossible to conduct a true experiment. For example, suppose that you want to determine whether physically abusing children will cause those individuals to become child abusers themselves when they grow up. Obviously, no one would wish to conduct a true experiment in which half of the child subjects are assigned to an “abused” condition, and the other half are assigned to a “nonabused” condition. Instead, you might conduct a correlational study in which you simply determine whether the way people were previously treated by their parents is correlated with the way that they now treat their own children.
•
As an early step in a research program that will eventually include experiments. In many situations, experiments are more expensive and time-consuming to conduct, relative to correlational research. Therefore, when researchers believe that two variables may be causally related, they sometimes begin their program of research by conducting a simple nonexperimental study to see if the two variables are, in fact, correlated. If yes, the researcher may then be sufficiently encouraged to proceed to more ambitious controlled studies such as experiments.
Chapter 10: Bivariate Correlation 303
Also, remember that in many situations researchers are not even interested in testing causeand-effect relationships. In many situations they simply wish to determine whether two variables are correlated. For example, a researcher might simply wish to know whether high scores on Test X tend to be associated with high scores on Test Y. And that is the approach that will be followed in this chapter. In general, it will discuss studies in which researchers simply wish to determine whether one variable is correlated with another. To the extent possible, it will avoid using language that implies possible cause-and-effect relationships between the variables.
Example 10.1: Correlating Weight Loss with a Variety of Predictor Variables Overview Most of this chapter focuses on analyzing data from a fictitious study that produces data that can be analyzed with the Pearson correlation coefficient. In this study, you will investigate the correlation between weight loss and a number of variables that might be correlated with weight loss. This section describes these variables, and shows you how to prepare the DATA step. The Study Hypotheses. Suppose that you are conducting a study designed to identify the variables that are predictive of weight loss in men. You want to test the following research hypotheses: •
Hypothesis 1: Weight loss will be positively correlated with motivation: Men who are highly motivated to lose weight will tend to lose more weight than those who are less motivated.
•
Hypothesis 2: Weight loss will be positively correlated with time spent exercising: Men who exercise many hours each week will tend to lose more weight than those who exercise fewer hours each week.
•
Hypothesis 3: Weight loss will be negatively correlated with calorie consumption: Men who consume many calories each day will tend to lose less weight than those who consume fewer calories each week.
•
Hypothesis 4: Weight loss will be positively correlated with intelligence: Men who are highly intelligent will tend to lose more weight than those who are less intelligent.
Research method. To test these hypotheses, you conduct a correlational study with a group of 22 men over a 10-week period. At the beginning of the study, you administer a 5-item scale that is designed to assess each subject’s motivation to lose weight. The scale consists of statements such as “It is very important to me to lose weight.” Subjects respond to each item using a 7-point response format in which 1 = “Disagree Very Strongly” and 7 =
304 Step-by-Step Basic Statistics Using SAS: Student Guide
“Agree Very Strongly.” You sum their responses to the five items to create a single motivation score for each subject. Scores on this measure may range from 5 to 35, with higher scores representing greater motivation to lose weight. You will correlate this motivation scale with subsequent weight loss to test Hypothesis 1 (from above). Throughout the 10-week study, you ask each subject to record the number of hours that he exercises each week. At the end of the study, you determine the average number of hours spent exercising for each subject, and correlate this number of hours spent exercising with subsequent weight loss to test Hypothesis 2. Throughout the study, you also ask each subject to keep a log of the number of calories that he consumes each day. At the end of the study, you compute the average number of calories consumed by each subject. You will correlate this measure of daily calorie intake with subsequent weight loss. You use this correlation to test Hypothesis 3. At the beginning of the study you also administer the Weschler Adult Intelligence Scale (WAIS) to each subject. The combined IQ score from this instrument will serve as the measure of intelligence in your study. You will correlate IQ with subsequent weight loss to test Hypothesis 4. Throughout the 10-week study, you weigh each subject and record his body weight in kilograms (1 kilogram is equal to approximately 2.2 pounds). When the study is completed, you subtract their body weight at the beginning of the study from their weight at the end of the study. You use the resulting difference as your measure of weight loss. The Criterion Variable and Predictor Variables in the Analysis This study involves one criterion variable and four predictor variables. The criterion variable is weight loss measured in kilograms (kgs). In the analysis, you will give this variable the SAS variable name KG_LOST for “kilograms lost.” The first predictor variable is motivation to lose weight, as measured by the questionnaire described earlier in this example. In the analysis, you will use the SAS variable name MOTIVAT to represent this variable. The second predictor variable is the average number of hours the subjects spent exercising each week during the study. In the analysis, you will use the SAS variable name EXERCISE to represent this variable. The third predictor variable is the average number of calories consumed during each day of the study. In the analysis, you will use the SAS variable name CALORIES to represent this variable. The final predictor variable is intelligence, as measured by the WAIS. In the analysis, you will use the SAS variable name IQ to represent this variable.
Chapter 10: Bivariate Correlation 305
Data Set to Be Analyzed Table 10.1 presents fictitious scores for each subject on each of the variables to be analyzed in this study. Table 10.1 Variables Analyzed in the Weight Loss Study ___________________________________________________________________ Kilograms Hours Calories Subject lost Motivation exercising consumed IQ ___________________________________________________________________ 01. John 2.60 5 0 2400 100 02. George 1.00 5 0 2000 120 03. Fred 1.80 10 2 1600 130 04. Charles 2.65 10 5 2400 140 05. Paul 3.70 10 4 2000 130 06. Jack 2.25 15 4 2000 110 07. Emmett 3.00 15 2 2200 110 08. Don 4.40 15 3 1400 120 09. Edward 5.35 15 2 2000 110 10. Rick 3.25 20 1 1600 90 11. Ron 4.35 20 5 1800 150 12. Dale 5.60 20 3 2200 120 13. Bernard 6.44 20 6 1200 90 14. Walter 4.80 25 1 1600 140 15. Doug 5.75 25 4 1800 130 16. Scott 6.90 25 5 1400 140 17. Sam 7.75 25 . 1400 100 18. Barry 5.90 30 4 1600 100 19. Bob 7.20 30 5 2000 150 20. Randall 8.20 30 2 1200 110 21. Ray 7.80 35 4 1600 130 22. Tom 9.00 35 6 1600 120 ___________________________________________________________________
Table 10.1 provides scores for 22 male subjects. The first subject appearing in the table is named John. Table 10.1 shows the following values for John on the study’s variables: •
He lost 2.60 kgs of weight by the end of the study.
•
His score on the motivation to lose weight scale was 5 (out of a possible 35).
•
His score on “Hours Exercising” was 0, meaning that he exercised zero hours per week on the average.
•
His score on calories was 2400, meaning that he consumed 2400 calories each day, on the average.
•
His IQ was 100 (with the WAIS, the mean IQ is 100 and the standard deviation is about 15 in the population).
Scores for the remaining subjects can be interpreted in the same way.
306 Step-by-Step Basic Statistics Using SAS: Student Guide
The DATA Step for the SAS Program Below is the DATA step for the SAS program that will read the data presented in Table 10.1: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM KG_LOST MOTIVAT EXERCISE CALORIES IQ; DATALINES; 01 2.60 5 0 2400 02 1.00 5 0 2000 03 1.80 10 2 1600 04 2.65 10 5 2400 05 3.70 10 4 2000 06 2.25 15 4 2000 07 3.00 15 2 2200 08 4.40 15 3 1400 09 5.35 15 2 2000 10 3.25 20 1 1600 11 4.35 20 5 1800 12 5.60 20 3 2200 13 6.44 20 6 1200 14 4.80 25 1 1600 15 5.75 25 4 1800 16 6.90 25 5 1400 17 7.75 25 . 1400 18 5.90 30 4 1600 19 7.20 30 5 2000 20 8.20 30 2 1200 21 7.80 35 4 1600 22 9.00 35 6 1600 ;
100 120 130 140 130 110 110 120 110 90 150 120 90 140 130 140 100 100 150 110 130 120
Some notes about the preceding program: •
Line 1 of the preceding program contains the OPTIONS statement which, in this case, specifies the size of the printed page of output. One entry in the OPTIONS statement is “PS=60”, which is an abbreviation for “PAGESIZE=60.” This key word requests that each page of output have up to 60 lines of text on it. Depending on the font that you are using (and other factors), requesting PS=60 may cause the bottom of your scatterplot to be “cut off” when it is printed. If this happens, you should change the OPTIONS statement so that it requests just 50 lines of text per page. You will do this by including PS=50 in your OPTIONS statement, rather than PS=60. Your complete OPTIONS statement should appear as follows: OPTIONS LS=80 PS=50;
Chapter 10: Bivariate Correlation 307 •
You can see that lines 3–8 of the preceding program provide the INPUT statement. There, the SAS variable name SUB_NUM is used to represent “subject number,” KG_LOST is used to represent “kilograms lost,” MOTIVAT is used to represent “motivation to lose weight,” and so forth.
•
The data themselves appear in lines 10–31. The data on these lines are identical to the data appearing in Table 10.1, except that the names of the subjects have been removed.
Using PROC PLOT to Create a Scattergram Overview A scattergram is a type of graph that is useful when you are plotting one multi-value variable against a second multi-value variable. This section explains why it is always necessary to plot your variables in a scattergram prior to computing a Pearson correlation coefficient. It then shows how to use PROC PLOT to create a scattergram, and how to interpret the output created by PROC PLOT. Why You Should Create a Scattergram Prior to Computing a Correlation Coefficient What is a scattergram? A scattergram (also called a scatterplot) is a graph that plots the individual data points from a correlational study. It is particularly useful when both of the variables being analyzed are multi-value variables. Each data point in a scattergram represents a single observation (typically a human subject). The data point indicates where the subject stands on both the predictor variable (the X variable) and the criterion variable (the Y variable). You should always create a scattergram for a given pair of variables prior to computing the correlation between those variables. This is because the Pearson correlation coefficient is appropriate only if the relationship between the variables is linear; it is not appropriate when the relationship is nonlinear. Linear versus nonlinear relationships. There is a linear relationship between two variables when their scattergram follows the form of a straight line. This means that, in the population, the mean criterion scores at each value of the predictor variable should fall on a straight line. When there is a linear relationship between X and Y, it is possible to draw a straight line through the center of the scattergram. In contrast, there is a nonlinear relationship between two variables if their scattergram does not follow the form of a straight line. For example, imagine that you have constructed a test of creativity, and have administered it to a large sample of college students. With this test, higher scores reflect higher levels of creativity. Imagine further that you obtain the Learning Aptitude Test (LAT) verbal test scores for these students, plot their LAT scores
308 Step-by-Step Basic Statistics Using SAS: Student Guide
against their creativity scores, creating a scattergram. With this scattergram, LAT scores are plotted on the horizontal axis, and creativity scores are plotted on the vertical axis. Suppose that this scattergram shows that (a) students with low LAT scores tend to have low creativity scores, (b) students with moderate LAT scores tend to have high creativity scores, and (c) students with high LAT scores tend to have low creativity scores. Such a scattergram would take the form of an upside-down “U.” It is would not be possible to draw a good-fitting straight line through the data points of this scattergram, and this is why we would say that there is a nonlinear (or perhaps a curvilinear) relationship between LAT scores and creativity scores. Problems with nonlinear relationships. When you use the Pearson correlation coefficient to assess the relationship between two variables involved in a nonlinear relationship, the resulting correlation coefficient usually underestimates the actual strength of the relationship between the two variables. For example, computing the Pearson correlation between the LAT scores and creativity scores (from the preceding example) might result in a correlation coefficient of .10, which would indicate only a very weak relationship between the two variables. And yet there might actually be a fairly strong relationship between LAT scores and creativity: It may be possible to predict someone's creativity with great accuracy if you know where they stand on the LAT. Unfortunately, you would never know this if you did not first create the scattergram. The implication of all this is that you should always create a scattergram to verify that there is a linear relationship between two variables before computing a Pearson correlation for those variables. Fortunately, this is very easy to do using the SAS PLOT procedure. Syntax for the SAS Program Here is the syntax for requesting a scattergram with the PLOT procedure: PROC PLOT PLOT TITLE1 RUN;
DATA=data-set-name; criterion-variable*predictor-variable ; ' your-name ';
The variable listed as the “criterion-variable” in the preceding program will be plotted on the vertical axis (the Y axis), and the “predictor-variable” will be plotted on the horizontal axis (the X axis).
Chapter 10: Bivariate Correlation 309
Suppose that you wish to compute the correlation between KG_LOST (kilograms lost) and MOTIVAT (the motivation to lose weight). Prior to computing the correlation, you would use PROC PLOT to create a scattergram plotting KG_LOST against MOTIVAT. The SAS statements that would create this scattergram are: 28 29 30 31 32 33 34 35
[First part of the DATA step appears here] 20 8.20 30 2 1200 110 21 7.80 35 4 1600 130 22 9.00 35 6 1600 120 ; PROC PLOT DATA=D1; PLOT KG_LOST*MOTIVAT; TITLE1 'JANE DOE'; RUN;
Some notes about the preceding program: •
To conserve space, the preceding shows only the last few data lines from the DATA step on lines 28–30. This DATA step was presented in full in the preceding section titled “The DATA Step for the SAS Program.”
•
The PROC PLOT statement appears on line 32. The DATA option for this statement requests that the analysis be performed on the data set named D1.
•
The PLOT statement appears on line 33. It requests that KG_LOST serve as the criterion variable (Y variable) in the plot, and that MOTIVAT serve as the predictor variable (X variable). This means that KG_LOST will appear on the vertical axis, and MOTIVAT will appear on the horizontal axis.
•
Lines 34–35 present the TITLE1 and RUN statements for the program.
310 Step-by-Step Basic Statistics Using SAS: Student Guide
Results from the SAS Output Output 10.1 presents the scattergram that was created by the preceding program. Plot of KG_LOST*MOTIVAT.
JANE DOE Legend: A = 1 obs, B = 2 obs, etc.
| 9 + A | | | | | A 8 + | A A | | | | A 7 + | A | KG_LOST | A | | 6 + | A A | A | | A | 5 + | A | | | A A | 4 + | | A | | A | 3 + A | | A A | | A | 2 + | A | | | | 1 + A ---+----------+----------+----------+----------+----------+----------+-5 10 15 20 25 30 35 MOTIVAT
Output 10.1. Scattergram plotting kilograms lost against the motivation to lose weight.
Understanding the scattergram. Notice that, in this output, the criterion variable (KG_LOST) is plotted on the vertical axis, while the predictor variable (MOTIVAT) is plotted on the horizontal axis. Each letter in a scattergram represents one or more individual subjects. For example, consider the letter “A” that appears in the top right corner of the scattergram. This letter is located directly above a score of “35” on the “MOTIVAT” axis (the horizontal axis). It is also located directly to the right of a score of “9” on the
Chapter 10: Bivariate Correlation 311
KG_LOST axis. This means that this letter represents a person who had a score of 35 on MOTIVAT and a score of 9 on KG_LOST. In contrast, now consider the letter “A” that appears in the lower left corner of the scattergram. This letter is located directly above a score of “5” on the “MOTIVAT” axis (the horizontal axis). It is also located directly to the right of a score of “1” on the KG_LOST axis. This means that this letter represents a person who had a score of 5 on MOTIVAT and a score of 1 on KG_LOST. Each of the remaining letters in the scattergram can be interpreted in the same fashion. The legend at the top of the output says, “Legend: A = 1 obs, B = 2 obs, etc.” This means that a particular letter in the graph may represent one or more observations (human subjects, in this case). If you see the letter “A,” it means that a single person is located at that point (i.e., a single subject had that particular combination of scores on KG_LOST and MOTIVAT). If you see the letter “B,” it means that two people are located at that point, and so forth. You can see that only the letter “A” appears in this output, meaning that there was no point in the scattergram where more than one person scored. Drawing a straight line through the scattergram. The shape of the scattergram in Output 10.1 shows that there is a linear relationship between KG_LOST and MOTIVAT. This can be seen from the fact that it would be possible to draw a good-fitting straight line through the center of the scattergram. To illustrate this, Output 10.2 presents the same scattergram, this time with a straight line drawn through its center.
312 Step-by-Step Basic Statistics Using SAS: Student Guide
Plot of KG_LOST*MOTIVAT.
JANE DOE Legend: A = 1 obs, B = 2 obs, etc.
| 9 + A | | | | | A 8 + | A A | | | | A 7 + | A | KG_LOST | A | | 6 + | A A | A | | A | 5 + | A | | | A A | 4 + | | A | | A | 3 + A | | A A | | A | 2 + | A | | | | 1 + A ---+----------+----------+----------+----------+----------+----------+-5 10 15 20 25 30 35 MOTIVAT
Output 10.2. Graph plotting kilograms lost against the motivation to lose weight, with straight line drawn through the center of the scattergram.
Strength of the relationship. The general shape of the scattergram also suggests that there is a fairly strong relationship between the two variables: Knowing where a subject stands on the MOTIVAT variable enables us to predict, with some accuracy, where that subject will stand on the KG_LOST variable. Later, we will compute the correlation coefficient for these two variables to see just how strong the relationship is. Positive versus negative relationships. Output 10.2 shows that the relationship between MOTIVAT and KG_LOST is positive: Large values on MOTIVAT are associated with large values on KG_LOST, and small values on MOTIVAT are associated with small values on KG_LOST. This makes intuitive sense: You would expect that the subjects who are highly motivated to lose weight would in fact be the subjects who would lose the most
Chapter 10: Bivariate Correlation 313
weight. When there is a positive relationship between two variables, the scattergram will stretch from the lower left corner to the upper right corner of the graph (as is the case in Output 10.2). In contrast, when there is a negative relationship between two variables, the scattergram will be distributed from the upper left corner to the lower right corner of the graph. A negative relationship means that small values on the predictor variable are associated with large values of the criterion variable, and large values on the predictor variable are associated with small values on the criterion variable. Because the relationship between MOTIVAT and KG_LOST is linear, it is reasonable to proceed with the computation of a Pearson correlation for this pair of variables.
Using PROC CORR to Compute the Pearson Correlation between Two Variables Overview This chapter illustrates three different ways of using PROC CORR to compute correlation coefficients. It shows (a) how to compute the correlation between just two variables, (b) how to compute all possible correlations between a number of variables, and (c) how to use the VAR and WITH statements to selectively suppress the printing of some correlations. The present section focuses on the first of these: computing the correlation between just two variables. This section shows how to manage the PROC step for the analysis, how to interpret the output produced by PROC CORR, and how to prepare a report that summarizes the results. Syntax for the SAS Program In some instances, you may wish to compute the correlation between just two variables. Here is the syntax for the statements that will accomplish this: PROC CORR DATA=data-set-name options ; VAR variable1 variable2 ; TITLE1 ' your-name '; RUN; In the PROC CORR statement, you specify the name of the data set to be analyzed and request any options for the analysis. A section toward the end of this chapter will discuss some of the options available with PROC CORR.
314 Step-by-Step Basic Statistics Using SAS: Student Guide
You use the VAR statement to list the names of the two variables to be correlated. (The choice of which variable is “variable1” and which is “variable2” is arbitrary.) For example, suppose that you want to compute the correlation between the number of kilograms lost (KG_LOST) and the motivation to lose weight (MOTIVAT). Here are the required statements: 28 29 30 31 32 33 34 35
[First part of the DATA step appears here] 20 8.20 30 2 1200 110 21 7.80 35 4 1600 130 22 9.00 35 6 1600 120 ; PROC CORR DATA=D1; VAR KG_LOST MOTIVAT; TITLE1 'JANE DOE'; RUN;
Some notes concerning the preceding statements: •
To conserve space, the preceding code block shows only the last few data lines from the DATA step on lines 28–30. This DATA step was presented in full in the preceding section titled “The DATA Step for the SAS Program.”
•
The PROC CORR statement appears on line 32. The DATA option for this statement requests that the analysis be performed on the data set named D1.
•
The VAR statement appears on line 33. It requests that SAS compute the correlation between KG_LOST and MOTIVAT.
•
Lines 34–35 present the TITLE1 and RUN statements for the program.
Chapter 10: Bivariate Correlation 315
Results from the SAS Output The preceding program results in a single page of output, reproduced here as Output 10.3: JANE DOE
2
11
The CORR Procedure Variables: KG_LOST MOTIVAT Simple Statistics
Variable KG_LOST MOTIVAT
N 22 22
Mean 4.98591 20.00000
Std Dev 2.27488 8.99735
Sum 109.69000 440.00000
Minimum 1.00000 5.00000
Maximum 9.00000 35.00000
Pearson Correlation Coefficients, N = 22 Prob > |r| under H0: Rho=0 KG_LOST MOTIVAT KG_LOST 1.00000 0.88524 F.” For the current analysis, this p value is 0.8350. This means that the probability of obtaining an F' this large or larger when the population variances are equal is quite large––it is 0.8350. Your obtained p value
438 Step-by-Step Basic Statistics Using SAS: Student Guide
is greater than the criterion of .05, and so you fail to reject the null hypothesis of equal variances––instead you tentatively conclude that the variances are equal. This means that you can interpret the equal variances t statistic (in the step that follows this one). To review, here is a summary of how you are to interpret the results presented in the Equality of Variances table: •
When the “Pr > F'” is nonsignificant (greater than .05), report the t test based on equal variances.
•
When the “Pr > F'” is significant (less than .05), report the t test based on unequal variances.
3. Review the t test for the difference between the means. You are now ready to determine whether there is a significant difference between your sample means. To do this, you will refer to the “Statistics” table and the “T-Tests” table from your output. For your convenience, those tables are reproduced here as Output 13.4. JANE DOE 1 The TTEST Procedure Statistics Lower CL Upper CL Lower CL Variable Class N Mean Mean Mean Std Dev Std Dev AGGRESS PUN 18 2.9766 4 5.0234 1.5443 2.058 AGGRESS REW 18 6.0338 7.1111 8.1884 1.6256 2.1663 AGGRESS Diff (1-2) -4.542 -3.111 -1.68 1.709 2.1128 Statistics Upper CL Variable Class Std Dev Std Err Minimum Maximum AGGRESS PUN 3.0852 0.4851 0 8 AGGRESS REW 3.2476 0.5106 3 11 AGGRESS Diff (1-2) 2.7682 0.7043 T-Tests Variable AGGRESS AGGRESS
Method Pooled Satterthwaite
Variances Equal Unequal
DF 34 33.9
t Value -4.42 -4.42
Pr > |t| |t| |t| | t |” ( ). You can see that the p value for the current analysis is “ < .0001”, which means that it is less than one in ten thousand. This text recommends that you reject the null hypothesis whenever your obtained p value is less than .05. The obtained p value of
476 Step-by-Step Basic Statistics Using SAS: Student Guide
“ |t| | t |” ( ). Means and standard deviations for the treatment conditions. In the formal description of results for a paper (in Item M), the third sentence states: ...(for emotional infidelity, M = 24.00, SD = 2.29; for sexual infidelity, M = 21.06, SD = 1.92). In this excerpt, the symbol “M” represents the mean, and “SD” represents the standard deviation. This sentence reports the mean and standard deviation of EMOTION (which contained distress scores obtained under the emotional infidelity condition), and SEXUAL (which contained distress scores obtained under the sexual infidelity condition). These means and standard deviations appear in the results of PROC MEANS, under the headings “Mean” ( ) and “Std Dev” ( ), respectively.
Chapter 14: Paired-Samples t Test 483
Example 14.2: An Illustration of Results Showing Nonsignificant Differences Overview This section presents the results of an analysis of a different data set––a data set that is designed to produce nonsignificant results. This will enable you to see how nonsignificant results might appear in your output. A later section will show you how to summarize nonsignificant results in an analysis report. The SAS Output Output 14.9 resulted from the analysis of a different fictitious data set, one in which the means for the two treatment conditions are not significantly different. The MEANS Procedure Variable N Mean Std Dev Minimum Maximum -------------------------------------------------------------------------EMOTION 17 21.0588235 1.9193289 18.0000000 25.0000000 SEXUAL 17 20.9411765 1.5996323 18.0000000 24.0000000 -------------------------------------------------------------------------The TTEST Procedure Statistics
Difference EMOTION - SEXUAL
N 17
Difference EMOTION - SEXUAL
Lower CL Upper CL Lower CL Mean Mean Mean Std Dev Std Dev -0.807 0.1176 1.0424 1.3396 1.7987 Statistics Upper CL Std Dev Std Err Minimum Maximum 2.7375 0.4362 -4 3 T-Tests
Difference EMOTION - SEXUAL
DF 16
t Value 0.27
Pr > |t| 0.7909
Output 14.9. Results from PROC MEANS and PROC TTEST, infidelity study (nonsignificant differences).
Steps in Interpreting the Output Overview. You would normally interpret Output 14.9 following the same steps that were listed in the previous section titled “Steps in Interpreting the Output.” However, this section will focus only on those results that are most relevant to the significance test, the confidence interval, and the index of effect size.
484 Step-by-Step Basic Statistics Using SAS: Student Guide
1. Review the means on the criterion variable obtained under the two conditions. In the output from PROC MEANS, below the heading “Mean” ( ), you can see that the mean distress score obtained under the emotional infidelity condition was 21.0588, and the mean score obtained under the sexual infidelity condition was 20.9412. You can see that there does not appear to be a large difference between these two treatment means. 2. Review the t test for the difference between the means. From the results of PROC TTEST, below the title “t Value” ( ), you can see that the obtained t statistic for the current analysis is 0.27. The p value that is associated with this t statistic is 0.7909 ( ). Because this obtained p value is larger than the standard criterion of .05, you fail to reject the null hypothesis. In your report, you will indicate that the difference between the two treatment conditions is not statistically significant. 3. Review the confidence interval for the difference between the means. From the output of PROC TTEST, you can see that the observed difference between the means for the two treatment conditions is 0.1176 ( ). The 95% confidence interval for this difference ranges from –0.807 ( ) to 1.0424 ( ). Notice that this interval does include the value of zero, which is consistent with your failure to reject the null hypothesis. 4. Compute the index of effect size. The formula for computing effect size in a pairedsamples t test is reproduced here: d=
| X1 – X2 | –––––––––––– sD
The symbols X1 and X2 represent the sample means on the “distress” variable obtained under the two treatment conditions. From Output 14.9, you can see that the mean distress score obtained under the emotional infidelity condition was 21.0588, and the mean distress score obtained under the sexual infidelity condition was 20.9412 ( ). Inserting these means into the formula for the effect size index results in the following: d=
| 21.0588 – 20.9412 | ––––––––––––––––––––– sD
Chapter 14: Paired-Samples t Test 485
The symbol sD in the formula represents the estimated standard deviation of difference scores in the population. This statistic may be found in the “Statistics” table produced by PROC TTEST. This table is reproduced here:
Difference EMOTION - SEXUAL
N 17
Difference EMOTION - SEXUAL
Statistics Lower CL Upper CL Lower CL Mean Mean Mean Std Dev Std Dev -0.807 0.1176 1.0424 1.3396 1.7987 Statistics Upper CL Std Dev Std Err Minimum Maximum 2.7375 0.4362 -4 3
Output 14.10. Estimated population standard deviation that is needed to compute effect size for the infidelity study (nonsignificant differences).
In Output 14.10, you can see that the estimated population standard deviation is 1.7987 ( ). Substituting this value into the formula for effect size results in the following: d=
| 21.0588 – 20.9412 | ––––––––––––––––––––– 1.7987
d=
| .1176 | ––––––––––––––––––––– 1.7987
d=
.0654
d=
.07
Thus, the index of effect size for the current analysis is .07. Cohen’s guidelines (appearing in Table 14.4) indicated that a “small effect” was obtained when d = .20. The value of .07 that was obtained with the present analysis was well below this criterion, indicating that the present manipulation produced less than a small effect. Summarizing the Results of the Analysis Following is an analysis report that summarizes the preceding research question and results. A) Statement of the research question: The purpose of this study is to determine whether there is a difference between emotional infidelity versus sexual infidelity with respect to the amount of psychological distress that they produce in women. B) Statement of the research hypothesis: When they are asked to imagine how they would feel if they learned that their partner had been unfaithful, women will display higher levels of psychological distress when imagining emotional infidelity than when imagining sexual infidelity.
486 Step-by-Step Basic Statistics Using SAS: Student Guide
C) Nature of the variables: This analysis involved one predictor variable and one criterion variable. • The predictor variable was type of infidelity. This was a dichotomous variable, was assessed on a nominal scale, and included two conditions: an emotional infidelity condition versus a sexual infidelity condition. • The criterion variable was subjects’ scores on a 4-item measure of distress. This was a multi-value variable and was assessed on an interval scale. Paired-samples t test.
D) Statistical test:
E) Statistical null hypothesis (H0): µ1 = µ2; In the study population, there is no difference between the emotional infidelity condition versus the sexual infidelity condition with respect to mean scores on the criterion variable (the measure of psychological distress). F) Statistical alternative hypothesis (H1): µ1 ≠ µ2; In the study population, there is a difference between the emotional infidelity condition versus the sexual infidelity condition with respect to mean scores on the criterion variable (the measure of psychological distress). G) Obtained statistic:
t = 0.27
H) Obtained probability (p) value:
p = .7909
I) Conclusion regarding the statistical null hypothesis: to reject the null hypothesis.
Fail
J) Confidence interval: Subtracting the mean of the sexual infidelity condition from the mean of the emotional infidelity condition resulted in an observed difference of 0.12. The 95% confidence interval for this difference extended from –0.81 to 1.04. K) Effect size:
d = .07.
L) Conclusion regarding the research hypothesis: These findings fail to provide support for the study’s research hypothesis. M) Formal description of results for a paper: Results were analyzed using a paired-samples t test. This analysis revealed a statistically nonsignificant difference between the two conditions, t(16) = 0.27, p = .7909. The sample means are displayed in Figure 14.3, which shows that the mean distress score obtained under the emotional infidelity condition was similar to the mean distress score obtained under the sexual infidelity condition (for emotional infidelity, M = 21.06, SD = 1.92; for sexual infidelity, M = 20.94, SD = 1.60). The observed difference between the means was 0.12, and the 95%
Chapter 14: Paired-Samples t Test 487
confidence interval for the difference between means ranged from –0.81 to 1.04. The effect size was computed as d = .07. According to Cohen’s (1969) guidelines, this represents less than a small effect. N ) Figure representing the results:
Figure 14.3. Mean scores on the measure of psychological distress as a function of the type of infidelity (nonsignificant differences).
Conclusion In this chapter, you learned how to perform a paired-samples t test. With the information learned here, along with the information learned in Chapter 13, “Independent-Samples t Test,” you should now be prepared to analyze data from many types of studies that compare two treatment conditions. But what if you are conducting an investigation that involves more than two treatment conditions? For example, you might conduct a study that investigates the effect of caffeine on learning in laboratory rats. Such a study might involve four treatment conditions: (a) a group given zero mg of caffeine, (b) a group given 1 mg of caffeine, (c) a group given 2 mg of caffeine, and (d) a group given 3 mg of caffeine. You might think that the way to analyze the data from this study would be to perform a series of t tests in which you compare every possible combination of conditions. But most researchers and statisticians would advise against this approach. Instead, most statisticians and researchers would counsel you to analyze your data using a one-way analysis of variance (abbreviated as ANOVA).
488 Step-by-Step Basic Statistics Using SAS: Student Guide
Analysis of variance is one of the most flexible and widely used statistical procedures in the behavioral sciences and education. It is essentially an expansion of the t test because it enables you to analyze data from studies that involve more than two treatment conditions. The following chapter shows you how to use SAS to perform a one-way analysis of variance.
One-Way ANOVA with One BetweenSubjects Factor Introduction..........................................................................................491 Overview................................................................................................................ 491 Situations Appropriate for One-Way ANOVA with One Between-Subjects Factor ................................................................491 Overview................................................................................................................ 491 Nature of the Predictor and Criterion Variables ..................................................... 491 The Type-of-Variable Figure .................................................................................. 492 Example of a Study Providing Data That Are Appropriate for This Procedure....... 492 Summary of Assumptions Underlying One-Way ANOVA with One BetweenSubjects Factor .................................................................................................. 493 A Study Investigating Aggression .......................................................494 Overview................................................................................................................ 494 Research Method................................................................................................... 495 The Research Design ............................................................................................ 496 Treatment Effects, Multiple Comparison Procedures, and a New Index of Effect Size .................................................................497 Overview................................................................................................................ 497 Treatment Effects................................................................................................... 497 Multiple Comparison Procedures ........................................................................... 498 2 R , an Index of Variance Accounted For ................................................................ 499 Some Possible Results from a One-Way ANOVA.................................500 Overview................................................................................................................ 500 Significant Treatment Effect, All Multiple Comparison Tests are Significant .......... 500
490 Step-by-Step Basic Statistics Using SAS: Student Guide
Significant Treatment Effect, Two of Three Multiple Comparison Tests Are Significant........................................................................................................... 502 Nonsignificant Treatment Effect ............................................................................. 504 Example 15.1: One-Way ANOVA Revealing a Significant Treatment Effect .............................................................................505 Overview................................................................................................................ 505 Choosing SAS Variable Names and Values to Use in the SAS Program .............. 505 Data Set to Be Analyzed........................................................................................ 506 Writing the SAS Program....................................................................................... 507 Keywords for Other Multiple Comparison Procedures ........................................... 510 Output Produced by the SAS Program .................................................................. 511 Steps in Interpreting the Output ............................................................................. 511 Using a Figure to Illustrate the Results .................................................................. 525 Analysis Report for the Aggression Study (Significant Results) ............................. 526 Notes Regarding the Preceding Analysis Report ................................................... 528 Example 15.2: One-Way ANOVA Revealing a Nonsignificant Treatment Effect .............................................................................529 Overview................................................................................................................ 529 The Complete SAS Program ................................................................................. 530 Steps in Interpreting the Output ............................................................................. 531 Using a Graph to Illustrate the Results .................................................................. 534 Analysis Report for the Aggression Study (Nonsignificant Results) ....................... 535 Conclusion............................................................................................537
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 491
Introduction Overview This chapter shows how to enter data into SAS and prepare SAS programs that will perform a one-way analysis of variance (ANOVA) using the GLM procedure. This chapter focuses on between-subjects research designs: designs in which each subject is exposed to only one condition under the independent variable. It shows how to determine whether there is a significant effect for the study’s independent variable, and how to use multiple comparison procedures to identify the pairs of groups that are significantly different from each other, how to request confidence intervals for differences between the means, and how to interpret an index of effect size. Finally, it shows how to prepare a report that summarizes the results of the analysis.
Situations Appropriate for One-Way ANOVA with One Between-Subjects Factor Overview One-way ANOVA is a test of group differences: it enables you to determine whether there are significant differences between two or more treatment conditions with respect to their mean scores on a criterion variable. ANOVA has an important advantage over a t test: A t test enables you to determine whether there is a significant difference between only two groups. ANOVA, on the other hand enables you to determine whether there is a significant difference between two or more groups. ANOVA is routinely used to analyze data from experiments that involve three or more treatment conditions. In summary, one-way ANOVA with one between-subjects factor can be used when you want to investigate the relationship between (a) a single predictor variable (which classifies group membership) and (b) a single criterion variable. Nature of the Predictor and Criterion Variables Predictor variable. With ANOVA, the predictor (or independent) variable is a type of classification variable: it simply indicates which group a subject is in. Criterion variable. With analysis of variance, the criterion (or dependent) variable is typically a multi-value variable. It must be a numeric variable that is assessed on either an interval or ratio level of measurement. The criterion variable in the analysis must also satisfy a number of additional assumptions, and these assumptions are summarized in a later section.
492 Step-by-Step Basic Statistics Using SAS: Student Guide
The Type-of-Variable Figure The figure below illustrates the types of variables that are typically being analyzed when researchers perform a one-way ANOVA with one between-subjects factor. Criterion
Predictor
=
The “Multi” symbol that appears in the above figure shows that the criterion variable in an ANOVA is typically a multi-value variable (a variable that assumes more than six values in your sample). The “Lmt” symbol that appears to the right of the equal sign in the above figure shows that the predictor variable in this procedure is usually a limited-value variable (i.e., a variable that assumes just two to six values). Example of a Study Providing Data That Are Appropriate for This Procedure The Study. Suppose that you are an industrial psychologist studying work motivation and work safety. You are trying to identify interventions that may increase the likelihood that employees will engage in safe behaviors at work. In your current investigation, you are working with pizza deliverers. With this population, employers are interested in increasing the likelihood that they will display safe driving behaviors. They are particularly interested in interventions that may increase the frequency with which they come to a full stop at stop signs (as opposed to a more dangerous “rolling stop”). You are investigating two research questions: •
Does setting safety-related goals for employees increase the frequency with which they will engage in safe behaviors?
•
Do participatively set goals tend to be more effective than goals that are assigned by a supervisor without participation?
To explore these questions, you conduct an experiment in which you randomly assign 30 pizza deliverers to one of three treatment conditions: •
10 drivers are assigned to the participative goal-setting condition. These drivers meet as a group and set goals for themselves with respect to how frequently they should come to a full stop at stop signs.
•
10 drivers are assigned to the assigned goal-setting condition. The drivers in this group meet with their supervisors, and the supervisors assign goals regarding how frequently they should come to a full stop at stop signs (unbeknownst to the drivers, these goals are the same as the goals developed by the preceding group).
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 493 •
10 drivers are assigned to the control condition. Drivers in this condition do not experience any goal-setting.
You secretly observe the drivers at stop signs over a two-month period, noting how many times each driver comes to a full stop out of 30 opportunities. You perform a one-way ANOVA to determine whether there are significant differences between the three groups with respect to their average number of full stops out of 30. The ANOVA procedure permits you to determine (a) whether the two goal-setting groups displayed a greater average number of full stops, compared to the control group, and (b) whether the participative goalsetting group displayed a greater number of full stops, compared to the assigned goal-setting group. Why these data would be appropriate for this procedure. The preceding study involved a single predictor variable and a single criterion variable. The predictor variable was “type of motivational intervention.” You know that this was a limited-value variable because it assumed only three values: a participative goal-setting condition, an assigned goal-setting condition, and a control condition. This predictor variable was assessed on a nominal scale because it indicates group membership but does not convey any quantitative information. However, remember that the predictor variable used in a one-way ANOVA may be assessed on any scale of measurement. The criterion variable in this study was the number of full stops displayed by the pizza drivers. This was a numeric variable, and you know it was assessed on a ratio scale because it had equal intervals and a true zero point. You would know that this was a multi-value variable if you used a procedure such as PROC FREQ to verify that the drivers’ scores displayed a relatively large number of values (i.e., some drivers had zero full stops out of a possible 30, other drivers had 30 full stops out of a possible 30, and still other drivers had any number of full stops between these two extremes). Before you analyze your data with ANOVA, you first want to perform a number of other preliminary analyses on your data to verify that they meet the assumptions underlying this statistical procedure. The most important of these assumptions are summarized in the following section. Note: Although the study described here is fictitious, it is based on a real study reported by Ludwig and Geller (1997). Summary of Assumptions Underlying One-Way ANOVA with One Between-Subjects Factor •
Level of measurement. The criterion variable must be a numeric variable that is assessed on an interval or ratio level of measurement. The predictor variable may be assessed on any level of measurement, although it is essentially treated as a nominal-level (classification) variable in the analysis.
•
Independent observations. A particular observation should not be dependent on any other observation in any group. In practical terms, this means that (a) each subject is
494 Step-by-Step Basic Statistics Using SAS: Student Guide
exposed to just one condition under the predictor variable and (b) subject matching procedures are not used. •
Random sampling. Scores on the criterion variable should represent a random sample drawn from the study populations.
•
Normal distributions. Each group should be drawn from a normally distributed population. If each group contains over 30 subjects, the test is robust against moderate departures from normality (in this context, “robust” means that the test will still provide accurate results as long as violations of the assumptions are not large). You should analyze your data with PROC UNIVARIATE using the NORMAL option to determine whether your data meet this assumption. Be warned that the tests for normality that are provided by PROC UNIVARIATE tend to be fairly sensitive when samples are large.
•
Homogeneity of variance. The populations that are represented by the various groups should have equal variances on the criterion. If the number of subjects in the largest group is no more than 1.5 times greater than the number of subjects in the smallest group, the test is robust against moderate violations of the homogeneity assumption (Stevens, 1986).
A Study Investigating Aggression Overview Assume that you are conducting research concerning the possible causes of aggression in children. You are aware that social learning theory (Bandura, 1977) predicts that exposure to aggressive models can cause people to behave more aggressively. You design a study in which you experimentally manipulate the amount of aggression that a model displays. You want to determine whether this manipulation affects how aggressively children subsequently behave, after they have viewed the model. Essentially, you wish to determine whether viewing a model’s aggressive behavior can lead to an increased aggressive behavior on the part of the viewer. Your research hypothesis: There will be a positive relationship between the level of aggression displayed by a model and the number of aggressive acts later demonstrated by children who observe the model. Specifically, you predict the following: •
Children who witness a high level of aggression will demonstrate a greater number of aggressive acts than children who witness a moderate or low level of aggression.
•
Children who witness a moderate level of aggression will demonstrate a greater number of aggressive acts than children who witness a low level of aggression.
You perform a single investigation to test these hypotheses. The following sections describe the research method in more detail. Note: Although the study and results presented here are fictitious, they are inspired by a real study reported by Bandura (1965).
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 495
Research Method Overview. You conduct a study in which 24 nursery-school children serve as subjects. The study is conducted in two stages. In Stage 1, you show a short videotape to your subjects. You manipulate the independent variable by varying what the children see in this videotape. In Stage 2, you assess the dependent variable (the amount of aggression displayed by the children) to determine whether it has been affected by this independent variable. The following sections refer to your independent variable as a “predictor variable.” This is because the term “predictor variable” is more general, and is appropriate regardless of whether your variable is a true manipulated independent variable (as in the present case), or a nonmanipulated subject variable (such as subject sex). Stage 1: Manipulating the predictor variable. The predictor variable in your study is the “level of aggression displayed by the model” or, more concisely, “model aggression.” You manipulate this independent variable by randomly assigning each child to one of three treatment conditions: •
Eight children are assigned to the low-model-aggression condition. When the subjects in this group watch the videotape, they see a model demonstrate a relatively low level of aggressive behavior. Specifically, they see a model (an adult female) enter a room that contains a wide variety of toys. For 90% of the tape, the model engages in nonaggressive play (e.g., playing with building blocks). For 10% of the tape, the model engages in aggressive play (e.g., violently punching an inflatable “bobo doll”).
•
Another eight children are assigned to the moderate-model-aggression condition. They watch a videotape of the same model in the same playroom, but they observe the model displaying a somewhat higher level of aggressive behavior. Specifically, in this version of the tape, the model engages in nonaggressive play (again, playing with building blocks) 50% of the time, and engages in aggressive play (again, punching the bobo doll) 50% of the time.
•
Finally, the remaining eight children are assigned to the high-model-aggression condition. They watch a videotape of the same model in the same playroom, but in this version the model engages in nonaggressive play 10% of the time, and engages in aggressive play 90% of the time.
Stage 2: Assessing the criterion variable. This chapter will refer to the dependent variable in the study as a “criterion variable.” Again, this is because the term “criterion variable” is a more general term that is appropriate regardless of whether your study is a true experiment (as in the present case), or is a nonexperimental investigation. The criterion variable in this study is the “number of aggressive acts displayed by the subjects” or, more concisely, “subject aggressive acts.” The purpose of your study was to determine whether certain manipulations in your videotape caused some groups of children to behave more aggressively than others. To assess this, you allowed each child to engage in a free play period immediately after viewing the videotape. Specifically, each child was individually escorted to a play room similar to the room that was shown in the tape. This playroom contained a large assortment of toys, some of which were appropriate for
496 Step-by-Step Basic Statistics Using SAS: Student Guide
nonaggressive play (e.g., building blocks), and some of which were appropriate for aggressive play (e.g., an inflatable bobo doll identical to the one in the tape). The children were told that they could do whatever they liked in the play room, and were then left to play alone. Outside of the play room, three observers watched the child through a one-way mirror. They recorded the total number of aggressive acts the child displayed during a 20-minute period in the play room (an “aggressive act” could be an instance in which the child punches the bobo doll, throws a building block, and so forth). Therefore, the criterion variable in your study is the total number of aggressive acts demonstrated by each child during this period. The Research Design The research design used in this study is illustrated in Figure 15.1. You can see that this design is represented by a figure that consists of three squares, or cells.
Figure 15.1. Research design for the aggression study.
The figure is titled “Predictor Variable: Level of Aggression Displayed by Model.” The first cell (on the left) represents the eight subjects in Level 1 (the children who saw a videotape in which the model displayed a low level of aggression). The middle cell represents the eight subjects in Level 2 (the children who saw the model display a moderate level of aggression). Finally, the cell on the right represents the eight subjects in Level 3 (the children who saw the model display a high level of aggression).
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 497
Treatment Effects, Multiple Comparison Procedures, and a New Index of Effect Size Overview This section introduces the three types of results that you will review when you conduct a one-way ANOVA. First, it covers the concept of treatment effects: overall differences among group means that are due to the experimental manipulations. Next, it discusses multiple comparison procedures––tests used to determine which pairs of treatment conditions are significantly different from each other. Finally, it introduces the R2 statistic–– a measure of the variance in the criterion variable that is accounted for by the predictor variable. It discusses the use of R2 as an index effect size in experiments. Treatment Effects Null and alternative hypotheses. The concept of treatment effects is best understood with reference to the concept of the null hypothesis. For an experiment with three treatment conditions (as with the current study), the statistical null hypothesis may generally be stated according to this format: Statistical null hypothesis (Ho): µ1 = µ2 = µ3; In the study population, there is no difference between subjects in the three treatment conditions with respect to their mean scores on the criterion variable. In the preceding null hypothesis, the symbol µ represents the population mean on the criterion variable for a particular treatment condition. For example µ1 represents the population mean for Treatment Condition 1, µ2 represents the population mean for Treatment Condition 2, and so on. The preceding section describes a study that was designed to determine whether exposure to aggressive models will cause subjects who view the models to behave aggressively themselves. The statistical null hypothesis for this study may be stated in this way: Statistical null hypothesis (H0): µ1 = µ2 = µ3; In the study population, there is no difference between subjects in the low-model-aggression condition, subjects in the moderate-model-aggression condition, and subjects in the high-model-aggression condition with respect to their mean scores on the criterion variable (the number of aggressive acts displayed by the subjects). For a study with three treatment conditions, the statistical alternative hypothesis may generally be stated in this way: Statistical alternative hypothesis (H1): Not all µs are equal; In the study population, there is a difference between at least two of the three treatment conditions with respect to their mean scores on the criterion variable.
498 Step-by-Step Basic Statistics Using SAS: Student Guide
The statistical alternative hypothesis appropriate for the model aggression study that was previously described may be stated in this way: Statistical alternative hypothesis (H1): Not all µs are equal; In the study population, there is a difference between at least two of the following three groups with respect to their mean scores on the criterion variable: subjects in the low-model-aggression condition, subjects in the moderate-model-aggression condition, and subjects in the highmodel-aggression condition. The following sections of this chapter will show you how to use SAS to perform a one-way analysis of variance. You will learn that, in this analysis, SAS computes an F statistic that tests the statistical null hypothesis. If this F statistic is significant, you can reject the null hypothesis that all population means are equal. You may then tentatively conclude that at least two of the population means differ from one another. In this situation, you have obtained a significant treatment effect. Treatment effects. In an experiment, a significant treatment effect refers to differences among group means that are due to the influence of the independent variable. When you conduct a true experiment and have a significant treatment effect, this means that your independent variable had some type of effect on the dependent variable. It means that at least two of the treatment conditions are significantly different from each other. In most cases, this is what you want to demonstrate. In a single-factor experiment (an experiment in which only one independent variable is being manipulated), there can be only one treatment effect. However, in a factorial experiment (an experiment in which more than one independent variable is being manipulated), there may be more than one treatment effect. The present chapter will deal exclusively with single-factor experiments; Chapter 16, “Factorial ANOVA with Two Between-Subjects Factors,” will introduce you to the concept of factorial experiments. Multiple Comparison Procedures As was stated previously, when an F statistic is large enough to enable you to reject the null hypothesis, you may tentatively conclude that, in the population, at least two of the three treatment conditions differ from one another. But which two? It is possible, for example, that Group 1 is significantly different from Group 2, but Group 2 is not significantly different from Group 3. Clearly, researchers need a tool that will enable them to determine which groups are significantly different from one another. A special type of test called a multiple comparison procedure is normally used for this purpose. A multiple comparison procedure is a statistical test that enables researchers to determine the significance of the difference between pairs of means from studies that include more than two treatment conditions. A wide variety of multiple comparison procedures are available with SAS, but this chapter will focus on just one that is very widely used: Tukey’s studentized range (HSD) test.
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 499
A later section shows how to request the Tukey test in your SAS program, and how to interpret the output that it generates. You will supplement the Tukey test with an option that requests confidence intervals for the difference between the means, somewhat similar to the confidence intervals that you obtained with the independent-samples t tests that you performed in Chapter 13. The results of these tests will give you a better understanding of the nature of the differences between your treatment conditions. R2, an Index of Variance Accounted For The need for an index of effect size. In Chapter 12, “Single-Sample t Test,” you learned that it is possible to conduct an experiment and obtain results that are statistically significant, even though the magnitude of the treatment effect is trivial. This outcome is most likely to occur when you conduct a study with a very large number of subjects. When your sample is very large, your statistical test has a relatively large amount of power. This means that you are fairly likely to reject the null hypothesis and conclude that you have obtained significant differences, even when the magnitude of the difference is relatively small. To address this problem, researchers are now encouraged to supplement their significance tests with indices of effect size. An index of effect size is a measure of the magnitude of a treatment effect. A variety of different effect size indices are available for use in research. In the three chapters in this text that deal with t tests, you learned about the d statistic. The d statistic indicates the degree to which one sample mean differs from the second sample mean (or population mean), stated in terms of the standard deviation of the population. The d statistic is often used as an index of effect size when researchers compute t statistics. R2 as a measure of variance accounted for. In this chapter, you will learn about a different index of effect size––R2. R2 is an index that is often reported when researchers perform analysis of variance. The R2 statistic indicates the proportion of variance in the criterion variable that is accounted for by the study’s predictor variable. It is computed by dividing the sum of squares for the predictor variable (the “between-groups” sum of squares) by the 2 sum of squares for the corrected total. Values of R may range from .00 to 1.00, with larger values indicating a larger treatment effect (the word “effect” is appropriate only for experimental research––not for nonexperimental research). The larger the value of R2, the larger the effect that the independent variable had on the dependent variable. For example, when a researcher conducts an experiment and obtains an R2 value of .40, she may conclude that her independent variable accounted for 40% of the variance in the dependent variable. Researchers typically hope to obtain relatively large values of R2. Interpreting the size of R2. Chapters 12–14 in this text provided information about t tests and guidelines for interpreting the d statistic. Specifically, they provided tables that showed how the size of a d statistic indicates a “small” effect versus a “moderate” effect versus a “large” effect. Unfortunately, however, there are no similar widely accepted guidelines for 2 2 interpreting R . For example, although most researchers would agree that R values less than .05 are relatively trivial, there is no widely accepted criterion for how large R2 must be to be considered “large.” This is because the significance of the size of R2 depends on the nature
500 Step-by-Step Basic Statistics Using SAS: Student Guide
of the phenomenon being studied, and also on the size of R2 values that were obtained when other researchers have studied the same phenomenon. For example, researchers looking for ways to improve the grades of children in elementary schools might find that it is difficult to construct interventions that have much of an impact. 2 If this is the case, an experiment that produces an R value of .15 may be considered a big success, and the R2 value of .15 may be considered meaningful. In contrast, researchers conducting research on reinforcement theory using laboratory rats and a “bar-pressing” procedure may find that it is easy to construct manipulations that have a major effect on the rats’ bar-pressing behavior. In these studies, they may routinely obtain R2 values over .80. If this is the case, then a new experiment that produces an R2 value of .15 may be considered a failure, and the R2 value of .15 may be considered relatively trivial. The above example illustrates the problem with R2 as an index of effect size: in one situation, an R2 value of .15 was interpreted as a meaningful proportion of variance, and in a different situation, the same value of .15 was interpreted as a relatively trivial proportion of variance. Therefore, before you can interpret an R2 value from a study that you have conducted, you must first be familiar with the R2 values that have been obtained in similar research that has already been conducted by others. It is only within this context that you can determine whether your R2 value can be considered large or meaningful. Summary. In summary, R2 is a measure of variance accounted for that may be used as an index of effect size when you analyze data from experiments that use a between-subjects design. Later sections of this chapter will show where the R2 statistic is printed in the output of PROC GLM, and how you should incorporate it into your analysis reports.
Some Possible Results from a One-Way ANOVA Overview When you conduct a study that involves three or more treatment conditions, a number of different types of outcomes are possible. Some of these possibilities are illustrated below. All these examples are based on the aggression experiment described above. Significant Treatment Effect, All Multiple Comparison Tests are Significant Figure 15.2 illustrates an outcome in which both of the following are true: •
there is a significant treatment effect for the predictor variable (level of aggression displayed by the model)
•
all multiple comparison tests are significant.
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 501
Figure 15.2. Mean number of aggressive acts as a function of the level of aggression displayed by the model (significant treatment effect; all multiple comparison tests are significant).
Understanding the figure. The bar labeled “Low” in Figure 15.2 represents the children who saw the model display a low level of aggression in the videotape. The bar labeled “Moderate” represents the children who saw the model display a moderate level of aggression, and the bar labeled “High” represents the children who saw the model display a high level of aggression The vertical axis labeled “Subject Aggressive Acts” in Figure 15.2 indicates the mean number of aggressive acts that the various groups of children displayed after viewing the videotape. You can see that the “Low” bar reflects a score of approximately “5” on this axis, meaning that the children in the low-model-aggression condition displayed an average of about five aggressive acts in the play room after viewing the videotape. In contrast, the bar for the children in the “Moderate” group shows a substantially higher score of about 14 aggressive acts, and the bar for the children in the “High” group shows a even higher score of about 23 aggressive acts. Expected statistical results. If you analyzed the data for this figure using a one-way ANOVA, you would probably expect the overall treatment effect to be significant because at least two of the groups in Figure 15.2 display means that appear to be substantially different
502 Step-by-Step Basic Statistics Using SAS: Student Guide
from one another. You would also expect to see significant multiple comparison tests, because •
The mean for the moderate-model-aggression group appears to be substantially higher than the mean for the low-model-aggression group; this suggests that the multiple comparison test comparing these two groups would probably be significant.
•
The mean for the high-model-aggression group appears to be substantially higher than the mean for the moderate-model-aggression group; this suggests that the multiple comparison test comparing these two groups would probably be significant.
•
The mean for the high-model-aggression group appears to be substantially higher than the mean for the low-model-aggression group; this suggests that the multiple comparison test comparing these two groups would probably be significant.
Conclusions regarding the research hypotheses. Figure 15.2 shows that the greater the amount of aggression modeled in the videotape, the greater the number of aggressive acts subsequently displayed by the children in the play room. It would be reasonable to conclude that the results provide support for the research hypotheses that were stated in the previous section “A Study Investigating Aggression.” Note: Of course, you don’t arrive at conclusions such as this by merely preparing a figure and “eyeballing” the data. Instead, you perform the appropriate statistical analyses to confirm your conclusions; these analyses will be illustrated in later sections. Significant Treatment Effect, Two of Three Multiple Comparison Tests Are Significant Figure 15.3 illustrates an outcome in which both of the following are true: •
there is a significant treatment effect for the predictor variable
•
two of the three possible multiple comparison tests are significant.
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 503
Figure 15.3. Mean number of aggressive acts as a function of the level of aggression displayed by the model (significant treatment effect; two of three multiple comparison tests are significant).
Expected statistical results. If you analyzed the data for this figure using a one-way ANOVA, you would probably expect the overall treatment effect to be significant because at least two of the groups in Figure 15.3 have means that appear to be substantially different from one another. You would also expect to see two significant multiple comparison tests, because •
The mean for the high-model-aggression group appears to be substantially higher than the mean for the moderate-model-aggression group.
•
The mean for the high-model-aggression group appears to be substantially higher than the mean for the low-model-aggression group.
In contrast, there is very little difference between the mean for the moderate-modelaggression group and the mean for the low-model-aggression group. The multiple comparison test comparing these two groups would probably not demonstrate significance. Conclusions regarding the research hypotheses. It is reasonable to conclude that the results shown in Figure 15.3 provide partial support for your research hypotheses. The results are somewhat supportive because the high-model-aggression groups was more aggressive than the other two groups. However, they were not fully supportive, as there was not a significant difference between the “Low” and “Moderate” groups.
504 Step-by-Step Basic Statistics Using SAS: Student Guide
Nonsignificant Treatment Effect Figure 15.4 illustrates an outcome in which the treatment effect for the predictor variable is nonsignificant.
Figure 15.4. Mean number of aggressive acts as a function of the level of aggression displayed by the model (treatment effect is nonsignificant).
Expected statistical results. If you analyzed the data for this figure using a one-way ANOVA, you would probably expect the overall treatment effect to be nonsignificant. This is because none of the groups in the figure appear to be substantially different from one another with respect to their mean scores on the criterion variable. Each group displays a mean of approximately 15 aggressive acts, regardless of condition. When an overall treatment effect is nonsignificant, it is normally not appropriate to further interpret the results of the multiple comparison procedures. This is important to remember because a later section of this chapter will illustrate a SAS program that will request that multiple comparison tests be computed and printed regardless of the significance of the overall treatment effect. As the researcher, you must remember to always consult this overall test first, and proceed to the multiple comparison results only if the overall treatment effect is significant. Conclusions regarding the research hypotheses. It is reasonable to conclude that the results shown in Figure 15.4 fail to provide support for your research hypotheses.
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 505
Example 15.1: One-Way ANOVA Revealing a Significant Treatment Effect Overview The steps that you follow in performing an ANOVA will vary depending on whether the treatment effect is significant. This section illustrates an analysis that results in a significant treatment effect. It shows you how to prepare the SAS program, interpret the SAS output, and summarize the results. These procedures are illustrated by analyzing fictitious data from the aggression study that was described previously. In these analyses, the predictor variable is the level of aggression displayed by the model, and the criterion variable is the number of aggressive acts displayed by the children after viewing the videotape. Choosing SAS Variable Names and Values to Use in the SAS Program Before you write a SAS program to perform an ANOVA, it is helpful to first prepare a figure similar to Figure 15.5. The purpose of this figure is to help you choose a meaningful SAS variable name for the predictor variable, and meaningful values to represent the different levels under the predictor variables. Carefully choosing meaningful variable names and values at this point will make it easier to interpret your SAS output later.
Figure 15.5. Predictor variable name and values to be used in the SAS program for the aggression study.
SAS variable name for the predictor variable. You can see that Figure 15.5 is very similar to Figure 15.1, except that variable names and values have now been added. Figure 15.5 again shows that the predictor variable in your study is the “Level of Aggression Displayed by Model.” Below this heading is “MOD_AGGR,” which will be the SAS variable name for the predictor variable in your SAS program (“MOD_AGGR” stands for “model aggression”). Of course, you can choose any SAS variable name, but it should be meaningful and must comply with the rules for SAS variable names.
506 Step-by-Step Basic Statistics Using SAS: Student Guide
Values to represent conditions under the predictor variable. Below the heading for the predictor variable are the names of the three conditions under this predictor variable: “Low,” “Moderate,” and “High.” Below these headings for the three conditions are the values that you will use to represent these conditions in your SAS program: “L” represents children in the lowmodel-aggression condition, “M” represents children in the moderate-model-aggression condition, and “H” represents children in the high-model-aggression condition. Choosing meaningful letters such as L, M, and H will make it easier to interpret your SAS output later. Data Set to Be Analyzed Table 15.1 presents the data set that you will analyze. Table 15.1 Variables Analyzed in the Aggression Study (Data Set Will Produce a Significant Treatment Effect) ____________________________________________________ Model Subject Subject aggression aggression ____________________________________________________ 01 L 02 02 L 14 03 L 10 04 L 08 05 L 08 06 L 15 07 L 03 08 L 12 09 M 13 10 M 25 11 M 16 12 M 20 13 M 21 14 M 21 15 M 17 16 M 26 17 H 20 18 H 14 19 H 23 20 H 22 21 H 24 22 H 26 23 H 19 24 H 29 __________________________________________________
Understanding the columns in the table. The columns in Table 15.1 provide the variables that you will analyze in your study. The first column in Table 15.1 is headed “Subject.” This column simply assigns a subject number to each child. The second column is headed “Model aggression.” In this column, the value “L” identifies children who saw the model in the videotape display a low level of aggression, “M” identifies children who saw the model display a moderate level of aggression, and “H” identifies children who saw the model display a high level of aggression.
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 507
Finally, the column headed “Subject aggression” indicates the number of aggressive acts that each child displayed in the play room after viewing the videotape. This variable will serve as the criterion variable in your study. Understanding the rows of the table. The rows in Table 15.1 represent individual children who participated as subjects in the study. The first row represents Subject 1. The “L” under “Model Aggression” tells you that this child was in the low condition under the predictor variable. The “02” under “Subject Aggression” tells you that this child displayed two aggressive acts after viewing the videotape. The data lines for the remaining children may be interpreted in the same way. Writing the SAS Program The DATA step. In preparing the SAS program, you will type the data similar to the way that they appear in Table 15.1. That is, you will have one column to contain subject numbers, one column to indicate the subjects’ condition under the model-aggression predictor variable, and one column to indicate the subjects’ scores on the criterion variable. Here is the DATA step for your SAS program: OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM MOD_AGGR $ SUB_AGGR; DATALINES; 01 L 02 02 L 14 03 L 10 04 L 08 05 L 08 06 L 15 07 L 03 08 L 12 09 M 13 10 M 25 11 M 16 12 M 20 13 M 21 14 M 21 15 M 17 16 M 26 17 H 20 18 H 14 19 H 23 20 H 22 21 H 24 22 H 26 23 H 19 24 H 29 ;
508 Step-by-Step Basic Statistics Using SAS: Student Guide
You can see that the INPUT statement of the preceding program uses the following SAS variable names: •
The SAS variable name SUB_NUM is used to represent subject numbers.
•
The SAS variable name MOD_AGGR is used to code subject condition under the modelaggression predictor variable. Values are either L, M, or H for this variable. Note that the variable name is followed by the “$” symbol to indicate that it is a character variable.
•
The SAS variable name SUB_AGGR is used to contain subjects’ scores on the subject aggression criterion variable.
The PROC step. Following is the syntax for the PROC step needed to perform a one-way ANOVA with one between-subjects factor, and follow it with Tukey's HSD test: PROC GLM DATA = data-set-name ; CLASS predictor-variable ; MODEL criterion-variable = predictor-variable ; MEANS predictor-variable ; MEANS predictor-variable / TUKEY CLDIFF ALPHA=alpha-level ; TITLE1 ' your-name ' ; RUN; QUIT; Substituting the appropriate SAS variable names into this syntax results in the following (line numbers have been added on the left; you will not actually type these line numbers): 1 2 3 4 5 6 7 8
PROC GLM CLASS MODEL MEANS MEANS TITLE1 RUN; QUIT;
DATA=D1; MOD_AGGR; SUB_AGGR = MOD_AGGR; MOD_AGGR; MOD_AGGR / TUKEY CLDIFF 'JOHN DOE';
ALPHA=0.05;
Some notes about the preceding code: •
In Line 1, the PROC GLM statement requests the GLM procedure, and requests that the analysis be performed on data set D1.
•
In Line 2, the CLASS statement lists the classification variable as MOD_AGGR (model aggression, the predictor variable in the experiment).
•
Line 3 contains the MODEL statement for the analysis. The name of the criterion variable SUB_AGGR appears to the left of the equal sign in this statement, and the name of the predictor variable MOD_AGGR appears to the right of the equal sign.
•
Line 4 contains the first MEANS statement: MEANS
MOD_AGGR;
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 509
This statement requests that SAS print the group means and standard deviations for the treatment conditions under the predictor variable. You will need these means in interpreting the results, and you will report the means and standard deviations in your analysis report. •
Line 5 contains the second MEANS statement: MEANS
MOD_AGGR / TUKEY
CLDIFF
ALPHA=0.05;
This second MEANS statement requests the multiple comparison procedure that will determine which pairs of treatment conditions are significantly different from each other. You should list the name of your predictor variable to the right of the word MEANS. In the preceding statement, MOD_AGGR is the name of the predictor variable in the current analysis, and so it was listed to the right of the word MEANS. The name of your predictor variable should be followed by a slash (“/”) and the keywords TUKEY, CLDIFF, and ALPHA=0.05. The keyword TUKEY requests that the Tukey HSD test be performed as a multiple comparison procedure. The keyword CLDIFF requests that Tukey tests be presented as confidence intervals for the differences between the means. The keyword ALPHA=0.05 requests that alpha (the level of significance) be set at .05 for the Tukey tests, and results in the printing of 95% confidence intervals for differences between means. If you had instead used the keyword ALPHA=0.01, it would have resulted in alpha being set at .01 for the Tukey tests, and in the printing of 99% confidence intervals. If you had instead used the keyword ALPHA=0.1, it would have resulted in alpha being set at .10 for the Tukey tests, and in the printing of 90% confidence intervals. If you omit the ALPHA option, the default is .05. •
Finally, lines 6, 7, and 8 contain the TITLE1, RUN, and QUIT statements for your program.
The complete SAS program. Here is the complete SAS program that will input your data set, perform a one-way ANOVA with one between-subjects factor, and follow with Tukey’s HSD test: OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM MOD_AGGR $ SUB_AGGR; DATALINES; 01 L 02 02 L 14 03 L 10 04 L 08 05 L 08 06 L 15 07 L 03 08 L 12 09 M 13 10 M 25 11 M 16 12 M 20
510 Step-by-Step Basic Statistics Using SAS: Student Guide
13 M 21 14 M 21 15 M 17 16 M 26 17 H 20 18 H 14 19 H 23 20 H 22 21 H 24 22 H 26 23 H 19 24 H 29 ; PROC GLM CLASS MODEL MEANS MEANS TITLE1 RUN; QUIT;
DATA=D1; MOD_AGGR; SUB_AGGR = MOD_AGGR; MOD_AGGR; MOD_AGGR / TUKEY CLDIFF 'JOHN DOE';
ALPHA=0.05;
Keywords for Other Multiple Comparison Procedures The preceding section showed you how to write a program that would request the Tukey HSD test as a multiple comparison procedure. The sections that follow will show you how to interpret the results generated by that test. However, it is possible that some readers will want to use a multiple comparison procedure other than the Tukey test. In fact, a wide variety of other tests are available with SAS. They can be requested with the MEANS statement, using the following syntax: MEANS
predictor-variable / ALPHA=alpha-level ;
mult-comp-proc
You should insert the keyword for the procedure that you want in the location where “multcomp-proc” appears. With some of these procedures, you can also include the CLDIFF option to request confidence intervals for differences between means. Here is a list of keywords for some frequently used multiple comparison procedures that are available with SAS: BON
Bonferroni t tests of differences between means
DUNCAN
Duncan’s multiple range test
DUNNETT Dunnett’s two-tailed t test, determining if any groups are significantly different from a single control
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 511
GABRIEL
Gabriel’s multiple-comparison procedure
REGWQ
Ryan-Einot-Gabriel-Welsch multiple range test
SCHEFFE
Scheffe’s multiple-comparison procedure
SIDAK
Pairwise t tests of differences between means, with levels adjusted according to Sidak’s inequality
SNK
Student-Newman-Keuls multiple range test
T
Pairwise t tests (equivalent to Fisher’s least-significant-difference test when cell sizes are equal)
TUKEY
Tukey’s studentized range (HSD) test
Output Produced by the SAS Program Using the OPTIONS statement shown, the preceding program would produce four pages of output. The information that appears on each page is briefly summarized here; later sections will provide detailed guidelines for interpreting these results. •
Page 1 provides class level information and the number of observations in the data set.
•
Page 2 provides the ANOVA summary table from the GLM procedure.
•
Page 3 provides the results of the first MEANS statement. This MEANS statement simply requests means and standard deviations on the criterion variable for the three treatment conditions.
•
Page 4 provides the results of the second MEANS statement. This includes the results of the Tukey multiple comparison tests and the confidence intervals.
Steps in Interpreting the Output 1. Make sure that everything looks correct. With most analyses, you should begin this process by analyzing the criterion variable with PROC MEANS or PROC UNIVARIATE to verify that (a) no “Minimum” observed value in your data set is lower than the theoretically lowest possible score, and (b) no “Maximum” observed value in your data set is higher than the theoretically highest possible score. Before you analyze your data with PROC GLM, you should review the section titled “Summary of Assumptions Underlying One-Way ANOVA with One Between-Subjects Factor,” earlier in this chapter. See Chapter 7, “Measures of Central Tendency and Variability,” for a discussion of PROC MEANS and PROC UNIVARIATE. The output created by the GLM procedure in the preceding program also contains information that might help to identify possible errors in the writing the program or in typing the data. This section shows how to review that information.
512 Step-by-Step Basic Statistics Using SAS: Student Guide
First review the class level information that appears on Page 1 of the PROC GLM output. This page is reproduced here as Output 15.1. JOHN DOE The GLM Procedure Class Level Information Class MOD_AGGR
Levels 3
Number of observations
1
Values H L M 24
Output 15.1. Class level information from one-way ANOVA performed on aggression data, significant treatment effect.
First, verify that the name of your predictor variable appears under the heading “Class.” Here, you can see that the classification variable is MOD_AGGR. Under the heading “Levels,” the output should indicate how many groups of subjects were included in your study. Output 15.1 correctly indicates that your predictor variable consists of three groups. Under the heading “Values,” the output should indicate the specific numbers or letters that you used to code this predictor variable. Output 15.1 correctly indicates that you used the values “H,” “L,” and “M.” It is important to use uppercase and lowercase letters consistently when you are coding treatment conditions under the predictor variable. For example, the preceding paragraph indicated that you used uppercase letters (H, L, and M) in coding conditions. If you had accidentally keyed a lowercase “h” instead of an uppercase “H” for one or more of your subjects, SAS would have interpreted that as a code for a new and different treatment condition. This, of course, would have led to errors in the analysis. Finally, the last line in Output 15.1 indicates the number of observations in the data set. The present example used three groups with eight subjects each, for a total N of 24. Output 15.1 indicates that your data set included 24 observations, so everything appears to be correct at this point. Page 2 of the output provides the analysis of variance table created by PROC GLM. It is reproduced here as Output 15.2.
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 513
JOHN DOE The GLM Procedure
2
Dependent Variable: SUB_AGGR Source Model Error Corrected Total R-Square 0.640854
DF 2 21 23
Sum of Squares 788.250000 441.750000 1230.000000
Coeff Var 26.97924
Mean Square 394.125000 21.035714
Root MSE 4.586471
F Value 18.74
Pr > F F F F F F F”, you can see that the p value (probability value) associated with the preceding F statistic is 0.0001. This p value indicates that the F statistic is significant at the .0001 level. This last heading (“Pr > F”) gives you the probability of obtaining an F statistic that is this large or larger, if the null hypothesis were true. In the present case, this p value is very small: it is equal to 0.0001. When a p value is less than .05, you may reject the null hypothesis, so in this case the null hypothesis of no population differences is rejected. This means that you have a significant treatment effect. In other words, you may tentatively conclude that, in the population, there is a difference between at least two of the treatment conditions. Because you have obtained a significant F statistic, these results seem to provide support for your research hypothesis that model aggression has an effect on subject aggression. Later, you will review the group means and results of the multiple comparison procedures to see if the results are in the predicted direction. First, however, you will prepare an ANOVA summary table to summarize some of the information from Output 15.4. 3. Prepare your own version of the ANOVA summary table. Table 15.2 provides the completed ANOVA summary table for the current analysis. Table 15.2 ANOVA Summary Table for Study Investigating the Relationship between Level of Aggression Displayed by Model and Subject Aggression (Significant Treatment Effect) ___________________________________________________________________ 2
Source df SS MS F R ___________________________________________________________________ Model aggression Within groups
2
788.25
394.13
21
441.75
21.04
18.74 *
.64
Total 23 1230.00 ___________________________________________________________________ Note: N = 24. * p < .0001
To complete the preceding table, you simply transfer information from Output 15.4 to the appropriate line of the ANOVA summary table. For your convenience, Output 15.4 is reproduced here as Output 15.5.
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 517
JOHN DOE The GLM Procedure
2
Dependent Variable: SUB_AGGR Source Model Error Corrected Total R-Square 0.640854
DF 2 21 23
Sum of Squares 788.250000 441.750000 1230.000000
Coeff Var 26.97924
Mean Square 394.125000 21.035714
Root MSE 4.586471
F Value 18.74
Pr > F F F F F F F” ( ). The MSE (mean square error). Item N from the analysis report provides a statistic abbreviated as the “MSE.” The relevant section of Item N is reproduced here: N) Formal description of the results for a paper: Results were analyzed using a one-way ANOVA with one betweensubjects factor. This analysis revealed a significant treatment effect for level of aggression displayed by the model, F(2, 21) = 18.74, MSE = 21.04, p = .0001. The last sentence of the preceding excerpt indicates that “MSE = 21.04.” Here, “MSE” stands for “Mean Square Error.” It is an estimate of the error variance in your analysis. In the output from PROC GLM, you will find the MSE where the row headed “Error” intersects with the column headed “Mean Square.” In Output 15.10, you can see that the mean square error is equal to 21.035714 ( ), which rounds to 21.04.
Example 15.2: One-Way ANOVA Revealing a Nonsignificant Treatment Effect Overview This section presents the results of a one-way ANOVA in which the treatment effect is nonsignificant. These results are presented so that you will be prepared to write analysis reports for projects in which nonsignificant outcomes are observed.
530 Step-by-Step Basic Statistics Using SAS: Student Guide
The Complete SAS Program The study presented here is the same aggression study that was described in the preceding section. The data will be analyzed with the same SAS program that was presented earlier. Here, the data have been changed so that they will produce nonsignificant results. The complete SAS program, including the new data set, is presented here: OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM MOD_AGGR $ SUB_AGGR; DATALINES; 01 L 07 02 L 17 03 L 14 04 L 11 05 L 11 06 L 20 07 L 08 08 L 15 09 M 08 10 M 20 11 M 11 12 M 15 13 M 16 14 M 16 15 M 12 16 M 21 17 H 14 18 H 10 19 H 17 20 H 18 21 H 20 22 H 21 23 H 14 24 H 23 ; PROC GLM DATA=D1; CLASS MOD_AGGR; MODEL SUB_AGGR = MOD_AGGR; MEANS MOD_AGGR; MEANS MOD_AGGR / TUKEY CLDIFF TITLE1 'JOHN DOE'; RUN; QUIT;
ALPHA=0.05;
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 531
Steps in Interpreting the Output As with the earlier data set, the SAS program that performs this analysis produces four pages of output. This section will present just those sections of output that are relevant to preparing the ANOVA summary table, the confidence intervals table, the figure, and the analysis report. This section will review the output in a fairly abbreviated manner; for a more detailed discussion of the output of PROC GLM, see earlier sections of this chapter. 1. Determine whether the treatment effect is significant. As before, you determine whether the treatment effect is significant by reviewing the ANOVA summary table produced by PROC GLM. This table appears on page 2 of the output, and is reproduced here as Output 15.11. JOHN DOE The GLM Procedure
2
Dependent Variable: SUB_AGGR Source Model Error Corrected Total
DF 2 21 23
R-Square 0.151655
Sum of Squares 72.3333333 404.6250000 476.9583333 Coeff Var 29.34496
Mean Square 36.1666667 19.2678571
Root MSE 4.389517
F Value 1.88
Pr > F 0.1778
SUB_AGGR Mean 14.95833
Source MOD_AGGR
DF 2
Type I SS 72.33333333
Mean Square 36.16666667
F Value 1.88
Pr > F 0.1778
Source MOD_AGGR
DF 2
Type III SS 72.33333333
Mean Square 36.16666667
F Value 1.88
Pr > F 0.1778
Output 15.11. ANOVA summary table for one-way ANOVA performed on aggression data, nonsignificant treatment effect.
As with the earlier data set, you review the results of the analyses that appear in the section headed “Type III SS” ( ), as opposed to the section headed “Type I SS.” To determine whether the treatment effect is significant, look to the right of the heading “MOD_AGGR” ( ). Here, you can see that the F statistic is only 1.88 ( ), with a p value of .1778 ( ). The obtained p value is greater than the standard criterion of .05, which means that this F statistic is nonsignificant. This means that you do not have a significant treatment effect for your predictor variable. 2. Prepare your own version of the ANOVA summary table. The completed ANOVA summary table for this analysis is presented here as Table 15.4.
532 Step-by-Step Basic Statistics Using SAS: Student Guide Table 15.4 ANOVA Summary Table for Study Investigating the Relationship between Level of Aggression Displayed by Model and Subject Aggression (Nonsignificant Treatment Effect) ___________________________________________________________________ 2
SS MS F R Source df ___________________________________________________________________ Model aggression Within groups
2
72.33
36.17
21
404.63
19.27
a
1.88
.15
Total 23 476.96 ___________________________________________________________________ Note: N = 24. F statistic is nonsignificant with alpha set at .05.
a
Notice how information from Output 15.11 was used to fill in the relevant sections of Table 15.4: •
Information from the line headed “MOD_AGGR” ( ) in Output 15.11 was transferred to the lined headed “Model aggression” in Table 15.4.
•
Information from the line headed “Error” ( ) in Output 15.11 was transferred to the lined headed “Within groups” in Table 15.4.
•
Information from the line headed “Corrected Total” ( ) in Output 15.7 was transferred to the lined headed “Total” in Table 15.4.
•
The R2 value that appeared below the heading “R-Square” ( ) in Output 15.11 was transferred below the heading “R2” in Table 15.4.
3. Review the results of the multiple comparison procedure. Notice that, unlike the previous section, this section does not advise you to review the results of the multiple comparison procedure (the Tukey test). This is because the treatment effect in the current analysis was nonsignificant, and you normally would not interpret the results of multiple comparison procedures for treatment effects that are not significant. 4. Review the confidence intervals for the differences between means. Although the results from the F statistic are nonsignificant, it may still be useful to review the size of the confidence intervals created in the analysis. Output 15.12 presents these intervals.
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 533
JOHN DOE The GLM Procedure Tukey's Studentized Range (HSD) Test for SUB_AGGR NOTE: This test controls the Type I experimentwise error rate.
4
Alpha 0.05 Error Degrees of Freedom 21 Error Mean Square 19.26786 Critical Value of Studentized Range 3.56463 Minimum Significant Difference 5.532 Comparisons significant at the 0.05 level are indicated by ***. MOD_AGGR Comparison H - M H - L M - H M - L L - H L - M
Difference Between Means 2.250 4.250 -2.250 2.000 -4.250 -2.000
Simultaneous 95% Confidence Limits -3.282 7.782 -1.282 9.782 -7.782 3.282 -3.532 7.532 -9.782 1.282 -7.532 3.532
Output 15.12. Confidence intervals for differences between the means created in analysis of aggression data, nonsignificant differences.
The confidence intervals for the current analysis appear below the heading “Simultaneous 95% Confidence Limits” ( ). You can see that none of the comparisons in this section are flagged with three asterisks, which means that none of the differences were significant according to the Tukey test. In addition, you can see that all of the confidence intervals in this section contain the value of zero. This is also consistent with the fact that none of the differences were statistically significant. Table 15.5 presents the confidence intervals resulting from the Tukey tests that you might prepare for a published report. The note at the bottom of the table tells the reader that all comparison are nonsignificant. You can see that all of the differences between means and confidence limits came from Output 15.12.
534 Step-by-Step Basic Statistics Using SAS: Student Guide Table 15.5 Results of Tukey Tests Comparing High-Model-Aggression Group versus Moderate-Model-Aggression Group versus Low-Model-Aggression Group on the Criterion Variable (Subject Aggression), Nonsignificant Differences Observed _____________________________________________________ Simultaneous 95% Difference confidence limits between ––––––––––––––––– a b Comparison means Lower Upper _____________________________________________________ High - Moderate
2.250
-3.282
7.782
High - Low
4.250
-1.282
9.782
Moderate - Low 2.000 -3.532 7.532 _____________________________________________________ Note: N = 24. Differences are computed by subtracting the mean for the second group from the mean for the first group. b With alpha set at .05, all Tukey test comparisons were nonsignificant. a
Using a Graph to Illustrate the Results Journal articles typically do not include a graph to illustrate group means when the treatment effect is nonsignificant. However, a graph presenting group means will be presented here as an illustration. The means for the three conditions of the present investigation appeared on page 3 of the preceding output (which presented the results from the first MEANS statement). Page 3 from the analysis is reproduced here as Output 15.13.
Level of MOD_AGGR
JOHN DOE The GLM Procedure -----------SUB_AGGR---------N Mean Std Dev
H L M
8 8 8
17.1250000 12.8750000 14.8750000
3
4.29077083 4.45413131 4.42194204
Output 15.13. Means and standard deviations produced by the MEANS statement in analysis of aggression data, nonsignificant differences.
Below the heading “Mean” ( ) you will find the mean scores for these three groups. You can see that the mean scores were 17.13, 12.88, and 14.88 for the high, low, and moderate groups, respectively. Figure 15.7 illustrates these group means.
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 535
Figure 15.7. Mean number of aggressive acts as a function of the level of aggression displayed by model (nonsignificant F statistic).
Analysis Report for the Aggression Study (Nonsignificant Results) The results from the preceding analysis could be summarized in the following report. Notice that some results (such as the results of the Tukey test) are not discussed because the treatment effect was nonsignificant. A) Statement of the research question: The purpose of this study was to determine whether there was a relationship between (a) the level of aggression displayed by a model and (b) the number of aggressive acts later demonstrated by children who observed the model. B) Statement of the research hypothesis: There will be a positive relationship between the level of aggression displayed by a model and the number of aggressive acts later demonstrated by children who observed the model. Specifically, it is predicted that (a) children who witness a high level of aggression will demonstrate a greater number of aggressive acts than children who witness a moderate or low level of aggression, and (b) children who witness a moderate level of aggression will demonstrate a greater number of aggressive acts than children who witness low level of aggression.
536 Step-by-Step Basic Statistics Using SAS: Student Guide
C) Nature of the variables: This analysis involved one predictor variable and one criterion variable: • The predictor variable was the level of aggression displayed by the model. This was a limited-value variable, was assessed on an ordinal scale, and included three levels: low, moderate, and high. • The criterion variable was the number of aggressive acts displayed by the subjects after observing the model. This was a multi-value variable, and was assessed on a ratio scale. One-way ANOVA with one between-subjects
D) Statistical test: factor.
E) Statistical null hypothesis (Ho): µ1 = µ2 = µ3; In the population, there is no difference between subjects in the low-model-aggression condition, subjects in the moderatemodel-aggression condition, and subjects in the high-modelaggression condition with respect to their mean scores on the criterion variable (the number of aggressive acts displayed by the subjects). F) Statistical alternative hypothesis (H1): Not all µs are equal; In the population, there is a difference between at least two of the following three groups with respect to their mean scores on the criterion variable: subjects in the lowmodel-aggression condition, subjects in the moderate-modelaggression condition, and subjects in the high-modelaggression condition. G) Obtained statistic:
F(2, 21) = 1.88
H) Obtained probability (p) value:
p = .1778
I) Conclusion regarding the statistical null hypothesis: to reject the null hypothesis.
Fail
J) Multiple comparison procedure: The multiple comparison procedure was not appropriate because the F statistic for the ANOVA was nonsignificant. K) Confidence Intervals: Confidence intervals for differences between the means are presented in Table 15.5. 2 L) Effect size: R = .15, indicating that model aggression accounted for 15% of the variance in subject aggression.
M) Conclusion regarding the research hypothesis: These findings fail to provide support for the study’s research hypothesis. N) Formal description of the results for a paper: Results were analyzed using a one-way ANOVA with one between-subjects factor. This analysis revealed a nonsignificant treatment
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 537
effect for level of aggression displayed by the model, F(2, 21) = 1.88, MSE = 19.27, p = .1778. On the criterion variable (number of aggressive acts displayed by subjects), the mean score for the high-modelaggression condition was 17.13 (SD = 4.29), the mean for the moderate-model-aggression-condition was 14.88 (SD = 4.42), and the mean for the low-model-aggression condition was 12.88 (SD = 4.45). The sample means are displayed in Figure 15.7. Confidence intervals for differences between the means (based on Tukey’s HSD test) are presented in Table 15.5. In the analysis, R2 was computed as .15. This indicated that model aggression accounted for 15% of the variance in subject aggression. O) Figure representing the results:
See Figure 15.7.
Conclusion This chapter has shown how to perform an analysis of variance on data from studies in which only one independent variable is manipulated. However, researchers in the social sciences and education often conduct research in which two independent variables are manipulated simultaneously in a single study. With such investigations, it is usually not appropriate to perform two separate one-way ANOVAs on the data––one ANOVA for the first independent variable, and a separate ANOVA for the second independent variable. Instead, it is usually more appropriate to analyze the data with a different statistical procedure: a factorial ANOVA. Performing a factorial analysis of variance not only enables you to determine whether you have significant treatment effects for your two independent variables; it also enables you to test for an entirely different type of effect: an interaction. Chapter 16 introduces you to the concept of an interaction, and shows how to use the GLM procedure to perform a factorial ANOVA with two between-subject factors.
538 Step-by-Step Basic Statistics Using SAS: Student Guide
Factorial ANOVA with Two BetweenSubjects Factors Introduction..........................................................................................542 Overview................................................................................................................ 542 Situations Appropriate for Factorial ANOVA with Two Between-Subjects Factors ......................................................542 Overview................................................................................................................ 542 Nature of the Predictor and Criterion Variables ..................................................... 542 The Type-of-Variable Figure .................................................................................. 543 Example of a Study Providing Data Appropriate for This Procedure...................... 543 True Independent Variables versus Subject Variables .......................................... 544 Summary of Assumptions Underlying Factorial ANOVA with Two Between-Subjects Factors ................................................................................ 545 Using Factorial Designs in Research ..................................................546 A Different Study Investigating Aggression ........................................546 Overview................................................................................................................ 546 Research Method................................................................................................... 547 The Factorial Design Matrix ................................................................................... 548 Understanding Figures That Illustrate the Results of a Factorial ANOVA .........................................................550 Overview................................................................................................................ 550 Example of a Figure............................................................................................... 550 Interpreting the Means on the Solid Line ............................................................... 551 Interpreting the Means on the Broken Line ............................................................ 552 Summary ............................................................................................................... 553
540 Step-by-Step Basic Statistics Using SAS: Student Guide
Some Possible Results from a Factorial ANOVA ................................553 Overview................................................................................................................ 553 A Significant Main Effect for Predictor A Only........................................................ 554 Another Example of a Significant Main Effect for Predictor A Only ........................ 556 A Significant Main Effect for Predictor B Only........................................................ 557 A Significant Main Effect for Both Predictor Variables ........................................... 558 No Main Effects...................................................................................................... 559 A Significant Interaction ......................................................................................... 560 Interpreting Main Effects When an Interaction Is Significant.................................. 562 Another Example of a Significant Interaction ......................................................... 563 Example of a Factorial ANOVA Revealing Two Significant Main Effects and a Nonsignificant Interaction ...............................565 Overview................................................................................................................ 565 Choosing SAS Variable Names and Values to Use in the SAS Program .................. 565 Data Set to Be Analyzed........................................................................................ 567 Writing the DATA Step of the SAS Program .......................................................... 568 Data Screening and Testing Assumptions Prior to Performing the ANOVA........... 570 Writing the SAS Program to Perform the Two-Way ANOVA.................................. 571 Log File Produced by the SAS Program ................................................................ 574 Output Produced by the SAS Program .................................................................. 575 Steps in Interpreting the Output ............................................................................. 576 Using a Figure to Illustrate the Results .................................................................. 592 Steps in Preparing the Graph ................................................................................ 595 Interpreting Figure 16.11........................................................................................ 596 Preparing Analysis Reports for Factorial ANOVA: Overview ................................. 597 Analysis Report Concerning the Main Effect for Predictor A (Significant Effect) ... 597 Notes Regarding the Preceding Analysis Report ................................................... 599 Analysis Report Regarding the Main Effect for Predictor B (Significant Effect)...... 601 Notes Regarding the Preceding Analysis Report ................................................... 603 Analysis Report Concerning the Interaction (Nonsignificant Effect) ...................... 605 Notes Regarding the Preceding Analysis Report ................................................... 607 Example of a Factorial ANOVA Revealing Nonsignificant Main Effects and a Nonsignificant Interaction ...............................607 Overview................................................................................................................ 607 The Complete SAS Program ................................................................................. 608 Steps in Interpreting the Output ............................................................................. 609 Using a Figure to Illustrate the Results .................................................................. 612 Interpreting Figure 16.12........................................................................................ 613 Analysis Report Concerning the Main Effect for Predictor A (Nonsignificant Effect) ........................................................................................ 614 Analysis Report Concerning the Main Effect for Predictor B (Nonsignificant Effect) ........................................................................................ 615 Analysis Report Concerning the Interaction (Nonsignificant Effect) ...................... 616
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 541
Example of a Factorial ANOVA Revealing a Significant Interaction ..617 Overview................................................................................................................ 617 The Complete SAS Program ................................................................................. 617 Steps in Interpreting the Output ............................................................................. 618 Using a Graph to Illustrate the Results .................................................................. 620 Interpreting Figure 16.13........................................................................................ 621 Testing for Simple Effects ...................................................................................... 622 Analysis Report Concerning the Interaction (Significant Effect)............................. 622 Using the LSMEANS Statement to Analyze Data from Unbalanced Designs........................................................................625 Overview................................................................................................................ 625 Reprise: What Is an Unbalanced Design? ............................................................ 625 Writing the LSMEANS Statements......................................................................... 626 Output Produced by LSMEANS ............................................................................. 627 Learning More about Using SAS for Factorial ANOVA ........................627 Conclusion............................................................................................628
542 Step-by-Step Basic Statistics Using SAS: Student Guide
Introduction Overview This chapter shows how to enter data and prepare SAS programs that will perform a twoway analysis of variance (ANOVA) using the GLM procedure. This chapter focuses on factorial designs with two between-subjects factors, meaning that each subject is exposed to only one condition under each independent variable. It discusses the differences between main effects versus interaction effects in factorial ANOVA. It provides guidelines for interpreting results that do not indicate a significant interaction, and separate guidelines for interpreting results that do indicate a significant interaction. It shows how to use multiple comparison procedures to identify the pairs of groups that are significantly different from each other, how to request confidence intervals for differences between the means, how to interpret an index of effect size, and how to prepare a figure that illustrates cell means. Finally, it shows how to prepare a report that summarizes the results of the analysis.
Situations Appropriate for Factorial ANOVA with Two Between-Subjects Factors Overview Factorial ANOVA is a test of group differences that enables you to determine whether there are significant differences between two or more groups with respect to their mean scores on a criterion variable. Furthermore, it enables you to investigate group differences with respect to two independent variables (or predictor variables) at the same time. In summary, factorial ANOVA with two between-subjects factors may be used when you wish to investigate the relationship between (a) two predictor variables (each of which classifies group membership) and (b) a single criterion variable. Nature of the Predictor and Criterion Variables Predictor variables. In factorial ANOVA, the two predictor (or independent) variables are classification variables, that is, variables that indicate which group a subject is in. They may be assessed on any scale of measurement (nominal, ordinal, interval, or ratio), but they serve mainly as classification variables in the analysis. Criterion variable. The criterion (or dependent) variable is typically a multi-value variable. It must be a numeric variable that is assessed on either an interval or ratio level of measurement. The criterion variable must also satisfy a number of additional assumptions, and these assumptions are summarized in a later section.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 543
The Type-of-Variable Figure The figure below illustrates the types of variables that are typically being analyzed when researchers perform a factorial ANOVA with two between-subjects factors. Criterion
Predictors
=
The “Multi” symbol that appears in the above figure shows that the criterion variable in a factorial ANOVA is typically a multi-value variable (a variable that assumes more than six values in your sample). The “Lmt” symbols that appear to the right of the equal sign in the above figure show that the two predictor variables in this procedure are usually limited-value variables (that is, variables that assume only two to six values). Example of a Study Providing Data Appropriate for This Procedure The study. Suppose that you are a physiological psychologist conducting research on aggression in children. You are interested in two research questions: (a) does consuming sugar cause children to behave more aggressively?, and (b) are boys more aggressive than girls? To investigate these questions, you conduct an experiment in which you randomly assign 90 children to three treatment conditions: •
30 children are assigned to a 0-gram condition. These children consume zero grams of sugar each day over a two-month period.
•
30 children are assigned to a 20-gram condition. These children consume 20 grams of sugar each day over a two-month period.
•
30 children are assigned to a 40-gram condition. These children consume 40 grams of sugar each day over a two-month period.
You observe the children over the two-month period, and for each child you record the number of aggressive acts that the child displays each day. At the end of the two month period, you determine whether the children in the 40-gram condition displayed a mean number of aggressive acts that is significantly higher than the mean displayed by the other two groups. This analysis helps you to determine whether consuming sugar causes children to behave more aggressively. In the same study, however, you are also interested in investigating sex differences in aggression. At the time that subjects were assigned to conditions you ensured that, within
544 Step-by-Step Basic Statistics Using SAS: Student Guide
each of the “sugar consumption” treatment groups, half of the children were male and half were female. This means that, in your study, both of the following are true: •
45 children were in the male group
•
45 children were in the female group.
At the end of the two-month period, you determine whether the male group displays a mean number of aggressive acts that is significantly different from the mean number displayed by the female group. Why these data would be appropriate for this procedure. The preceding study involved two predictor variables and a single criterion variable. The first predictor variable (Predictor A) was “amount of sugar consumed.” You know that this was a limited-value variable, because it assumed only three values: a 0-gram condition, a 20-gram condition, and a 40gram condition. This predictor variable was assessed on a ratio scale, since “grams of sugar” has equal intervals and a true zero point. However, remember that the predictor variables that are used in ANOVA may be assessed on any scale of measurement. In general, they are treated as classification variables in the analysis. The second predictor variable (Predictor B) was “subject sex.” You know that this was a dichotomous variable because it involved only two values: a male group versus a female group. This variable was assessed on a nominal scale, since it indicates group membership but does not convey any quantitative information. Finally, the criterion variable in this study was the number of aggressive acts that were displayed by the children. You know that this was a multi-value variable if you verified that the children’s scores took on a relatively large number of values (that is, some children might have displayed zero aggressive acts each day, other children might have displayed 50 aggressive acts each day, and still other children might have displayed a variety of aggressive acts between these two extremes). Remember that, for our purposes, we label a variable a multi-value variable if it assumes more than six different values in the sample. The criterion variable was assessed on a ratio scale. You know this because the “number of aggressive acts” has equal intervals and a true zero point. True Independent Variables versus Subject Variables Notice that, with the preceding study, one of the predictor variables was a true independent variable, while the other predictor variable was merely a subject variable. A true independent variable is a variable that is manipulated and controlled by the researcher so that it is independent of (uncorrelated with) any other independent variable in the study. In this study, “amount of sugar consumed” (Predictor A) was a true independent variable because it was manipulated and controlled by you, the researcher. You manipulated this variable by randomly assigning subjects to either the 0-gram condition, the 20-gram condition, or the 40-gram condition.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 545
In contrast to a true independent variable, a subject variable is a characteristic of the subject that is not directly manipulated by the researcher, but is used as predictor variable in the study. In the preceding study, “subject sex” (Predictor B) was a subject variable. You know that this is a subject variable because sex is a characteristic of the subject that is not directly manipulated by the researcher. You know that subject sex is not a true independent variable because it is not possible to manipulate it in a direct fashion (i.e., it is not possible to randomly assign half of your subjects to be male, and half to be female). With a subject variable, you simply note which condition a subject is already in; you do not assign the subject to that condition. Other examples of subject variables include age, political party, and race. When you perform a factorial ANOVA, it is possible to use any combination of true independent variables and subject variables. That is, you can perform an analysis in which •
both predictor variables are true independent variables
•
both predictor variables are subject variables
•
one predictor is a true independent variable and the other predictor is a subject variable.
Summary of Assumptions Underlying Factorial ANOVA with Two Between-Subjects Factors •
Level of measurement. The criterion variable must be a numeric variable that is assessed on an interval or ratio level of measurement. The predictor variables may be assessed on any level of measurement, although they are essentially treated as nominal-level (classification) variables in the analysis.
•
Independent observations. An observation should not be dependent on any other observation in any cell (the meaning of the term “cell” will be explained in a later section). In practical terms, this means that each subject is exposed to only one condition under each predictor variable, and that subject matching procedures are not used.
•
Random sampling. Scores on the criterion variable should represent a random sample that is drawn from the populations of interest.
•
Normal distributions. Each cell should be drawn from a normally-distributed population. If each cell contains over 30 subjects, the test is robust against moderate departures from normality (in this context, “robust” means that the test will still provide accurate results as long as violations of the assumptions are not large). You should analyze your data with PROC UNIVARIATE using the NORMAL option to determine whether your data meet this assumption. Remember that the significance tests for normality provided by PROC UNIVARIATE tend to be fairly sensitive when samples are large.
•
Homogeneity of variance. The populations represented by the various cells should have equal variances on the criterion. If the number of subjects in the largest cell is no more than 1.5 times greater than the number of subjects in the smallest cell, the test is robust against moderate violations of the homogeneity assumption.
546 Step-by-Step Basic Statistics Using SAS: Student Guide
Using Factorial Designs in Research Chapter 15, “One-Way ANOVA with One Between-Subjects Factor,” described a simple experiment in which you manipulated a single independent variable: the level of aggression that was displayed by a model. Because there was a single independent variable in that study, it was analyzed using a one-way ANOVA. But suppose that there are two independent variables that you wish to manipulate. In this situation, you might think that it would be necessary to conduct two separate experiments, one for each independent variable. But you would be wrong: In many cases, it will be possible (and preferable) to manipulate both independent variables in a single study. The research design that is used in these studies is called a factorial design. In a factorial design, two or more independent variables are manipulated in a single study so that the treatment conditions represent all possible combinations of the various levels of the independent variables. In theory, a factorial design might include any number of independent variables. In practice, however, it often becomes impractical to use more than three or four. This chapter illustrates factorial designs that include only two independent variables, and such designs can be analyzed using a two-way ANOVA.
A Different Study Investigating Aggression Overview To illustrate the concept of factorial design, imagine that you are interested in conducting a different type of study that investigates aggression in nursery-school children. You want to test the following two research hypotheses: •
Hypothesis A: There will be a positive relationship between the level of aggression displayed by a model and the number of aggressive acts later demonstrated by children who observe the model. Specifically, it is predicted that (a) children who witness a high level of aggression will demonstrate a greater number of aggressive acts than children who witness a moderate or low level of aggression, and (b) children who witness a moderate level of aggression will demonstrate a greater number of aggressive acts than children who witness low level of aggression.
•
Hypothesis B: Children who observe a model being rewarded for engaging in aggressive behavior will later demonstrate a greater number of aggressive acts, compared to children who observe a model being punished for engaging in aggressive behavior.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 547
You perform a single investigation to test these two hypotheses. In this investigation, you will simultaneously manipulate two independent variables. One of the independent variables will be relevant to Hypothesis A, and the other independent variable will be relevant to Hypothesis B. The following sections describe the research method in more detail. Note: Although the study and results presented here are fictitious, they are inspired by the actual studies reported by Bandura (1965, 1977). Research Method Overview. Suppose that you conduct a study in which 30 nursery-school children serve as subjects. The study is conducted in two stages. In Stage 1, you show a short videotape to your subjects. You manipulate the two independent variables by varying what the children see in this videotape. In Stage 2, you assess the dependent variable (the amount of aggression displayed by the children) to determine whether it has been affected by the independent variables. The following sections refer to your independent variables as “Predictor A” and “Predictor B” rather than “Independent Variable A” and “Independent Variable B.” This is because the term “predictor variable” is more general, and is appropriate regardless of whether your variable is a true manipulated independent variable (as in the present case), or a nonmanipulated subject variable (such as subject sex). Stage 1: Manipulating Predictor A. Predictor A in your study is “the level of aggression displayed by the model” or, more concisely, “model aggression.” You manipulate this independent variable by randomly assigning each child to one of three treatment conditions: •
Ten children are assigned to the “low” condition. When the subjects in this group watch the videotape, they see a model demonstrate a relatively low level of aggressive behavior. Specifically, they see a model (an adult female) enter a room that contains a wide variety of toys. For 90% of the tape, the model engages in nonaggressive play (e.g., playing with building blocks). For 10% of the tape, the model engages in aggressive play (e.g., violently punching an inflatable “bobo doll”).
•
Another 10 children are assigned to the “moderate” condition. They watch a videotape of the same model in the same playroom, but they observe the model displaying a somewhat higher level of aggressive behavior. Specifically, in this version of the tape, the model engages in nonaggressive play (again, playing with building blocks) 50% of the time, and engages in aggressive play (again, punching the bobo doll) 50% of the time.
•
Finally, the last 10 children are assigned to the “high” condition. They watch a videotape of the same model in the same playroom, but in this version the model engages in nonaggressive play 10% of the time, and engages in aggressive play 90% of the time.
Stage 1 continued: Manipulating Predictor B: Predictor B is “the consequences for the model.” You manipulate this independent variable by randomly assigning each child to one of two treatment conditions:
548 Step-by-Step Basic Statistics Using SAS: Student Guide •
Fifteen children are assigned to the “model rewarded” condition. Toward the end of the videotape (described above), children in this group see the model rewarded for her behavior. Specifically, the videotape shows another adult who enters the room with the model, praises her, and gives her cookies.
•
The other 15 children are assigned to the “model punished” condition. Toward the end of the same videotape, children in this group see the model punished for her behavior: Another adult enters the room with the model, scolds her, shakes her finger at her, and puts her in “time out.”
Stage 2: Assessing the criterion variable. This chapter will refer to the dependent variable in the study as a “criterion variable.” Again, this is because the term “criterion variable” is a more general term that is appropriate regardless of whether your study is a true experiment (as in the present case), or is a nonexperimental investigation. The criterion variable in this study is the “number of aggressive acts displayed by the subjects” or, more concisely, “subject aggressive acts.” The purpose of your study was to determine whether certain manipulations in your videotape caused some groups of children to behave more aggressively than others. To assess this, you allowed each child to engage in a free play period immediately after viewing the videotape. Specifically, each child was individually escorted to a playroom similar to the one shown in the tape. This playroom contained a large assortment of toys, some of which were appropriate for nonaggressive play (e.g., building blocks), and some of which were appropriate for aggressive play (e.g., an inflatable bobo doll identical to the one in the tape). The children were told that they could do whatever they liked in the play room, and were then left to play alone. Outside of the playroom, three observers watched the child through a one-way mirror. They recorded the total number of aggressive acts the child displayed during a 20-minute period in the playroom (an “aggressive act” could be an instance in which the child punches the bobo doll, throws a building block, and so on). Therefore, the criterion variable in your study is the total number of aggressive acts demonstrated by each child during this period. The Factorial Design Matrix The factorial design of this study is illustrated in Figure 16.1. You can see that this design is represented by a matrix that consists of two rows and three columns.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 549
Figure 16.1. Factorial design used in the aggression study.
The columns of the matrix. When an experimental design is represented in a matrix such as this, it is easiest to understand if you focus on only one aspect of the matrix at a time. For example, first consider just the three columns in Figure 16.1. The three columns are headed “Predictor A: Level of Aggression Displayed by Model,” and these columns represent the various levels of the “model aggression” independent variable. The first column represents the 10 subjects in level A1 (the children who saw a videotape in which the model displayed a low level of aggression), the second column represents the 10 subjects in level A2 (the children who saw the model display a moderate level of aggression), and the last column represent the 10 subjects in level A3 (the children who saw the model display a high level of aggression). The rows of the matrix. Now consider just the two rows in Figure 16.1. These rows are headed “Predictor B: Consequences for Model.” The first row is headed “Level B1: Model Rewarded,” and this row represents the 15 children who saw the model rewarded for her behavior. The second row is headed “Level B2: Model Punished,” and represents the 15 children who saw the model punished for her behavior. The r × c design. It is common to refer to a factorial design as an “r × c” design, in which “r” represents the number of rows in the matrix, and “c” represents the number of columns. The present study is an example of a 2 × 3 factorial design because it has two rows and three columns. If it included four levels of model aggression rather than three, it would be referred to as a 2 × 4 factorial design. The cells of the matrix. You can see that this matrix consists of six cells. A cell is a location in the matrix where the column for one predictor variable intersects with the row for a second predictor variable. For example, look at the cell where column A1 (low level of model aggression) intersects with row B1 (model rewarded) . The entry “5 Subjects” appears in this cell, which means that there were five children who experienced this particular combination of “treatments” under the two predictor variables. More specifically,
550 Step-by-Step Basic Statistics Using SAS: Student Guide
it means that there were five subjects who both (a) saw the model engage in a low level of aggression, and (b) saw the model rewarded for her behavior. Now look at the cell in which row A2 (moderate level of model aggression) intersects with row B2 (model punished). Again, the cell contains the entry “5 Subjects,” which means that there was a different group of five children who experienced the treatments of (a) seeing the model display a moderate level of aggression and (b) seeing the model punished for her behavior. In the same way, you can see that there was a separate group of five children assigned to each of the six cells of the matrix. Earlier, it was said that a factorial design involves two or more independent variables being manipulated so that the treatment conditions represent all possible combinations of the various levels of the independent variables. The cells of Figure 16.1 illustrate this concept. You can see that the six cells of the figure represent every possible combination of (a) level of aggression displayed by the model and (b) consequences for the model. This means that, for the children who saw a low level of model aggression, half of them saw the model rewarded, and the other half saw the model punished. The same is true for the children who saw a moderate level of model aggression, as well as for the children who saw a high level of model aggression.
Understanding Figures That Illustrate the Results of a Factorial ANOVA Overview Factorial designs are popular in research for a variety of reasons. One reason is that they allow you to test for several different types of effects in a single investigation. The types of effects that may be produced from a factorial study will be discussed in the next section. However, it is important to note that this advantage has a corresponding drawback: Because they involve different types of effects, factorial designs sometimes produce results that can be difficult to interpret, compared to the simpler results that are produced in a one-way ANOVA. Fortunately, however, this task of interpretation can be made much easier if you first prepare a figure that plots the results of the factorial study. This section shows how to interpret these figures. Example of a Figure Figure 16.2 presents one type of figure that is often used to illustrate the results of a factorial study. Notice that, with this figure, scores on the criterion variable (“Subject Aggressive Acts”) are plotted on the vertical axis.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 551
Figure 16.2. Example of a one type of figure that is often used to illustrate the results from a factorial ANOVA.
The three levels of Predictor A (level of aggression displayed by model) are plotted on the horizontal axis. The first point on this axis is labeled “Low,” and represents group A1 (the children who watched the model display a low level of aggression). The middle point is labeled “Moderate,” and represents group A2 (the children who watched the children display a moderate level of aggression). The point at the right is labeled “High,” and represents group A3 (the children who watched the children display a high level of aggression). The two levels of Predictor B (consequences for the model) are identified by drawing two different lines in the body of the figure itself. Specifically, the mean aggression scores displayed by children who saw the model rewarded (level B1) are illustrated with small circles connected by a solid line, while the mean aggression scores displayed by the children who saw the model punished (level B2) are displayed by small triangles connected by a broken line. Interpreting the Means on the Solid Line You will remember that your investigation involved six groups of subjects, corresponding to the six cells of the factorial design matrix described earlier. Figure 16.2 illustrates the mean score for each of these six groups. To read these means, begin by focusing on just the solid line with circles. This line provides means for the subjects who saw the model rewarded. First, find the circle that appears above the label “Low” on the figure’s horizontal axis. This
552 Step-by-Step Basic Statistics Using SAS: Student Guide
circle represents the mean aggression score for the five children who (a) saw the model display a low level of aggression, and (b) saw the model rewarded. Look to the left of this circle to find the mean score for this group on the “Subject Aggressive Acts” axis. The circle for this group is found at about 13 on this axis. This means that the five children in this group displayed an average of approximately 13 aggressive acts in the playroom after they watched the videotape. Now find the next circle on the solid line, above the label “Moderate” on the horizontal axis. This circle represents the five children in the group that (a) saw the model display a moderate level of aggression, and (b) saw the model rewarded. Looking to the vertical axis on the left, you can see that this group displayed a mean score of about 18. This means that the five children in this group displayed an average of approximately 18 aggressive acts in the playroom after they watched the videotape. Finally, find the circle on the solid line above the label “High” on the horizontal axis. This circle represents the five children in the group who (a) saw the model display a high level of aggression, and (b) saw the model rewarded. Looking to the vertical axis on the left, you can see that this group displayed a mean score of 24, meaning that they engaged in an average of about 24 aggressive acts in the playroom after watching the videotape. These three circles are all connected by a single solid line, indicating that all of these subjects were in the same condition under Predictor B––the model-rewarded condition. Interpreting the Means on the Broken Line Next you will find the mean scores for the subjects in the other condition under Predictor B: the children who saw the model punished. To do this, focus on the broken line with triangles. First, find the triangle that appears above the label “Low” on the figure’s horizontal axis. This triangle provides the mean aggression score for the five children who (a) saw the model display a low level of aggression, and (b) saw the model punished. Look to the left of this triangle to find the mean score for this group on the “Subject Aggressive Acts” axis. The triangle for this group is 2, which means that the five children in this group displayed an average of approximately 2 aggressive acts in the playroom after they watched the videotape. Repeating this process for the two other triangles on the broken line shows the following: •
The five children who (a) saw the model display a moderate level of aggression, and (b) saw the model punished displayed an average of approximately 7 aggressive acts in the playroom.
•
The five children who (a) saw the model display a high level of aggression, and (b) saw the model punished displayed an average of approximately 13 aggressive acts in the playroom.
These three triangles are all connected by a single broken line, indicating that all of these subjects were in the same condition under Predictor B––the model-punished condition.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 553
Summary The important points to remember when interpreting the graphs in this chapter are as follows: •
The possible scores on the criterion variable are represented on the vertical axis.
•
The three levels of Predictor A are represented as three different points on the horizontal axis.
•
The two levels of Predictor B are represented by drawing two different lines within the graph.
Now you are ready to learn about the different types of effects that are observed in factorial designs, and how these effects appear when they are plotted in this type of graph.
Some Possible Results from a Factorial ANOVA Overview When a predictor variable (or independent variable) in a factorial design displays a significant main effect, it means that, in the population, there is a difference between at least two of the levels of that predictor variable with respect to mean scores on the criterion variable. In a one-way analysis of variance, only one main effect is possible: the main effect for the study’s one independent variable. However, in a factorial design, there will be one main effect possible for each predictor variable included in the study. Because the present study involves two predictor variables, two types of main effects are possible: •
a main effect for Predictor A
•
a main effect for Predictor B.
However, a factorial ANOVA can also produce an entirely different type of effect that is not possible with a one-way ANOVA––it can reveal a significant interaction between Predictor A and Predictor B. When an interaction is significant, it means that the relationship between one predictor variable and the criterion variable is different at different levels of the second predictor variable (a later section will discuss interactions in more detail). The following sections show how main effects and interactions might appear when plotted in a graph.
554 Step-by-Step Basic Statistics Using SAS: Student Guide
A Significant Main Effect for Predictor A Only Figure 16.3 shows one example of how a graph may appear when there is •
A significant main effect for Predictor A. Predictor A was the level of aggression displayed by the model: low versus moderate versus high.
•
A nonsignificant main effect for Predictor B. Predictor B was consequences for the model: model-rewarded versus model-punished.
•
A nonsignificant effect for the interaction.
In other words, Figure 16.3 shows a situation in which the only significant effect is a main effect for Predictor A.
Figure 16.3. A significant main effect for Predictor A (level of aggression displayed by the model); nonsignificant main effect for Predictor B; nonsignificant interaction.
Interpreting the graph. Figure 16.3 shows that a relatively low level of aggression was displayed by subjects in the “low” condition of Predictor A (the level of aggression displayed by the model). When you look above the label “Low” on the horizontal axis, you can see that both the children in the model-rewarded group (represented with a small circle) as well as the children in the model-punished group (represented with a small triangle) display relatively low scores on aggression (the two groups demonstrated a mean of approximately 6 aggressive acts in the playroom). However, a somewhat higher level of
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 555
aggression was demonstrated by subjects in the moderate-model-aggression condition: When you look above the “Moderate” on the horizontal axis, you can see that the children in this condition displayed an average of about 12 aggressive acts. Finally, an even higher level of aggression was displayed by subjects in the high-model-aggression condition: When you look above the label “High,” you can see that the children in this condition displayed an average of approximately 18 aggressive acts. In short, this trend shows that there was a main effect for the model aggression variable. Figure 16.3 shows that, the greater the amount of aggression displayed by the model in the videotape, the greater the number of aggressive acts subsequently displayed by the children when they were in the playroom. Characteristics of main effect for Predictor A when graphed. This leads to an important point: When a figure representing the results of a factorial study displays a significant main effect for Predictor A, it will demonstrate both of the following characteristics: •
Corresponding line segments are parallel.
•
At least one set of corresponding line segments displays a relatively steep angle (or slope).
First, you need to understand that “corresponding line segments” refers to line segments that (a) run from one point on the horizontal axis for Predictor A to the next point on the same axis, and (b) appear immediately above and below each other in a figure. For example, the solid line and the broken line that run from “Low” to “Moderate” in Figure 16.3 are corresponding line segments. Similarly, the solid line and the broken line that go from “Moderate” to “High” in Figure 16.3 are also corresponding line segments. The first of the two preceding conditions––that the lines should be parallel––conveys that the two predictor variables are not involved in an interaction. This is important because you typically will not interpret a main effect for a predictor variable if that predictor variable is involved in a significant interaction (the meaning of interaction will be discussed later in the section “A Significant Interaction”). In Figure 16.3, you can see that the lines for the two conditions under Predictor B (the solid line and the broken line) are parallel to one another. This suggests that there probably is not an interaction between Predictor A (level of aggression displayed by the model) and Predictor B (consequences for the model) in the present study. The second condition––that at least one set of corresponding line segments should display a relatively steep angle—can be understood by again referring to Figure 16.3. Notice that the segment that begins at “Low” (the low-model-aggression condition) and extends to “Moderate” (the moderate-model-aggression condition) is not horizontal; it displays upward angle––a positive slope. Obviously, this is because the aggression scores for the moderatemodel-aggression group were higher than the aggression scores for the low-modelaggression group. When you obtain a significant effect for “Predictor A” variable in your study, you should expect to see this type of angle. Similarly, you can see that the line segment that begins at “Moderate” and continues to “High” also displays an upward angle, also consistent with a significant effect for the model aggression variable.
556 Step-by-Step Basic Statistics Using SAS: Student Guide
Remember that these guidelines are merely intended to help you understand what a main effect looks like when it is plotted in a graph such as Figure 16.3. To determine whether this main effect is statistically significant, it will of course be necessary to review the results of the analysis of variance, to be discussed below. Another Example of a Significant Main Effect for Predictor A Only Figure 16.4 shows another example of a significant main effect for the model aggression factor. You know that this figure illustrates a main effect for Predictor A, because both of the following are true: •
the corresponding line segments are all parallel
•
one set of corresponding line segments displays a relatively steep angle.
Figure 16.4. Another example of a significant main effect for Predictor A (level of aggression displayed by the model); nonsignificant main effect for Predictor B, nonsignificant interaction.
Notice that the solid line and the broken line that run from “Low” to “Moderate” are parallel to each other. In addition, the solid line and the broken line that run from “Moderate” to “High” are also parallel. This tells you that there is probably not a significant interaction between Predictor A and Predictor B. Where an interaction is concerned, it is irrelevant that the lines show an upward angle from “Low” to “Moderate” and then become level from “Moderate” to “High.” The important point is that the corresponding line segments are
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 557
parallel to each other. Think of these lines as being like the rails of a railroad track that twists and curves along the landscape: As long as the two corresponding rails are always parallel to each other (regardless of how much they slope), the interaction is probably not significant. You know that Figure 16.4 illustrates a significant main effect for Predictor A because the lines demonstrate a relatively steep angle as they run from “Low” to “Moderate.” This tells you that the children who observed a moderate level of model aggression displayed a higher number of aggressive acts after viewing the videotape, compared to the children who observed a low level of model aggression. It is this difference that tells you that you probably have a significant main effect for Predictor A. You can see from Figure 16.4 that the lines do not demonstrate a relatively steep slope as they run from “Moderate” to “High.” This tells you that there was probably not a significant difference between the children who watched a moderate level of model aggression versus those who observed a high level. But this does not change the fact that you still have a significant effect for Predictor A. When a predictor variable contains three or more conditions, that predictor variable will display a significant main effect if at least two of the conditions are markedly different from each other. A Significant Main Effect for Predictor B Only How Predictor B is represented. You would expect to see a different type of pattern in a graph if the main effect for the other predictor variable (Predictor B) were significant. Earlier, you learned that Predictor A was represented in a graph by plotting three points on the horizontal axis. In contrast, you learned that Predictor B was represented by drawing different lines within the body of the graph: one line for each level of Predictor B. In the present study, Predictor B was the “consequences for the model” variable: A solid line was used to represent mean scores from the children who saw the model rewarded, and a broken line was used to represent mean scores from the children who saw the model punished. Characteristics of a main effect for Predictor B when graphed. When Predictor B is represented in a figure by plotting separate lines for its various levels, a significant main effect for Predictor B is revealed when the figure displays both of the following characteristics: •
Corresponding line segments are parallel
•
At least two of the lines are relatively separated from each other.
Interpreting the graph. For example, a main effect for Predictor B in the current study is represented by Figure 16.5. Consistent with the two preceding points, the two lines in Figure 16.5 (a) are parallel to one another (indicating that there is probably no interaction), and (b) are separated from one another.
558 Step-by-Step Basic Statistics Using SAS: Student Guide
Figure 16.5. A significant main effect for Predictor B (consequences for the model); nonsignificant main effect for Predictor A, nonsignificant interaction.
Regarding the separation between the lines: Notice that, in general, the children in the “model rewarded” condition tended to demonstrate a higher number of aggressive acts after viewing the videotape, compared to the children in the “model punished” condition. This is the general trend that you would expect, given the assumptions of social learning theory (Bandura, 1977). Notice that neither the solid line nor the broken line show much of an angle, or slope. This indicates that there was probably not a main effect for Predictor A (level of aggression displayed by model). A Significant Main Effect for Both Predictor Variables It is possible to obtain significant effects for both Predictor A and Predictor B in the same investigation. When there is a significant effect for both predictor variables, you should see all of the following: •
Corresponding line segments are parallel (indicating no interaction).
•
At least one set of corresponding line segments displays a relatively steep angle (indicating a main effect for Predictor A)
•
At least two of the lines are relatively separated from each other (indicating a significant main effect for Predictor B).
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 559
Figure 16.6 shows what the graph might look like under these circumstances.
Figure 16.6. Significant main effects for both Predictor A (level of aggression displayed by the model) and Predictor B (consequences for the model); nonsignificant interaction.
From Figure 16.6, you can see that the dotted line and the solid line are parallel, indicating no interaction. Both lines display an upward angle, indicating a significant effect for Predictor A (level of aggression displayed by the model): The children in the “high” condition were more aggressive than the children in the “moderate” condition, who in turn were more aggressive than the children in the “low” condition. Finally, the solid line is higher than the broken line, indicating a significant effect for Predictor B (consequences for the model): The children who saw the model rewarded tended to be more aggressive than the children who saw the model punished. No Main Effects Figure 16.7 shows what a graph might look like if there were no main effects for either Predictor A or Predictor B. Notice that the lines are parallel (indicating no interaction), none of the line segments display a relatively steep angle (indicating no main effect for Predictor A), and the lines are not separated (indicating no main effect for Predictor B).
560 Step-by-Step Basic Statistics Using SAS: Student Guide
Figure 16.7. Nonsignificant main effects and a nonsignificant interaction.
A Significant Interaction Overview. An earlier section indicated that, when you perform a two-way ANOVA, there are three types of effects that may be observed: (a) a main effect for Predictor A, (b) a main effect for Predictor B, and (c) an interaction between Predictor A and Predictor B. This section provides definitions for the concept of “interaction,” shows what an interaction might look like when plotted on a graph, and addresses the issue of whether main effects should typically be interpreted when an interaction is significant. Definitions for “interaction.” The concept of an interaction can be defined in several ways. For example, with respect to experimental research (in which you are actually manipulating true independent variables), the following definition can be used: •
An interaction is a condition in which the effect of one independent variable on the dependent variable is different at different levels of the second independent variable.
On the other hand, for nonexperimental research (in which you are simply measuring naturally-occurring variables rather than manipulating true independent variables), the concept of interaction can be defined in this way: •
An interaction is a condition in which the relationship between one predictor variable and the criterion variable is different at different levels of the second predictor variable.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 561
Characteristics of an interaction when graphed. These definitions are abstract and somewhat difficult to understand at first reading. However, the concept of interaction is much easier to understand by seeing an interaction illustrated in a graph. A graph indicates that there is probably an interaction between two predictor variables when it displays the following characteristic: •
At least one set of corresponding line segments are not parallel.
An interaction illustrated in a graph. For example, Figure 16.8 displays a significant interaction between Predictor A and Predictor B in the present study. Notice that the solid line and the broken line are no longer parallel: The line representing the children who saw the model rewarded now displays a fairly steep angle, while the line for the line for the children who saw the model punished is relatively flat. This is the key characteristic of a figure that displays a significant interaction: lines that are not parallel.
Figure 16.8. A significant interaction between Predictor A (level of aggression displayed by the model) and Predictor B (consequences for the model).
Notice how the relationships depicted in Figure 16.8 are consistent with the definition for interaction provided above: The relationship between one predictor variable (level of aggression displayed by the model) and the criterion variable (subject aggression) is different at different levels of the second predictor variable (consequences for the model). More specifically, the figure shows that there is a relatively strong, positive relationship between Predictor A (model aggression) and the criterion variable (subject aggression) for the children in the “model rewarded” level of Predictor B. For the children in this group, the more aggression that they saw from the model, the more aggressively they (the children)
562 Step-by-Step Basic Statistics Using SAS: Student Guide
behaved when they were in the playroom. The children who saw a high level of model aggression displayed an average of about 23 aggressive acts, the children who saw a moderate level of model aggression displayed an average of about 17 aggressive acts, and the children who saw a low level of model aggression displayed an average of about 9 aggressive acts. In contrast, notice that Predictor A (level of aggression displayed by the model) had essentially no effect on the children in the “model-punished” level of Predictor B. The children in this condition are represented by the broken line in Figure 16.8. You can see that this broken line is relatively flat: The children in the “high,” “moderate,” and “low” groups all displayed the same number of aggressive acts when they were in the playroom (each group displayed about 7 aggressive acts). The fact that there were no differences between the three conditions means that Predictor A had no effect on the children in the modelpunished condition. Interpreting the interaction in the figure. If you conducted the study described here and actually obtained the interaction that is illustrated in Figure 16.8, what would the results mean with respect to the effects of your two predictor variables? These results suggest that the level of aggression displayed by a model can have an effect on the level of aggression later displayed by the subjects, but only if the subjects see the model being rewarded for her aggressive behavior. However, if the subjects instead see the model being punished, then the level of aggression displayed by the model has no effect. Two caveats regarding this interpretation: First, remember that these results, like most of the results presented in this book, are fictitious, and were provided only to illustrate statistical concepts. They do not necessarily represent what researchers have discovered when conducting research on aggression. Second, remember that the interaction illustrated in Figure 16.8 is just one example of what an interaction might look like when plotted. When you perform a two-way ANOVA, a significant interaction might take on any of an infinite variety of forms. These different types of interactions will have one characteristic in common: they will all involve corresponding line segments that are not parallel. A later section will show a different example of how an interaction from the present study might appear. Interpreting Main Effects When an Interaction Is Significant The problem. When you perform a two-way ANOVA, it is possible that you will find that (a) the interaction term is statistically significant, and (b) one or both of the main effects are also statistically significant. When you prepare a report summarizing the results, you will certainly discuss the nature of your significant interaction. But is it also acceptable to discuss and interpret the main effects that were significant? There is some disagreement between statisticians in answering this question. Some statisticians argue that, if the interaction is significant, you should not interpret the main effects at all, even if they are significant. Others take a less extreme approach. They say that
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 563
it is acceptable to interpret significant main effects, as long as the primary interpretation of the results focuses on the interaction (if it is significant). Example. The interaction illustrated in Figure 16.8 provides a good example of why you must be very cautious in interpreting main effects when the predictor variables are involved in an interaction. To understand why, consider this: Based on the results presented in the figure, would it make sense to begin your discussion of the results by saying that there is a main effect for Predictor A (level of aggression displayed by the model)? Probably not––to simply say that there is a main effect for Predictor A would be somewhat misleading. It is clear that the level of aggression displayed by the model does seem to have an effect on aggression among children who see the model rewarded, but the graph suggests that the level of aggression displayed by the model probably does not have any real effect on aggression among children who see the model punished. To simply say that there is a main effect for model aggression might mislead readers into believing that exposure to aggressive models is likely to cause increased subject aggression under any circumstances (which it apparently does not). Recommendations. If it is questionable to discuss main effects under these circumstances, then how should you present your results when you have a significant interaction along with significant main effects? In situations such as this, it often makes more sense to do the following: •
Note that there was a significant interaction between the two predictor variables. Your interpretation of the results should be based primarily upon this interaction.
•
Prepare a figure (like Figure 16.8) that illustrates the nature of the interaction.
•
If appropriate, further investigate the nature of the interaction by testing for simple effects (a later section explains this concept of “simple effects”).
•
If the main effect for a predictor variable was significant, interpret this main effect very cautiously. Remind the reader that the predictor variable was involved in an interaction, and explain how the effect of this predictor variable was different for different groups of subjects.
Another Example of a Significant Interaction An earlier section pointed out that an interaction can assume an almost infinite variety of forms. Figure 16.9 illustrates a different type of interaction for the current study.
564 Step-by-Step Basic Statistics Using SAS: Student Guide
Figure 16.9. Another example of a significant interaction between Predictor A (level of aggression displayed by the model) and Predictor B (consequences for the model).
How would you know that these results constitute an interaction? Because one of the sets of corresponding line segments in this figure contains lines that are not parallel. Specifically, you can see that the solid line running from “Moderate” to “High” displays a fairly steep angle, while the broken line running from “Moderate” to “High” does not. Obviously, these line segments are not parallel, and that means that you have an interaction. It is true that another set of line segments in the figure are parallel to each other (i.e., the solid line and broken line running from “Low” to “Moderate” are parallel), but this is irrelevant. As long as at least one set of corresponding line segments display markedly different angles, the interaction is probably significant. As was noted before, the purpose of this section was simply to show you how main effects and interactions might appear when plotted in a graph. Obviously, you do not conclude that a main effect or interaction is significant by simply viewing a graph; instead, this is done by performing the appropriate statistical analysis using the SAS System. The remainder of this chapter shows how to perform these analyses.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 565
Example of a Factorial ANOVA Revealing Two Significant Main Effects and a Nonsignificant Interaction Overview The steps that you follow in performing a factorial ANOVA will vary depending on whether the interaction is significant: If the interaction is significant, you will follow one set of steps, and if the interaction is nonsignificant you will follow a different set of steps. This section illustrates an analysis that results in a nonsignificant interaction, along with two significant main effects. It shows you how to prepare the SAS program, how to interpret the SAS output, and how to write up the results. These procedures are illustrated by analyzing fictitious data from the aggression study described previously. In these analyses, Predictor A is the level of aggression displayed by the model, Predictor B is the consequences for the model, and the criterion variable is the number of aggressive acts displayed by the children after viewing the videotape. Choosing SAS Variable Names and Values to Use in the SAS Program Overview. Before you write a SAS program to perform a factorial ANOVA, you might find it helpful to prepare a table similar to that in Figure 16.10. This table will help you choose (a) meaningful SAS variable names for the variables in the analysis, (b) values to represent the different levels under the predictor variables, and (c) cell designators for the cells that constitute the factorial design matrix. If you carefully choose meaningful variable names and values now, you will find it easier to interpret your SAS output later.
Figure 16.10. Variable names and values to be used in the SAS program for the aggression study.
566 Step-by-Step Basic Statistics Using SAS: Student Guide
SAS variable name for Predictor A. You can see that Figure 16.10 is very similar to Figure 16.1, except that variable names and values have now been added. For example, Figure 16.10 again shows that Predictor A in your study is the “Level Of Aggression Displayed by Model.” Below this heading (within parentheses) is “MOD_AGGR,” which will serve as the SAS variable name for Predictor A in your SAS program (“MOD_AGGR” stands for “model aggression”). Obviously, you can choose any SAS variable name that is meaningful and that complies with the rules for SAS variable names. Values to represent conditions under Predictor A. Below the heading for Predictor A are the names of the three conditions for this predictor variable: “Low,” “Moderate,” and “High.” Below these headings for the three conditions (within parentheses) are the values that you will use to represent these conditions in your SAS program. You will use the value “L” to represent children in the low-model-aggression condition, the value “M” to represent children in the moderate-model-aggression condition, and the value “H” to represent children in the high-model-aggression condition. Choosing meaningful letters such as L, M, and H will make it easier to interpret your SAS output later. SAS variable name for Predictor B. Figure 16.10 shows that Predictor B in your study is “Consequences for Model.” Below this heading (within parentheses) is “CONSEQ,” which will serve as the SAS variable name for Predictor B in your SAS program (“CONSEQ” stands for “consequences”). Values to represent conditions under Predictor B. To the right of this heading are the names for the treatment conditions under Predictor B, along with the values that will represent these conditions in your SAS program. You will use the value “MR” to represent children in the “Model Rewarded” condition, and “MP” to represent children in the “Model Punished” condition. Cell designators. Each cell in Figure 16.10 contains a cell designator that indicates the condition a child was assigned to under both predictor variables. For example, the upper left cell has the designator “Cell MR-L.” The “MR” tells you that this group of children were in the “model-rewarded” condition under Predictor B, and the “L” tells you that they were in the “low” condition under Predictor A. Now consider the cell in the middle column on the bottom row of the figure. This cell has the designator “Cell MP-M.” Here, the “MP” tells you that this group of children were in the “model-punished” condition under Predictor B, and the “M” tells you that they were in the “moderate” condition under Predictor A. When you work with these cell designators, remember that the value for the row always comes first, and the value for the column always comes second. This means that the cell at the intersection of row 2 and column 1 should be identified with the designator MP-L, not LMP. As you will soon see, being able to quickly interpret these cells designators will make it easier to write your SAS program and to interpret the results.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 567
Data Set to Be Analyzed Table 16.1 presents the data set that you will analyze. Table 16.1 Variables Analyzed in the Aggression Study (Data Set Will Produce Significant Main Effects and a Nonsignificant Interaction) ____________________________________________________ Consequences Model Subject Subject for model aggression aggression ____________________________________________________ 01 MR L 11 02 MR L 7 03 MR L 15 04 MR L 12 05 MR L 8 06 MR M 24 07 MR M 19 08 MR M 20 09 MR M 23 10 MR M 29 11 MR H 23 12 MR H 29 13 MR H 25 14 MR H 20 15 MR H 27 16 MP L 4 17 MP L 0 18 MP L 9 19 MP L 2 20 MP L 8 21 MP M 17 22 MP M 20 23 MP M 12 24 MP M 17 25 MP M 21 26 MP H 12 27 MP H 20 28 MP H 21 29 MP H 20 30 MP H 18 __________________________________________________
Understanding the columns in the table. The columns of Table 16.1 provide the variables that you will analyze in your study. The first column in Table 16.1, “Subject,” assigns a unique subject number to each child. The second column is headed “Consequences for model.” This column identifies the condition to which children were assigned under Predictor B, consequences for the model. In this column, the value “MR” identifies children in the model-rewarded condition, and “MP” identifies children in the model-punished condition. You can see that children with subject numbers 1-15 were in the model-rewarded condition, and children with subject numbers 16-30 were in the model-punished condition.
568 Step-by-Step Basic Statistics Using SAS: Student Guide
The third column is headed “Model aggression.” This column identifies the condition to which children were assigned under Predictor A, level of aggression displayed by the model. In this column, the value “L” identifies children who saw the model in the videotape display a low level of aggression, “M” identifies children who saw the model display a moderate level of aggression, and “H” identifies children who saw the model display a high level of aggression. Finally, the column headed “Subject aggression” indicates the number of aggressive acts that each child displayed in the play room after viewing the videotape. This variable will serve as the criterion variable in your study. Understanding the rows of the table. The rows of Table 16.1 represent the individual children who participated as subjects in the study. The first row represents Subject 1. The “MR” under “Consequences for model” tells you that this child was in the model-rewarded condition under Predictor B. The “L” under “Model aggression” tells you that the subject was in the low condition under Predictor A. Finally, the “11” under “Subject aggression” tells you that this child displayed 11 aggressive acts after viewing the videotape. The rows for the remaining children may be interpreted in the same way. Stepping back and getting the “big picture” of Table 16.1 shows that it contains every possible combination of the levels of Predictor A and Predictor B. Notice that subjects 1-15 were in the model-rewarded condition under Predictor B, and that subjects 1-5 were in the low-model-aggression condition, subjects 6-10 were in the moderate-model-aggression condition, and subjects 11-15 were in the high-model-aggression condition under Predictor A. For subjects 16-30, the pattern repeats itself, with the exception that these subjects were in the model-punished condition under Predictor B. Writing the DATA Step of the SAS Program As you type the SAS program, you will enter the data similar to the way that they appear in Table 16.1. That is, you will have one column to contain subject numbers, one column to indicate the subjects’ condition under the consequences for model predictor variable, one column to indicate the subjects’ condition under the model aggression predictor variable,
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 569
and one column to indicate the subjects’ score on the subject aggression criterion variable. Here is the DATA step for your SAS program: OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM CONSEQ $ MOD_AGGR $ SUB_AGGR; DATALINES; 01 MR L 11 02 MR L 7 03 MR L 15 04 MR L 12 05 MR L 8 06 MR M 24 07 MR M 19 08 MR M 20 09 MR M 23 10 MR M 29 11 MR H 23 12 MR H 29 13 MR H 25 14 MR H 20 15 MR H 27 16 MP L 4 17 MP L 0 18 MP L 9 19 MP L 2 20 MP L 8 21 MP M 17 22 MP M 20 23 MP M 12 24 MP M 17 25 MP M 21 26 MP H 12 27 MP H 20 28 MP H 21 29 MP H 20 30 MP H 18 ; You can see that INPUT statement of the preceding program uses the following SAS variable names: •
The variable SUB_NUM represents subject numbers;
•
The variable CONSEQ represents the condition that each subject is in under the consequences-for-model predictor variable (values are either MR or MP for this variable;
570 Step-by-Step Basic Statistics Using SAS: Student Guide
note that the variable name is followed by the “$” symbol to indicate that it is a character variable); •
The variable MOD_AGGR represents subject condition under the model-aggression predictor variable (values are either L, M, or H for this variable; note that the variable name is also followed by the “$” symbol to indicate that it is a character variable);
•
The variable SUB_AGGR contains subjects’ scores on the criterion variable: the number of aggressive acts displayed by the subject in the playroom.
Data Screening and Testing Assumptions Prior to Performing the ANOVA Overview. Prior to performing the ANOVA, you should perform some preliminary analyses to verify that your data are valid and that you have met the assumptions underlying analysis of variance. This section summarizes these analyses, and refers you to the other sections of this Guide that show you how to perform them. Basic data screening. Before performing any statistical analyses, you should always verify that your data are valid. This means checking for any obvious errors in typing the data or in writing the DATA step of your SAS program. At the very least, you should analyze your numeric variables with PROC MEANS to verify that the means are reasonable that you do not have any invalid values. For guidance in doing this, see Chapter 4 “Data Input,” the section “Using PROC MEANS and PROC FREQ to Identify Obvious Problems with a Data Set.” It is also wise to create a printout of your raw data that you can audit. For guidance in doing this, again see Chapter 4, the section “Using PROC PRINT to Create a Printout of Raw Data.” Testing assumptions underlying the procedure. The first section of this chapter included a list of the assumptions underlying factorial ANOVA. For many of these assumptions (such as “random sampling”), there is no statistical procedure for testing the assumption. The only way to verify that you have met the assumption is to conduct a careful review of how you conducted the study. On the other hand, some of these assumptions can be tested statistically using the SAS System. In particular, an excerpt of one of the assumptions is reproduced below: •
Normal distributions. Each cell should be drawn from a normally-distributed population.
You can use PROC UNIVARIATE to test this assumption. Using the PLOT option with PROC UNIVARIATE prints a stem-and-leaf plot that you can use to determine the approximate shape of the sample data’s distribution. Using the NORMAL option with PROC UNIVARIATE requests a test of the null hypothesis that the sample data were drawn from a normally-distributed population. You should perform this test separately for each of
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 571
the cells that constitute your experimental design (remember that the tests for normality provided by PROC UNIVARIATE tend to be fairly sensitive when samples are large). For guidance in doing this, see Chapter 7, “Measures of Central Tendency and Variability,” and especially the section “Using PROC UNIVARIATE to Determine the Shape of Distributions.” Writing the SAS Program to Perform the Two-Way ANOVA Overview. This section shows you how to write SAS statements to perform an ANOVA with two between-subjects factors. It shows how to prepare the PROC GLM statement, the CLASS statement, the MODEL statement, and the MEANS statement. The syntax. Below is the syntax for the PROC step that is needed to perform a two-way factorial ANOVA, and to follow it with Tukey's HSD test: PROC GLM DATA = data-set-name; CLASS predictorB predictorA; MODEL criterion-variable = predictorB predictorA predictorB*predictorA; MEANS predictorB predictorA predictorB*predictorA; MEANS predictorB predictorA / TUKEY CLDIFF ALPHA=alpha-level ; TITLE1 ' your-name '; RUN; QUIT; The actual code for the current analysis. Here, the appropriate SAS variable names have been substituted into the syntax (line numbers have been added on the left): 1 2 3 4 5 6 7 8
PROC GLM DATA=D1; CLASS CONSEQ MOD_AGGR; MODEL SUB_AGGR = CONSEQ MOD_AGGR CONSEQ*MOD_AGGR; MEANS CONSEQ MOD_AGGR CONSEQ*MOD_AGGR; MEANS CONSEQ MOD_AGGR / TUKEY CLDIFF ALPHA=0.05; TITLE1 'JANE DOE'; RUN; QUIT;
Some notes about the preceding code: •
The PROC GLM statement in Line 1 requests the GLM procedure, and requests that the analysis be performed on data set D1.
•
The CLASS statement in Line 2 lists the two classification variables as CONSEQ (Predictor B) and MOD_AGGR (Predictor A).
572 Step-by-Step Basic Statistics Using SAS: Student Guide •
Line 3 contains the MODEL statement for the analysis. The name of the criterion variable (SUB_AGGR) appears to the left of the equal sign in this statement. To the right of the equal sign, you should list the following: – the two predictor variables (CONSEQ and MOD_AGGR, in this case). The names of the two predictor variables should be separated by at least one space. – an additional term that represents the interaction between these two predictor variables. To create this interaction term, type the names for Predictor B and Predictor A, and connect them with an asterisk (“*”). This should be typed as a single term with no spaces. For the current analysis, the interaction term was CONSEQ*MOD_AGGR.
•
You will notice that in all of the statements that contain the SAS variable names for Predictor A and Predictor B, the statement lists Predictor B first, followed by Predictor A (that is, CONSEQ is always followed by MOD_AGGR). This order may seem counterintuitive, but there is a reason for it: When SAS lists the means for the various cells in the study, this output will be somewhat easier to interpret if you list Predictor B prior to Predictor A in these statements (more on this later)
•
Line 4 presents the first MEANS statement: MEANS
CONSEQ
MOD_AGGR
CONSEQ*MOD_AGGR;
•
This statement requests means and standard deviations on the criterion variable for the various conditions under Predictor B (CONSEQ) and Predictor A (MOD_AGGR). By including the interaction term in this statement (CONSEQ*MOD_AGGR), you ensure that PROC GLM will also print means and standard deviations on the criterion variable for each of the six cells in the factorial design. You will need these means in interpreting the results.
•
Line 5 presents the second MEANS statement: MEANS
CONSEQ
MOD_AGGR / TUKEY
CLDIFF
ALPHA=0.05;
•
In this statement, you will list the names for the two predictor variables followed by a slash and a number of options. The first option is requested by the key word TUKEY. The key word TUKEY requests that Tukey’s HSD test be performed as a multiple comparison procedure, in the event that the main effects are significant (for an explanation of multiple comparison procedures, see the section “Treatment Effects, Multiple Comparison Procedures, and a New Index of Effect Size” in Chapter 15, “One-Way ANOVA with One Between-Subjects Factor”). Technically, you could have omitted the predictor variable CONSEQ from this statement because it contains only two conditions; you will remember from the preceding chapter that it is not necessary to perform a multiple comparison procedure on a predictor variable when it involves only two conditions.
•
The MEANS statement also contains the key word CLDIFF, which requests that the results of the Tukey test be printed as confidence intervals for the differences between the means. The option ALPHA=0.05 requests that the the significance level (alpha) be set at .05 for the Tukey tests. If you had wanted alpha set at .01, you would have used the option
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 573
ALPHA=0.01, and if you had wanted alpha set at .10, you would have used the option ALPHA=0.1. •
Although the preceding MEANS statement requests the Tukey HSD test, remember that it is possible to request other multiple comparison procedures instead of the Tukey test (such as the Bonferroni t test and the Scheffe multiple comparison procedure). For guidance in doing this, see Chapter 15, the section “Keywords for Other Multiple Comparison Procedures.”
•
Finally, lines 6, 7, and 8 contain the TITLE, RUN, and QUIT statements for your program.
The complete SAS program. Here is the complete SAS program that will input your data set and perform a factorial ANOVA with two between-subjects factors: OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM CONSEQ $ MOD_AGGR $ SUB_AGGR; DATALINES; 01 MR L 11 02 MR L 7 03 MR L 15 04 MR L 12 05 MR L 8 06 MR M 24 07 MR M 19 08 MR M 20 09 MR M 23 10 MR M 29 11 MR H 23 12 MR H 29 13 MR H 25 14 MR H 20 15 MR H 27 16 MP L 4 17 MP L 0 18 MP L 9 19 MP L 2 20 MP L 8 21 MP M 17 22 MP M 20 23 MP M 12 24 MP M 17 25 MP M 21 26 MP H 12 27 MP H 20 28 MP H 21 29 MP H 20
574 Step-by-Step Basic Statistics Using SAS: Student Guide
30 MP H 18 ; PROC GLM DATA=D1; CLASS CONSEQ MOD_AGGR; MODEL SUB_AGGR = CONSEQ MOD_AGGR CONSEQ*MOD_AGGR; MEANS CONSEQ MOD_AGGR CONSEQ*MOD_AGGR; MEANS CONSEQ MOD_AGGR / TUKEY CLDIFF ALPHA=0.05; TITLE1 'JANE DOE'; RUN; QUIT;
Log File Produced by the SAS Program Why review the log file? After submitting your SAS program, you should always review the log file prior to reviewing the output file. Verify that the analysis was performed correctly, and look for any error messages, warning messages, or other notes indicating that something went wrong. Log 16.1 contains the log file produced by the preceding program. NOTE: SAS initialization used: real time 20.53 seconds 1 OPTIONS LS=80 PS=60; 2 DATA D1; 3 INPUT SUB_NUM 4 CONSEQ $ 5 MOD_AGGR $ 6 SUB_AGGR; 7 DATALINES; NOTE: The data set WORK.D1 has 30 observations and 4 variables. NOTE: DATA statement used: real time 1.43 seconds 38 ; 39 PROC GLM DATA=D1; 40 CLASS CONSEQ MOD_AGGR; 41 MODEL SUB_AGGR = CONSEQ MOD_AGGR CONSEQ*MOD_AGGR; 42 MEANS CONSEQ MOD_AGGR CONSEQ*MOD_AGGR; 43 MEANS CONSEQ MOD_AGGR / TUKEY CLDIFF ALPHA=0.05; 44 TITLE1 'JANE DOE'; 45 RUN; NOTE: Means from the MEANS statement are not adjusted for other terms in the model. For adjusted means, use the LSMEANS statement. NOTE: Means from the MEANS statement are not adjusted for other terms in the model. For adjusted means, use the LSMEANS statement. 46 QUIT; NOTE: There were 30 observations read from the dataset WORK.D1. NOTE: PROCEDURE GLM used: real time 3.29 seconds Log 16.1. Log file produced by the SAS program.
Note regarding the number of observations and variables. Log 16.1 provides no evidence of any obvious errors in conducting the analysis. For example, following line 7 of
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 575
the program, you can see a note that says “NOTE: The data set WORK.D1 has 30 observations and 4 variables.” This is a good sign, because you intended your program to have 30 observations (subjects) and four variables. Notes regarding the LSMEANS statement. Following line 45 of the program, you can see two notes that both say the same thing: “NOTE: Means from the MEANS statement are not adjusted for other terms in the model. For adjusted means, use the LSMEANS statement.” This note is not necessarily a cause for alarm. When you are performing a two-way ANOVA, it is appropriate to use the MEANS statement (rather than the LSMEANS statement) to compute group means on the criterion variable as long as your experimental design is balanced. An experimental design is balanced if the same number of observations (subjects) appear in each cell of the design. For example, Figure 16.10 illustrates the research design used in the aggression study. It shows that there are five subjects in each cell of the design (that is, there are five subjects in the cell of subjects who experienced the “low” condition under Predictor A and the “model-rewarded” condition under Predictor B, there are five subjects in the cell of subjects who experienced the “moderate” condition under Predictor A and the “model-punished” condition under Predictor B, and so forth). Because your experimental design is balanced, there is no need to adjust the means from the analysis for other terms in the model. This means that it is acceptable to use the MEANS statement in your program, and it is not necessary to use the LSMEANS statement. In contrast, a research design is typically unbalanced if some cells in the design contain a larger number of observations (subjects) than other cells. For example, again consider Figure 16.10. If there were 20 subjects in Cell MR-L, but only five subjects in each of the remaining five cells, then the experimental design would be unbalanced. Note that if you are analyzing data from an unbalanced design, using the MEANS statement may produce marginal means that are biased. Thus, to analyze data from an unbalanced design, it is generally preferable to use the LSMEANS (least-squares means) statement in your program, rather than the MEANS statement. This is because the LSMEANS statement will estimate the marginal means over a balanced population. The marginal means estimated by the LSMEANS statement are less likely to be biased. In summary, if your experimental design is balanced, you can ignore the note about LSMEANS that appears in your log file. Because the design of your aggression study is balanced, it is appropriate to use the MEANS statement. Analyzing unbalanced designs. For guidance in analyzing data from studies with unequal cell sizes, see the section “Using the LSMEANS Statement to Analyze Data from Unbalanced Designs,” which appears toward the end of this chapter. Output Produced by the SAS Program The preceding program would produce five pages of output. Most of this output will be presented in this section. The information that appears on each page is summarized below:
576 Step-by-Step Basic Statistics Using SAS: Student Guide •
Page 1 provides class level information and the number of observations in the data set.
•
Page 2 provides the ANOVA summary table from the GLM procedure.
•
Page 3 provides results from the first MEANS statement. These results consist of three tables that present means and standard deviations for the criterion variable. These means and standard deviations are broken down by the various levels that constitute the study: – The first table provides the means observed for each level of CONSEQ. – The second table provides the means observed for each level of MOD_AGGR. – The third table provides the means observed for each of the six cells in the study’s factorial design.
•
Page 4 provides the results of the Tukey multiple comparison procedure for Predictor B (CONSEQ).
•
Page 5 provides the results of the Tukey multiple comparison procedure for Predictor A (MOD_AGGR).
Steps in Interpreting the Output Overview. The fictitious data set that was analyzed in this section was designed so that the interaction term would be nonsignificant. When the interaction is nonsignificant, interpreting the results of a two-factor ANOVA is very similar to interpreting the results of a one-factor ANOVA. This section begins by showing you how to review specific sections of output to verify that there were no obvious errors in entering data or in writing the program. It will then show you how to determine whether the interaction is significant, how to determine whether the main effects were significant, how to prepare an ANOVA summary table, and how to review the results of the multiple comparison procedures. 1. Make sure that the output looks correct. The output created by the GLM procedure in the preceding program contains information that may help identify possible errors in the writing the program or in entering the data. This section shows how to review that information. You will begin by reviewing the class level information which appears on Page 1 of the PROC GLM output. This page is reproduced here as Output 16.1. JANE DOE The GLM Procedure Class Level Information Class CONSEQ MOD_AGGR
Levels 2 3
Number of observations
1
Values MP MR H L M 30
Output 16.1. Verifying that everything looks correct on the class level information page; two-way ANOVA performed on aggression data, significant main effects, nonsignificant interaction.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 577
Below the heading “Class,” you will find the names of the classification variables (the predictor variables) in your analysis. In Output 16.1, you can see that the classification variables are CONSEQ and MOD_AGGR. Under the heading “Levels” the output should indicate how many levels (or conditions) for each of your predictor variables. Output 16.1 shows that there were two levels for the predictor variable CONSEQ, and there were three levels for the predictor variable MOD_AGGR. This is as it should be. Under the heading “Values” the output should indicate the specific numbers or letters that you used to code the two predictor variables. Output 16.1 shows that you used the values “MP” and “MR” to code conditions under CONSEQ, and that you used the values “H,” “L,” and “M” to code conditions under MOD_AGGR. This is all correct. Remember that it is important to use uppercase and lowercase letters consistently when you are coding treatment conditions under the predictor variables. For example, the preceding paragraph indicated that you used uppercase letters (H, L, and M) in coding conditions under MOD_AGGR. If you had accidentally keyed a lowercase “h” instead of an uppercase “H” for one or more of your subjects, the SAS system would have interpreted that as a code for a new and different treatment condition. This, of course, would have led to errors in the analysis. Again, the point is that it is important to use uppercase and lowercase letters consistently when you are coding treatment conditions, because SAS treats uppercase letters and lowercase letters as different values. Finally, the last line in Output 16.1 indicates the number of observations in the data set. The present experimental design consisted of six cells of subjects with five subjects in each cell, for a total of 30 subjects in the study. Output 16.1 indicates that your data set included 30 observations, and so everything appears to be correct at this point. Page 2 of the output provides the analysis of variance table created by PROC GLM. It is reproduced here as Output 16.2.
578 Step-by-Step Basic Statistics Using SAS: Student Guide
JANE DOE The GLM Procedure
2
Dependent Variable: SUB_AGGR Source Model Error Corrected Total R-Square 0.822988
DF 5 24 29
Sum of Squares 1456.166667 313.200000 1769.366667
Coeff Var 21.98263
Mean Square 291.233333 13.050000
Root MSE 3.612478
F Value 22.32
Pr > F F 0.0001 F 0.0001 F F 0.0001 F 0.0001 F F 0.0001 F 0.0001 F” ( ).
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 601
The MSE (mean square error). The mean square error is an estimate of the error variance in your analysis. Item N from the analysis report provides a formal description of the results for a paper. The first paragraph of this report includes a statistic abbreviated as “MSE.” The relevant section of Item N is reproduced as follows: N) Formal description of the results for a paper: Results were analyzed using a with two between-subjects factors. revealed a significant main effect aggression displayed by the model, MSE = 13.05, p = .0001.
factorial ANOVA This analysis for level of F(2, 24) = 45.17,
The last sentence of the preceding excerpt indicates that “MSE = 13.05.” In the output from PROC GLM, you will find it at the location where the row labeled “Error” intersects with the column headed “Mean Square.” In Output 16.10, you can see that the mean square error is equal to 13.050000 ( ), which rounds to 13.05. Analysis Report Regarding the Main Effect for Predictor B (Significant Effect) This section shows how to prepare a report about Predictor B, “consequences for the model.” You will remember that the main effect for Predictor B was significant in this analysis. There are some differences between the way that this report was prepared, versus the way that the report for Predictor A was prepared. See the notes following this analysis report for details. A) Statement of the research question: One purpose of this study was to determine whether there was a relationship between (a) the consequences experienced by an aggressive model and (b) the number of aggressive acts later demonstrated by children who had observed this model. B) Statement of the research hypothesis: Children who observe a model being rewarded for engaging in aggressive behavior will later demonstrate a greater number of aggressive acts, compared to children who observe a model being punished for engaging in aggressive behavior. C) Nature of the variables: This analysis involved two predictor variables and one criterion variable: • Predictor A was the level of aggression displayed by the model. This was a limited-value variable, was assessed on an ordinal scale, and included three levels: low, moderate, and high. • Predictor B was the consequences for the model. This was a dichotomous variable, was assessed on a nominal scale, and
602 Step-by-Step Basic Statistics Using SAS: Student Guide
included two levels: model rewarded and model punished. This was the predictor variable that was relevant to the present hypothesis. • The criterion variable was the number of aggressive acts displayed by the subjects after observing the model. This was a multi-value variable, and was assessed on a ratio scale. Factorial ANOVA with two between-
D) Statistical test: subjects factors
E) Statistical null hypothesis (H0): µB1 = µB2; In the population, there is no difference between subjects in the model-rewarded condition versus subjects in the model-punished condition with respect to mean scores on the criterion variable (the number of aggressive acts displayed by the subjects). F) Statistical alternative hypothesis (H1): µB1 ≠ µB2; In the population, there is a difference between subjects in the model-rewarded condition versus subjects in the model-punished condition with respect to their mean scores on the criterion variable (the number of aggressive acts displayed by the subjects). G) Obtained statistic:
F(1, 24) = 21.15
H) Obtained probability (p) value:
p = .0001
I) Conclusion regarding the statistical null hypothesis: Reject the null hypothesis. J) Multiple comparison procedure: A multiple comparison procedure was not necessary because this predictor variable included just two levels. K) Confidence interval: Subtracting the mean of the modelpunished condition from the mean of the model-rewarded condition resulted in an observed difference of 6.067. The 95% confidence interval for this difference ranged from 3.344 to 8.789. L) Effect size: R2 = .16, indicating that consequences for the model accounted for 16% of the variance in subject aggression. M) Conclusion regarding the research hypothesis: These findings provide support for the study’s research hypothesis that children who observe a model being rewarded for engaging in aggressive behavior will later demonstrate a greater number of aggressive acts, compared to children who observe a model being punished for engaging in aggressive behavior.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 603
N) Formal description of the results for a paper: Results were analyzed by using a factorial ANOVA with two between-subjects factors. This analysis revealed a significant main effect for consequences for the model, F(1, 24) = 21.15, MSE = 13.05, p = .0001. On the criterion variable (number of aggressive acts displayed by the subjects), the mean score for the modelrewarded condition was 19.47 (SD = 7.32), and the mean score for the model-punished condition was 13.40 (SD = 7.29). Sample means for the various conditions that constituted the study are displayed in Figure 16.11. Subtracting the mean of the model-punished condition from the mean of the model-rewarded condition resulted in an observed difference of 6.067. The 95% confidence interval for this difference ranged from 3.344 to 8.789. 2 In the analysis, R for this main effect was computed as .16. This indicated that consequences for the model accounted for 16% of the variance in subject aggression.
O) Figure representing the results:
See Figure 16.11.
Notes Regarding the Preceding Analysis Report Overview. Many sections of the analysis report for Predictor B were prepared using the same format that was used for the analysis report for Predictor A. Therefore, to conserve space, this section will not repeat explanations that were provided following the analysis report for Predictor A (such as the explanation of where to look in the SAS output for the MSE statistic). Instead, this section will discuss the ways in which the report for Predictor B differed from the report from Predictor A. Some sections of the analysis report for Predictor B were completed in a way that differed from the report for Predictor A. In most cases, this was because Predictor B consisted of only two levels (model-rewarded versus model-punished), while Predictor A consisted of three levels (low versus moderate versus high). Because Predictor B involved two levels, several sections of its analysis report used a format similar to that used for an independentsamples t test. To refresh your memory on how reports were prepared for that statistic, see Chapter 13, “Independent-Samples t Test,” the section “Example 13.1: Observed Consequences for Modeled Aggression: Effects on Subsequent Subject Aggression (Significant Differences),” in the subsection “Summarizing the Results of the Analysis.” Information about the main effect for Predictor B. Much of the information about the main effect for Predictor B came from Output 16.10, in the section for the Type III sum of
604 Step-by-Step Basic Statistics Using SAS: Student Guide
squares, in the row labeled CONSEQ ( ). This includes the F statistic, the p value, as well as other information. Item F, the statistical alternative hypothesis. For Predictor B, the statistical alternative hypothesis was stated in symbolic terms as follows: µB1 ≠ µB2 The above format is similar to the format used with an independent samples t test. Remember that this format is appropriate only for predictor variables that contain two levels. Item G, the obtained statistic. Item G from the preceding analysis report presented the F statistic and degrees of freedom for Predictor B (CONSEQ). This item is reproduced again here: G) Obtained statistic: F(1, 24) = 21.15 The degrees of freedom for this main effect appear in parentheses above. The number 1 is the degree of freedom for the numerator used in computing the F statistic for the CONSEQ main effect. This “1” comes from the ANOVA summary table that was created by PROC GLM, and was reproduced in Output 16.10 , presented earlier. Specifically, this “1” appears in Output 16.10 at the location where the row labeled “CONSEQ” intersects with the column headed “DF” ( ). The second number in item G, 24, represents the degrees of freedom for the denominator used in computing the F statistic. This “24” appears in Output 16.10 at the location where the row labeled “Error” intersects with the column headed “DF” ( ). Item K, the confidence interval. Notice that, in the analysis report for Predictor B, Item K does not refer the reader to a table that presents the confidence intervals (as was the case with Predictor A). Instead, because there is only one confidence interval to report, it is described in the text of the report itself. Item N, the formal description of the results for a paper. Item N of the preceding report provides means and standard deviations for the model-rewarded and model-punished conditions under Predictor B. These statistics may be found in Output 16.9, in the section for the CONSEQ variable ( ). The relevant portion of that output is presented again as Output 16.11.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 605
Level of CONSEQ MP MR
JANE DOE The GLM Procedure -----------SUB_AGGR---------N Mean Std Dev 15 13.4000000 7.28795484 15 19.4666667 7.31794923
3
Output 16.11. Means and standard deviations for conditions under Predictor B (consequences for model).
Analysis Report Concerning the Interaction (Nonsignificant Effect) An earlier section stated that an interaction is an outcome for which the relationship between one predictor variable and the criterion variable is different at different levels of the second predictor variable. Output 16.10 showed that the interaction term for the current analysis was nonsignificant. Below is an example of how this result could be described in a report. A) Statement of the research question: One purpose of this study was to determine whether there was a significant interaction between (a) the level of aggression displayed by the model, and (b) the consequences experienced by the model in the prediction of (c) the number of aggressive acts later displayed by subjects who have observed the model. B) Statement of the research hypothesis: The positive relationship between the level of aggression displayed by the model and subsequent subject aggressive acts will be stronger for subjects who have seen the model rewarded than for subjects who have seen the model punished. C) Nature of the variables: This analysis involved two predictor variables and one criterion variable: • Predictor A was the level of aggression displayed by the model. This was a limited-value variable, was assessed on an ordinal scale, and included three levels: low, moderate, and high. • Predictor B was the consequences for the model. This was a dichotomous variable, was assessed on a nominal scale, and included two levels: model rewarded and model punished. • The criterion variable was the number of aggressive acts displayed by the subjects after observing the model. This was a multi-value variable, and was assessed on a ratio scale. D) Statistical test: subjects factors
Factorial ANOVA with two between-
606 Step-by-Step Basic Statistics Using SAS: Student Guide
E) Statistical null hypothesis (Ho): In the population, there is no interaction between the level of aggression displayed by the model and the consequences for the model in the prediction of the criterion variable (the number of aggressive acts displayed by the subjects). F) Statistical alternative hypothesis (H1): In the population, there is an interaction between the level of aggression displayed by the model and the consequences for the model in the prediction of the criterion variable (the number of aggressive acts displayed by the subjects). F(2, 24) = 0.05
G) Obtained statistic:
p = .9527
H) Obtained probability (p) value:
I) Conclusion regarding the statistical null hypothesis: Fail to reject the null hypothesis. J) Multiple comparison procedure: K) Confidence intervals:
Not relevant.
Not relevant.
L) Effect size: R2 = .00, indicating that the interaction term accounted for none of the variance in subject aggression. M) Conclusion regarding the research hypothesis: These findings fail to provide support for the study’s research hypothesis that the positive relationship between the level of aggression displayed by the model and subsequent subject aggressive acts will be stronger for subjects who have seen the model rewarded than for subjects who have seen the model punished. N) Formal description of the results for a paper: Results were analyzed by using a factorial ANOVA with two between-subjects factors. This analysis revealed a nonsignificant F statistic for the interaction between the level of aggression displayed by the model and the consequences for the model, F(2, 24) = 0.05, MSE = 13.05, p = .9527. Sample means for the various conditions that constituted the study are displayed in Figure 16.11. 2
In the analysis, R for this interaction effect was computed as .00. This indicated that the interaction accounted for none of the variance in subject aggression. O) Figure representing the results:
See Figure 16.11.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 607
Notes Regarding the Preceding Analysis Report Item B, statement of the research hypothesis. Item B (above) not only predicts that there will be an interaction, but also makes a specific prediction about the nature of that interaction. The type of interaction described here was chosen arbitrarily; remember that an interaction can be expressed in an infinite variety of forms. Items G and H. The degrees of freedom for the interaction term, along with the F statistic and p value for the interaction term, appeared in Output 16.10. The relevant portion of that output is presented again here as Output 16.12. The information relevant to the interaction appears to the right of the heading “CONSEQ*MOD_AGGR” ( ). Source CONSEQ MOD_AGGR CONSEQ*MOD_AGGR
DF 1 2 2
Type III SS 276.033333 1178.866667 1.266667
Mean Square 276.033333 589.433333 0.633333
F Value 21.15 45.17 0.05
Pr > F 0.0001 F” ( ). The second number in Item G, “24,” again represents the degrees of freedom for the denominator of the F ratio. Again, these degrees of freedom may be found in Output 16.10 at the location where the row titled “Error” intersects with the column titled “DF.”
Example of a Factorial ANOVA Revealing Nonsignificant Main Effects and a Nonsignificant Interaction Overview This section presents the results of a factorial ANOVA in which the main effects for Predictor A and Predictor B, along with the interaction term, are all nonsignificant. These results are presented so that you will be prepared to write analysis reports for projects in which nonsignificant outcomes are observed.
608 Step-by-Step Basic Statistics Using SAS: Student Guide
The Complete SAS Program The study presented here is the same aggression study described in the preceding section. The data will be analyzed with the same SAS program that was presented earlier, in the section “Writing the SAS Program.” Here, the data have been changed so that they will produce nonsignificant results. The complete SAS program, including the new data set, is presented below: OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM CONSEQ $ MOD_AGGR $ SUB_AGGR; DATALINES; 01 MR L 15 02 MR L 22 03 MR L 19 04 MR L 16 05 MR L 11 06 MR M 16 07 MR M 24 08 MR M 10 09 MR M 17 10 MR M 17 11 MR H 17 12 MR H 12 13 MR H 24 14 MR H 20 15 MR H 15 16 MP L 14 17 MP L 7 18 MP L 22 19 MP L 15 20 MP L 13 21 MP M 14 22 MP M 21 23 MP M 11 24 MP M 9 25 MP M 19 26 MP H 15 27 MP H 9 28 MP H 10 29 MP H 20 30 MP H 21 ; PROC GLM DATA=D1; CLASS CONSEQ MOD_AGGR; MODEL SUB_AGGR = CONSEQ MOD_AGGR CONSEQ*MOD_AGGR; MEANS CONSEQ MOD_AGGR CONSEQ*MOD_AGGR; MEANS CONSEQ MOD_AGGR / TUKEY CLDIFF ALPHA=0.05; TITLE1 'JANE DOE'; RUN; QUIT;
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 609
Steps in Interpreting the Output Overview. As was the case with the earlier data set, the SAS program performing this analysis would produce five pages of output. This section will present only those sections of output that are relevant to preparing the ANOVA summary table, the graph, and the analysis reports. You will notice that the steps listed in this section are not identical to the steps listed in the preceding section. Some of the steps (such as “1. Make sure that everything looks correct”) are not included here because the key concepts have already been covered. Some other steps are not included here because they are typically not appropriate when effects are nonsignificant. 1. Determine whether the interaction term is statistically significant. As was the case before, you determine whether the interaction is significant by reviewing the ANOVA summary table produced by PROC GLM. This table appears on output page 2, and is reproduced here as Output 16.13. JANE DOE The GLM Procedure Dependent Variable: SUB_AGGR Sum of Source DF Squares Mean Square Model 5 45.3666667 9.0733333 Error 24 594.8000000 24.7833333 Corrected Total 29 640.1666667 R-Square 0.070867
Coeff Var 31.44181
Root MSE 4.978286
2
F Value 0.37
Pr > F 0.8667
SUB_AGGR Mean 15.83333
Source CONSEQ MOD_AGGR CONSEQ*MOD_AGGR
DF 1 2 2
Type I SS 40.83333333 4.06666667 0.46666667
Mean Square 40.83333333 2.03333333 0.23333333
F Value 1.65 0.08 0.01
Pr > F 0.2115 0.9215 0.9906
Source CONSEQ MOD_AGGR CONSEQ*MOD_AGGR
DF 1 2 2
Type III SS 40.83333333 4.06666667 0.46666667
Mean Square 40.83333333 2.03333333 0.23333333
F Value 1.65 0.08 0.01
Pr > F 0.2115 0.9215 0.9906
Output 16.13. ANOVA summary table for two-way ANOVA performed on aggression data, nonsignificant main effects, nonsignificant interaction.
As was the case with the earlier data set, you will review the results of the analyses that appear in the section headed “Type III SS” ( ), as opposed to the section headed “Type I SS.” To determine whether the interaction is significant, look to the right of the heading “CONSEQ*MOD_AGGR” ( ). Here, you can see that the F statistic is only 0.01, with a p value of .9906. This p value is larger than our standard criterion of .05, and so you know that
610 Step-by-Step Basic Statistics Using SAS: Student Guide
the interaction is nonsignificant. Since the interaction is nonsignificant, you may proceed to interpret the significance tests for the main effects. 2. Determine whether either of the two main effects are statistically significant. Information regarding the main effect for Predictor A appears to the right of the heading “MOD_AGGR” ( ). You can see that the F value for this effect is 0.08, with a nonsignificant p value of .9215. Information regarding Predictor B appears to the right of “CONSEQ” ( ). This factor displayed a nonsignificant F statistic of 1.65 (p = .2115). 3. Prepare your own version of the ANOVA summary table. The completed ANOVA summary table for this analysis is presented here as Table 16.4. Table 16.4 ANOVA Summary Table for Study Investigating the Relationship Between Level Of Aggression Displayed by Model (A), Consequences for Model (B), and Subject Aggression (Nonsignificant Interaction, Nonsignificant Main Effects) _______________________________________________________________________ SS MS F p R2 Source df _______________________________________________________________________ Model aggression (A) 2 4.07 2.03 0.08 .9215 .01 Consequences for model (B) 1 40.83 40.83 1.65 .2115 .06 A X B Interaction 2 0.47 0.23 0.01 .9906 .00 Within groups 24 594.80 24.78 Total 29 640.17 _______________________________________________________________________ Note: N = 30
Notice how information from Output 16.13 was used to fill in the relevant sections of Table 16.4: •
Information from the row labeled “MOD_AGGR” ( ) in Output 16.13 was transferred to the row labeled “Model aggression (A)” in Table 16.4.
•
Information from the row labeled “CONSEQ” ( ) in Output 16.13 was transferred to the row labeled “Consequences for model (B)” in Table 16.4.
•
Information from the row labeled “CONSEQ*MOD_AGGR” ( ) in Output 16.13 was transferred to the row labeled “A × B Interaction” in Table 16.4.
•
Information from the row labeled “ERROR” ( ) in Output 16.13 was transferred to the row labeled “Within groups” in Table 16.4.
•
Information from the row labeled “Corrected Total” ( ) in Output 16.13 was transferred to the row labeled “Total” in Table 16.4.
The R2 entries in Table 16.4 were computed by dividing the sum of squares for a particular effect by the total sum of squares. For detailed guidelines on constructing an ANOVA summary table (such as Table 16.4), see the subsection “4. Prepare your own version of the
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 611
ANOVA summary table” from the major section “Example of a Factorial ANOVA Revealing a Significant Main Effects and a Nonsignificant Interaction,” presented earlier. 4. Prepare a table that displays the confidence intervals for Predictor A. Notice that, unlike the previous section, this section does not advise you to review the results of the Tukey multiple comparison procedures to determine whether there are significant differences between the various levels of Predictor A or Predictor B. This is because the main effects for these two predictor variables were nonsignificant, and you normally should not interpret the results of multiple comparison procedures for main effects that are not significant. However, the Tukey test that you requested also computes confidence intervals for the differences between the means. Some readers might be interested in seeing these confidence intervals, even when the main effects are not statistically significant. To conserve space, this section will not present the SAS output that shows the results of the Tukey tests and confidence intervals. However, it will show how to summarize the information that was presented in that output for a paper. Predictor A was the level of aggression displayed by the model. This predictor contained three levels, and therefore produced three confidence intervals for differences between the means. Because this is a fairly large amount of information to present, these confidence intervals will be summarized in Table 16.5 here: Table 16.5 Results of Tukey Tests for Predictor A: Comparing High-Model-Aggression Group versus Moderate-Model-Aggression Group versus Low-Model-Aggression Group on the Criterion Variable (Subject Aggression) ________________________________________________________ Simultaneous 95% Difference confidence limits between ––––––––––––––––– a b means Lower Upper Comparison ________________________________________________________ High – Moderate 0.5000 -5.0599 6.0599 High - Low 0.9000 -4.6599 6.4599 Moderate - Low 0.4000 -5.1599 5.9599 ________________________________________________________ Note: N = 30 a Differences are computed by subtracting the mean for the second group from the mean for the first group. b Tukey test indicates that the none of the differences between the means were significant at p < .05.
5. Summarize the confidence interval for Predictor B in the text of your paper. In contrast, Predictor B (consequences for the model), contained only two levels and therefore produced only one confidence interval for a difference between the means. This interval could be summarized in the text of your paper in this way:
612 Step-by-Step Basic Statistics Using SAS: Student Guide
Subtracting the mean of the model-punished condition from the mean of the modelrewarded condition resulted in an observed difference of 2.333. The 95% confidence interval for this difference ranged from –1.418 to 6.085. Using a Figure to Illustrate the Results When all main effects and interactions are nonsignificant, researchers usually do not illustrate means for the various conditions in a graph. However, this section will present a graph, simply to provide an additional example of how to prepare a graph from the cell means that are provided in the SAS output. Page 3 of the SAS output from the current program provides means and standard deviations for the various treatment conditions manipulated in the current study. These results are presented here as Output 16.14.
Level of CONSEQ MP MR
JANE DOE The GLM Procedure -----------SUB_AGGR---------N Mean Std Dev 15 14.6666667 4.95215201 15 17.0000000 4.27617987
Level of MOD_AGGR H L M
N 10 10 10
Level of CONSEQ MP MP MP MR MR MR
Level of MOD_AGGR H L M H L M
3
-----------SUB_AGGR---------Mean Std Dev 16.3000000 4.98998998 15.4000000 4.69515116 15.8000000 4.87168691 N 5 5 5 5 5 5
-----------SUB_AGGR---------Mean Std Dev 15.0000000 5.52268051 14.2000000 5.35723809 14.8000000 5.11859356 17.6000000 4.61519230 16.6000000 4.15932687 16.8000000 4.96990946
Output 16.14. Means needed to plot the study’s results in a figure; two-way ANOVA performed on aggression data; nonsignificant main effects; nonsignificant interaction.
As was noted earlier, this page of output actually contains three tables of means. However, you will be interested only in the third table––the one that presents the means broken down by both Predictor A and Predictor B ( ). Figure 16.12 graphs the six cell means from the third table in Output 16.14. As before, cell means from subjects in the model-rewarded condition are displayed as circles on a solid line, while cell means from subjects in the model-punished condition are displayed as triangles on a solid line.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 613
Figure 16.12. Mean number of aggressive acts as a function of the level of aggression displayed by the model and the consequences for the model (nonsignificant main effects and nonsignificant interaction).
To conserve space, this section will not repeat the steps that you should follow when transferring mean scores from Output 16.14 to Figure 16.12. For detailed guidelines on how to construct a graph such as that in Figure 16.12, see the subsection “Steps in Preparing the Graph” from the major section “Example of a Factorial ANOVA Revealing Significant Main Effects and a Nonsignificant Interaction,” presented earlier. Interpreting Figure 16.12 Notice how the graphic pattern of results in Figure 16.12 are consistent with the statistical results reported in Output 16.13. Specifically: •
In Figure 16.12, the corresponding line segments are parallel to one another. This is the “railroad track” pattern that you would expect when the interaction is nonsignificant (as reported in Output 16.13).
•
None of the line segments going from “Low” to “Moderate” or from “Moderate” to “High” display a substantial angle. This is consistent with the fact that Output 16.13 reported a nonsignificant main effect for Predictor A (model aggression).
•
Finally, the solid line for the model-rewarded group is not substantially separated from the broken line for the model-punished group. This is consistent with the finding from Output 16.13 that the main effect for Predictor B (consequences for the model) was nonsignificant
614 Step-by-Step Basic Statistics Using SAS: Student Guide
(it is true that the solid line is slightly separated from the broken line, but the separation is not enough to attain statistical significance). Analysis Report Concerning the Main Effect for Predictor A (Nonsignificant Effect) The analysis report in this section provides an example of how to write up the results for a predictor variable that has three levels and is not statistically significant. Items A through F of this report would be identical to items A through F of the analysis report in the section titled “Analysis Report Concerning the Main Effect for Predictor A (Significant Effect),” presented earlier. These sections stated the research question, the research hypothesis, and so on. Therefore, those items will not be presented again in this section. (Items A through F would normally appear here) G) Obtained statistic:
F(2, 24) = 0.08
H) Obtained probability (p) value:
p = .9215
I) Conclusion regarding the statistical null hypothesis: to reject the null hypothesis.
Fail
J) Multiple comparison procedure: The multiple comparison procedure was not appropriate because the F statistic for the main effect was nonsignificant. K) Confidence intervals: Confidence intervals for differences between the means are presented in Table 16.5 L) Effect size: R2 = .01, indicating that model aggression accounted for 1% of the variance in subject aggression. M) Conclusion regarding the research hypothesis: These findings fail to provide support for the study’s research hypothesis that there will be a positive relationship between the level of aggression displayed by a model and the number of aggressive acts later demonstrated by children who observed the model. N) Formal description of the results for a paper: Results were analyzed by using a factorial ANOVA with two between-subjects factors. This analysis revealed a nonsignificant main effect for level of aggression displayed by the model, F(2, 24) = 0.08, MSE = 24.78, p = .9215. On the criterion variable (number of aggressive acts displayed by subjects), the mean score for the high-modelaggression condition was 16.30 (SD = 4.99), the mean for the moderate-model-aggression-condition was 15.80 (SD = 4.87), and
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 615
the mean for the low-model-aggression condition was 15.40 (SD = 4.70). Sample means for the various conditions that constituted the study are displayed in Figure 16.12. Confidence intervals for differences between the means (based on Tukey’s HSD test) are presented in Table 16.5. 2 In the analysis, R for this main effect was computed as .01. This indicated that model aggression accounted for 1% of the variance in subject aggression.
O) Figure representing the results:
See Figure 16.12.
Notice that the preceding “Formal description of the results for a paper” did not discuss the results of the Tukey HSD test (other than to refer to the confidence intervals). This is because the results of multiple comparison procedures are typically not discussed when the main effect is nonsignificant. Analysis Report Concerning the Main Effect for Predictor B (Nonsignificant Effect) The analysis report in this section provides an example of how to write up the results for a predictor variable that has two levels and is not statistically significant. Items A through F of this report would be identical to items A through F of the analysis report in the section titled “Analysis Report Concerning the Main Effect for Predictor B (Significant Effect),” presented earlier. These sections stated the research question, the research hypothesis, and so on. Therefore, those items will not be presented again in this section. (Items A through F would normally appear here) G) Obtained statistic:
F(1, 24) = 1.65
H) Obtained probability (p) value:
p = .2115
I) Conclusion regarding the statistical null hypothesis: to reject the null hypothesis. J) Multiple comparison procedure:
Fail
Not relevant.
K) Confidence intervals: Subtracting the mean of the modelpunished condition from the mean of the model-rewarded condition resulted in an observed difference of 2.333. The 95% confidence interval for this difference extended from –1.418 to 6.085. L) Effect size: R2 = .06, indicating that consequences for the model accounted for 6% of the variance in subject aggression. M) Conclusion regarding the research hypothesis: These findings fail to provide support for the study’s research
616 Step-by-Step Basic Statistics Using SAS: Student Guide
hypothesis that children who observe a model being rewarded for engaging in aggressive behavior will later demonstrate a greater number of aggressive acts, compared to children who observe a model being punished for engaging in aggressive behavior. N) Formal description of the results for a paper: Results were analyzed by using a factorial ANOVA with two between-subjects factors. This analysis revealed a nonsignificant main effect for consequences for the model, F(1, 24) = 1.65, MSE = 24.78, p = .2115. On the criterion variable (number of aggressive acts displayed by subjects), the mean score for the model-rewarded condition was 17.00 (SD = 4.28), and the mean for the modelpunished condition was 14.67 (SD = 4.95). Sample means for the various conditions that constituted the study are displayed in Figure 16.12. Subtracting the mean of the model-punished condition from the mean of the model-rewarded condition resulted in an observed difference of 2.333. The 95% confidence interval for this difference ranged from –1.418 to 6.085. In the analysis, R2 for this main effect was computed as .06. This indicated that consequences for the model accounted for 6% of the variance in subject aggression. O) Figure representing the results:
See Figure 16.12.
Analysis Report Concerning the Interaction (Nonsignificant Effect) An earlier section of this chapter has already shown how to prepare a report for a nonsignificant interaction term. Therefore, to save space, a similar report will not be provided here. To find the previous example, see subsection “Analysis Report Concerning the Interaction (Nonsignificant Effect)” within the major section “Example of a Factorial ANOVA Revealing Significant Main Effects and a Nonsignificant Interaction.”
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 617
Example of a Factorial ANOVA Revealing a Significant Interaction Overview This section presents the results of a factorial ANOVA in which the interaction between Predictor A and Predictor B is significant. You will remember that an interaction means that the relationship between one predictor variable and the criterion variable is different at different levels of the second predictor variable. These results are presented so that you will be prepared to write analysis reports for projects in which significant interactions are observed. The Complete SAS Program The study presented here is the same aggression study that was described in the preceding sections. The data will be analyzed with the same SAS program that was presented earlier in the section headed “Writing the SAS Program.” Here, the data have been changed so that they will produce a significant interaction term. The complete SAS program, including the new data set, is presented below: OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM CONSEQ $ MOD_AGGR $ SUB_AGGR; DATALINES; 01 MR L 10 02 MR L 8 03 MR L 12 04 MR L 10 05 MR L 9 06 MR M 14 07 MR M 15 08 MR M 17 09 MR M 16 10 MR M 16 11 MR H 18 12 MR H 20 13 MR H 20 14 MR H 23 15 MR H 21 16 MP L 9 17 MP L 8 18 MP L 7 19 MP L 11 20 MP L 10 21 MP M 10 22 MP M 12 23 MP M 13
618 Step-by-Step Basic Statistics Using SAS: Student Guide
24 MP M 9 25 MP M 9 26 MP H 11 27 MP H 12 28 MP H 9 29 MP H 13 30 MP H 11 ; PROC GLM DATA=D1; CLASS CONSEQ MOD_AGGR; MODEL SUB_AGGR = CONSEQ MOD_AGGR CONSEQ*MOD_AGGR; MEANS CONSEQ MOD_AGGR CONSEQ*MOD_AGGR; MEANS CONSEQ MOD_AGGR / TUKEY CLDIFF ALPHA=0.05; TITLE1 'JANE DOE'; RUN; QUIT;
Steps in Interpreting the Output Overview. As was the case with the earlier data set, the SAS program performing this analysis would produce five pages of output. This section will present only those sections of output that are relevant to preparing the ANOVA summary table, the graph, and the analysis reports. You will notice that the steps listed in this section are not identical to the steps listed in the preceding sections. Some of the steps (such as “1. Make sure that everything looks correct”) are not included here because the key concepts have already been covered. Other steps in this section are different because, when an interaction is significant, it is necessary to follow a special sequence of steps. 1. Determine whether the interaction term is statistically significant. As was the case before, you determine whether the interaction is significant by reviewing the ANOVA summary table produced by PROC GLM. This table appears on output page 2, and is reproduced here as Output 16.15.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 619
JANE DOE The GLM Procedure
2
Dependent Variable: SUB_AGGR Source Model Error Corrected Total R-Square 0.890647
DF 5 24 29
Sum of Squares 482.1666667 59.2000000 541.3666667
Coeff Var 12.30206
Mean Square 96.4333333 2.4666667
Root MSE 1.570563
F Value 39.09
Pr > F F