Stata Reference H-P Release 7

_f'( ./l ii_ StataReferenceManual Relea 7 Volume2 H-P Stata Press CollegeStation,Texas Stata Press, 4905 Lakeway Dri...

Author: Stata Press

27 downloads 1636 Views 56MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

_f'( ./l

ii_

StataReferenceManual Relea 7 Volume2 H-P Stata Press CollegeStation,Texas

Stata Press, 4905 Lakeway Drive, College Station, Texas 77845 Copyright C) 1985-2001 All fights reserved Version 7.0

by Stata Corporation

Typeset in TEX Printed in the United States of America l0 9 8 7 6 ISBN ISBN 1SBN ISBN ISBN

5 4 3 2

1-881228-47-9 1-88t228-48-7 1-881228-49-5 1-881228-50-9 1-881228-51-7

1

(volumes 1-4) (volume I) (volume 2) (volume 3) (volume 4)

This manual is protected by copyright. All rights are reserved. No part of this manual may be reproduced, stored in a retrieval system, or transcribed, in any form or by any means--electronic, mechanical, photocopying, recording, or otherwise--without the prior written permission of Stata Corporation (StataCorp), StataCorp provides this manual "as is" without warranty of any kind, either expressed or implied, including, bul not limited to the implied warranties of merchantability and fitness for a particular purpose. StataCorp may make improvements and/or changes in the product(s) and the programCsl described in this manual at any time and without notice. The software described in this manual is furnished under a license agreement or nondiscIosure agreement. The software may be copied only in accordance with the terms of the agreement. It is against the law m copy the software onto CD. disk, diskette, tape, or any other medium for any purpose other than backup or archival purposes. The automobile dataset appearing on the accompanying media is Copyright (_) 1979, 1993 by Consumers Union of U.S.. Inc.. Yonkers. NY 10703-1057 and is reproduced by permission from CONSUMER REPORTS, April 1979, April 1993. The Stata for Windows installation software was produced using Wise Installation System, which is Copyright @ 1994-2000 Wise Solutions. Inc, Portions of the Macintosh installation software are Copyright (_ 1990-2000 Aladdin Systems, Inc.. and Raymond Lau. Stata is a registered trademark and NetCourse is a trademark of Stata Corporation. Alpha and DEC are trademarks of Compaq Computer Corporation. AT&T is a registered trademark of American Telephone and Telegraph Company. HP-UX and HP LaserJet are registered trademarks of Hewlett-Packard Company. IBM and 0S/2 are registered trademarks and AIX. PowerPC. and RISC System/6000 are trademarks of International Business Machines Corporation. Linux is a registered trademark of Linus Torvalds. Macintosh is a registered trademark and Power Macintosh is a trademark of Apple Computer. Inc. MS-I)OS. Microsoft, and Windows are registered trademarks of Microsoft Corporation. Pentium is a trademark of Intel Corporation. PostScript and Display PostScript are registered trademarks of Adobe Systems. Inc. SPARC is a registered trademark of SPARC International. Inc Star/Transfer is a trademark of Circle Systems. Sun, SunOS. Sunview, Solaris. and NFS are trademarks or registered trademarks of Sun Microsysrems. Inc. TEN is a trademark of the American Mathematical Society. UNIX and OPEN LOOK are registered trademarks and X Window System is a trademark of The Open Group Limited. WordPerfect is a registered trademark of Corel Corporation.

The suggested citation for this software is StataCorp. 2001_ Stata Statistical Software: Release 7.0. College Station. TX: Stata Corporation.

i I

i

Title hadimvo -- Identify multivariate outliers

Syntax hadiravo varlist [if exp] [in range], g_en_rate(newvarl

[newvar2]) [p(#)]

;Description hadimvoidentifies multiple outtiers in multiv_ate data using the method of Hadi (1992, 1994), creating newvarl equal to 1 if an observation is an "outlier" and 0 otherwise. Optionally, newvar2 can also be created containing the distances from!the basic subset.

Options E

generate (newvarl [newvar2]) is not optional; it identifiesthe new variable(s) to be created. Whether you specify two variables or one, however, is dptional, newvarl--which is required--will create newvarl containing 1 if the observation is an outlier in the Hadi sense and 0 otherwise: Specifying gen (odd) would call this variable odd. newvai'2,if specified, will also create newvar2 containing the distances (not the distances squared) from the basic subset. Specifying gen (odd dist) creates odd and also creates dist containing the Hadi distances. p(#) specifies the sl_,mficance level for outlier cutoff; 0 < # < 1. The default is p(.05). Larger numbers identify a larger proportion of the sample as outliers. If # is specified greater than I. it is interpreted as a percent. Thus, p(5) is the Sameas p(.05).

Remarks Multivariate analysis techniques are commonly used to analyze data from many fields of study. The data often contain outliers. The search for subsets of the data which, if deleted, would change results markedly is known as the search for outli+rs,hadimvo provides one computer-intensive but practical method for identifying such observations. Classical outlier detection methods (e.g., Mahalhnobisdistance and Wilks' test) are powerful when the data contain only one outlier, but the power Of these methods decreases drastically when more than one outlying observation is present. The losisof power is usually due to what are known as masking and swamping problems (false negative and false positive decisions), but in addition, these methods often fail simply because they are affecied by the very observations they are supposed to identih,. Solutions to these problems often involve an unreasonable amount of calculation and therefore computer time. (Solutions involving hundreds of! millions of calculations for samples as small as 30 have been suggested.) The method developed:,{bY Hadi (1992, I994) attempts to surmount these problems and produce an answer, albeit second b{st, in finite time. A basic outline of the procedure is as follows_A measure of distance from an observation to a cluster of points is defined. A base cluster of r pdints is selected and then that cluster is continually redetined by taking the r + 1 points "closest" to the cluster as the new base cluster. This continues until some rule stops the redefinition of the clustei.

d:

.au=mvo

_

=o_nt.y

mUlZlVarlale

OUlllers

Ignoring many of the fine details, given k variables, the initial base cluster is defined as r = k + 1 points. The distance that is minimized in selecting these k + 1 points is a covariance-matrix distance on the variables with their medians removed. (We wilt use the language loosely; if we were being more precise, we would have said the distance is based on a matrix of second moments, but remember, the medians of the variables have been removed. We would also discuss how the k -r- 1 points must be of full column rank and how they would be expanded to include additional points if they are not.) Given the base cluster, a more standard mean-based center of the r-observation cluster is defined and the r + 1 observations closest in the covariance-matrix sense are chosen as a new base cluster. This is then repeated until the base cluster has r = int{(n + k + 1)/2} points. At this point, the method continues in much the same way, except a stopping rule based on the distance of the additional point and the user-specified p() is introduced. Simulation results are presented in Hadi (1994).

Examples hadimvo

price

• list

if odd

• summ

price

drop

gen(odd) /*

weight

if -odd

price

weight,

gen(odd

the

*/

outliers stats

for

clean

data

*/

D)

id=_n

graph

/* make

Did

graph price weight [w=D] graph price weight [w=i/D] summarize D, detail sort D list

list

/* summary

odd

hadimvo gen

weight,

make

hadimvo • regress

price

weight

price weight ... if -odd2

an index

/* index

plot

/* 2-way /* same,

scatter, outliers

variable

,/

of D

,/ outliers small

big

*/ ,/

D odd

mpg,

gen(odd2

Di)

p(.01)

Identifying outliers You have a theory about xz, x2, ..., xk which we will write as F(xl, x2,..., xk). Your theory might be that xl, x2, ..., xk are jointly distributed normally, perhaps with a particular mean and covariance matrix: or your theory might he that

xl =/31 T/32x2 + .-- + _kxk + u where u _ N(0, or2); or your theory might be

xl -/310 +/312x2+ ,,_14x4+ ua x2 -/320 + 1321xl+ _23xz + u2 or your theory might be anything else it does not matter. You have some data on x_, x2, ..., xk, which you will assume are generated by F(.). and from that data you plan to estimate the parameters (if any) of your theory and then test your theory in the sense of how well it explains the observed data.

h_dimvo-- Identifymultivariateoutliers

l

3

What if, however, some of your data are generated not by F(-) but by G(.), a different process? a a wages example, you For have theory on how are_assignedto employees in firm and. for the bulk of employees, that theory is correct. There are, hdwever, six employees at the top of the hierarchy for whom wages are set by a completely different process. Or, you have a theory on how individuals select different health insurance options except that, for a handful of individuals already diagnosed with serious illness, a different process controls ihe selection process. Or, you are testing a drug that reduces trauma after surgery except that, for a few patients with a high level of a particular protein, the drug has no effect. Or, in another drug experiment, some of the historical data are simply misrecorded. The data values generated by G(.) rather than F(.) are called contaminant observations. Of course, the analysis should be based only on the observations generated by F(.), but in practice we do not know which observations those are. In addition, if it happened by chance that some of the observations are within a reasonable distance _om the center of F(.), it becomes impossible to detennine whether they are contaminants. Accordingly.we adopt the following operational definition: Outliers are observations that do not conformto the pattern suggestedbythe majority of the observations in a dataset. Accordingly, observations generated b_ F(.) but located at the tail of F(.) are considered outliers. On the other hand, contaminants that are within a statistically reasonable distance from the center of F(.) are not considered outliers. It is well worth noting that outliership is strongly related to the completeness of the theory--a grand unified theory would have no outliers becafise it would explain all processes (including, one supposes, errors in recording the data). Grand uni_d theories, however, are difficult to come by and are most often developed by synthesizingthe results of many special theories. i / Theoretical work has tended to focus on one !pecial case: data containing only one outlier. As mentioned above, the single-outlier techniques ofteh fail to identify multiple outliers, even if applied recursively. One of the classic examples is the star t cluster data (a.k.a. Hertzsprung-Russell diagram) shown in the figure below (Rousseeuw and Leroy 1987, 27). For 47 stars, the data contain the (log) light intensity and the (log) effective temperature _t the star's surface. (For the sake of illustration, we treat the data here as bivafiate data--not as r_gression data--i.e., the two variables are treated ..... i . similarly with no dlstmcuon between which vanatile is dependent and which is independent.) This graph presents a scatter of the data along with two ellipses expected to contain 95% of the data. The larger ellipse is based on the mean and dovariance matrix of the full data. All 47 stars are inside the larger ellipse, indicating that classical iingle-case analysis fails to identify any outliers, The smaller ellipse is based on the mean and co_,afiancematrix of the data without the five stars identified by had±taro as outliers. These observaiions are located outside the smaller ellipse. The i dramatic effects of the outliers can be seen by comparing the two ellipses. The volume of the larger ellipse is much greater than that of the smaller one and the two ellipses have completely different orientations. In fact, their major axes are nearly orthogonal to each other; the larger ellipse indicates a negative correlation (r = -0.2) whereas the smalle_rellipse indicates a positive correlation (r = 0.7] (Theory would suggest a positive correlation: hot ihings glow.)

:i

(Graph on _ext page)

i

•

_r

Itc_u..vu

"-;"

lUt:lluly

inuiLivuna|e

OUtllers

I

I

8-

/

/ "_ =

/

o o

0 0

_o_ o -4

4"

0

\\

_.

// /

@__2"j

2

/

.//

F_Log

lemperalute

The single-outlier techniques make calculations for each observation under the assumption that it is the only outlier and the remaining n - 1 observations are generated by .F('.) producing a statistic for each of the n observations. Thinking about multiple oufliers is no more difficult. In the case of two outliers, consider all possible pairs of observations (there are n(nI)/2 of them) and, for each pair, make a calculation assuming the remaining n 2 observations are generated by F(-). For the three-outlier case, consider all possible triples of observations (there are rz(_ - 1)(n - 2)/(3 x 2) of them) and, for each triple, make a calculation assuming the remaining rL 3 observations are generated by F(-). Conceptually, this is easy but practically, it is difficult because of the rapidly increasing number of calculations required (there are also theoretical problems in determining how many outliers to test simultaneously). Techniques designed for detecting multiple outliers, therefore, make various simplifying assumptions to reduce the calculation burden and, along the way, lose some of the theoretical foundation. This loss, however, is no reason for ignoring the problem and the (admittedly second best) solutions available today. It is unreasonable to assume that outtiers do not occur in real data. If outliers exist in the data, they can distort parameter estimation, invalidate test statistics, and lead to incorrect statistical inference. The search for outliers is not merely to improve the estimates of the current model but also to provide valuable insight into the shortcomings of the current model. In addition, outliers themselves can sometimes provide valuable clues as to where more effort should be expended. In a drug experiment, for example, the patients excluded as outliers might well be further researched to understand why they do not fit the theory.

Multivariate, multiple outliers hadimvo is an example of a multivariate, multiple outlier technique. The multivariate aspect deserves some attention, In the single-equation regression techniques for identifying outliers, such as residuals and leverage, an important distinction is drawn between the dependent and independent variables--the b' and the x's in y = x/3 + u. The notion that the ff is a linear function of x can be exploited and. moreover, the fact that some point (Yi-xi) is "far" from the bulk of the other points has different meanings if that "farness" is due to ;ti or xi. A point that is far due to xi but, despite that, still close in the _/i given xi metric adds precision to the measurements of the coefficients and may not indicate a problem at all. In fact, if we have the luxury of designing the experiment, which means choosing the values of x a priori, we attempt to maximize the distance between the x's (within

h_dimvo-- Identify multivariateoutliers

5

the bounds of x we believe are covered by our {inear model) to maximize that precision. In that extreme case, the distance of xi carries no information as we set it prior to running the experiment. More recently, Hadi and Simonoff (1993) exploit _he structure of the linear model and suggest two

i

diagnostics). methods for identifying muhiple outliers v,hen the inodel is fitted to the data (also see [R] regression In the multivariate case, we do not know the strhcture of the model, so (y,, x+) is just a point and the y is treated no differently from an5"of the x's_a, fact which we emphasize by+writin_ the point as (xaz,x2i) or simply (Xi). The technique doeg assume, however, that the X's are multivariate normal o1"at least elliptically symmetric. This lead_ to a problem if some of the X's are functionally related to the other X's, such as the inclusion of x _nd x 2, interactions such as .r_x> or even dummy variables for multiple categories (in which one of the dummies being 1 means the other dummies must be 0). There is no good solution to this problrm. One idea, however, is to perform the analysis with and without the functionally related variables _ind to subject all observations identified for further study (see What to do with outliers below).

]

An implication of hadJ.mvo being a multivariaie technique is that it would be inappropriate to apply it to (9,x) when x is the result of experi_ntal design. The technique would know nothing of our design of x and would inappropriately treat "distance" in the x-metric the same as distance in the +j-metric. Even when x is inuhivariate norNal, unless y and x are treated similarly it may still be inappropriate to apply had±taro to (9, x)because of the different roles that _/ and x play in regression. However, one may apply had±taro on x to identify outliers which, in this case. are called leverage points. (We should also mention here that if had:i.mvo is applied to x when it contains + constants or any collinear variables, those variables will be correctly i'_nored, allowine_ the analxs_. ,_'s to continue.) It is also inappropriate to use hadimvo (and other outlier detection techniques) when the sample + size is too small, had:i.mvo uses a small-sample correction factor to adjust the covariance matrix of the "clean" subset. Because the quantity n - (3L:@ I) appears in the denominator of the correction factor, the sample size must be larger than 31,:+ _. Some authors would require the sample size to be at least 5h, i.e., at least five observations per vhriable, With these warnings, it is difficult to misapply {his tool assuming that you do not take the results as more than suggestive, hadimvo has a p () optioh that is a "significance level" for the outliers that i

are chosen. quote the term level b_cause, although has that beeninterpretation expended to really make We a significance level,significance approximations a_e involved and itgreat will effort not have i in all cases. It can be thought of as an index between 0 and 1, with increasing values resulting in the labeling of more obse_,'ations as outliers and _ith the suggestion that you select a number much as you would a significance level--it is roughly the probability of identifying any given point as an outlier if the data truly were multivariate normal. Nevertheless, the terms significance level or critical values should be taken with a grain of shlt. It is suggested that one examine a graphical display (e.g., an index plotl of the distance with berhaps different values of p(). The graphs give more information than a simple yes/no answer. FOr example, the graph may indicate that some of the observations (inliers or outliers) are only mar_nally so,

i }

What to do with outliers After a reading of the literature on outlier ddtection., many people are left with the incorrect impression that once outliers are identified, they, s_ould be deleted from the data and analysis should be continued. Automatic deletion (or even automatic down-weighting) of outliers is not ahvavs correct because outliers are not necessarily bad obser 'atlons. On the contrary,, if they are correct, they' may bc thc most informative points in the data. For _xample, they may indicate that the data do not

!:

wuaunmwvv

v

--

Igl_lltlly

illglIiVarla|e

come from a normally distributed techniques.

OUtllers

population,

as is commonly

assumed

by almost all multivariate

The proper use of this tool is to label the outliers and then subject the outtiers to further study, not simply to discard them and continue the analysis with the rest of the data. After further study, it may indeed turn out to be reasonable to discard the outliers, but some mention of the oufliers must certainly be made in the presentation of the final results. Other corrective actions may include correction of errors in the data, deletion or down-weighting of outliers, redesigning the experiment or sample survey, collecting more data, etc.

Saved Results hadimvo saves in r(): Scalars r(N)

number of outliers remaining

Methodsand Formulas hadimvo

is implemented

as an ado-file. Formulas are given in Hadi (I992,

1994).

Acknowledgment We would like to thank Ali S. Hadi of Comell University for his assistance

in writing hadimvo.

References Gould, W. W. and A. S. Hadi. 1993. smv6: Identifying multivariate outliers. Stata Technical Bulletin 11: 28-32. Reprinted in Stata TechnicalBulletin Reprints, vol. 2. pp. 163-168 Hadi, A. S. 1992. Identifying multiple outliers in multivariatedata. Journal of the Royal Statistical Society, Series B 54: 761-771. 1994. A modification of a method for the detection of outliers in multivariate samples. Journal of the Royal Statistical SocieD',Series B 56: 393-396. Hadi, A. S. and J. S. Simonoff. 1993. Procedures for the identificationof multiple outtiers in linear models. Journal of the American Statistical Association 88:1264 1272. Rousseeuw,P. J. and A. M. Leroy. 1987. Robust Regression and Outlier Detection. New York: John Wiley & Sons.

Also See Related:

JR] mvreg, [R] regression diagnostics,

[R] sureg

Title hausman -- Hausman specification test

Syntax hausman, save

hausman

[, [more

l!ess ] constant a_leqs skipeqs(eqtixt) i

sigmamore

p_rior(string)current (string)equations (matchlist)] hausman, clear where matchlist in equat ions () is

#:#[,#i#[,...]] For instance, equations (:1.: 1), equations(1

:_, 2: 2), or equations (1 :2).

i Description hausmanperforms Hausman's (t978) specificition test. /

Options save requests that Stata save the current estimation results, hausman will later compare these results with the estimation results from another model. A model must be saved in this fashion before a test against other models can be performed. more specifies that the most recently estimated model is the more efficient estimate. This is the default, less specifies that the most recently estimated model is the less efficient estimate. constant specifies that the estimated intercept(s) are to be included in the model comparison: by default, they are excluded. The default behavi_ is appropriate for models where the constant does not have a common interpretation across the two models. i alleqs specifies that all the equations in the mod_l be used to perform the Hausman test: by default. only the first equation is used. skipeqs(eqlist) specifies in eqlist the names of equations to be excluded from the test. Equation numbers are not allowed in this context as it is the equation names, along with the variable names. that are used to identify common coefficients. sigmamore allows you to specify that the two Cbvariancematrices used in the test be based on a common estimate of disturbance variance (cr2i--the variance from the fully efficient estimator. This option provides a proper estimate of the :contrast variance for so-called tests of e×ogeneity and over-identification in instrumental variablegregression: see Baltagi (1998, 29!). Note that this option can only be specified when both estimators save e(sigma) or e(rmse). !

prior (string) and current (string) are formattin_options that allow you to specify alternate wording for the "Prior" and "Current" default labels used to identify the columns of coefficients.

i i J

v

,JlauoIIIglll

equations

--

I ll;Ig;_lll(:ill

(matchlist)

If equations

:tl.l_l.;llli.;dl.l[.}l]

test

specifies• by number, the pairs of equations

that are to be compared.

() is not specified, then equations are matched on equation names.

equations() handles the situation where one does not. For instance, equations(l:2) means equations(i:1, 2:2) means equation 1 is to to be tested against equation 2. If equations() ignored.

estimator ases equation names and the other equation 1 is to be tested against equation 2. be tested against equation 1 and equation 2 is is specified, options alleqs and skipeqs are

clear discards the previously saved estimation resuhs and frees some memory; it is not necessary to specify hausmaxl, clear before specifying hausman, save.

Remarks hausm_n is a general implementation of Hausman's estimator that is known to be consistent with an estimator tested. The null hypothesis is that the efficient estimator the true parameters. If it is• there should be no systematic

(1978) specification test that compares an that is efficient under the assumption being is a consistent and efficient estimator of difference between the coefficients of the

efficient estimator and a comparison estimator that is known to be consistent for the true parameters. If the two models display a systematic difference in the estimated coefficients, then we have reason to doubt the assumptions on which the efficient estimator is based. To use hausman, you (estimate

the less efficient

• hausman,

save

(estimate • hausman

the fully

Alternatively,

efficient

model

)

model)

you can turn this around:

(estimate the fully efficient model) • hausman, • (estimate • hausman,

save the tess efficient

model)

less

> Example We are studying the factors that affect the wages of young women in the United States between 1970 and 1988 and have a panel-data sample of individual women over that time span.

(Continued

on next page)

lausman -- Hausman specificationtest

9

• describe Contains

data

from nlswork.dta

obs:

28,534

National i

Longitudinal

Young Women in 1968

14-26

Survey. years

of age

Z

vars :

6

size:

485,078

I Aug 2000 (88.4_ of memory

storage

09:48

fr_e)

display

value i

type

format

label!

idcode

imt

_8.0g

NLS id

year

byte

_8.0g

interview

age

byte

_8.0g

age in current

year

msp

byte

_8.0g

1 if married,

spouse

ttl_exp

float

_9.0g

total

work

experience

In_wage

float

_9.0g

In(wage/GNP

deflator)

variable

Sorted

name

by:

Note:

idcode

year

dataset

has

changed

since

last i

variable

label

year present

saved

We believe that a random-effects specification is @ppropriate for individual-level effects in our model. We estimate a fixed-effects model that will capture all temporally constant individual-level effects. . xtreg

In_wage

Fixed-effects

age msp ttl_exp,

(within)

fe Number

of obs

=

28494

(i) : idcode

Number

of groups

=

4710

within

= 0.1373

Obs per

between overall

= 0.2571 = 0.1800

Group

variable

_-sq:

corr(u_i,

Xb)

in_wage

regression

= 0.1476

Coef.

Std. Err.

i

t

group:

min =

1

avg = max =

6.0 15

F(3,23781)

=

Prob

=

P>[tl

> F

[95Z Conf.

1262.01 0.0000

Interval]

age

-.005485

.000837

_6.55

0.000

-.0071256

-.0038443

msp

.0033427

.0054868

!0.61

0.542

-.0074118

.0140971

ttl_exp

,0383604

.0012416

_0.90

0.000

.0359268

.0407941

_cons

1.593953

.0177538

_9.78

0.000

1.559154

1.628752

sigma_u

.37674223

sigma_e rho

.29751014 .61591044

F test

that

all u_i=O:

! (fraction

i of!variance ii i

F(4709,23781)ii =

due to u_i)

7.76

Prob

> F = 0.0000

{J

We assume that this model is consistent for the true parameters and save the results by typing . hausman,

save

Now, we estimate a random-effects model _s a fully efficient specification of the individual effects under the assumption that they follow a rahdom-normal distribution. These estimates are then compared to the previously saved results using tile hausman command.

1

•_

Jtuu_w

mvow I --

! lOU_I

1I_1

i _}.ll_'_ll

II,;BllOn

Ie$l

I

• xtreg

in_wage

age

msp

ttl_exp,

re

Random-effects

GLS

regression

Number

of

obs

=

28494

Group

variable

(i)

: idcode

Number

of

groups

=

4710

R-sq:

within

= 0.1373

between overall

= 0.2552 = 0.1Z97

effects

u_i

Random

corr(u_i,

X)

in_wage

Obs per

group:

min

=

1

avg max

= =

6.0 15

~ Gaussian

Wald

chi2(S)

=

5100.33

= 0 (assumed)

Prob

> chi2

=

0.0000

Coef.

Std.

Err.

z

P>Izl

[95Z

Conf.

Interval]

age

-.0069749

.0006882

-10.13

0.000

-.0083238

-.0056259

msp

.0046594

.0051012

0.91

0.361

-.0053387

.0146575

ttlexp _cons

.0429635 1.609916

.0010169 .0159176

42.25 101.14

0.000 0.000

.0409704 1.578718

.0449567 1.641114

sigma_u

.32648519

sigma_e rho

.29751014 .54633481

(fraction

of

variance

to

due

u_i)

• hausman Coefficients (b) Prior

(B) Current

(b-B) Difference

sqrt(diag(V_b-V_B)) S.E.

age

-.005485

-.0069749

msp

.0033427

.0046594

-.0013167

.0020206

.0383604

.0429635

-.0046031

.0007124

tt1_exp

b = less

efficient

B = fully Test:

Ho:

difference chi2(3)

.0014899

estimates

efficient

obtained

estimates

in coefficients

obtained

not

.0004764

previously from

from

xtreg

xtreg

systematic

= (b-B)'[(V_b-V_B)'(-I)](b-B) = 275.44 Prob>chi2

=

0.0000

Using the current specification, our initial hypothesis that the individual-level effects are adequately modeled by a random-effects model is resoundingly rejected. We realize, of course, that this result is based on the rest of our model specification and that it is entirely possible that random effects might be appropriate for some alternate model of wages. hausman is a generic implementation of the Hausman test and assumes that the user knows exactly what they want tested. The test between random- and fixed-effects is so common that Stata provides a special command for use after xtreg. We could have obtained the above test in a slightly different format by typing xthausman Hausman

specification

test -

Coefficients Fixed

in_wage

Random

Effects

Effects

Difference

age

-. 005485

-. 0069749

msp

.0033427

.0046594

-.0013167

tt l_exp

.0383604

.0429635

-.0046031

.0014899

t hausman--

Hausmanspecificationtest

11

i Test:

Ho:

difference

in coefficients

not

chi2(3)

= (b-B)'ES'(-!)]Cb-8),

Prob>chi2

= =

systematic

S = (S_fe - S_re)

275.44 O.0000

q

Example A stringent assumption of multinomial and cbnditional logit models is that outcome categories • i for the model have the property of independence of irrelevant alternatives (IIA). Stated simply, this assumption requires that the inclusion or exclusmn of categories does not affect the relative risks associated with the regressors in the remaining citegories. One classic example of a situation where thi_ assumption would be violated involves choice of transportation mode: see McFadden (1974). For s_mplicity, postulate a transportation model with the four possible outcomes: rides a train to work, t_es a bus to work, drives the Ford to work, and drives the Chevrolet to work• Clearly "drives the _ord" is a closer substitute to' drives the Chevrolet" than it is to "rides a train" (at least for most people). This means that excluding "drives the Ford" from the model could be expected to affect the relative risks of the remaining options and the model would not obey the IIA assumption. i Using the data presented in [R] mlogit, we w_l use a simplified model to test for IIA. Choice of insurance type among indemnity, prepaid, and un_sured is modeled as a function of age and gender. The indemnity category is allowed to be the base _ategory and the model including all three outcomes is estimated. i

• mlogit insure age male Iteration O: Iteration I: Iteration 2:

log likelihood = -555.854_6 log likelihood = -551.329t3 log likelihood = -551.32802

Multinomial regression

Number of obs LR chi2(4) Prob > chi2

= = =

615 9,05 0.0598

Log likelihood = -551.32802

Pseudo R2

=

0.0081

,i

insure

Coef.

Std. Err.

Prepai6 age male

-. 0100251 .5095767

.0060181 ,1977893

_cons

.2633838

.2787574

Uninsure age male _cons

z i i _1.67 2.58

I

O, 94

P>lz]

[95Y,Conf. Interval]

O. 096 O. 010

-. 0218204 .1219148

.0017702 .8972345

O. 345

-. 2829708

.8097383

0.648 O. 189 0.001

-.027501 -.2343477 -2.797504

.017116 i.184057 -.7161824

i

i -.0051925 .4748547 -1.756843

.0113821 .3618446 .5309591

40.46 iI.31 43.31 :i

(Outcome insure==Indem is the comparisonlgrouP)

]

• hausman, save

i i

!

Under the IIA assumption, we would expect no _;ystematic change in the coefficients if we excluded one of the outcomes from the model. (For an ektensive discussion, see Hausman and McFadden. 1984.) We re-estimate the model, excluding the!uninsured outcome, and perform a Hausman test

i i

. mlogit insure age male if insure-=

against the fully efficient full model.

"U_insure":insure

I

_:r,,

..

..u_.,..--n_usman

specmcaIlontest

Iteration

O:

log

likelihood

=

-394.8693

Iteration

I:

log likelihood

=

-390.4871

I_eration

2:

log likelihood

= -390.48643

Multinomial

Log

regression

likelihood

= -390.48643

Number of obs LR chi2(2)

= =

Prob

=

0.0125

=

0.0111

> chi2

Pseudo

Std.

Err.

z

R2

P>Iz}

[95Z

Conf.

570 8.77

insure

Coef.

Interval]

age male

-.0101521 .5144003

.0060049 .1981735

-1.69 2.60

0.091 0.009

-.0219214 .1259875

.0016173 .9028132

_cons

.2678043

.2775562

0.96

0.335

-.2761959

.8118046

Prepaid

(Outcome

insure==Indem

hausman,

alleqs

is the

less

comparison

group)

constant

Coefficients (b) Current

(B) Prior

(b-B) Difference

sqrt (diag (V_b-V_B)) S.E.

age male

-.0101521 .5144003

-.0100251 .509574Z

-.0001269 .0048256

_cons

.2678043

.2633838

.0044205

b = less

efficient

B = fully Test:

Ho:

efficient

difference

estimates

in coefficients

chi2(3)

obtained

estimates

from

obtained

not

.012334

mlogit

previously

from

mlogit

systematic

= (b-B)'[(V_b-g_B)_(-l)](b-B) = 0.08 Prob>chi2

=

0.9944

First, note that the somewhat used to identify the "Uninsured"

subtle syntax of the if condition on the mlogit command was simply category using the insure value label; see [U] 15.6.3 Value label.

On examining the output has been violated.

hausman,

Second, mlogit

from

since the Hausman

requires

test is a standardized

that the base category

most frequem category use the basecategory()

we see that there

be the same

is no evidence

comparison

of model

in both competing

that

the IIA assumption

coefficients,

models.

using

it with

In particular,

if the

(the default base category) is being removed to test for IIA, then you must option in mlogit to manually set the base category to something else.

The missing values for the square root of the diagonal of the covariance matrix of the differences is not comforting but is also not surprising. This covariance matrix is guaranteed to be positive definite only asymptotically and assurances are not made about the diagonal elements, Negative values _ong the diagonal are possible, and the fourth column of the table is provided maitdy for descriptive use. We can also perform the Hausman • mlogit

insure

age

male

IIA

if insure

test against the remaining alternative in the model. ~=

"Prepaid":insure

Iteration

O:

log

likelihood

= -132.59915

Iteration

i:

log

likelihood

= -131.78009

Iteration

2:

log

likelihood

= -131.76808

Iteration

3:

log

likelihood

= -131.76807

Multinomial

Log

regression

likelihood

= -131.76807


= =

Prob

> chi2

=

0.4356

R2

=

0.0063

Pseudo

338 i.66

lausman -- Hausmanspecificationtest

it

13

i

insure

Coef.

Std. Err.

z

P>Jzl

[95Z Conf. Interval]

Uninsure age male _cons

-.0041055 .4591072 -1. 801774

.0115807 .3595663 .5474476

_0.35 I.28 43.29

O.723 O. 202 O. 001

-.0268033 -.2456298 -2. 874752

.0185923 I.163844 -, 7287968

(Outcome insure==Indem is the comparison group)

• hausman, alleqs less constant -Coefficients -(b) Current age male _cons

-.0041055 .4591072 -1.801774

i

(B) Prior

(b-B) sqrt(diag(V b-V_B) ) Difference S.E.

-.0051925 .4748547 -1.756843

.001087 -.0157475 -.0449311

i

.0021357 .1333464

I

b = less efficient estim_ttesobtained from mlogit B = fully efficient estiz_atesobtained previously from mlogit Test:

Ho:

difference in coefficients not systematic chi2(3)

= (b-B)'[(V_b-__B)'(-I)](b-B) = -0.18 dhi2 model estimated on these } _ata fails to meet the asymptotic $ssumptions of the Hausman test i

/

In this case, the X2 statistic is actually negati_'e. We might interpret this as strong evidence that we cannot reject the null hypothesis. Such a result is not an unusual outcome for the Hausman test, particularly when the sample is relatively small'there are only 45 uninsured individuals in this dataset. Z

Are we surprised by the results of the Hausn_an test in this example? Not really. Judging from the z-statistics on the original multinomiat logit model, we were struggling to identify any structure in the data with the current specification. Even when we were willing to make the assumption of IIA I and estimate the most efficient model under this assumption, few of the effects could be identified as statistically different from those on the base category. Trying to base a Hausman test on a contrast (difference) between two poor estimates is just a_king too much of the existing data. chi2 = 0.0000

heckman assumes thatwage is the dependent vada_]e and thatthe firstvariablelist(educ and age) are the determinants of wage. The variablesspecifiedin the select() option (married, children.

educ,andage)areassumedtodetermine whetherihedependent variable isobserved(theselection equation). Thus, we estimated the model wage = fi0+ flledUc + fl2age+ ul andwe assumedthatwage isobservedif 70 + 71married

+ 72children

_ 73educ + 74age + u2 > 0

where ut and u2 have correlation p.

l t

The reported results for the wage equation are interpreted exactly as though we observed wage data for all women in the sample; the coefficients on age and education level represent the estimated marginal effects of the regressors in the underlying regression equation. The results lbr the two ancillary parameters require some explanation, hec_anan does not directly estimate p: to constrain ,o within its valid limits, and for numerical stabiliiv during optimization, it estimates the inverse hyperbolic tangent of p:

atanh p = _

1-_/

I,

22

heckman -- Heckman selection model

This estimate is reported as /athrho. In the bottom panel of the output, heckman undoes this transformation for you: the estimated value of p is .7035061. The standard error for p is computed using the delta method and its confidence intervals are the transformed intervals of/aghrho. Similarly, or, the standard error of the residual in the wage equation, numerical stability, heckman instead estimates In or. The untransformed of the output: 6.004797.

is not directly estimated; for sigma is reported at the end

Finally, some researchers--especially economists are used to the selectivity effect summarized not by p but by A = per. heckman reports this, too, along with an estimate of the standard error and confidence interval.

q

[] Technical Note If each of the equations in the model had contained many regressors, the heckman command could become quite long. An alternate way of specifying our wage model would make use of Stata's global macros. The following lines are an equivalent way of estimating our model. global

wageeq

global

seleq

. heckman

"wage "married

$wageeq,

educ

age"

children

edue

age"

select($seleq)

[]

o Technical Note The reported model X _ test is a Wald test of all coefficients in the regression model (except the constant) being 0. heckman is an estimation command, so you can use test, testnl, or lrtest to perform tests against whatever nested alternate model you choose; see [R] test, [R] testnl, and [R] lrtest. The estimation of P and cr in the form atanh p and In cr extends the range of these parameters to infinity in both directions, thus avoiding boundary problems during the maximization. Tests of p must be made in the transformed units. However, since atanh(0) - 0, the reported test for atanh p = 0 is equivalent to the test for ,o = O. The likelihood-ratio test reported at the bottom of the output is an equivalent test for p = 0 and is computationally the comparison of the joint likelihood of an independent probit model for the selection equation and a regression model on the observed wage data against the heckman model likelihood. The z -- 8.619 and X _ of 61.20, both significantly different from zero. clearly justify the Heckman selection equation with this data. []

Example heckman supports the HuberAVhite/sandwich estimator of variance under the robust and cluster() options, or when the pweights are used for population weighted data: see IU] 23.11 Obtaining robust variance estimates. We can obtain robust standard errors for our wage model by specifying clustering on county of residence (the county variable).

h$ckman-- Heckmanselection model

23

i

• heckman

wage

educ

Iteration

O:

log likelihood

= -5178.7009

Iteration

I:

log likelihood

= -5178.3049

Iteration

2:

log likelihood

= -5178,3045

Heckman

selection

(regression

select(married

children

model

model

Log likelihood

age,

with

educ age)

Number sample

selection)

= -5178.304

Coef.

cluster(county)

of obs

=

2000

Censored obs Uncensored obs

= =

657 1343

Wald

=

272.17

=

0.0000

chi2(1)

Prob > chi2 (standard

errorsiladjusted :!

Robust

i!

Std,

Err.

for clustering

P>lzl

[957 Conf.

on county)

Interval]

! wage education age _cons

.9899537

.0600061

16.$0

0.000

.8723438

1.107564

.2131294 .4857752

.020995 1.302103

I0,i5 0._7

0.000 0.709

.17198 -2.066299

.2542789 3.03785

.4451721 .4387068

.0731472 .0312386

6.09 14.04

0.000 0.000

.3018062 .3774802

.5885379 .4999333

5._6

0.000

.0341645

.0772991

0.000 0.000

.0285954 -2,717059

.0444242 -2.264972

0.000

.5991596

1.149258

0.000

1.741902

1.843216

select married children education age _cons

.0110039 •004038 .1153305

9.$4 -21._0

.8742086

.1403337

6._3

1.792559

.0258458

69._6

rho

•7035061

,0708796

.5364513

.817508

sigma lambda

6.004797 4.224412

.155199 .5186709

5.708189 3.207835

6.316818 5.240988

Prob

= 0.0000

/athrho /Insigma

Wald

.0557318 •0365098 -2.491015

test

of indep,

eqns.

(rho = 0): chi2(l!

=

38.81

> chi2

The robust standard errors tend to be a bit larger, bit we do not notice any systematic differences. This is not surprising since the data were not constructed to have any county-specific correlations or other characteristics that would deviate from the assumptions of the Heckman model. q

Example The default statistic produced by predict after tieckman is the expected value of the dependenl variable from the underlying distribution of the regression model. In our wage model, this is the expected wage rate among all women, regardless of whether they were observed to participate in the labor force. • predict heckwage (option xb assumed;

fitted

values) }

It is instructive to compare these predicted wage v_lues from the Heckman model with an ordinary regression model--a model without the selection adiustment:

;_r;

24

heckman-Heckmanselectionmodel wage educ age

. regress

]

Source

13524.0337

wage

age _cons education

(option

MS

2

39830.8609

Total

• predict

df

i

Model Residual

)

SS

1340

53354,8948

1342

I

Coef.

Std.

I

.1465739 6.084875 .8965829

Number of obs F( 2, 1340)

= =

1343 227.49

6762.01687

Prob

=

0.0000

29.7245231

R-squared

=

0.2535

39.7577456

Adj R-squared Root MSE

= =

0.2524 5.452

Err.

t

.0187135 .8896182 .0498061

7.83 6.84 18.00

> F

P> It J

[95Y, Conf.

0.000 O. 000 0.000

.109863 4.339679 .7988765

Interval]

.1832848 7. 830071 .9942893

re.age xb assumed;

. summarize

fitted

heckwage

values)

re.age

'

Variable

0bs

Mean

Std.

Dev.

Min

Max

(

heckwage

2000

21. 15532

3. 83965

14.6479

32. 85949

regwage

2000

23. 12291

3. 241911

17. 98218

32. 66439

Since this dataset was concocted, we know the true coefficients of the wage regression equation to be 1, 0.2, and 1, respectively. We can compute the true mean wage for our sample. • gen

truewage

• sum

truewage

= i +

,2*age

+ l*educ

Variable

I

0bs

Mean

truewage

I

2000

21. 3256

Std.

Dev.

3.797904

Min

Max

15

32.8

Whereas the mean of the predictions from heckmanis within 18 cents of the true mean wage, ordinary regression yields predictions that are on average about $1.80 per hour too high due to the selection effect. The regression predictions also show somewhat less variation than the true wages. The coefficients from heckman are so close to the true values that they are not worth testing. Conversely, the regression equation is significantly off, but seems to give the right sense. Would we be led far astray if we relied on the OLS coefficients? The effect of age is off by over 5 cents per year of age and the coefficient on education level is off by about 10%. We can test the OLS coefficient on education level against the true value using test. • test (I)

educ

= 1

education F(

= 1.0

1, 1340) = Prob > F =

4.31 0.0380

Not only is the OL$ coefficient on education substantially lower than the true parameter, the difference from the true parameter is statistically significant beyond the 5% level. We can perform a similar test for the OLS age coefficient: • test (1)

age

=

.2

age

=

.2

F(

1, 1340) = Prob > F =

8.15 0.0044

We find even stronger evidence that the OLS regression results are biased away from the true parameters. Example Several other interesting aspects of the Heckmah model can be explored with predict;. Continuing with our wage model, the expected wages for wOmen conditional on participating in the labor force can be obtained with the ycond option. Let's gdt these predictions and compare them with actual wages for women participating in the labor forcel • predict

hcndwage,

• stmm_lize

wage

ycond

hcndwage

Variable wage hcndwage

if wage

-=

Obs

Mean

Std ! Dev.

1343

23.69217

6.3_5374,

1343

23.68239

3.355087 i

Min

Max

5.88497

45.80979

16. 18337

33.7567

We see that the average predictions from beckman are very close to the observed levels but do not have exactly the same mean. These conditional w'age predictions are available for all observations in the dataset, but can only be directly compared with observed wages where individuals are participating in the labor force. What if we were interested in making predictions about mean wages for all women? In this case, the expected wage is 0 for those who are not exp_ted to participate in the labor force, with expected participation determined by the selection equation.: These values can be obtained with the yexpected option of predict. For comparison, a variable can be generated where the wage is set to 0 for nonparticipants. . predict

hexpwage,

yexpected

• gen wageO=

wage

(657 missing

values

generated)

. replace

wageO=

0 if wage

(657 real

changes

made)

• summarize Variable hexpwage wageO

hexpwage

== .

wageO

fibs

Mean

Stdi Dev.

2000

15. 92511

5.949336

2000

15,90929

12. _7081

Min 2.492469

Max 32.45858

0

45.80979

i

i

Again, we note that the predictions from heckman are very' close to the observed mean hourly wage rate for all women. Why aren't the predictions uging ycond and yexpected exactly equal to their observed sample equivalents? For the Heckman _odel, unlike linear regression, the sample moments implied by the optimal solution to the model likelihood do not require that these predictions exactly match observed data. Properly accounting for thh additional variation from the selection equation ,-quires that the model use more information thar_just the sample moments of the observed wao_es. q

Example Stata wilt also produce Heckman's (1979) two-step efficient estimator of the model with the twostep option. Maximum likelihood estinaation of the parameters can be time-consuming with large datasets and the two-step estimates may provide a_ood alternative in such cases. Continuing with the women's wage model, we can obtain the two-step estimates with Heckman's consistent covariance estimates by typing

!

I

',,

26

heckman m Heckman selection model

• heckman wage educ age, select(married children ednc age) twostep Heckman selection model -- two-step estimates (regression model with sample selection)

Coef, wage education age _cons

Std, Err.

z

Number of obs Censored obs Uncensored obs

= = =

2000 657 1343

Wald chi2(4) Prob > chi2

= =

551.37 0.0000

P>Izl

[95_ Conf. Interval]

.9825259 .2118695 .7340391

.0538821 .0220511 1.248331

18.23 9.61 0.59

0.000 0.000 0.557

.8789189 .1686502 -1.712645

1.088133 .2550888 3.180723

.4308575 .4473249 .0583645 .0347211 -2.467365

.074208 .0287417 .0109742 .0042293 .1925635

5.81 15.56 5.32 8.21 -12.81

0.000 0.000 0.000 0.000 0.000

.2854125 .3909922 .0368555 .0264318 -2.844782

.5763025 .5036576 .0798735 .0430105 -2.089948

select married children education age _cons mills lambda

4.001615

rho sigma lambda

0.67284 5.9473529 4.0016155

.6065388

6.60

0.000

2.812821

5.19041

.6065388

q

t] Technical Note The Heckman selection mode] depends strongly on the model being correct; much more so than ordinary regression. Running a separate probit or ]ogit for sample inclusion followed by a regression, referred to in the literature as the two-part model (Manning, Duan, and Rogers 1987) not to be confused with Heckman's two-step procedure--is an especially attractive alternative if the regression part of the model arose because of taking a logarithm of zero values. When the goal is to analyze an underlying regression model or predict the value of the dependent variable that would be observed in the absence of selection, however, the Heckman model is more appropriate. When the goal is to predict an actual response, the two-part model is usually the better choice. The Heckman selection model can be unstable when the model is not properly specified, or if a specific dataset simply does not support the model's assumptions. For example, let's examine the solution to another simulated problem.

(Continued

on next page)

heckman-- _man

• heckman

yt xl x2 x3,

Iteration Iteration Iteration Iteration Iteration

O: i: 2: 3: 4:

log log log log log

selec¢(zl

likelihood likelihood likelihood likelihood likelihood

selection model

27

z2) = = = = =

-t11.94996 -110.82258 -II0.17707 -107.70663 (not concave) -107.07729 (not concave)

(outputo_.ed ) Iteration 31: Iteration 32:

log likelihood = -104.08268 log likelihood = -104.08267 (backed up)

Heckman selection model

Number of obs

=

150

(regression model with sample selection)


= =

87 63

Wald chi2(3)

=

8.64e+07

Prob

=

0.0000

Log likelihood

= -104.0827 Coef.

Std. Err.

z

> chi2

P>Izl

[957,Conf, Interval]

yt xl x2

.8974192 -2,525302

.0006338 1415._ .0003934 -6418.57

O.000 O.000

.896177 -2.526074

.8986615 -2.524531 2. 856651

x3 _cons

2.855786 .6971255

.0004417 6465.84 .0851518 8.I_

O.000 O.000

2.85492 ,5302311

zl

-.6830377

.0824049

-8.29

O.000

-.8445484

-.521527

z2 _cons

1.004249 -.361413

.1211501 .1165081

8. _ -3,ID

O. 000 O.002

.7667993 -.589"/647

1. 241699 -. 1330613

/athrho /insigma

15.12596 -.5402571

151.3627 ,1206355

0.10 -4._

0.920 O.000

-281.5395 -.7766984

311.7914 -.3038158

.8640198

select

i

rho sigma lambda

1 .5825984 .5825984

4.40e-

LR test ol indep, eqns. (rho = 0):

I

I !

11

-1

.0702821 .0702821

.459922 .4448481 chi2(i) =

25.67

1 .7379968 .7203488

Prob > chi2 = 0,0000

the form of the likelihood for the Heckman selectioh model, this implies a division by zero and it is surprising that the model solution turns out as will as it does. Reparameterizing p has allowed The model has converged to a value of p that is 1.0--within machine rounding tolerances. Given the estimation to converge, but we clearly have problems with the estimates. Moreover, if this had occurred in a large dataset, waiting over 32 iteration_ for convergence might take considerable time. This dataset was not intentionally developed to cause problems. It is actually generated by a "Heckman process" and when generated starting fromidifferent random values can be easily estimated. The luck of the draw in this case merely led to daia that, despite its source, did not support the assumptions of the Heckman model. The two-step model is generally more stable in chses where the data are problematic. It is even tolerant of estimates of p less than -1 and greater !than t. For these reasons, the two-step model may be preferred when exploring a large dataset. Still, if the maximum likelihood estimates cannot converge, or converge to a value of p that is at the bouhdary of acceptable values, there is scant support for estimating a Heckman selection model on the d_ta. Heckman (1979) discusses the implications of p being exactly t or 0, together with the implica!ions of other possible covariance relationships among the model's determinants.

_

l_,T

28

heckman --

Saved Results heckman

saves

Heckman selection model

in e():

Scalars e (N)

number of observations

e (re)

return code

e (k)

number of variables

e (chi2)

x2

e(k_eq)

number of equations

e(chi2_c)

X2 for comparison test

e(k_dv) e (N_eens)

number of dependent variables number of censored observations

e(p_c) e (p)

p-value for comparison test significance of comparison test

e (dr_m)

model degrees of freedom

e (rho)

p

e(11)

log likelihood

e(£c)

number of iterations

e(ll_O) e(N_clust)

log likelihood, constant-only model number of clusters

e(rank) e(rankO)

rank of e(V) rank of e(V) for constant-only

e(lambda)

A

model

e(selambda) standard errorof A

e(sigma)

sigma

typeof optimization

Macros e(cmd)

heckman

e(opt)

e(depv_')

name(s)of dependent variable(s)

e(chi2type) Wald or Lit; typeof modcl x2 test

e(title) e(title2)

title in estimation output secondary title in estimation output

e(chi2_ct)

Wald or LR; type of model )c2 test corresponding to e(chi2_c)

e(utype)

weight type

e(offset#)

offset for equation #

e (wexp) e(clustvar)

weight expression name of cluster variable

e (mills)

variable containing nonselection hazard (inverse of Mills')

e (method)

requested estimation method

e(predict)

program used to implement predict

e(vcetype) e (user)

covanance estimation method name of tikelihood-evaluator program

e(cnslist)

constraint numbers

e(b)

coefficient vector

e(V)

variance-covariance

e(ilog)

iteration log (up to 20 iterations)

Matrices matrix of

the estimators

Functions e(sample)

marks estimation sample

Methods and Formulas heckma_n is implemented 446-450)

provide

as an ado-file.

an introduction

Greene

Regression estimates using the nonselection maximum likelihood estimation. The regression

equation

(2000,

928-933)

to the Heckmm-a selection hazard

(Heckman

is

yj = xjO + ulj The selection

equation

is zj'7 + u2j

> 0

where

ul _ N(O, a) u2 _ N(0,

1)

1

I_'i''1 :-

cor_(_l,u_)= p

or Johnston

and DiNardo

(1997,

model. t979')

provide

starting

values

for

|

!

--

i necKman- Heclcmanselection model

2g

The log likelihood for observation j is

observed lj =

V/1-- "_

-_

a

/

-- Wj

ln(

wjln @(-zjT)

yj

yj not observed i

where _0

is the standard cumulative normal and wj is an optional weight for observation j.

In the maximum likelihood estimation, o-and p are not directly estimated. Directly estimated are In a and atanh p:

(

l+p

_

i i i

atanh p = _ ln\ _] The standard error of ,_ = #a is approximated through the propagation of error (delta) method: that is, Var(A) _ D Var{(atanh

p lncr)} D'

where D is the Jacobian of )_ with respect to at_h p and In a. The two-step estimates are computed using H_ckman's (1979) procedure. Probit estimates of the selection equation Pr(yj

observed I zj)-

_(zj")')

are obtained. From these estimates the nonselection hazard, what Heckman (t979) referred to as the inverse of the Mills' ratio, m j, for each observa¢ion 3 is computed as

¢(zjS) mj where ¢ is the normal density. We also define

Following Heckman, the two-step parameter estimates of /3 are obtained by augmenting the regression equation with the nonselection hazard m. Thus, the regressors become [X m] and we obtain the additional parameter estimate/3,a on the variable containing the nonselection hazard. A consistent estimate of the regression disturbance variance is obtained using the residuals from the augmented regression and the parameter estimate on the nonselecfion hazard. e'e +/3_ _--]j=l N N

5j

The two-step estimate of p is then _ = /3r,L c3 Heckman derived consistent estimates of the coefficient covariance matrix based on the augmented regression.

]

.......

--.o_..

.,vv,-..,t,ua|

.O_I¢_I_.I.I1JII

IIIUUI_I

Let W = [X m] and D be a square diagonal matrix of rank N with (1 _ P^2o_ j) on the diagonal elements.

Vtwostep - 2(W'W)-I(W'DW + Q)(W'W)-1 where q = _2(W'DW)Vp(W'DW) and Vp is the variance-covariance

estimate from the probit estimation of the selection equation.

References Greene. W. H. 2000. Econometric Analysis. 4th ed. Upper Saddle River. NJ: Prentice-Hall. Gronau. R. 1974. Wage comparisons: A selectivity bias. Joumat of Political Economy 82: 1119-1155. Heckman, J. 1976. The common structure of statistical models of truncation, sample selection, and limited dependent variables and a simple estimator for such models. The Annals of Economic and Social Measurement 5: 475--492. 1979. Sample selection bias as a specificationerror. Econometrica47: 153-16t. Johnston. J, and J. DiNardo. 1997. EconometricMethods. 4th ed. New York: McGraw-Hill. Lewis, H. 1974. Comments on selectivity biases in wage comparisons. Journal of Political Economy 82: 1119-1155. Manning, W. G., N. Duan. and W. H. Rogers. 1987, Monte Carlo evidence on the choice between sample selection and two-part models. Journal of Econometrics 35: 59-82.

Also See Complementary:

[R] adjust, [R] constraint, JR] lincom, [R] lrtest, [R] mfx, [R] predict, [R] test, [R] testnl, JR] vce, [R] xi

Related:

[R] heckprob,

Background:

[U] [u] [U] [U]

[R] regress,

[R] tobit, [R] treatreg

16.5 Accessing coefficients and standard errors. 23 Estimation and post-estimation commands, 23.11 Obtaining robust variance estimates. 23.12 Obtaining scores

---Title heckprob -- Maximumd_kelihood probit estimation with selection

Syntax heckprob

dewar

[vartist] [,,eight]

[if

exp] [in

range],

select( [ depva,'s = ] varlists [, ,gffse_(varname) } [ robust

] )

cl__uster (vamame)

constraints by .,.

noconstant

(numlist)

s qcore(newv_rlist) first noconstant i noskip level(#) _ffset (varname) maximize_options

: may be used with heckprob; and i_eights

see [R] by.

_eights,

f_eights,

are allowed; see [U] 1_1.6

weight.

heckprob

shares the featuresof all estimationcommands;see [U] 23 Estimationand post-estimationcommands.

Syntaxforpredict predict

[type] newvarname

[if exp] [in range]

[, statistic nooffset

]

where statistic is /

pmargin

q'(xjb),

success probability (th_ default)

pll

_2(xjb,

_/.probit = 1, yj _ select zjg, p), predicted prolJability P'tyj

plO

_52(xjb,-z/g,-p),

predicted ,robability Pr(_/pr°bit = 1,_/;elect = O)

pO1

_52(-x3b,

predicted _robability P_yj

pO0

_2(-xjb,--zjg,

psel pcond

_(zjg), selection probability _52(xjb, zig, p)/_5(zjg), prob_ility

xb stdp

xyb, fitted values standard error of fitted values

xbsel

linear prediction for selection equation

stdpsel

standard error of the linear prediction for selection equation

zjg,-p),

_/_ probit

p),

predicted )robability

Pr(y

= l)

= O, yj

_ select

p.r°bit = O, y;elect

= 1) = O)

of success conditional on selection

q)() is the standard normal distribution function and q52() is the bivariate normal distribution function. These statistics are available both in and out of sample; type predict the estimation

...

if

e(sample)

...

sample.

Description heckprob

estimates maximum-likelihood probit models with sample selection. 31

if wanted only for

Options select(...) specifiesthe va_ables and optionsfor the selectionequation. It is an integral part of specifying a selection model and is not optional. robust specifies that the Huber/White/sandwich estimator of the variance is to be used in place of the conventional MLE variance estimator, robust combined with cluster() further allows observations which are not independent within cluster (although they must be independent between clusters). If you specify pweights, robustis implied; see [U] 23.11 Obtaining robust variance estimates. clust;er (varname) specifies that the observations are independent across groups (clusters) but are not necessarily independent within groups, varname specifies to which group each observation belongs, cluster() affects the estimated standard errors and variance-covariance matrix of the estimators (VCE),but not the estimated coefficients, cluster() can be used with pweights to produce estimates for unstratified cluster-sampled data. cluster() cluster()

implies robust; by itself.

that is, specifying robust

cluster()

is equivalent to typing

score(newvarlist) creates a new variable, or a set of new variables, containing the contributions to the scores for each equation and the ancillary parameter in the model. The first new variable specified will contain ul_ -- OtnLj/O(xj/3) for each observation j in the sample, where lnLj is the 3th observation's contribution to the log likelihood. The second new variable: u2j = OlnLj/O(zj_') The third: u3j = OlnLj / O(atanh p) If only one variable is specified, only the first score is computed; if two variables are specified, only the first two scores are computed; and so on. The jth observation's contribution to the score vector is { OtnLj/Ol30lnLj/O("/)

OlnLj/O(atanhp)}

= (UljXj

u2jzj

u3j)

The score vector can be obtained by summing over j; see [U] 23.12 Obtaining scores. first specifies that the first-step probit estimates of the selection equation be displayed prior to estimation. noconstant omits the constant term from the equation. This option may be specified on the regression equation, the selection equation, or both. constraints (numlist) specifies by number the linear constraints to be applied during estimation. The default is to perform unconstrained estimation. Constraints are specified using the constraint command; see [R] constraint. See [R] reg3 for the use of constraints in multiple-equation contexts. noskS.p specifies that a full maximum likelihood model with only a constant for the regression equation be estimated. This model is not displayed but is used as the base model to compute a likelihood-ratio test for the model test statistic displayed in the estimation header. By default, the overall model test statistic is an asymptotically equivalent Wald test of all the parameters in the regression equation being zero (except the constant). For many models, this option can substantially increase estimation time. level (#) specifies the confidence level, in percent, for confidence intervals The default is level (95) or as set by set level; see [U] 23.5 Specifying the width of confidence intervals. ogfset (w_rname) is a rarely used option that specifies a variable to be added directly to Xb. This option may be specified on the regression equation, the selection equation, or both.

hect(prob--:Maximum-likelih_od_pmbit estimationwith selection

33

_,

mca:imize_.options control the maximization process; see [R] maximize. With the possible exception of iterate (0) and trace, you should never ha_e to specify them.

Optionsfor predict pma.rgin, the default, calculates the univariate (ma@nal) predicted probability of success _e, probit 1). ptty j _. calculates the bivariate predicted probability P_yj

probit

=

__ probit plO calculates the bivariate predicted probability P,_y)

--

pll

i

1

- 1, _yjselect

_}_,probit

p01 calculates the bivariate predicted probability

P_yj

._. probit pO0 calculates the bivariate predicted probability P_(yj

psel

_ select , yj

=

l).

___ 0).

o select

= O, yj

-----1).

. select = O, yj

:

0).

calculates the univariate (marginal) predicted probability of selection Pr(y_ elect = l).

pcond calculates the conditional (on selection) predicted probability of success P_tYj-'" probit = l, yj-select = t)/Pr(y_ elect = 1). xb calculates the probit linear prediction xjb. stdp calculates the standard error of the prediction, it can be thought of as the standard error of the predicted expected value or mean for the obsel_'_tion's covariate patfern. This is also referred to as the standard error of the fitted value. xbsel

calculates the linear prediction for the selectibn equation.

stdpsel

calculates the standard error of the linear irediction for the selection equation.

nooffset is relevant only if you specified offset (varname) for heckprob. It modifies the calculations made by predict so that they ignore th_ offset variable; the linear prediction is treated as xj b rather than xj b + offsetj.

Remarks The probit model with sample selection (Van d_ Yen and Van Pragg 1981) assumes that there exists an underlying relationship y_ = xjj3 + ulj

latent

equation

probit

equation

such that we observe only the binary outcome , probit

yj

= (y_ > O)

The dependent variable, however, is not always observed. Rather, the dependent variable for observation j is observed if ffselect j

I =-

(Zjy-4:-U2j

>

0)

selection

where ul _ 2?{0, 1)

Nio,1) corr(ul,

=p

equation

i

F

F

o,_

necKproo

m MaxlmumqlKellhood

probit estimation

with

selection

When p _ 0, standard probit techniques applied to the first equation yield biased results, heckprob provides consistent, asymptotically efficient estimates for _t the parameters in such models.

> Example We use the data from Pindyck and Rubinfeld (1998). In this dataset, the variables are whether children a_end private school (private), number of years the family has been at the present residence (years), log of property tax (logptax), log of income (loginc), and whether one voted for an increase in prope_y taxes (vote). In this example, we alter the meaning of the data. Here we assume that we observe whether children attend private school only if the family votes for increasing the property taxes. This is not true in the dataset and we make this untrue assumption only to illustrate the use of this command. We observe whether children attend private school only if the head of household voted for an increase in property taxes. We assume that the vote is affected by the number of years in residence, the current prope_y taxes p_d. and the household income. We wish to model whether children are sent to private school based on the number of years spent in the current residence and the cu_ent prope_y taxes paid. . heckprob Fitting

private

probit

years

logptax,

sel(vote=years

Iteration

O:

log

likelihood

Iteration

I:

log

likelihood

= -18.407192

Iteration

2:

log

likelihood

= -16.1412S4

Iteration

3:

log

likelihood

= -15.953354

Iteration

4:

log

likelihood

= -15.887269

Iteration

5:

log

likelihood

= -15.883886

Iteration

6:

log

likelihood

= -15.883655

Fitting

selection

model:

O:

log likelihood

= -63.036914

Iteration

I:

log likelihood

= -58.581911

Iteration

2:

log

likelihood

= -58.497419

Iteration

3:

log

likelihood

= -58.497288

log

likelihood

= -74.380943

Comparlson: starting

values:

Iteration

O:

log

likelihood

Iteration

I:

log

likelihood

= -17.920826

Iteration

2:

log

likelihood

= -18.375362

= -40.895884

Iteration

3:

log likelihood

= -16.067451

Iteration

4:

log

likelihood

=

Iteration

5:

log

likelihood

= -15.760354

Iteration

6:

log

likelihood

= -15.753805

Iteration

7:

log

likelihood

= -15.753785

full

model

Iteration

0:

log

likelihood

= -75.010619

Iteration

I:

log

likelihood

= -74.287753

Fitting

-15.84787

Iteration

2:

log

likelihood

= -74.250148

Iteration

3:

log

likelihood

= -74.245088

Iteration

4:

log

likelihood

= -74.244973

Iteration

5:

log

likelihood

= -74.244973

Probit

Log

model

likelihood

logptax)

= -IZ.122381

Iteration

Fitting

loginc

model:

with

sample

= -74.24497

selection

(not

concave)

Number

=

95


of obs

= =

36 59

Wald

chii(2)

=

Prob

> chi2

=

1.04 0.5935

heckprob- Maximum-likelihoodprobitestimationwith selection

Coef.

Std.

Err.

P> lzl

!z

[95_.

Conf.

35

Interval]

)

private years logptax _cons

-. 1142597

.1461711

-0 i78

0.434

-.4007498

.1722304

.3516095 -2,780664

1.01648 6.905814

0 i 35 -0_40

O. 729 0.687

-1.640655 -16.31581

2.343874 10.75448

-,0167511

.0147735

-li13

0.257

-.0457067

vote years

.0122045

loginc

.9923025

.4430003

2 i24

O.025

,1240378

I.860567

logptax _cons

-1. 278783 -.5458214

.5717544 4.070415

-2 _24 -0.13

O.025 O.893

-2.399401 -8,523689

-. 1581649 7.432046

/athrho

-.8663147

1.449995

-0.60

O.550

-3. 708253

1.975623

-. 6994969

.7405184

-. 9987982

.9622642

rho LR test

of indep,

eqns.

(rho = 0):

chi2(_)

=

0.27

Prob

> chi2

= 0.6020

J

The output shows several iteration logs. The first iteration log corresponds to running the probit model for those observations in the sample where we hav< observed the outcome. The second iteration log corresponds to running the selection probit model, _hich models whether we observe our outcome of interest. If p = 0, then the sum of the log likelihoods from these two models will equal the log likelihood of the probit model with sample selectioh; this sum is printed in the iteration log as the comparison log likelihood. The third iteration log shows starting values 'for the iterations. The finn iteration log is for estimating the full probit model with sample selection. A likelihoodratio test of the log likelihood for this model and the comparison log likelihood is presented at the end of the output. If we had specified the robust option, then this test would be presented as a Wald test instead of a likelihood-ratio test.

q

Example In the previous example, we could obtain robust standard errors by specifying the robust We also eIiminate the iteration logs using the nolog option. • heckprob Probit

private

model

Log likelihood

with

years

lo_ptax,

sample

selection

eel(vote=years

= -74.24497

loginc

lo_q_tax) nolog

robust

Number of obs Censored obs

= =

95 36

Uncensored

=

59

obs

Wald

chi2(2)

=

Prob

> chi2

=

2,55 0.2798

Kobust Coef.

Std. Err.

2

P>Iz[

[95_ Conf.

Interval]

i

private years

-.1142597

.1113949

i -1.03

0.305

-.3325896

•1040702

logptax _cons

.3516095 -2.780664

.735811 4.786602

0.48 -0.58

0.633 0.561

-I.090553 -12.16223

1.793773 6.600903

)

Vote

! i

years

-.0167511

.0173344

-0.97

0.334

-.0507258

.0172237

loginc

.9923025

.4228035

2.35

0.019

.1636229

1.820982

lo_ptax _cons

-1.2T8783 .5458214

.5095157 4.543884

-2._1 -0._2

0.012 0.904

-_.277416 -9.45167

-.280t505 8.360027

option.

!L

t_ir_

_

,=©_Rp, uu --

/athrho rho

maA..um-.Ke.nooa

-.8663147

1.630569

-. 6994969

.8327381

prODlI estimation

-0,53

Wald test of indep, eqns, (rho = 0): chi2(1) =

with

O.595

0.28

selection

-4,062171

2.329541

-, 9994077

.9812276

Prob > chi2 = 0.5952

Regardless of whether we specify the robustoption, it is clear that the outcome is not significantly different from the outcome obtained by estimating the probit and selection models separately. This is not surprising since the selection mechanism estimated was invented for the example rather than born from any economic theory. Example It is instructive to compare the marginal predicted probabilities with the predicted probabilities we would obtain ignoring the selection mechanism. To compare the two approaches, we will synthesize data so that we know the "true" predicted probabilities. First, we need to generate correlated error terms, which we can do using a standard Cholesky decomposition approach. For oar example, we will clear any data from memory and then generate errors that have correlation of .5 using the following commands. Note that we set the seed so that interested readers might type in these same commands and obtain the same results. clear set

seed

set

obs 5000

gen ci

12309

= invnorm(uniform())

gen c2 = invnorm(uniform()) matrix P = (1,.5\.5,1) matrix A = cholesky(P) local facl = A[2,1] local fac2 = A[2,2] gen ul = cl gen u2 = "facl"*cl + "fac2"*c2

We can check that the errors have the correct correlation using the corr command. We will also normalize the errors such that they have a standard deviation of one so that we can generate a bivariate probit model with known coefficients. We do that with the following commands. summarize ul replace ul = ul/sqrt(r(Var)) summarize u2 replace u2 = u2/sqrt(r(Var)) drop cl c2 gen xl = u_uifomn{)-.5 gen x2 = uniform()+i/3 gen yls = 0.5 gen

y2s

gen yl gen

+ 4.x1

= 3 - 3.x2 = (yls>0)

y2 = (y2s>0)

+ ul + .5*xl

+ u2

heckpmb -- Maximum-likelihood, , probit estimationwith selection

37

We have now created two dependent variables yl aM y2 that are defined by our specified coefficients. We also included error terms for each equation and ]he error terms are correlated. We run heckprob to verify that the data have been correctly _oenerate(]according to the model Yl -- .5 -}-4Xl _ ul Y2 = 3 + .5xl _ 3x2 + u2 where we assume that Yl is observed only if Y2 = J. • heckprob yl xl, sel(y2 = xl x2) nolog Probit model with sample selection

Log likelihood = -3600.854 Coef.

Std. Err.

Number of obs Censored obs Uncemsored obs

= = =

5000 1790 3210

Wald chi2(1) Prob > chi2

= =

941.68 0.0000

P>Iz]


xl _cons

3.985923 .4852946

.1298904 .0464037

30._9 i0•_6

0.000 0.000

3.73i342 •3943449

4.240503 .5762442

xl x2 _cons

.5998148 -3.004937 3.0_1587

.0716655 .0829469 .0782817

8.37 -36•23 38.47 i

0.000 0.000 0.000

.4593531 -3.1/6751 2.858157

.7402765 -2.842364 3.165016

0.000

.4053964

.7427295

.3845569

.6307914

y2

/athrho rho

,574063

.0860559

,5183369

.062935

LR test of indep, eqns. (rho = 0):

6._7

chi2(I) =

46.58

Prob > chi2 = 0.0000

Now that we have verified that we generated data according to a known model, we can obtain and then compare predicted probabilities from the pi'obit model with sample selection and a (usual) probit model. predict pmarg (option pmargin assumed; Pr(yl=l)) probit yl xl if y2==l

(outputomitted) predict phat (option p assumed;

Pr(yl))

Using the (marginal) predicted probabilities from the probit model with sample selection (pmarg) and the predicted probabilities from the (usual) prob!t model (phat), we can also generate the "true" predicted probabilities from the synthesized yls variOble and then compare the predicted probabilities: • gen ptrue

= norm(yls)

• summarize pmarg ptrue phat Variable Obs

Mean

Std. Dev. i

Min

Max

.0658356 1.02e-06

.9933942 1

pmarg ptrue

5000 5000

.619436 .6090161

.32094_4 .34990_5

phat

5000

.6723897

.30632_)7

.096498

.9971064

i

_

38

heckprob m Maximum-likelihood

Here

we see that ignoring

the selection

probit estimation with selection

mechanism

(comparing

the phat

variable

with the true

ptrue variable) results in predicted probabilities that are much higher than the true values. Looking at the marginal predicted probabilities from the model with sample selection, however, results in more accurate

predictions.

0) The selection equation is zj'T + u2i

> 0

where

ul _ N(0, 1) uz _ N(O. 1)

corr(ul,u2)- p

heckprob-- Maximum-!ikelih0odprobit estimationwith selection

39

The log likelihood is

IES

_ti=0

+

{1- (z,-y+ offsetT)}

where S is the set of observations for which 9i is observed, (1)20 is the cumulative bivariate normal distribution function (with mean [0 0 ]'), _0 is the standard cumulative normal, and wi is an optional weight for observation i. In the maximum likelihood estimation, p is not directly estimated. Directly estimated is atanh p: i

7

From the form of the likelihood, it is clear that if p = 0, then the log likelihood for the probit model with sample selection is equal to the sum of the probit model for the outcome 9 and the selection model. A likelihood-ratio test may therefore be performed by comparing the likelihood of the full model with the sum of the log likelihoods fo_ the probit and selection models,

References Greene,W. H. 2000. EconometricAnalysis.4th ed. Upper Sa_le River, NJ: Prentice-Hall. Beckman.J. t979. Sampleselectionbias as a specificationerror. Economettica47: 153-161. Pindyek.R. and D. Rubinfeld.1998. EconometricModelsand EconomicForecasts.4th ed. New York:McGraw-Hill. Vande Ven,W. R M. M. and B. M. S. VanPragg. 1981.The demandfor deductiblesin private health insurance:A probit modelwith sample selection.JournaIof Econometric_17: 229-252.


[R] adjust, [R] constraint, [R] l_com, [R] lrtest, [R] mfx, [R] predict, [R] test, [R] testnl, [R] vce, [R] Xi

Related:

[R] heckman, JR] probit, [R] treatreg

Background:

[u] [U] Iv] [u]

16.5 Accessing coefficients and standard errors, 23 Estimation and post-est_ation commands, 23.H Obtaining robust var] ance estimates, 23.12 Obta_iningscores

vw __'° ti"

Title he

u Obtain on-line help In

I

I

I

Syntax Windows, Macintosh, and Unix(GUI): help [ command or topic name ]

whelp[command or fopicname] Unix(console & GUI):

{help lma.n} [command or topic name ]

Description The help command displays help information on the specified command or topic. If help is not followed by a command or a topic name, a description of how to use the help system is displayed. Stata for Unix users may type help or mmamthey mean the same thing. Stata for Windows, Stata for Macintosh, and Stata for Unix(GUI) users may click on the Help menu. They may also type whelp something to display the help topic for something in Stata's Viewer. whelp typed by itself displays the table of contents for the on-line help.

Remarks See [U] 8 Stata's on-line help and search facilities for a complete description of how to use help. Q Technical Note When you type help something, Stata first looks along the S_ADOpath for something.hip; [U] 20.5 Where does Stata look for ado-files?. If nothing is found, it then looks in state.hip the topic.


[R] search

Related:

[R] net search

Background:

[GSM]4 Help, [GSW]4 Help, [GSU] 4 Help, [U] 8 Stata's on-line help and search facilities

40

see for vl

Title i [ hetprOb - llliMaximum-l_etih°°d r 1 _ heter°skedastic "'i!_pr°bit estimati°n l llllllll I ,I I I

II I

i

i

ntax

eerar het(varlist,

[offset(varname)

c!luster(varname) nolrtest

'd [noconstant

level(#)

asis_robust

score (newvarl [newvar2:]) noskip offset (varname)

constraints

by ... : may be used with hetFob; fweights, iweights,

])

(numlist) nolqg maximize_options ] see [R] by.

and pweights are allowed; see [U] 14il.6 weight.

This command shares the features of all estimation commands;see [U] 23 Estimation and post-estimation commands.

Syntaxforpredict

i

predict[O?e]newvarname[ifexp][in r_ge] [, { p I xb [ sigma } nooffset] These statistics are available both in and out of sample; type predict the estimation sample.

...

if e(samp!e)

...

i

if wanted only for

scription hetprob

estimates a maximum-likelihoodhetero_kedasticprobit model.

See [R] logistic for a list of related estimation commands.

Options het(varlist [, of.fset(varname)]) specifies the independent variables and the offset variable, if there is one, in the variance function, her () is not optional. noconstant

suppresses the constant term (intercept}in the model.

level (#) specifiesthe confidencelevel, in percent, foi confidenceintervals. The default is level (95) or as set by set level; see [U] 23.5 Specifying !he width of confidence intervals. as is forces retention of perfect predictor variablesand their associatedperfectly predictedobservations and may produce instabilities in maximization; see [R] probit. robust specifies that the Huber/White/sandwichestinmtor of variance is to be used in place of the traditional calculation; see [U] 23.11 Obtaining robust varianee estimates, robust combined with cluster () allows observations which are not independent within cluster (although the)' must be independent between clusters). If you specify pweights, robust is implied: see[U] 23.13 Weighted estimation, 41

i

42

hetprob -- Maximum-likelihood heteroskedastic probit estimation

cluster(varname) specifies that the observations are independent across groups (clusters) but not necessarily within groups, varname specifies to which group each observation belongs; e,g., cluster(personid) in data with repeated observations on individuals, cluster() affects the estimated standard errors and variance-covariance matrix of the estimators (VCE), but not the estimated coefficients; see [U] 23.11 Obtaining robust variance estimates, cluster() can be used with pweights to produce estimates for unstratified cluster-sampled data. but see the svyprobit command in [r Example You have data on 956 U.S. cities, including average January temperature, average July temperature, and region. The region variable is coded 1, 2, 3, and 4, with 4 standing for the West. You wish to make a graph showing the relationship between January and July temperatures, highlighting the fourth region: . hilite tempjan tempjuly, hilite(region==4)

region==4

ylabel xlabel

highlighted

80=

= =

©

60"

'

_ :,"'.3 =_ 4e

._.¢'""

C

•

g (g _'

",

20

io_ a •

.....

g=*

" .

i"_r

• J _

". ,tNi'_': " : "; _'.2"_.: , "-.:

=

,:

4

.; .

e

0

1 Average

July

48

Temperalure

,, :_

hilite -- Highlighta subset of pointsin a two-way scatterplot It is possible to use graph

to product graphs like ffiis, but hilite is often more convenient.

49

q

[3Technical Note By default, hilite uses'.' for the plotting Lvmbbl and additionally highlights using the o symbol. Its default is equivalent to specifying sTabol(.o)as one of the graph_options. You can vary the symbols used, but you must specify exactly two symbols. The first is used to plot all the data and the second is used for overplotting the highlighted Subset.

Methodsand Formulas hilite is implemented as an ado-file.

References Weesie,J. 1999.dr38: Enhancementto the hilite command.Stata TechnicalBulletin 50: 17-20. Reprintedin Stara TechnicalBulletin Reprints,vol. 9, pp. 98-101.

AlsoSee Related:

[R] separate

Background:

Stata Graphics Manual

'"::"

Title I

hist-

Categorical

variable histogram

[

II

Syntax hist

varna,ne [weight]

[if exp] [in range]

[. i._ncr(#) graph_options

]

fweights are allowed; see [U] 14.1.6 weight.

Description hist graphs a histogram of varname, the result being quite similar to graph hist, however, is intended for use with integer-coded categorical variables. hist determines the number of bins automatically, labels are centered below the corresponding bar.

the z-axis

hist may only be used with categorical variables maximum(varname) - minimum(varname) < 50.

with

varname,

is automatically a range

of less

histogram.

labeled, and the than

50;

i.e.,

Options incr(#) specifies how the z-axis is to be labeled, incr(1), the default if varname reflects 25 or fewer categories, labels the minimum, minimum + 1, minimum 4- 2..... maximum, incr (2), the default if there are more than 25 categories, would label the minimum, minimum + 2, ..., etc. graph_options xlabel(), saving().

refers to any of the options allowed with graph's histogram style excluding bin (), and xscale(). These do include, for instance, freq, ylabel(), by(), total, and See [G] histogram.

Remarks b, Example You have a categorical variable rep78 reflecting the repair records of automobiles. 1 = Poor, 2 = Fair, 3 = Average, 4 = Good, and 5 - Excellent. You could type

(Continued

on next page)

5O

It is coded

h_t -- Categoricalvariablehistogram

51

• graph rep78, histogram bin(5)

to obtain a histogram. Youshould specie, bin(5) because your categorical variable takes on 5 values and you want one bar per value. (You could omit the option in this case, but only because the default value of bin() is 5; if you had 4 or 6 bars, you would have to specify it; see [G]histogram.) In any case, the resulting graph, while technically correct, ii aesthetically displdasing because the numeric code 1 is on the left edge of the first bar while the numeric code 5 is on the fight edge of the last bar. Using hist; is better: • hist rep78

434783

0

Repair

Record

1878

not only centers the numeric codes underneath the corresponding bar, it also automatically labels all the bars. hist

You are cautioned: hist is not a general replacement for graph, histogram, hist is intended for use with categorical data only, which is to say, floncontinuousdata. If you wanted a histogram of automobile

prices,

for instance,

you

would

still want

_o use the graph,

histogram

command.

;:

,_r,.

52

hist -- Categorical variable histogram

I> Example You may use any of Research and Survevs on election day data in Lipset (1993)--you • hist

candi

the options you would with graph, histogram. Using data collected by Voter based on questionnaires completed by 15,490 voters from 300 polling places originally printed in the New York Times• November 5. 1992 and reprinted draw the following graph:

[freq=pop],

by(inc) total ylab ytine noaxis title (Exit Polling By Family Income)

$50-75k

$75k+

6

= o o It.

.6

o,,_' .... L, ' Candidale

Exit Polling

voted

for,

1992

by Family

Income

[] Technical Note In both of these examples, each bar is labeled; if vour categorical you may not want to label them all. Typing

variable takes on many values,

hist myvar, incr(2)

would labeleveryotherbar.Specifying incr(3) would labeleverythirdbar,and so on,

Methods and Formulas hist is implemented

as an ado-file.

References Lipset, S. M. ]993. The significance

of the I992

Also See Related:

[R] spikeplot, [G] histogram

Background:

Stata Graphics Manual

election.

Political

Science

and Politic,_ 26_1_: 7-16,

Title hotel -- Ho_lling's

generalized means test

Syntax hotel varlist [weigh_ [iJ exp] [in range] [, by(varname)notable aweights

and fweights

]

are allow _d: see [U] 14,1.6 weight

DescriptiOn hotelperforms Hotelling's T-squared test for testing whether a set of means is zero or, alternatively, equal between two groups

i

Options by(varname) specifies a var able identifying two groups; the test of equality of means between groups is performed. If by '.) is not specified, a test of means being jointly zero is performed. i

t

notablesuppresses printing

table of the means being compared.

Remarks hotel performs Hotelling's T-squared test of whether a set of means is zero, or two sets of means are equal. It is a multivariate est that reduces to a standard t test if only one variable is specified.

I, t i I

_ Example You wi_h to _est whether a new fuel additive improves gas mileage in both stop-and-go and highway situatiotls. Taking tw_lye cars, you fill them with gas and run them on a highway-style track, recordingtheir gas mileage. Y_)uthen refill them and run them on a stop-and-go style track. Finally, you repeat the two runs but this time use fuel with the additive. Your dataset:is . describe Contains d_ta from gasexp.dta obS : 12 vats : size:

i

variable n_me

i ! i

id bmpgl ampgl bmpg2 ampg2

_

5 288 (c _.9% of memory free) storage type float float float float float;

lisplay _ormat

value label

/,9. Og /,9.Og /,9.0g /,9.0g /,9.0g

13 Jul

2000

13:22

variable label car id trackl before additiv_ trackl after additive_ track 2 before additive track 2 after additive

Sortdd by :

53

r'_

54

hotel -- Hoteiling's T-squared generalized means test [

To perfor_ zero:

the statistical test, you jointly test whether the differences

• g_n |

diffl

= ampgl

- bmpgl

g_n dill2 = ampg2 | hgtel diffl dill2

bmpg2

Variable

0bs

diffl dill2

12 12

I

Mean

Std.

1.75 2.083333

Dev.

2.70101 2.906367

1-g_oup Hotelling's T-squared = 9.6980676 F t, st statistic: ((12-2)/(12-1)(2)) x 9.6980676 H0:

Vector

of means is equal ¢o a vector F(2,10) = 4.4082

Prob

The meat

> F(2,10)

=

in before-and-after

Min

Max

-3 -3.5

5.5

results are

5

= 4.4082126

of zeros

0.0424

are different at the 4.24% significance level.

[] Technical _ote We used Hotelling's T-squared test because we were testing two differences jointly. Had there been onlvlone difference, we could have used a standard t test which would have yielded the same results as_totelling's''

test:

* W_ could have performed ampgl = bmpgl

the

test

like

this:

t test

Vari

Obs

Mean 22.75

12

• tt_st

20.68449

24.81551

2.730301

19.26525

22.73475

12

1.75

•7797144

2.70101

.0338602

3.46614

mean(ampgl

- bmpgl)

Ha:

= mean(diff)

mean(dill) ~= 0 t = 2,2444

P >

Itl =

mean(dill) > 0 _ = 2.2444

P > t =

0,0232

this: = 0

t test Std.

dlffl

12

1.75

.7797144

of freedom:

Ha: mean < 0 t = 2.2444 0.9768 this:

Err.

Std.

Dev.

2.70101

[95_

Conf,

.0338602

Interval] 3.46614

11 He:

Or like

= 0 Ha:

0.0463

Mean

t =

Interval]

3.250874

Obs

PI

mean t =

Itl =

= 0

-= 0 2.2444 0.0463

Ha: mean > 0 t = 2.2444 P > t =

0.0232

""---

hotel-- Hotellin_sT-squaredgeneralizedmeanstest

55

. _otel dill1 i

Variable

i

[

0

Mean

diffl

Std.

1.75

Dev.

Min

Max

2.70101

-3

5

1-Sroup _otelling's T squared = 5.0373832 F _est statistic: ((i!2-I)/(12-I)(I))x 5.0373832 = 5.0373832

I

HO{ Vecter of means i 3 equal to a vector of zeros F(I,II) = 5.0374 Prob > F(I,II) = 0.0463

> Example Now"donsider a variation _n the experiment: rather than using 12 cars and running each car with and without the fuel additiv, you run 24 cars, 12 with the additive and 12 without. You have the following!dataset: . d_scribe

! i

I i

I

Contains data from ga: _xp2.dta o_s: 24 vats: 4 size: 480 97.4_ of memory free)

8 Sep 2000 12:19

[

! storage variable name type

display format

id mpgl mpg2 additive

_9.Og Z9.0g _9.0g _9.0g

float float float float

value label

Variable label

yesno

car id _rack 1 track 2 additive?

Sorted by: • tab additive additive?

Fr_q.

Percent

_um. i

no

12

50.:00

50.00

yes

12

50.00

100.00

Total

24

I00,00

jr

I

i

This is an_unpaired expefime_t because there is no natural pairing of the cars; we want to test that the rneanslof mpgl are equal for the two groups specified by additive as are the means of mpg2:

{

(Continued on next page)

_

=_

60

d9 -- ICD-9-CM diagnostic and procedure codes

] !

t ICD-9 codes and, if it does, icd9_] clean modifies the _ariable to contain the codes in either of two standard formats. Use of icd9[p]_clean is optional: all icd9[p] commands work equally well with cleaned or uncleaned codes. 'I_e_e are numerous ways of writing the same ICD-9 code, and icd9[p] clean is designed (1) to ensure c insistency, and (2) to make subsequent output look better. icd9[p] uncleaned) icd9[p] ge code. icd9 ICD-9 code

generate produces new variables based on existing variables containing (cleaned or ICD-9 codes, icd9[p] generate, main produces newvar containing the main code. aerate, description produces newvar containing a textual description of the ICD-9 p] generate, range() produces numeric newvar containing I if varname records an in the range listed and 0 otherwise.

icd9_p] lookup and icd9[p] search are utility routines that are useful interactively, icd9[p] lookup sin ply displays descriptions of the codes specified on the command line, so to find out what diagnostic _:913.1 means, you can type icd9 lookup e913.1. The data you have in memory are . I lrrelevant-_and remain unchanged when using icd9[p] lookup, icd9[p] search is like icd9[p] lookup ex!:ept that it turns the problem around; icd9[p] search looks for relevant ICD-9 codes from the dd_cription given on the command line. For instance, you could type icd9 search liver or icd9p

s._arch

liver

to obtain a list of codes containing the word "liver".

icd9[p] query displays the identity of the source from which the leD-9 codes were obtained and the textual _escription that icdg[p] uses. Note that! ICD-9 codes are commonly written in two bays,, with and without periods. For instance, with diagnostic codes, you can write 001, 86221. E8008. and V822, or you can write 001., 862.21, E800.8, and V82.2. With procedure codes, you can write 01, 50. 502. and 502l, or 0l., 50., 50.2. and 50.21. _he icd9[p] command does not care which syntax you use or even whether you are consistent. _ase also is irrelevant: v822, v82.2, v822. and v82.2 are all equivalent. Codes may be recorded w_h or without leading and trailing blanks.

Optionsfor use with icd9[p]check any tells ic the code, 230.52 option, conside list

59[p] check to verify that the codes fit the format of leD-9 codes but not to check whether are actually valid. This makes icd9_p] check run faster. For instance, diagnostic code _r 23052 if you prefer) looks valid, but there is no such ICD-9 code. Without the any 30.52 (or 23052) would be flagged as an error. With any. 230.52 (and 23052) is not d an error.

tells cd9[p] chock that invalid codes were found in the data. 1. 1.1.1. and perhaps 230.52 if any is n )t specified, are to be individually listed.

genaratet

ewvar) specifies that icd9[p]

check

is to create new variable newvar containing,

for

each observation, 0 if the code is valid and a number from 1 to I0 otherwise. The positive numberslindicate the kind of problem and correspond to the listing produced by icd9[p] check. For instance. 10 means the code could be valid, but it turns out not to be on the official list.

Options for use with icd9[p] clean dots specifies whether periods are to be included in the final format. Do you want the diagnostic codes recorded, for instance, as 86221 or 862.21? Without the dots option, the 86221 format would b_ used. With the dots option, the 862.21 format would be used. pad specifids that the codes are to be padded with spaces, front and back. to make the codes line up vertically.: in listings. Specifying pad makes the resulting codes look better when used with most other Stata commands.

i

icd9 -- ICD-9-CM diagnoSticand procedure codes

Options fOr i

61

with icd )[p]generate

main,descrip}ion,and ra _ge(icd9rangelist) specify what icd9[p] generate is to calculate. In all cases, varname specifies variable containing ICD,9 codes. main specifies ihat the malt .'ode is to be extracted from the IED-9 code. For procedure codes, the

i

i i

main code i_ the first tw_ characters. For diagnostic codes, the main Code is usually the first three or four characters (tlie characters before the dot if the code has dots) In any case, icdg[p] generate &Ses not care _hether the code is padded With blanks in front or how strangely it might be written; :icd9[p] gene rate will find the main code and extract it. The resulting variable is itself an ICD-Ocode and may be used with the other icd9[p] subcommands. This includes icd9[p] generate, ilain.

i ! i i_

descriptiondreates newva, containing descriptions of the ICD-9 codes.

I i{ i I

long is for _e with desc: iption.It specifies thai the new variable, in addition to containing the text describing the code, is to contain the code, too. Without long, newvar in an observation might contain "bro1_chus injury-closed". With long, it would contain " 862.21 t_ronchus injury-closed". end modifie_ long (speci _'ying end implies long) and places the code at the end of the string: "bronchus injury-closed 8I 2.21".

! i

i

range(icd9ran_elist) allows you to create indicator variables equal to l when the ICD-9 code is in the inclusive! range specifi _d.

Optionsfor usewith icd )[p]search

!

i

or specifies thai ICD-9 codes are to be searched for entries that contain any of the words specified after icd9[p I search,Th, default is to list only entries that contain all the words specified.

i

Remarks

I

code is

Let us begin _withdiagnost

codes--the

codes icd9 processes. The format of an ICD-9 diagnostic

[blanks { O-9,V,v} {0-9} {0-9} [. ] _0-9 r [0-9] jl [blanks] or

I

i i ; .

i

[blanks! E.e } { 0-9 } { 0-9 } {0-9 }[. ][0-9 [0-91 ] [blanl_s1 icd9 can dell with ICD-9 tiagnOstic codes written in any of the ways the above allows. Items in square brackets tare optional, the code might start with some number of blanks. Braces { } indicate required items. _he code the_ either has a digit from 0 to 9 the letter V (utmercase or lowercase) (first hne), or thei letter E (upl_ercase or lowercase ' '_.. second line).' After that, it has--two or mo re d_'g"_ts s, perhaps followed b a enod and th v u to tw e dvm ha s tollowed b ' more b' " i ! Y P " _t en it may ha "e p omor "_'siper p. " y

,anks .

l

_;;::

62

icd9 --

ICD-9-CM diagnostic and procedure codes

All of the following

meet the above

definition:

00: ool

')ol 001,9 O019 862 ._2 862,22 E80_). 2 e8092 V82|

Meeting t_ae above the above[definition, Examl_les|

definition of which

of currently

does not make the code valid. 15,186

defined

are currently

diagnostic

codes

There

are 233.100

possible

codes meeting

defined include

l

code

I

t i

I

description

001 001.0

cholera* cholera d/t vib cholerae

001.9 001.1

cholera cholera nos d/t vib el tot

999

complic medical care nec*

VOl

communicable dis contact*

V01. i VOI. 701.20 VOI. 3 VOl. 4 VOI. 5 VO1.6 VOl. 7 YO1.8 V01.9

tuberculosis contact cholera contactcontact poliomyelitis smallpox contact rubella contact rabies contact venereal dis contact viral dis contact nec communic dis contact nec communic dis contact nos

. . .

E800 E800.0 E800.1 E800.2 E800.3 E800.8 E800.9

rr rr rr rr rr rr rr

collision nos* collision nos-employ coll nos-passenger coll nos-pedestrian coll nos-ped cyclist coil nos-person nec coil nos-person nos

o , o

"Main ,eodes" refer to the part of the code to the left of the period. v82, and !_800 ..... E999 are main codes. There are 1.182 diagnostic

001,002 ..... main codes.

999. v0] .....

The m'Ain code corresponding to a detailed code can be obtained by taking the part of the code to the left lof the period, except for codes beginning with 176. 764. 765. v29. and v69. Those main codes are not defined, yet there are more detailed codes under them:

icd9 -- ICD-9-CM diagnostic and procedure codes I

cdde

d,:scription

[

1}'6 if(6.0 176,1

CDDE DOES NOT EXIST, but $ codes starting with 176 do exist: sl in - ka_si's sarcoma sl tisue - kpsi's srcma

764 754.0 754. O0

C )DE DOES NOT EXIST, but 44 codes starting with 7i54 do exist: It for-dates w/o let real* li :ht-for-dates winos

63

i

.5.

755 7_5.0 7_5. O0

C )DE DOES NOT EXIST, but 22 codes starting with %5 do exist: extreme immaturity* e_treme immatur wtnos

I

V_9 V:_9.0

O )DES DOES NOT EXIST, but 6 codes stating with V29 do exist: nt obsrv suspct infect

i

V_9.1

nt obsrv

I

V69 V_9.0 V619.1

O )DE DOES NOT EXIST, but 6 codes starting with V69 do exist: la k of physical exercse inirpprt diet eat habits

suspct

neurlgcl

"'_"

!

Our solution iis to define f)ur new codes:

t ! i

!

code

description

176 764 765 729 g69

kaposi's sarcoma (Stata)* light-for-dates (Stata)* immat & preterm (Stata)* nb suspct cnd (St,am)* lifestyle (stata)*

Thus, there are 15,186 + 5 = 15,191 diagnostic

i

Things are less confusing format of I CD-9iprocedure

!

I I

I

'

codes, of which

_'ith respect to procedure

codes_the

1,181 + 5 = 1,186 are main codes. codes processed

by icd9p.

The

co :les is [banks]

{0-9}{0-9}

[. ] [0-9 [0-9]]

[blanks]

Thus, there are i0,000 possil: e procedure codes, of which 4,275 are currently valid. The first two digits represent _e main co& of which 100 are feasible and 98 are currently used (00 and 17 are not used).

Descriptions The degcriptidns given for each of the codes is as found in the original source. The procedure codes • contain' the addition of fve new codes b_, us. An asterisk on the end of a description n_ d_cate_ "" _

i_ I

that the c°trespoiding ICD-9 tiagnostic code has subcategories. icd9[pJi quer_ reports thebriginal source of the information

1

on the codes:

r F J

64

icd9 -- ICD-9-CM diagnostic and procedure codes

• _cd9

query

_dta: I

i 1

Dataset from http://www.hcfa.gov/stats/pufiles.htm obtained 24aug1999 file http://www.hcfa,gov/stats/icdgv16.exe Codes 176, 764, 765, V29, and V69 defined

I

-- 765 176 kaposi's immat _ preterm sarcoma (Stata)* (Stata)* V29 nb suspct end (Stata)* V69 lifestyle (Stata)* cd9

query

J _d_a: Dataset obtained 24aug1999 • from http://www.hcfa.gov/stats/pufiles.htm file http://www.hcfa.gov/stats/icd9vl6.exe

> Example You t_ve a dataset containing up to three diagnostic codes and up to two procedures on a sample of 1,000 Ipatients: _se patients, _ist in 1/10 7.

patid I

I:

I_.

clear diagl 65450

diag2

3 2

710.02 23v.6

5 6 7 8 9

861.01 38601 705 v53.32 20200

procl 9383

proc2

37456

8383

17

2969

9337 7309 7878 0479

8385 951

i0

464.11

7548

diag3

E8247

20197

!

4641

Do not tD, to make sense of these data because, in constructing procedure codes were randomly selected.

this example,

the diagnostic

and

- _,-_Be_inlbvnoting that variable diagl is recorded sloppily--sometimes the dot notation is used and sometimes not, and sometimes there are leading blanks. That does not matter. We decide to begin by using icd9 cd9

clean clean

to clean up this variable:

diagl

di_gl contains invalid ICD-9 codes r (459) ;

icd9 clean refused because there are invalid codes among the 1.000 observations, check to find and flag the problem observations (or observation, as in this easel:

We can use icd9 :

i-|

!

[

) ) i

i

_....

icd9-_-ICD-9-CMdiagnostic and proceIt[

hhsize

Number of obs = F( 6, 657) =

664 212.55

Prob > F R-squared Adj R-squared Root MSE

0.0000 0.6600 0.6569 .26228

=


In_:sales_pc jantemp pre:ipitat~n i.n_income

.6611241 .0019624 -.0014811 .I158486

.026623 .0007601 .0008433 .056352

24.83 2.58 -1.70 2.06

0,000 0.010 0.090 0-040

.6088476 .0004698 -.0030869 .0051969

.7134006 .003455 .000224Z .2265003

m,_dian_age hhsize .cons

-.0010863 -.0050407 -I.377592

.0002823 .0004243 .4777641

-3.85 -11.88 -2.88

0.000 0.000 0.004

-.0016407 -.0058739 -2.31572

-.0005319 -.0042076 -.459463

Despit, having data on 898 cities, your regression was estimated on only 664 cities_74% of the original 8 )8. Some 234 observations were unused due to missing data. In this case, when you type snrnmari: e, you discover that each of the independent variables has missing values, so the problem is not that one variable is missing in 26% of the observations, but that each of the variables is missing in some _servations. In fact, summarize revealed that each of the variables is missing in roughly 5% of th_ observations. We lost 26% of our data because, in aggregate. 26% of the observations have one )r more missing variables. Thus, |"+'eimpute each independent variable on the basis of the other independent variables: . i_pute In_rtl jantemp precip In_inc medage hhsize, gen(i_In_rtl) 4._0Y, (44) observations imputed impute } jantemp in rtl precip In_inc medage hhsize, gen(i_]antmp) 5.B0_, (53) observations imputed

f

impute -- Predict missing values

71

. impute _recip In rtL jantemp In_inc medage hhsize, gen(i_precip) 4.i56_(41) observati)ns imputed . impute In_inc In rtL jantemp precip medage!hl_size,gen(i_in_inc) 4.!34_ (39) observati)ns imputed . impute Medage In rt jantemp precip In inc hhsize, gen(i_medage) 4._5% (40) observati .ns imputed . impute lihsize In rt jantemp precip in_inc medage, gen(i_hhsize) 5._3_ (47) observati ,nsimputed

Thatdone,we can now re-estmatethe recession on the imputedvariables: • regress !in,eat i_Injsales Soul4ce Mod_l Residual ! _ Total

: i

In_eat

!

i _

S_

df

108.8_923 63.792_

i_jantmp i_precip i_in_inc i_median_ase i hhsize

.45

172.65:i145 Conf.

MS

6

18.1432051

891

,071596986

897

.192477308

Std. Err.

t

Number of obs = F( 6, 89_) = Prob > F =

898 253.41 0.0000

R-squared = Adj R-squexed = Root MSE =

0.6305 0.6280 .26758

P>ItI

[95% Conf. Interval]

i_im_rsales i_jantmp i_precip i_in_inc

.660_ )6 .0021G .9 -.0013_88 .095883

.0245827 .0006932 .0007646 .0510231

26,89 3.03 -1.74 1.88

0.000 0.002 0.083 0,061

.6126593 .0007414 -.0028275 -.0042764

.7091528 .0034625 .0001739 .1960024

i_median_age i_khsize _cons

-,0011_34 -.0052508 -I.143142

.0002584 .0003953 .4304284

-4.35 -13.28 -2.66

0.000 0.000 0,008

-.0016304 .0060267 -1.987914

-.0006163 -,004475 -.2983702

Notethat the regressionis no_ estimatedon all 898observations. Example impute canalsobe used 4th factor to extend[actorscoreestimatesto caseswith missing data.Forinstance, we havea /afiantof the automobile dataset (see[U] 9 Stata'son-linetutorials and sampledataSets)that conltinsa few additionalvariables.Wewill begin by factoringall but the price

vadable;

see [R] factor

• factor m_g-foreign, f ctors(4) (obs=66) (principal _actors; 4 factors retained) Eigenvalue Difference Proportion

Factor

Cumulative

3

i

1 2 3 4 5 6 7 8 9 I0

6. 99066 1.39528 O. 58952 O. 29870 0.24252 O. 12598 0.03628 -0.01457 -0.02732 -0.05591

5. 59588 0.80576 O. 29082 O. 05618 0.11654 O. 08970 0.05085 0.01275 O. 02860 0.05736

O. 7596 0.1516 O. 0641 O. 0325 O. 0264 O. 0137 0.0039 -0.0016 -0.0030 -0.006i

O.7596 0.9112 O. 9753 1. 0077 1. 0341 1.0478 1.0517 1.0501 1.0472 1.0411

tt I2 13

-0.11327 -0.11891 -0. 14605

O.00564 0,02714

-0.0i23 -0.0129 -0.0i59

1. 0288 1.0159 1.0000

r

72

impute

--

Predict missing values Factor Loadings

i Variable

1

mpg rep78 I rep77 headroom Irear_seat trunk

2

3

4

Uniqueness

-0.78200 -0.51076 -0.27332 0.56480 0.66135 0.72935

-0.02985 0.68322 0.70653 0.26549 0.20473 0.37095

-0.06546 -0.1i181 -0.32005 0.29651 0.36471 0.28176

0.33951 -0.01428 0.04710 0.16485 0.02062 0.12140

0,26803 0.25963 0.32145 0.49542 0.38727 0.23633

weight length turn ! displacement

0.95127 0.94621 0.88264 0.92199

0.10135 0.19595 -0.05607 0.06333

-0.18056 -0.05372 -0.08502 -0.17349

-0.09179 -0.10325 0.01169 -0.02554

0.04378 0.05274 0.21043 0.11518

_ar_ratio | order |foreign

-0.82782 -0.25907 -0.75728

0.06672 0.15344 0.30756

0.24558 0.01622 0.19130

-0.10994 0.14668 -0.29188

0.23787 0.88756 0.21014

I

There appear interpreta_on

to

be two we might

factors interpret

here. the

Let's pretend that we have given first factor as size. We now obtain

the first the factor

two factors scores:

an

||

. s_ore fl f2 (based on unrotated factors) (2 scorings not used) Scoring Coefficients 1 2

Variable I

mpg rep78 rep77 headroom _ear_seat I trunk

-0.02094 -0.03224 -0.11150 0.05530 0,03355 0.04603

0.11107 0.44562 0.27942 0.10017 0.02812 0.20622

I

0.12250 0.39997 0.04562 0.19281 -0.08534 0.00638

-0.13040 0.60223 -0.12825 0.11611 0.03528 0.06433

weight length turn displacement g_ar_ratio order

-0.06469

foreign Although nfissing observati(

is not v ]ues is:

(we

0.28292

revealed

by

this

output,

in 8 cases

would

see

that

if we typed

the

scores

summarize).

could

To

not

impute

. i_ _ute fl mpg-foreign, gen(i_fl) 10.91_ (8) observations imputed i, _ute f2 mpg-foreign, gen(i_f2) I0._1Z (8) observations imputed And

we _ ight

now

run

a regression

of price

(Continued

in terms

on next

of the

page)

two

thctors:

the

be calculated factor

scores

because

of

to all the

impute -- Predict imissingvalues i

. regre_s

price

i_f3

Source

i_f2 SS

df

MS

Number

of obs =

F( 2, Model

15¢_._23103

Residual

47£ 342293

Total

63_ )65396

price i_fl

t

73

74

71)=

2

79611551.5

Prob

71

6702004.13

R-squared:

=

0.2507

8699525.97

kdj R-squ_red Root MSE

= =

0.2296 2588.8

73

Err.

t

P>lt

> F

3oef.

Std.

3.347

315.7177

3.88

0.000

595.8234

{

[95Y. Conf.

=

ti.88 0.0000

Interval] 1854.87

i_f2

I

911:2878

339.9821

2.68

0.009

233.3827

1589.193

icons

J

626 1.285

301.7093

20.76

0.000

5660.694

6863,877

Methodsand Formulas imputeis implemented

Lsan ado-file.

Consider the command

repute y xl X2 ... Xk, gen(_)

When y is not missing,

varp(_).

=yand_=0.

Let y9 be an observatiol br which y is missing. A regressor list is formed containirig all x's for whic_ xij is not missing. If _e resulting list is empty, missing. !OtherWise a regres!iion of y on the list is dsdmated (see [R] regress) the predicted Value of yj (si,'e IN] predict), t,"j is defined as the square of the prediction, as Calculated by _redict, stdp; see [Ri predict.

from xl, x2 ..... xk _.3 and _j are set to and _j is defined as standard error of the

References i

Goldstein,R. 1996.sedl0: Patters of missingdata, Stata TechnicalBulletin32: 12-13, Reprintedin Stata Technical Bulletin Reprints.vol. 6. p. I 5.

i[

_.

1996. sedl0.I: Patternsof!missing data. update. Stata TechnicalBulletin 33: 2, Reprintedin Stata Technical BulletinReprints,vol. 6, pp. 15-116.

l,ittle.R. i. A. and D. B. Rubin. 1987.StatisticalAnalysis u4OJMissingData. New York:John Wiley& Sons. }

.

Mander.Ai asd D. Clayton.I999 sg116:Hotdeckimputation.Stata TechnicalBulletin 51: 32-34. Reprintedin Srata TechniralBulletin Reprints, v,)t.9, pp. 196-199. -_.

2000. sgll6A: Update to hotdeck imputation.Stata TechnicalBulletin 54: 26. Reprinted in Smta Technical BulletirtReprints, vol. 9, p. )9,

AlsoSee

i

Complementary:

[R] pr, diet

Related:

[R] ipelate, JR]regress

..... Title

,t ;

i

Quick reference for reading data into Stata

Description This er_try provides a quick reference for determining which method to use for reading non-Stata data into _hemoD,. See [U] 24 Commands to input clam for more details.

Remarks Summary bfthe different methods insheet o inshe, t reads text (ASCII) files created by a spreadsheet o The da a must be tab-separated space-s ,_parated. o A sing] _ observation

or comma-separated,

or a database program.

but not both simultaneously,

nor can it be

must be on only one line.

o The fir t line in the file can optionally contain the names of the variables. infile (fre_ format)--infile without a dictionary o The da

can be space-separated,

o Strings with embedded separat, d). o A singl _observation line.

tab-separated,

or comma-separated.

spaces or commas must be enclosed

in quotes (even if tab- or comma-

can be on more than one line or there can even be multiple observations

infix (fixe(_ format) o The da!

must be in fixed-column

format.

o A singl

observation

o infix

as simpler syntax than infile

can be on more than one line. (fixed format).

infile (fixe 1 format)--infile with a dictionary o The daa may be in fixed-column o A singl _ observation o infil_

format.

can be on more than one line.

(fixed format) has the most capabilities

74

for reading data.

per

infile-- Quicl_referencefor readingdata intoStata

75

I

Examples

I

l

> Example

topof exampl.raw i

1 0 0 0

0 0 I O

1 I 0

John Smith Paul Lin Jan Doe f Julie McDonald

m m f

endof exampl.raw-contains tab-separated data. The type command with the showtabs

option shows the tabs:

type eXampl.rau, slowtabs 1O1John Smithm OO.IPaulLin_T>m OIOJan Doef oO.Julie Mc[onaldf Z

It could be read in by • insheet a b c name gender using exampl

Example topof examp2.raw--

i

a,b, c,name, gender 1,0,I ,John Smith,m 0,0,I ,Pa_l Lin,m O,l,O,Jan Doe,f 0,0, Julie McDonald,:

!

endof examp2.rawcould be read in by

i

" insheet

using

examl,2

q

Example topof examp3.raw 1 0 0 0

0 0 I 0

i 1 0

"John Smith" m "Paul Lin" m "Jan Doe" f "Julie McDonald"

f

t

endof examp3.raw

contains tab-separated data _'ith strings in double quotes. ;

. type

examp3.raw, s]lowtabs

lO Example The _lir_e () and _lines () directives instruc! Stata how to read your data when there are multiple records per _Sbservation. You have the following in mydata2 .raw: ri

top of mvdata2.raw

id incpme educ sex age 1024 2_000 HS Male | 28 1025 2_7000 C Femalei 1035 2 000 HS Male 32 1036 25000 C Female 25

1

You can read this with a dictionary mydata2, reads the daia: • infi_e using mydata2, clear

end of mydata2.raw

dct, which we will just let Stata list as it simultaneously

i infile(fixed formal -- Read ASCII (text) data in fixed format with a dictionary

87

z

in_ile dictionary usiag mydata2 { _first(2) _lines(3) int id "Identific ttion Number" income "Annual in :ome" sir2 educ "Highes _line(2) sir6 sex _line(3)

* * * *

Begin reading on line 2 Each obbervatiOn takes 3 lines. Since __ine is not specified, Stata assumes that it is I.

educ level" * Go to line 2 of the observation. * (values for sex are located on line 2) * Go to llne 3 of the observation. * (values for age are located on line 3)

int age

} (4 bbservations read)

. list Ii 2! 3_

id 1024 1025 1035

inc(ime 251)00 27_i00 26_I00

4i

1036

25( 00

Now, here is the really good ii

could jus( as wdll have

educ _S C HS

sex Male Female Male

age 28 24 32

C

Female

25

art: We read these variables in order but that was not necessaD_. We

usedrhedictionary:

top of mydata2p.dct

inf_le dictionary using mydata2 { _first (2) _lines (3) _line (1)

int

id "Identification number" income "_ual income"

_line(3) _line(2)

sti int st_

educ age sex

"Highest educ level"

} end of mydata2p.dct

We would obtain the same re_ults--and just as quickly--the only difference being that our variables in the fin_ dataset would be n the order specified: id, income, educ, age, and sex. q

Technical!Note i.

You can use _newline tO specify where breaks occur, if you prefer: Z

........ i

i

!_

topof highway.dct,example5--

infile dictionary { acc_rate "Ac :. Rate/Million Miles" spdlimit "S )eed Limit (mph)"

>

_newline acc_pts

"Ac :essPts/Mile"

4.58 55 4.6 2.861 60 4.4 1.61. 2.2 3.02 i 60 4.7 end of highway.dct, example 5

The line th)at reads '1. 61 .' ould have been read 1.61 (without the period), and the results would have been unchanged. Since _tictionaries do not go to new lines automatically, a missing value is assumed for all values not foulnd in the record.

!

88 ---

i_file (fixed format) -- Read ASCII (text) data in fixed format with a dictionary =

]

Readingfied-format files Values _n formatted data are sometimes packed one against the other with no intervening For instande, the highway data might appear as I top of highway.raw,example 6

,,

:',

blanks.

4.58_54.6

2.86 04.4 1.61| 2.2 3.02604.7

":

end of highway.raw,example6 The first f_ur columns of each record represent the accident rate; the next two columns, the speed limit; and _he last three columns, the number of access points per mile. To read:: these data, you must specify the %infmt in the dictionary. Numeric Y,infints are denoted by a leadir_g percent sign (%) followed optionally by a string of the form w or w.d, where w and d stand for @o integers. The first integer, w, specifies the width of the format. The second integer, d, specifies ti_enumber of digits that are to follow the decimal point. Logic requires that d be less than or equal tqw. Finally, a character denoting the format type (f, g, or e) is appended. For example, %9.2f spe_zifies an f format that is nine characters wide and has two digits following the decimal point.

Numericformats The f f_rmat indicates that infile is to attempt to read the data as a number. When you do not specify th_%infmt in the dictionary, infile assumes the %f format. The missing width w means that infille is to attempt to read the data in free format. At the _mrt of each observation, to 1, indic moves the occurrence is left at tl

infile

reads a record into its buffer and sets a column pointer

ating that it is currently on the first column. When infile processes a %f format, it "olumn pointer forward through white space. It then collects the characters up to the next of white space and attempts to interpret those characters as a number. The column pointer e first occurrence of white space following those characters, If the next variable is also

free forma I, the logic repeats. When ypu space. Instead, the result @ a is, on the first

explicitly specify the field width w, as in %wf, infile does not skip leading white it collects the next w characters starting at the column pointer and attempts to interpret number. The column pointer is left at the old value of the column pointer plus w, that character following the specified field.

Example If the d tta above are stored in highway, the data:

raw, you could create the following

infi e dictionary using highway { acc_rate Y,4f "Acc. Rate/Million spdlimit acc_pts

dictionary to read

top of highway.dct,

example 6

end of highwa?.dct,

example 6

Miles"

Y,2f "Speed Limit (mph)" Y,3f "Access Pts/Mile

} 1

Wh:ncolu_s you explicitly field width, not skip intervening and characters. The first are usedindicate for the the variable ace_rate,infile the does next two for spdlim-it, the last three for acc_pts.

# lines!states the number of lines per observation in the file. Simple datasets typically have '1 ; lines!. Large datasets often have many lines (sometimes called records) per observation, lines is optional even when there is more than one line per observation because infixcan sometimes figure _t out for itself. Still, if 1 lines is not fight for your data, it is best to specify the directive.

'

i, ,

lines Iappears only once in the specifications. #: tells infix to jump to line # of the observation. Consider a file with 4 lines, meaning four lines per observation. 2: says to go to the second line of the observation. 4: says to go to the fourth line of_the observation. You may jump forward or backward: infix does not care nor is there any inefficiency in going forward to 3:, reading a few variables, jumping back to 1:, reading anothei" variable, and jumping back again to 3 :. It is n0t your responsibility to ensure that, at the end of your specification, you are on the last line of!the observation, infix knows how to get to the next observation because it knows where you are and it knows lines, the total number of lines per observation #: may appear, and typically does, many times in the specifications. / is an alternative to #:. / goes forward one line. //goes forward two lines. We do not recommend the usd of / because #: is better. If you are currently on line 2 of an observation and want to get [_ "'

to linei6, you could type////, but your meaning is clearer if you type 6:. / may!appear many times in the specifications.

: : i

[byte I int Ifloat j long I double and, sdmetimes,_more than one.

I str ]varlist [#-]#[-#]

instructs infix

to read a variable

'_

Begin _y realizing that the simplest form of this is varname #, such as sex 20. That says that variabl_ varname is to be read from column # of the current line: variable sex is to be read from

: '_ t

column20 and here, sex is a one-digit number. varn " rn m fr m the column ran e s eclfied read ar_e#-#, such as age 21-23, says to readva a e o " g p " ; age frtm columns 21 through 23 and here, age is a three-digit number. You cab prefix the variable with a storage type. str name 25-44 means to read the string variable name _rom columns 25 through 44. If you do not specify str. the variable is assumed to be numeriC. "Youcan specify the numeric subtype if you wish.

infix (fixed format) _ Read A_Cll (text) data in fixeclformat-

i i

You can specify more than one variable, with or without a type. byte ql-q5 51-55 means read va_ables ql, q2, q3, q4. and q5 from column_; 51 through 55 and store the five variables as b_tes. Finally, you can spec fy the line on which the Variable(S) appear, age 2:21-23

i

107

says that age is

tO:be obtained from }he second line, column_ 21 through 23. Another way to do this is to put together the #: direct}ve _ith the input-_afiabte directive: 2: age 21-23. There is a difference. but not with respect t_ reading the variable age, Let s consider two alternatives: ;1: str name 25-4_ age 2:21-23 ql-q5 51-55 1:

[

str

name

25-44

2:

age

21"23

ql-q5

51-55

The difference is thai the first directive says variables ql through q5 are on line I whereas the seCond says they are an line 2. When the colon is p_t out front it says on which line variables are to be found when we do not explicitly say otherwise. Vc'hen the colon is put inside, it applies only to the variable under consideration.

Remafks There are two ways t9 use "infix il

One is to type the specifications that describe how to read the

fixed_format data on thelcommand line: .

infix

ace

rate

_-4

spdlimit

6-7

acc_pts

9-11

using

highway.raw /

The other is to type the specifications into a file Z

topof highway.dcI,exampleI

--i infix

dictionary acc rate spdlimit acc_pts

asing highway.raw t-4

{

3-7 I-II

} end of highway.dct,

example

I

and {hen, inside Stata. t, _e . infix

{

i

using

hig way.dct

Which you use makes r_o difference to Stata. The first form is more convenient if there are only a few variables and the second form is less prone to error if you are reading a big, complicated file The second form alkws two variations, the one we just showed--where file_and one where the data are in the same file as the dictionary:

the data are in another

topof highwav.dct,example2i

infix

dictionary acc_rate

{ i-4

spdlimit

_-7

acc_pts

)-II

} 4.58

55 .46

2.8660 1.61 3.02

4.4

2.2 60 4.7 --

>a ot6 that

in the first ex

ple, the top line of the file read infix

wheieas in the second toe line reads simply iMix {

dictionary.

data _.are it is implied t_at__the data follow the dictionary.

end of highway.tier

dictionary

example

using

2

highway.raw where the When you do not say.

108

infi]K(fixed format) -- Read ASCII (text) data in fixed format

'!.

> Example So let's complete the example we started. You have a dataset on the accident rate per million vehicle miles along a stretch of highway, the speed limit on that highway, and the number of access points per mile. You have created the dictionary file highway, dct which contains the dictionary and the data: top of highway.dct,example 2 infix d_ctionary { ace_rate I-4 spdlimit 6-7 acc_pts 9-11

} 4.58 2.86

55_ .46 6_ 4.4

1.61 3.02

i 2.2 6_ 4.7

|

! ! end of highway.dct,

example 2

You created this file outside of Stata using an editor or word processor. Inside Stata. you now read the data. infix lists the dictionary so you will know the directives it follows:

! i

: !

• infix_using highway infix dictionary { ace_rate 1-4 spdlimit 6-7 acc_pts 9-11

} (4 observations

read)

list 1. 2. 3. 4.

ace_rate 4.58 2.86 1.61 3.02

spdlimit 55 60 60

acc_pts .46 4.4 2.2 4.7

Note that we simply typed infix using highway rather than infix using highway.dct, When we do not specify the file extension, infix assumes we mean .dot. Example Consider the following

l

aw data:

i_ income educ / se_ age / rcode, 1024 25000 HS | Male 28 119503

top of mydata.raw answers

to questions

--

1-5

1025 27000 C Female 24 022113 1035 26000 HS Male 32 110321

f36 2sooo c Female 25 131232 ;

--

end of mydata.mw

This dmaset has 3 lines oer observation and the first line is just a comment. One possible set of specifi+ations to mad _is ktata'_is infix dictionary u i 2 first 3 lines I: id income str educ 2: str sex 3:

4

topof mydatalAct

ing mydata {

I-4 6-10 12-13 6-11

int age !13-14 rcode 16 ql-q5

7-16

I end of mydatal,dct

----_

although we pre_r i

110

infix (fixed format) -- Read ASCII (text) data in fixed format top of mydata2.dct infi_

dictionary

using

mydata

{

2 first 3 lines id

1:I-4

income

I: 6-10

' E_

sir

educ

1:12-13

i!i

sir

sex

2:6-11

I '

age rcode

2:13-14 3:6

ql-q5

3:7-16

} •end of mydata2.dct Either will read these data, so we will use the first and then explain why we prefer the second. • infix

using

mydatal

infix dictionary 2 first I:

using

lines id

mydata

{

1-4

income

6-10

str

educ

12-13

2:

str

sex

6-11

3:

int age rcode

13-14 6

ql-q5

7-16

} (4 observations • list

in

read)

I/2

Observation

1 id

1024

income

sex

Male

age

28

q2 q5

9 3

q! q4 Observation

1 0

25000

educ

HS

rcode

1

q3

5

2 id

1025

sex

Female

income

educ

C

age

27000 24

rcode

0

q3

1

ql

2

q2

2

q4

1

q5

3

Now, what is better about the second? What is better is that the location of each variable is completely documented on each line, in terms of both line number and column, Since infix does not care about the order in which we read the variables, we could take the dictionary, jumble the lines, and it would still work. For instance, .... infi:

dictionary first

using

mydata

top of mydata3.dct

{

lines str

sex

1

rcode

!

sir age id

educ

ql-q5 income

2:6-11 3:6 1:12-13 2:13-14 I: i-4 3:7-16 i: 6-10

}

t

end of mydam3.dct

!

[ ]

I

infix(fixedformat)--Read ASCII(text)datain fixedformat

111

wilt also read these data even though•for each observation, we start on line 2, go forward to line 3, jump back to line l, and end up on line 1. It is not even inefficient to do this because infix does not really jump to record 2, then record 3, then record 1 again, etc, infix takes what we say and organizes it efficiently. The order in which we say it makes no difference. Well, it does make one: the order of the variables in the resulting Stata dataset will be the order we specify. In this case the reordering is senselessbut, in real datasets, reordering variables is often desirable. Moreover, we often construct dictionaries, realize _at we omitted a variable, and then go back and modify them. By making each line complete in and of itself, we can add new variables anywhere in the dictionary and not worry that. because of our addition, something that occurs later will no longer read correctly. in i/i00

1:

id I-6

str name 7-36

2: age i-2

str sex 4

using empi.raw

Or, if the specification was instead recorded in a dictionary and you wanted observations 10l through 573, you could type • infix using emp2.dct in 101/573


[R]outfile, [R] outsheet, [R] save

Related:

[R]intile (fixed format), [R]insheet

B_ckground:

[L] 24 Commands to input data, [R]intile

i

F °'; e

input -- Enter data from keyboard I

I II I

I III

I

I

I

Syntax input

[varlist]

[,_automatic label]

Description input allows you to type data directly into the dataset in memo_• alternative to input.

Also see [R] edit for a windowed

Options automatic causes Stata to create

value labels from the nonnumeric

data it encounters•

automatically widens the display format to fit the longest label. Specifying label even if you do not explicitly type the label option.

automatic

It also implies

label allows you to type the labels (strings) instead of the numeric values for variables associated with value labels. New value labels are not automatically created unless automatic is specified.

Remarks If there are no data in memory, when you type input you must specify a vartist• Stata will then prompt you to enter the new observations until you type end.

> Example You have data on the accident rate per million vehicle miles along a stretch of highway along with the speed limit on that highway. You wish to type these data directly into Stata: • input nothing to input r (104) ;

Typing input by itself does not provide enough information know the names of the variables you wish to create. • input ace_rate spdlimit 1. 2. 3. 4.

ace_rate 4.58 55 2.86 60 1.61 end

spdlimit

112

about your intentions.

Stata needs to

input -- Enter data from keyboard

113

:

! _ !

i

We typed input acc_rate spdlimit and Stata responded by repeating the variable names and then prompting us for the first observation. We then typed 4.58 and 55 and pressed Retth,'n. Stata prompte_ktusfor the second obsen, ation. We entered it and pressed Return. Stata prompted us for the third 6bservation. We knew that the accident rate is 1.61 per million vehicle miles, but we did not know the corresponding speed limit for the highway. We typed the number we knew, 1.61, followed by a period, the missing value indicator. When we pressed Return, Stata prompted us for the fourth 6bservation. We were finished entering our data, so we typed end in lowercase letters.

i i

We can now list

the data to verify that we have entered it correctly:

. list i. 2. 3.

acc_rate 4.58 2.86 1.61

spdlimit 55 60 Q

If you have data in memory and type input without a vartist, you will be prompted to enter adklitional information on all the variables. This continues until you type end. :

i

i

Examp You now have an additional observation you wish to add to the dataset. Typing input by itself tells Stata that you wish to add new observations: • i_ut 4, 5,

act_rate 3.02 60 end

spdlimit

St_ta rem/nded us of the names of our v_-iables and prompted us for the fomth observation. We entered 'the numbers 3,02 and 60 and pressed Return. Stats then prompted us for the fifth observation. We could add as many new observations as we wish. Since we needed to add only one observation, we typ_ _nd, Our dataset now has four observations. "xl

You may add new variables to the data in memory by typing input followed by the names of the new variables. Stata will begin by prompting yGu for the first observation, then the second, and so on, until you type end or enter the last observation.

'iExample i

! ,

In addition to the accident rate and speed limit, we now obtain data on the number of access points (omramps and off-ramps) per mile along each stretcl of highway. We wish to enter the new data.

I

• input acc_pts acc_pts t. 4.6 2. 4.4

3 2.2 I

i

4. _4.7

F

114 input -- Enter data from keyboard When we typed input acc_pts, Stata responded by prompting us for the first observation. There are 4.6 access points per mile for the first highway, so we entered 4.6 and pressed Return. Stata then prompted us for the second observation, and so on. We entered each of the numbers. When we entered the final observation, Stata automatically stopped prompting us--we did not have to type end. Stata knows that there are four observations in memory, and since we are adding a new variable, it stops automatically. We can, however, type end anytime we wish. If we do so, Stata fills the remaining observations on the new variables with m/ssing. To illustrate this, we enter one more variable to our data and then list the result: • input

junk

jun_ 1. 1 2. 2 3. end • list acc_rate 4.58

I.

spdlimit 55

acc_pts 4.6

60

4.4

2.86

2. 3,

1.61

4.

3• 02

junk 1 2

2.2 60

4.7

q

You can input string variables using input, but you must remember to explicitly indicate that the variables are strings by specifying the type of the variable before the variable's name.

> Example String variables are indicated by the types str#, where #represents the storage length, or maximum length, of the variable. For instance, a str4 variable has maximum length 4, meaning it can contain the strings a, ab, abe, and abed but not abede. Strings shorter than the maximum length can be stored in the variable, but strings longer than the maximum length cannot. You can create variables up to str80 in Stata. Since a str80 variable can store strings shorter than 80 characters, you might wonder why you should not make all your string variables str80. You do not want to do this because Stata allocates space for strings based on their maximum length. It would waste the computer's memory. Let's assume that we have no data in memory and wish to enter the following input

strl6

name

age

str6

name i.

"Arthur

2.

"Mary

3. Guy "Fawkes" 3.

"Guy

Hope"

Fawkes cannot

We first typed input sex a str6 variable.

sex age

sex

22 male

37

"female"

48 male be read

Fawkes"

4. "Kriste 5. end

:

Doyle"

data:

as a number

48 male

Yeager"

25 female

strl6 name age str6 sex, meaning that name is to be a strl6 variable and Since we did not specify anything about age, Stata made it a numeric variable.

Stata then prompted us to enter our data. On the first line, the name is Arthur Doyle, which we typed in double quotes. The double quotes are not really part of the string; they merely delimit the

_!lput

_

_l_,t;'w uam

llVli!

hVyL_)awu

J ,_,

beginning and end of the str ng. We followed that with Mr Doyle's age, 22, and his sex, male. We did not bpther to type doubk quotes around the word male because it contained no blanks or special characters. For the second o _servation,we did type the double quotes around female;it changed _othing. In the third observation w omitted the double quotes around the name, and Stata informed us that Fawkes c_uld not be read as number and repromptddus for the observation. When we omitted the double q_otes, Stata interpre:ed Guy as the name, Fa_rl_esas the age, and 48 as the sexl All of this would have been okay with Stata except for one problem: Fawkes looks nothing like a number, so Stata complained and gave :s another chance. This lime, we remembered to put the double quotes around ttie name.

i

Stata was satisfied, and _ continued. We entered lhe fourth observation and then typed end. Here is our dataset: • _ist 1. 2. 3. 4.

1 nam_ Arthur Doyle Mary Hope Guy Fawke. _ Kriste Yeagez

age 22 37 48 25

sex male female male female

q

I

>

I

Example Just as we indicated whic Lvariables Werestrings by placing a storage type in front of the variable name, we can indicate the .torage type of our numeric variables as well. Stata has five numeric storage types: byte, int, 1c ng, float, and double. When you do not specify the storage type. Stata assumes the variable is afl _at. Youmay want to review the definitions of numbers in [U] 15 Data.

! ' i i i i !i i i

,'_dditional Therei are two reasons you might The wantdefault to explicitly specify storage type: toforinduce precision or to co_vhy aserve memory. type float has the plenty of precision most circumstances because Stata performs all calculations in double precision no matter how the data are stored. I[ you were storing 9-digit Social Security Numbers, however, you would want to coerce a different storage type or else the last digit would be r0uhded, long would be the best choice: double would _,_orkequally well, b_]tit would waste memory. Sometimes you do not need to store a variable as float.If the variable contains only integers between -32,768 and 32,7i_6,it can be stored as an int and would take only half the space. If a variable contains only inti',gersbetween -127 and 126, it can be stored as a byte which would _:akeonly half again as mu( i space. For instance, in tile previous example we entered data for age ,_ithout explicitly specifyin, the storage type; hence, it was a float. It would have been better to _tore it as a byte.To do ti" tt. we would have typed input

strl6

name b _te age str6 nam _

_. "Arthur Doyle"

sex sex

12male

°I i

2. "Mary Hope" 37 'female" _. "Guy Fawkes" 48 male

i

4. "Kriste

Yeager"

age

25 female

5. end

i

Stata understands a number of shorthands. For instance,

_I

input

int(a

b) c

allows you to input three variables, Remember .{input

int

and c a float•

a b c

would make a an int *

a, b, and c, and makes both a and b ints

but both b and c floats.

. inputa longb double(cd) e would make a a float,b a long,c and d doubles,and e a float. Statahas a shorthandforvariable names withnumericsuffixes. Typingvl-v4 isequivalent to typing Vl v2 v3 v4. Thus, • linput

'

int(vl-v4)

inputs f6ur variables and stores them as ints. q

Q Technic,_l Note You may want to stop reading now. The rest of this section deals with using input with value labels. If you are not familiar with value labels, you should first read [U] 15.6.3 Value labels. Remdmber that value labels map numbers into words and vice versa. There are two aspects to the process. !First, we must define the association between numbers and words. We might tell Stata that 0 corresponds to male and 1 corresponds to female by typing label define sexlbl 0 "male" 1 "female". The correspondences are named, and in this case we have named the O_male l++female correspondence sexlbl. Next, iwe must associate this value label with a variable. If we had already entered the data and the variable was called sex, we would do this by typing label values sex sexlbl. We would have entered the data by typing O's and !'s, but at least now when we list the data, we would see the words rather than the underlying.numbers. We cab do better than that. After defining the value label, we can associate the value label with the type:variable at the time we input the data and tell Stata to use the value label to interpret what we l_bel • i_put

define strl6

I.

"Arthur

2.

"Mary

3.! "Guy

sexlbl name

byte(age

Hope"

1 "female"

sex:sexlbl),

name Doyle" 22 male

Fawkes"

4. "Kriste 5. end

0 "male"

age

label sex

37 "female" 48 male

Yeager"

25 female

After deft ing the value label, we typed our input command. Two things are noteworthy: We added the label option at the end of the command, and we typed sex:sexlbl for the name of the sex variable, T_e byte(...) around age and sex:sexlbl was not really necessary: it merely forced both age _nd sex to be stored as bytes. Let's first decipher sex : sexlbl, sex is the name of the variable we want to input. The : sexlbl part tells Stata thal the new variable is Lo be associated with the value label named sexlbl. The label option tells Stata that it is to look up any strings we type for labeled variables in their

input- Enter datafrom keyboard

117

corresponding value label and substitute the number when it stores the data. Thus, when we entered the first observation of ou • data, we typed male for Mr Doyle's sex even though the corresponding variable is numeric. Rather than complaining that ""mate" could not be read as a number", Stata accepted what we typed, 3oked up the number corresponding to male, and stored that number in the data.

i

The! fact that Stata has lctually stored a number rather than the words male or female is almost irrelevant. Whenever we ist the data or make a table, Stata will use the words male and female just as if those words were actually stored in the dht/set rather than their numeric codings: • list

I.

nm _e

age

se_

DoylLe

22

male

Ho],e

37

female

Guy Fawb is

48

male

95

female

Arthur

2.

Mary

3. i

, Kriste Yeag_ r tabulate sex sex

] req.

Percent

Cure,

male

2

50. O0

50.00

female

2

50. O0

I00. O0

Total

4

I00. O0

It is only almost irreleva at since we can make use of the underlying numbers in statistical analyses. For instance, if we were to ask Stata to calculate the mean of sex by typing sumrnarize sex, Stata would report 0.5. We woul interpret that to mean that one-half of our sample is female.

i i

Value labels are perman_ Itly associated with variables. Thus, once we associate a value label with a variaNe, we never have ti do so again. If we wanted to add another observation to these data, we

i

could type . input,

i i

label

i5. "Mark

Esman"

nam _ 26 male

age

sex

_. end

!_ i

[3Technical Note The automatic option ',utomates the definition of the value label. In the previous example, we _nformed Stata that male c, ,rresponds to 0 and female corresponds to 1 by typing label define sexlbl 0 "male" :t "female". It was not necessary to explicitly specify the mapping. Speci_,ing the aut6maticoption tells ;tara to interpret what we type as follows:

i i

ii

First, ;see if it is a numbeI If so, store that number and be done with it. If it is not a number, check

I I ! i_ i

I

the value label associated u th the variable in an attempt to interpret it. If an interpretation exists, store theIcorresponding nun: .tic code• If one does not exist, add a new numeric code corresponding to what was typed. Store th_ new number and update the value label so that the new correspondence is never t'orgotten. We can use these feature to reenter our age and sex data. Before reentering the data, we drop -all and label drop _all to prove tha_ we have nothing up our sleeve:

atop_an _abel

drop

_all

....

i

118

i input -- Enterdata from keyboard input

strl6

name

!

byte(age

sex:sexlbl),

name

i.

"Arthur

_.

"Mary

3.

"Guy

4.

"Kriste

Doyle" Hope"

22

37

Fawkes"

age

48

Yeager"

automatic sex

male

"female" male 25

female

. end i I

•

T We previouslydefinedthevaluelabelsexlbl so thatmale correspondedto 0 and female corresponded

to 1. Th+ label that Stata automatically created is slightly different but just as good: i

Sabel list sexlbl se: Ibl : I

male

0 2

female

Also See

'

Complementary:

[R] save

i

Related: i

[R] edit, [R] infile

Background:

[U] 24 Commands to input data

, i

t

I

i

, I

!

!_ !

i

_ in_iheet -- Read AS II (text)data created by a spreadsheet i i iHll i r i i iJ ii i iil ill [

i

Syntax i i

i i

i

insheet

[varlist] using

filename

[, _double [no]n_ames comma t__abclear

]

If filen_me is specifiedwithmt an extension, .raw is assumed.

Description

in_heet reads into rremory a disk dataset that is not in Stata format. ±nsheet is intended for readir_g files created by a spreadsheet or database program. Regardless of the creator, :i,nsheet reads text (ASCII) files where here is one observation per line and the values are separated by tabs or commas. In addition, the first line of the file can contain the variable names or not. The best thing

I i

[ about!insheet is that if you type . insheetusingill, name

i

insheet will read your lata; that's all there is to it. Stata has other comrr ands for reading data. If you are not sure that insheet

i

lookingfor, see [R] infih and [U] 24 Commands to input data. If y/ou want to save your data in "spregdsheet-style" forma

Options

is what you are

see [R] outsheet.

i

double forces Stata to st_age types.

t

tore variables as double_

rather than floats:

see IV] 15.2.2 Numeric

:-

[no]names informs Stata whether variable names are included on the first line of the file. Speci_,ing this option will speed insheet's processing--assuming you are right--but that is all. ±nsheet can determine for itse!f whether the file includes variable names.

1

comma tells Stata that the values are comma-separated. Specifying this option will speed insheet's pr0cessing--assumin_ you are right--but thai is all. insheetcan determine for itself whether the separation charact_ is a comma or a tab.

i i

!

tab prOcessing--assumin_ tells Stata the v_lues are right--but tab-separated. this can option will speed insheet's you are that Specifying ig all. insheet determine for itself whether the separation charact_:r is a tab or a comma.

i

clear specifies that it is okay for the new data |o replace what is currently in memory. To ensure that you do not lose sc mething important, insheetwill refuse to read new data if data are already in memory I clear

is _ne way you can tell ±nsheet

x_ourselfb_ typing drip

_all

that it is okay. The other is to drop _he data

before reading new data.

119

i

12o

Remarks

insheet-

Read ASCII (text) data created by a spreadsheet

There i_ nothing to using insheet.You type insheet and

insheet

using

filename

will read your data, That is, it will read your data if

1. It can find the file and 2. The file meets insheet's

expectations

as to the format in which it _s written.

Assuring I is easy enough; just realize that if you type infix using myfile, Stata interprets this as an instruction to read myfile.raw. If your file is called myfile.txt, type infix using myf ile. t,btt. As for the file's fo,-mat, most spreadsheets and some database programs write data in the form insheet expects, It is easy enough to look as we will show you--and it is even easier simply to try and see what happens. If typing • insheet

using

filenarrle

does not produce the desired result, you will have to try one of Stata's other infile commands: see [R] infile.

> Example You ha*e a raw data file on automobiles and can bd read by typing (5

called auto.raw.

This file was saved by a spreadsheet

insheet using auto vars, I0 obs)

That done, we can now look at what we just loaded: • describe Contains

data

obs:

I0 5

vats: size:

310

(99.8%

storage

of memory

free)

display

value

type

format

label

make_ price

strt3 int

%13s %8.0g

mpg

byte

Z8.0g

rep78

byte

%8.0g

foreign

strlO

ZlOs

variable

name

Sorted by: |Note:

dataset

has

changed

since

last

variable

label

saved

li_t

I. i 2._I 3,

make

price

mpg

AMC Concord AMC Pacer

4099 4749

22 17

3 3

foreign Domestic Domestic

Spirit

3799

22

4. Buick 5. Buick

Century Electra

4816 7827

20 15

3 4

Domestic Domestzc

6. Buick

LeSabre

5788

18

3

Domestzc

4453

26

7. !

AMC

rep78

Buick

Opel

Domestic

Domestic

insheet 8. BuickRegal 9. Buick Riviera 10. Buick


5189 10372

20 16

3 3

Domestic Domestic

4082

19

3

Domestic

Skylark

Note that these data contain a combination of string and numeric variables, insheet out by i_elfi

121

figured all that

i

i

3 Technical Note Now let's back up and look at the auto.raw screen: • Sype mal_e

These invisible

file. Stata's type command will display files to the

auto.raw mpg

rep78

foreign

AM¢ Concord

4099

22

3

Domestic

AMC Pacer

4749

17

3

Domestic

AMC Spirit

3799

22

Buick

Century

4816

20

3

Domestic

Buick Buick

Electra LeSabre

7827 5788

15 18

4 3

Domestic Domestic

Buick

Opel

4453

26

Buick Buick

Regal Riviera

5189 10372

20 16

3 3

Domestic Domestic

Buick

Skylark

4082

19

3

Domestic

data and

i

price

have

tab

hence

characters

Domestic

Domestic

between

indistinguishable

i !

]

i

values.

from

blanks,

Tab

characters

type's

showtabs

are

difficult option

to makes

see

since the

tabs

thev

are

visible: I

_ype

auto.raw,

showtabs

1

makepricempgrepZ8foreign AMC Concord4099223Domestic AMC Pacer4749173Domestic AMC Spirit379922.Domestic Buick Century4816203Domestic Buick Electra7827lS4Domestic Buick

LeSabre5788lS3Domestic

Buick

Opel445326.Domestic

Buick Buick

Rega!5189203Domestic KivieralO372i63Domestic

Buick

Skylark4082193Domestic

:

This is an example of the kind of data insheet is willing to read. The first line contains the variable names, Nthough that is not necessm/. What is nedessary is that the data values have tab characters between them. + insheet would be just as happy if the data values were separated by commas. Here is another vafiationi on auto. raw that insheet can read: type

auto2.raw

make,price,mpg,rep78,foreign AMC Concord,4099,22,3,Domestic AMC Pacer,4749,17,3,Domestic AMC Spirit,3799,22,,,Domestic Buick Buick

Century,48i6+,20,3,Domestic Electra,7827i, 15,4,Domestic

Buick

LeSabre,5788,18,3,Domestic

Buick

Opel,4453,26!,,Domestic

i

Buick Buick

Regal,5189,20,3,Domesti¢ Riviera,lO37_,16,3,Domestic

i

Buick

Skylark,4082.,19,3,Dom_sZic

["

It is way one easieror for the us other. human beings to see the commas rather than the tabs. but computers 122

insheet-

do not care O


!

D Example The file does not have to contain variable names. Here is another variation on auto. the first line. this time with commas rather than tabs separating the values:

raw without

type auto3, raw AMC Concord, 4099,22,3, Domestic AMC Pacer, 4749,17,3 ,Domestic (output omitted ) Buick

Skylark. 4082,19,3, Domestic

Here is what happens when we read it: insheet using auto3 you must start with an empty dataset r(I8);

Oops; we still have the data from the last example in memory. • insheet using auto3, clear (5 vars, I0 obs) . describe Contains data obs : vats : size:

variable name

10 5 310 (99.8Y,of memory free) storage type

display format

vl

strl3

Y,13s

v2 v3 v4 v5

int byte byte strlO

7,8.0g ]/,,8.0g XS. 0g Y, IOs

Sorted by : Note:

value label

variable label

dataset has changed since last saved

list vl AMC Concord AMC Pacer

v2 4099 4749

v3 22 17

v4 3 3

v5 Domestic Domestic

(output omitted ) i0. Buick Skylark

4082

19

3

Domestic

I. 2.

'j

The only difference is that rather than the variables being nicely named make, price, mpg, rep78, and foreign,they are named vl,v2, ..., v5. We could now give our variables nicer names: • rename vl make . rename v2 price

insheet-- Read ASCII (text) data created by a spreadsheet

123

!

Another alternative is to specie' the variable names when we read the data: • insheet make price mpg rep78 foreign u_ing auto3, clear

(5 vats,I0obs) list make AMC Concord AMC Pacer

i. 2.

price 4099 4749

mpg 22 17

rep78 3 3

4082

19

3

foreign Domestic Domestic

!

i

(outputomi,ed ) 10.

,i

Buick Skylark

Domestic

ii

If we use this approach, we must not specify too few variables • insheet make price mpg rep78 using aut03, clear too few variables specified error in line 11 of file r,(102) ;

i 4 I

|

or too many.

|

. insheet make price mpg rep78 foreign weight using auto3, clear too many variables specified e_ror in line 11 of file r,(103);

mat is why we recommend . insheet using

i

filename

/

It is not difficult to rename your variables afterwards should that be necessary,

q

> Example

I

About the only other thing that can go wrong is that the data are not appropriate for reading by insheet. Here is yet another version of the automobile data: type auto4.raw, showZabs "_AMCConcord" 4099 22 '_,AMC Pacer" 4749 17

3 3

Domestic Domestic

3 4

Domestic Domestic Domestic

"!AMCSpirit" '_BuickCentury" "Buick Electra"

3799 4816 7827

22 20 15

"Buick LeSabre" "Buick 0pel" "Buick Regal" ".'Buick Riviera"

5788 4453 5189 10372

18 26 20 16

3 3 3

Domestic Domestic Domestic Domestic

"_uick Skylark"

4082

19

3

Domestic

] ]

i

Note that we specified type's showtabs option and no tabs are shown. These data are not tabdelimited or comma-delimited and are not the kind of data insheet is designed to read. I,et's try insheetanyway:

] 1


i

124

insheet -- Read ASCII (text) data created by a spreadsheet insheet using auto4, clear (I vat, I0 obs) describe Contains data obs : vats:

10 1

size:

430 (99.8Y,of memory free)

variable name

storage type

vl

display format

sir39

Sorted by: Note:

value label

variable label

7,39s

dataset has changed since last saved

• list vl I. AMC Concord 2. AMC Pacer (output omitted) 10, Buick Skylark

4099 4749

22 17

3 3

Domestic Domestic

4082

19

3

Domestic

When £nsheet tries to readdatathathave no tabsor commas, itisfooledintothinkingthedata contain justone variable. Ifyou had thesedata,you would have toreadthedatawithone ofStata's other commands such as infile (free format). F R-squared Adj R-squared Root MSE

MS

: Model Residual

46189.152 15053.968

Total

61243.12

rent

Coef.

hs_val pctttrban hsns_hat _cons

135

.0006509 .0815159 .0015889 120.7065

3 46

15396.384 327.260173

49

1249.85959

Std. Err.

t

P>ttl

.0002947 .2438355 .0003984 12.42856

2.21 0.$3 3.99 9._1

0.032 0.740 0.000 0.000

= = = = = =

50 47.05 0.0000 0.7542 0.7382 18.09

[95_ Cong. Interval] .0000577 -.4092994 .000787 95.68912

.00124_2 .57233113 .0023908 145.72_9

Since we have only a single endogenous rhs variable, our test statistic is just the t statistic for the hsng__hat variable. If there were more than one endogenous rhs variable, we would need to perform a joint test of _illtheir predicted value regressors being zero. For this simple case, the test statement w_ld be • ,test _sng_hat (1) ,

Itsng_hat= 0.0 _( 1, 46) = Prob > F =

15.91 0.0002

While the p-value from the augmented regression test is somewhat lower than the p-value from the Hausman test, both tests clearly show that OLS is nor indicated for the rent equation (under the assumption that the instrumental variables estimator is a consistent estimator for our rent modeD.

_!Example Robust sta_ard • ivreg

rent

errors are availab_ with ±vreg: pcturban

(hsngval

= famine

reg_-reg4),

robust

IV (2SLS) regression with robust standard errors

--_

rent

Coef.

hsngval pcturban _cons

.0022398 .081516 120.7065

Robust Std. Err.

t

P>It I

.0006931 .4585635 15.7348

3.23 O.18 7.67

O.002 O.860 O.000

Number of obs = F( 2, 47) = Prob > F =

50 21._4 O.O(YO0

R-squared Root MSE

O.5989 22.882

= =

[95_ Conf. Interva_l] .0008455 -. 8409949 89.05217

.0036342 i.004027 152.3609

T

InstzRlmented: hsngval In_tra_ments: pcturban famine reg2 reg3 re$4

The robust star_darderror for the coefficiem on housing value is double what was previously estimated.

_

_

13u

wreg -- mstrumemal variables and two-stage least squares regression

Q Technical Note You may perform weighted two-stage instrumental variables qualifier with irreg. You may perform weighted or unweighted variable estimation, suppressing the constant, by specifying the constant is excluded from both the structural equation and the

estimation by specifying the [weight] two-stage least squares or instrumental noconstant option. In this case, the instrument list. Cl

Acknowledgments The robust estimate of variance with instrumental Mead Over, Dean Jolliffe, and Andrew Foster (1996).

variables was first implemented

in Stata by

Saved Results ivreg saves in e() : Scalars e (N) e (ross) e(df_m) e(rss) e(df.._r) Macros

number of observations mode] sum of squares mode] degrees of freedom residual sum of squares, residual degrees of freedom

e(r2) e (r2_a) e(F) e(rmse) e(N_clust)

e(cmd)

ivreg

e(clustvar)

e(version)

version number of ivreg name of dependent variable iv weight type weight expression

e(vcetype)

e(depva.r)

e(model) e(wtype) e (wexp) Matrices e (b)

coefficientvector

e(instd)

e(insts) e(predict)

e (V)

R-squared

adjusted R-squared F statistic root mem_square error number of clusters name of cluster variable covariance estimation method instrumented variable instruments program used to implement predict

variance-covanance matrix of the estimators

Functions e (sample)


Methods and Formulas ivreg

is implemented

as an ado-file.

Variables printed in lowercase and not boldfaced (e.g., x) are scalars. Variables printed in lowercase and boldfaced (e.g., x) are column vectors. Variables printed in uppercase and boldfaced (e.g., X) are matrices. Let v be a column vector of weights specified by the user. If no weights are specified, then v -- 1. Let w be a column vector of normalized weights. If no weights are specified or if the user specified fweights or iweights, w= v. Otherwise, w = {v/(I'v)}(ltl).

i

i

1 i

J

!

ivreg -- Instrumentalvariablesand two-stageleast squares regression

137

The number of observations, n, is defined as l'w. In the case of iweights, this is truncated to an integer. The sum of the weights is l'v. Define e = t if there is a constant in the regression and zero otherwise. Define k as the number of right-hand-side (rhs) variables (including the constant). Let X denote the matrix of observations on the ths variables, y the vector of observations on the left-hap,d-side (lhs) variable, and Z the matrix of observations on the instruments. In the following formulas, if the user specifies weights, then X'X, X ! y, y'y, Z'Z, Z'X, . and Z'y are replaced by X'DX; X'Dy, y'Dy, Z'DZ, Z'DX, and Z'Dy, respectively; where D is a diagonal matrix whose diagonal elements are the elements of w. We suppress the D below to simpli_ the notation. Define A as X'Z(Z'Z)-I(x'z)

j

i

' and a as X'Z(Z'Z)-IZ'y.

The coefficient vector b is defined as A-la. Although not shown in the notation, unless hascons is specified, A and a are accumulated in deviation form and the constant is calculated separately. This comment applies toall statistics listed below. The total sum of squares, ySS, equals y'y if there is no intercept and y'y - { (l'y)2/n The degrees of freedom are n - c. The error sum of squares, ESS, is defined as y'_ - 2bX'y

k.

i

aren

The mode/sum

+ b'X'Xb.

} otherwise.

The degrees of freedom

of squares, MSS, equals TSS- ESS. The degrees of freedom are k - c.

The mean square error, s2. is defined as ESS/(n 2 k). The root mean square error is s, its square root. If c - 1, then F is defined as F = (b - c)iA(b - c) (k _ 1)s 2 where e is a veclor of k - 1 zeros and h'th element l'y/n. this case, you may use the test

Otherwise, F is defined as missing. (In

command to construct any F test you wish )

]

The R-squared, R 2, is defined as R 2 = 1 - ESS/TSS.

i

The. adjusted R-squared, R2a, is 1 - (1 - R2)(n- c)/(n - k). If robust is not specified, the conventional estimate of variance is s2A -1.

i ]

For a discussion of robust variance estimates in the context of recession and regression with instrumental Variables. see [R] regress. Methods and Formulas. See this same section for a discussion of: the formulas for predict after irreg,

i i i]

References Baltagi,B. H. 1998. Econometrics.New York: Springer-Verlag. Basmann. R. L. t057. A generalizedclassical method of iinear estimationof coefficientsin a structural equation. -

Econometrica25: 77-83. Davidson.R, and J. G. MacKinnon.t993. Estimationand Inferencein Econometrics.New York:OX%rdUniver_iU Press:

Johnston.J. and J. DiNardo.I99Z EconometricMethods.4lh ed New York:McGraw-Hill. Kmenta,J. 1997. Elememsof Economemcs.2d ed. Ann Arbor:Universityof MichiganPress. Koopmans,T. C. and W C, Hood. I953. Studiesin EconometrkMethod,New York:John Witey& Son_. Koopmans.T. C and J. Marschak.I950. StatisticalInferencein DynamicEconomic Models. New York:John Wiley & SOns.

I

138

ivreg -- Instrumental variables and two-stage least squares regression

Over, M.. D. Jolliffe. and A. Foster. 1996. sg46: Huber correction for two-stage least squares estimates. Stata Technical Bulletin 29: 24-25. Reprinted in Sta_aTechnical Bulletin Reprints, vo]. 5. pp. 140-142. Theil, H. 1953. Repeated Least Squares Applied _oCompleteEquation Systems. Mimeographfrom the Central Planning Bureau, Hague. White, H. 1980. A heteroskedasticity-consistentcovariance matrix estimator and a direct test for heteroskedasticity. Econome_ca 48: 817-.838. Wooldridge, J. M. 2000. Introductory Econome;_cs: A Modern Approach. Cincinnati. OH: South-Western College Publishing.


[R] adjust, [R] lincom, JR] linktest, [R] mfx, [R] predict. [R] testnl, [R] vce, [R] xi

Related:

[R] anova, [R] areg, JR] ensreg,

[R] mvreg, [R] qreg, [R] reg3, JR] regress,

[R] rreg, [R] sureg, [R] svy estimators, [P] _robust Background:

[U] [U] [U] [U]

[R] test,

[R] xtreg, [R] xtregar,

16.5 Accessing coefficients and standard errors. 23 Estimation and post-estimation commands, 23.11 Obtaining robust variance estimates, 23.13 Weighted estimation

1

!

Title

i jknife_-- J_kknife ]

i i

i 'i

,i

estimation

'

i

I

i

i ,I

_

exp_list [if exp] [in ra,ge]

[,

i

N I

,, f

i

Syntax jtmife

"cmd"

[r_class

1 e_class

t n(exp) ]

!

level(#) keep ] expJist

Contains

] i i

newvarnarne = (exp) (exp) eexp

] i i

eexp is specname [eqno]specname specnarne is _b

i I

_b[]

1

_se _se []

eqno is ## /lan'td

Distinguish between [ ], which are to be typed, and [], which indicate optional arguments.

Iscription jknife

performs jack&nile estimation.

cmd defines the statistical command to be executed, cmd must be bound in double quotes. Compound double quotes ("" and "") are needed if the command itself contains double quotes exp_[ist specifies the statistics to be retrieved after the execution of cmd. on which jackknife statistics will lie calculated.

Qptions rclass, eclaSS, and n(exp) specify where crnd saves the number of observations on which it based the calculated results. You are strongly advised tO specify one of these options.

i i

rclass

specifies that cmd saves the number of dbservations in r(N).

ectass

specifies that cmd saves the number of observations in e(N).

n(exp) allows you to specify an}, other expression that evaluates to the number of obser_'ations used. Specifying n(r(N)) is equivalent to spedifying option rclass. Speci_'ing n(e(N)) is equi'falent to specifying option eclass. If cmd saved the number of observations in r(Nl), specify n(_(Ni) ). Y

139

140

Jknife-- Jackknifeestimation

If you specify none of these options, jknife assumes that all observations in the dataset contribute to the calculated result. If that assumption is incorrect, the reported standard errors wilt be incorrect. For instance, say you specify • jknife "regress y xl x2 x3" coef=_b[x2]

and pretend that observation 42 in the dataset has x3 equal to missing. The 42nd observation plays no role in obtaining the estimates, but jknife has no way of knowing that and will use the wrong N. If, on the other hand, you specify jknife "regress y xl x2 x3" coef=_b[x2], e

will correctly notice that observation 42 plays no role. Option e is specified because regress is an estimation command and saves the number of observations used in e (N). When jknife runs the regression omitting the 42nd observation, jknife will observe that e(N) has the same value as when jknife previously ran the regression using all the observations. Thus, jknife will know that regress did not use the observation. jknife

In this example, it does not matter whether you specify option eclass eclass is easier.

or n (e (N)). but specifying

level(#) specifies the confidence level, in percent, for the confidence intervals. The default is level(95) or as set by set level; see [U] 23.5 Specifying the width of confidence intervals. keep specifies that new variables are to be added to the dataset containing the pseudo-values of the requested statistics. For instance, if you typed . jknife "regress y xl x2 x3" coef=_b[x2], e keep

new variable cool would be added to the dataset containing the pseudo-values for _b [x2]. Let b be defined as the value of _b [x2] when all observations are used to estimate the model, and let b(j) be the value when the jth observation is omitted. The pseudo-values are defined as pseudovaluej = N • {b- b(j)} + b(j) where N is the number of observations used to produce b.

Remarks While the jackknife--developed in the late 1940s and earl}, 1950s--is of largely historical interest today, it is still useful in searching for overly influential observations. This feature is often forgotten. In any case, the jackknife is 1. an alternative, first-order unbiased estimator for a statistic; 2. a data-dependent way to calculate the standard error of the statistic and to obtain significance levels and confidence intervals; and 3. a way of producing measures called pseudo-values for each observation, reflecting the observation's influence on the overall statistic. The idea behind the simplest form of the jackknife the one implemented here is to calculate the statistic in question N times, each time omitting just one of the dataset's observations. Write S for the statistic calculated on the overall sample and S(j) for the statistic calculated when the jth observation is removed. If the statistic in question were the mean, then S=

(N - 1)5'(3) + sj N

i

_

jknife-- J_ffe where

estinl_.i_

141

is the value of the data in the jth observation. Solving for sj, we obtain

sj

sj = NS - (N-

1)S(j)

These are the pseudo-values the jackknife calculates, even though the statistic in question is not the mean. The jac_nife estimate is _, the average of the sj's, and its estimate of the standard error of the statistic is the corresponding standard error of the mean (Tukey 1958). The jackknife estimate of variance has been largely replaced by the bootstrap (see [R] bstrap), which is widely viewed as more efficient and robust. The use of iackknife pseudo-values to detect outliers is too o[ten forgotten and is something the bootstrap is unable to provide. See M0stetler and ,Tukey (1977, 133-t63) and Mooney and Duval (1993., 22-27) for more information.

I

JaCkknifedStandard deviation Example Moste!ler and Tukey (1977, 139-140) request a 95% confidence interval for the standard deviation of the eleven v_ilues: 0.t,

0.1,

0,1,

0.4,

0.5,

t.0,

1.1,

1.3,

1.9,

1.9,

4.7

Stata's summarize command calculates the mean and standard deviation and saves them as r (mean) and r (sd), To obtain the jackknifed standard deviation of the eleven values and to save the pseudovalues asia new variable sd, type • i input

x X

1.0.1 2. 0.1 3. 0.1 4. 0.4 5.0.5 6. i.O 7.!.1 8. t.3 9. 1.9

lo. 1.9 11. 12.

4.7 end

j:knife"summarize x" sd=r(sd), r keep command: statistic: n():

summarize x sd=r(sd) r(N)

Variable

Obs

jknife sd

overall

Statistic

i.489364 II

Std. Err.

[95Y,Conf. Interval]

.6244049

.0981028

2.880625

I.343469

Interpreting the, oulpu[, the standard deviation repoded by snmmaxize mpg is 1.34. The jackknife estimate is 1.49 with standard error 0.62. The 95% confidence interval for the standard deviation is .10 to 2.88. By spectfyii_g the keep option, jknife pseudo-valu_es.

creates a new variable in our dataset, sd. for the

• list -' ,

142

jknife-- Jackknife estimation x sd 1. 2. 3. 4. 5. 6. 7. 8. 9, 10. 11.

•I •1 •1 •4 •5 1 i.1 I.3 1.9 1.9 4.7

1.1399778 1.1399778 1,1399778 . 88931508 .8242672 • 63248882 •62031914 •62188884 .8354195 . 8354195 7.7039493

The jackknife estimate is the average of sd,so sd contains the individual "values" of our statistic. We can see that the last observation is substantially larger than the others, The last observation is certainly an outlier, but whether that reflects the considerable information it contains or indicates that it should be excluded from analysis is a decision that must be based on the context of the problem. In this case, Mosteller and Tukey created the dataset by sampling from an exponential distribution. so the observation is quite informative. ,q

> Example Let us repeat the above example using the automobile dataset, obtaining the standard error of the standard deviation of mpg. • use auto, clear (1978 Automobile Data) jknife "summarize mpg" sd=r(sd), r keep command: statistic: n() :

smmmarize mpg sd=r(sd) r(N)

Variable

Obs

Statistic

74

5.785503

Std. Err.


sd overall jknife

5.817373

,607251

4.607124

Looking at sd more carefully, summarize sd, detail r(sd) pseudovalues

i_ 57 107 25_ 50Z 75_ 90_ 95_ 997

Percentiles 2.870485 2.870485 2.906249 3•328494

Smallest 2.870485 2.870485 2.870485 2.870485

3.948327 6.844408 9.597005 17.34316 38.60905

Obs Sum of Wgt. Mean

Largest 17.34316 19.76169 19.76169 38.60905

Std. Dev. Variance Skewness Kurtosis

74 74 5.817373 5.22377 27.28778 4.07202 23.37822

7.027623

_

jknife-- Jackknifeest_ • ]/istimake mpg

sd if sd > 30

,.pg

_ake 71•

_

143

Diesel

41

sd 38.60@052

Inthi s case,the_v Dieselistheonlydiesel carinourdataset.

q

!Collectingmultiple statistics i>Example : jknife is not limited to collecting just one ;tatistic. For instance, you can use s-T.marize, detail and then obtain the jackknife estimate of _e standard deviation and skewness, m_rnmarize, detail saves the standard deviation in r(sd) and the skewness in r(skewness), so you might type i

• _se (I_78

auto, clear _utomobile

• jkni_e

"summarize

comm_: statistic n():

Data) mpg, detail"

summarize :

sd=r(sd)skew=r(skewness),

r

mpg, detail

sd=r (sd) skew=r (skewness) r(N)

Variable

Obs

Statistic

74

5.78550_

Std. Err.

[95_, Conf.

Interval]

sd overall jknif

e

5. 817379

•607251

4. 607124

7• 027623

.3525056

1. 694686

skew 74

overall

.948717_ 1. 023596

jknife

.3367242

q

!Collectingcoefficients and standard errors Example

, jkni_eCan also collect coefficients and standard errors from estimation commands. For instance, using auto. klta we wish to obtain the jackknife e,_timate of the coefficients and standard errors from a regression in which we model the mileage of a _ar by its weight and trunk space. To do this. we refer io the Coefficients and the standard errors as _b [weight], _b [trunk], _se [weight], and _se [ttrumk] in the exp_list, or we could simplify them by using the extended expressions _b and

i 1

_SO. • Use _uto, clear (1978 iAutomobile Data) • _kniife "reg mpg weight

trunk"

co_iman61:

reg mpg weight

statistic:

b_weight=_b

_b _se, e

trunk

[weight]

se_weight=_se [weight] b_trunk=_b [trunk] b cons=

n() :

b[_cons]

se

Zrtmk=_se

se

cons=_se

_(_)

[trunk] [_cons]

] i

144

jknife -- Jackknife estimation Variable

1

Obs

Statistic

74

-.0056527

Std. Err.

[95X Conf. Interval]

b_weight overall jknife

-.0056325

.O010216

-.0076684

-.0035965

se_weight overall

74

.0007016

jkaife

.0003496

.000111

.0001284

.0005708

b_trunk overall

74

-.096229 -.0973012

jknife

.1486236

-.3935075

.1989052

b_cons overall

74

39.68913

jknife

39.65612

1.873323

35.92259

43.38965

.0218525

.0196476

.1067514

.2690423

.2907921

1.363193

se_trunk overall

74

.1274771

jknife

.0631995

se_cons overall

1.65207

74

jknife

.8269927

q

Saved Results jknife saves in r(): of observations

used in calculating

#

r(N#)

number

r(stat#)

value of statistic # using all observations

statistic

f

r(me_n#) r(se#)

jackknife estimate of statistic # (mean of pseudo-values) standard error of the mean of statistic #

|

Methods and Formulas jknife is implemented

as an ado-file.

References Gould, W. 1995. sg34: Jackknife estimation. Reprints, vol. 4, pp. 165-170.

Stata Technical

Bulletin

24: 25-29.

Mooney, C. Z. and R. D. Duval. Park, CA: Sage Publications.

1993. Bootstrapping:

A Nonparametric

Mosteller, E Company.

1977. Data Amdysis

and Regression.

and J. W. Tukey.

Tukey, J. W. 1958. Bias and confidence 614.

in not-quite

large samples.

Reprinted

Approach Reading.

Abstract

to Statistical

Related:

[R] bstrap, [R] statsby

Background:

[U] 16.5 Accessing coefficients and standard errors, [u] 23 Estimation and post-estimation commands

Inference.

MA: Addison-Wesley

in Annals

Also See

in Stata Technical

of Mat,_ematical

Bulletin

Newbury Publishing

Statistics

29:

rT!tle jointly --- l_orm all p_rwise combinations within groups

Syntax joinby

[varIist] using

nqlabe_. ....... , update

filena,ne

replace

[, _atehed({none

_merge(varname)

] b_oth [ re_aster

using

})

]

DescripUon j oinby joiqs, within groups formed by varlist, observations of the dataset in memory withfiiename, a Stata-format dataset. By join we mean "form all parwise combinations", filename is required to be sorted by varti_t. Iffilename is specified without an extension, '. dta' is assumed. If rarlist isnot specified, joinby memory antt in filename.

takes as varligt the set of variables common to the dataset in

Observations unique to one or the other dataset are ignored unless unmatched () specifies differently. Whether you load one dataset and join the other or vice versa makes no difference in terms of the number of resalting observations. If there ar_ common variables between the two datasets, however, the combined dataset will contain the va]ues from the master data for those observations. This behavior can be modified with the update and replace options.

Options

z

unmatched({llone I both !master I using }) specifies whether observations unique to one of the datasets are to be kept, with the variables from the other dataset set to missing. Valid values are none both m_.stier using

all unmatched observations are ignored (default) unmatched observations from the master and using data are included unmatched obse_'ations from ihe master data are included unmatched observations from the using data are included

}

no!abe! Nevents Stata from copying the value label definitions from the disk dataset into the dataset in memory. Even if you do not specify this option, in no event do label definitions from the disk dataset tephce label definitions already in memory. update i

varies the action that joinby

data_et is tield inviolate--values

takes when an observation is matched. By default, the master from the master data are retained when the same variables are

found in both datasets. If update is specified, however, the values from the using dataset are retained in cases where the master dataset contains missing. i '-

replace,

fillowed with update only, specifies that even when the master dataset contains nonmissin_ values, the_' are to be replaced with corresponding values from the using dataset when the corresp0ndjng values are not equal. A nonmissi_g value, however, will never be replaced with a

missing value. -merge(varname) specifies the name of the variable that will mark the source of the resulting observation. The default is _.merge (_merge) . To preserve compatibility with earlier versions of joir_b_, ___erge is only generated if unmatched is specified. 145

l

Remarks

146 joinby -- Form all pairwise combinations within groups The following, admittedly artificial, example illustrates j oinby.

> Example You have two datasets: child, dta and parent .dta, identifies the people who belong to the s_ne family. .

use

Both contain a family_id

child on

(Data

Children)

describe Contains

data

from

child.dta

obs:

5

Data

vats:

4

13

size:

50

(99.9_

storage variable

name

type

of memory

on Children

Jul

display

value

format

label

variable

family_id

int

_8.0g

Family

child_id

byte

_8.0g

Child

xl

byte

_8.0g

x2

int

_8.0g

by:

Sorted

2000

15:29

free)

label Id Number

Id Number

family_id

list family~d 1025

I.

child_id 3

xl II

x2 320

2.

1025

1

12

300

3.

1025

4

10

275

4.

1026

2

13

280

5.

1027

5

15

210

. use (Data

parent, clear on Parents)

describe Contains obs:

data

from

parent.dta 6

Data

vats:

4

13 Jul

size:

108 storage

(99.9_

of memory

on Parents 2000

15:31

free)

display

value

type

format

label

family_id

int

_8.0g

Family

Id Number

parent_id

float

_9.0g

Parent

Id Number

xl

float

_9.0g

x3

float

_9.0g

variable

Sorted

name

variable

by:

list

1.

fsmily-d 1030

2. 3.

1025 1025

4. 5. 6.

parent_id I0

xl 39

x3 600

11 12

20 27

643 721

1026

13

30

Z60

1026

14

26

668

1030

15

32

684

label

variable which

joinby-- Form ail pairwisecombinationswithin groups

147

You want tO "join" the information for the parents and their children. The data on parents are in memory;the data on children are on disk. dhild.dta has been sorted by family_id, but parerit._ti has not, so first we sort the parent _data on famity_id: • Sort

i family_id

• joinby

family_id

using

child

• describe Co,tails

data

o.bs:I

8

vats:

6 168

Data (99.4_, of memory

free)

i

i

storage

on Parents

,

display

value

type

format

label

family¢id

int

Y,8.0g

Family

Id Number

paz_nt_id

float

Y,9.0g

Parent

Id Number

Xl

float

%9.0g

x3

float

Y.9.0g

child__d

byte

XS.0g

x2

int

Y,8.0g

variable

name

Sorted iby : Npte:

dataset

has changed

variable

Child

since

last

label

Id Number

saved

l_st 1.

family-d 1025

parent_id 12

xl 27

x3 721

child_id 3

2;.

1025

11

20

643

3

320

3. 4.

1025 1025

12 11

27 20

721 643

1 1

300 300

5,

1025

li

20

643

4

275

6. 7.

1025 1026

12 13

27 30

721 760

4 2

275 280

8.

1026

14

26

668

2

280

x2 320

Notice that

I , I

1. fami_y__d of I027, which appears only in child.dta, and family_id only in Narent. dta, are not in the combined dataset. Observations variable(s) are not in both datasets are omitted.

of 1030, which appears for which the matching

2. The x_ v_riable is in both datasets. Values for this variable in the joined dataset are the values from par_nt.dta--the dataset in memory when we issued the joinby command. If we had cMld.d_a in memory and parent.dta on di_k when we requested joinby, the values for xl wouldiha_'e been from child.dta. Values from the dataset in memory take precedence over the datasel o_i disk. q

Methods joinby

Formulas ii implemented as an ado-file.

t

148

joinby-- Formall pairwisecombinationswithin groups

Acknowledgment joinbywas written by Jeroen Weesie, Department of Sociology, Utrecht University, Netherlands.


[R] save

Related:

JR] append, [R] cross, JR] fillin, JR] merge

Background:

[U] 25 Commands for combining data

1

!

-¥itle kappa,

interrater agreement 4

1

i Syntax :

}

kap va_ai_el varname2 varnarne3 [...] [weigh3] [if exp] [in range] kappa i,ariist [if exp] [in range] fweights

i

a_e aliowed; see [U] 14,1.6 weight.

DescriptiOn kap (first s_,ntax)calculates the kappa-statistic measure of interrater agreement when there are two unique raters and two or more ratings. /

kapwgt: defines weights for use by kap in measuring the importance of disagreements. kap (secoqd syntax) and kappa calculate the kappa-statistic measure in the case of two or more (nonuniqu¢) r_atersand two outcomes, more than two outcomes when the number of raters is fixed, and more thah two outcomes when the number of raters varies, kap (second syntax) and kappa produce the same results: they merely differ in how they expect the data to be organized. kap assurrie's that each observation is a subject, varnamel contains the ratings by the first rater, varname2 'by ihe second rater, and so on. kappa also_assumesthat each obse_'ation is a subject. The variables,however, record the frequencies with which r@ngs were assigned. The first variable records the number of times the first rating was assigned, the gecond variable records the number of times the second rating was assigned, and so on.

Options tab displays a tabulation of the assessmentsby the two raters. wgt(wgtid) _pecifies that wgtid is to be used to weight disagreements. User-defined weights can be created using kapwgt: in that case, wgt() specifies the name of the user-defined matrix. For instance, you might define . kapwg_ mine i \, .8 1 \ 0 .8 I \ 0 0 .8 I'

and them . k_p

rgta

ratb,

wgt(mine)

14g

i i

150

kappa -- lnterrater agreement

In addition, two prerecorded

weights are available.

wgt(w) specifies weights 1 - [i -jl/(k - 1), where i and j index the rows and columns of the ratings by the two raters and k is the maximum number of possible ratings. wgt(w2)

specifies weights 1-{(i-

j)/(k-

1)} 2.

absolute is relevant only if wgt () is also specified; see wgt () above. Option absolute modifies how i, j, and k in the formulas below are defined and how corresponding entries are found in a user-defined weighting matrix. When absolute is not specified, i and j refer to the row and column index, not the ratings themselves. Say the ratings are recorded as {0, 1, 1.5, 2}. There are 4 ratings; k = 4 and i and j are still 1, 2, 3, and 4 in the formulas below. Index 3, for instance. corresponds to rating = 1.5. This is convenient but can, with some data, lead to difficulties. When absolute is specified, all ratings must be integers and they must be coded from the set {1,2, 3,...}. Not all values need be used; integer values that do not occur are simply assumed to be unobserved.

Remarks The kappa-statistic measure of agreement is scaled to be 0 when the amount of agreement is what would be expected to be observed by chance and 1 when there is perfect agreement. For intermediate values, Landis and Koch (1977a, 165) suggest the following interpretations: below 0.0 0.00-0.20 0.21-0.40 0.41-0.60 0.61- 0.80 0.81- 1.00

Poor Slight Fair Moderate Substantial Almost Perfect

The case of 2 raters > Example Consider the classification by two radMogists of 85 xeromammograms as normal, benign disease. suspicion of cancer, or cancer (a subset of the data from Boyd et al. 1982 and discussed in the context of kappa in Altman 1991, 403-405). . tabulate Radiologist A's assessment

rada

radb

Radiologist Normal

B's

benign

suspect

cancer

Total

12

0

0

33

benign

4

17

1

0

22

suspect cancer

3 0

9 0

15 0

2 1

29 1

38

16

3

85

Normal

Total

21

assessment

28

Our dataset contains two variables: Each observation is a patient.

rada,

radiologist A's assessment: radb.

We can obtain the kappa measure of interrater agreement

by typing

radiologisl B's assessment.

kap

-- lnterrat_ agreement

151

• kap rada radb ; Agreement i

Expected Agreement

Kappa

Std. Err,

30. 2z 0.472a 0.0694 !

Prob>Z

6.81

0.0000

Had each radiologist made his determination randomly (but with probabilities equal to the overall proportions), _we would expect the two radiologist_ to agree on 30.8% of the patients. In fact, they agreed on 6}.5% of the patients, or 47.3% of the way between random agreement and perfect a_reemenL _he amount of agreement indicates that we can reject the hypothesis that the?, are making their detetrni lations randomty.

Example l

Z

,I

i

There is a difference between two radiologists disagreeing whether a xeromammogram indicates cancer or th_ ,_uspicion of cancer and disagreeing whether it indicates cancer or is normal. The weighted kappa attempts to deal with this. kap provides two "prerecorded" weights, w and w2: . k_p _ada radb, wgt(_;) Ratlng_ weighted by: 1.0¢00 O, 666? 0.61_67 1.0000 O. 3_33 O. 6667 O. od,oo o. 3333

O.3333 0.6667 1.0000 o. 6667

i Expected /{gr¢_em_nt Agreement

l i i I

i !

O. 0000 0.3333 O. 6667 1. 0000

Kappa

Std. Err.

Z

Prob>Z

The w vJe_ghts are given by 1 - li - jt/(k - 1) where i and j index the rows of columns of the ratings by th_ two raters and k is the maxinmm _umber of possible ratings. The weighting matrix

i i

ratings normal, benign, suspicious, and cancerous. i the table. In our "case, the rows and columns of the 4 × 4 matrix correspond to the is prin_ed above A weight Of 1 indicates an observation should count as perfect agreement. The matrix has l s down the dia_ofials!--when both radiologists make the s_me assessment, they are in agreement. A weight of, say_0J66_7 means they are in two-thirds agreement. In our matrix they get that score if they are one aparl -+one radiologist assesses cancer and the other is merely suspicious, or one is suspicious and the o_herisays bemgn, and so on, An emry of 0.3333 means they are in one-third agreement or, } if you prefer,!two-thirds disagreement. That is the gcore attached when they are "two apart". Finally, they are in c_mplete disaueement when the weighi is zero, which happens only when the3, are three apart--one says cancer and the other says normal.: Z 0.0000

and is probably inappropriate

here.

> Example In addition to prerecorded weights, you can define your own weights with the kapwgt command. For instance, you might feel that suspicious and cancerous are reasonably similar, benign and normal reasonably similar, but the suspicious/cancerous group is nothing like the benign/normal group: • kapwgt xm i \ .8 1 \ 0 0 1 \ 0 0 .8 1 • kapw_ xm 1.0000 0.8000 1.0000 O. 0000 O.0000 i.0000 O. 0000 O.0000 O.8000 1.0000

You name the weights--we named ours xm and after the weight name, you enter the lower triangle of the weighting matrix, using \ to separate rows. In our example we have four outcomes and so continued entering numbers until we had defined the fourth row of the weighting matrix. If you type kapwgt followed by a name and nothing else, it shows you the weights recorded under that name. Satisfied that we have entered them correctly, we now use the weights to recalculate kappa: . kap rada radb, wgt (xm) Ratings weighted by: I.0000 O.8000 O.8000 I.0000 O.0000 O.0000 O.0000 O.0000

Agreement 80.47)',

O.0000 O.0000 I.0000 O.8000

O.0000 0.0000 O. 8000 1.0000

Expected Agreement

Kappa

52.677.

O. 5874

Std, Err, O.0865

Z

Prob>Z

6.79

O.0000

4

[3 Technical Note In addition to weights for weighting the differences in categories, you can specify Stata's traditional weights for weighting the data. In the examples above, we have 85 observations in our dataset one for each patient. If all we knew was the table of outcomes that there were 21 patients rated normal by both radiologists, etc. it would be easier to enter the table into Stata and work from it. The easiest way to enter the data is with tabi; see [R] tabulate.

) ) ) ( ,

kappa-- lnterrateragreement

!

153

. tabi 21 12 0 0 \ 4 17 I 0 \ 3 9 15 2 \ 0 0 0 I, replace col row

1

2

3

4

Total

1 2

21 4

12 17

0 1

0 0

33 22

3 4

3 0

9 0

15 0

2 1

:29 1

3

85

,(

T_tal

28

Pearson cM2(9)

38 =

16

77.8111

Pr _ 0.000

tabi

felt obligated to tell us the Pearson X2 for this table, but we do not care about it. The important thing is tlaat,with the replace option, tabi left the table in memory: • list )in I/5 row 1 1

col 1 2

pop 21 12

3_

I

3

0

4_ 5,

1 2

4 1

0 4

1, 2.

The variable row is radiologist A's assessment: so assesse_ _ both. Thus, •:kap _ow col [freq=pop] ; Expected Agreement Agreement Kappa : j

Std. Err.

!

63.53'/,

30.827,

col,. radiologist B's assessment: and pop the number

O.4728

O.0694

Z

Prob>Z

6.81

O.0000

If we are going to keep these data, the names row and col are not indicative of what the data reflects. We could (seb [U] 15.6 Dam.set, variable, a,d value labels) • rename row rada • rename col radb . label var rada "Radiologist A's assessment" label var radb "Radiologist B's assessment" . label define assess I normal 2 benign 3 suspect 4 cancer l&be] values rada assess label values radb assess l&be] data "Altman p. 403"

kap's

tab

option, which can be used with or withont weighted data. shows the table of assessments: i

• kap _ada radb

[freq=pop],

Radiolqgist iA's assessment

Radiologist B's assessment normal benign suspect cancer

!

tab

Total

21

n

0

o

33

bez_ign

4

17

1

0

22

Suspect

3

9

15

2

29

cancer ) T_tal

0

0

0

I

1

28

38

18

3

85

_o_mal

)

]:_

Kappa -- mmrramr agreement Expected Agreement

Agreement 63.53_

Kappa

30.82_

Std.

0.4728

Err.

Z

0.0694

Prob>Z

6.81

0.0000

0

Q Technical Note You have data on individual patients. There are two raters and the possible

ratings are I, 2, 3,

and 4, but neither rater ever used rating 3: . tabulate ratera raterb raterb •

ratera

I

2

4

Total

1 2 4

6 5 1

4 3 1

3 3 26

13 11 28

12

8

32

52

Total

In this case, kap would determine the ratings are from the set {1,2, 4} because those were the only values observed, kap would expect a use_defined weighting matrix to be 3 x 3 and, were it not, kap would issue an error message. In the formula-based weights, the calculation would be based on i,j -- I, 2, 3 corresponding to the three observed ratings {1,2, 4}. Specifying the absolute option would make it clear that the ratings are 1, 2, 3, and 4; it just so happens that rating = 3 was never assigned. Were a use_defined weighting matrix also specified, kap would expect it to be 4 × 4 or larger (larger because one can think of the ratings being 1, 2, 3, 4, 5, ... and it just so happens that ratings 5, 6, ... were never observed just as rating -- 3 was not observed). In the formula-based weights, the calculation would be based on i,j -- I, 2, 4. • kap ratera raterb, wgt(w) Ratings weighted by: 1.0000 0.5000 0.0000 0,5000 1.0000 0.5000 0.0000 0.5000 1.0000

Agreement 79.81_

Expected Agreement 57.17Z

Kappa

Z

Prob>Z

4.52

0.0000

Z

Prob>Z

Std. Err.

0.5285

0.1169

. kap ratera raterb, wgt(w) absolute Ratings weighted by: 1.0000 0.6667 0.0000 0.6667 1.0000 0.3333 0.0000 0,3333 1.0000

Agreement

Expected Agreement

81.41Z

55.08X

Kappa 0.5862

Std. Err. 0.1209

4.85

0.0000

If all conceivable ratings are observed in the data, then whether absolute is specified makes no difference. For instance, if rater A assigns ratings { 1,2, 4} and rater B assigns {1,2, 3, 4}, then the complete set of assigned ratings is {1,2, 3, 4}, the same as absolute would specify. Without absolut e, it makes no difference whether the ratings are coded { 1,2, 3, 4}, {0.1.2, 3 }, {1,7, 9, 100}, {0, 1, t.5, 2.0}, or otherwise.

O

kappa-- lnterrateragreement

The case

155:,

more than two raters

In the c,se of more than two raters, the matha aatics are such that the two raters are not considered unique.!Fol " instance, if there are three raters, there is no assumption that the three raters who rate I

the are the the three ratersraters that rate thanflrst_suSject two r_iters case, it same can beasused with two whenthe thesecond. raters' Although identities we vary.call this the more The 'norlunique rater case can be usefully broken down into three subcases: (a) there are two possible raiings which we will call positive and negative; (b) there are more than two possible ratings but _thenumber of raters per subject is the same for all subjects; and (c) there are more than two possiblle ratings and the number of raters per subject varies, kappa handles all these cases. To emphasize that there is no assumption of constant identity of raters across subjects, the variables specified contain counts of the number of raters rating the subject into a particular category.

!

{ i

_ Example (Two; ratings.) Fleiss (1981, 227) offers the following hypothetical ratings by different sets of raters on 25}subjects:

Subject 1 2 3 4 5 6 7 8 9 t0 11 i

l

NO.of No. of raters pos. ratings 2 2 2 0 3 2 4 3 3 3 4 1 3 0 5 0 2 0 4 4 5 5

12 13

34

34

No. of No. of Subject raters pos. ratings 14 4 3 15 2 0 16 2 ' 2 17 3 1 18 2 t 19 4 t 20 5 4 21 3 2 22 4 0 23 3 0 24 3 3 25

2

2

We have entered these data into Stata and the variables are called subject, raters, and pos. kappa, however, re@ires that we specify variables containing the number of positive ratings and negative ratings; that i_s,pos and raters-pos: gen

_eg

kapp_

= raters-pos

pos neg

TWo4ou_comes, Kappa 0.5415

multiple

raters: Z 5.28

Prob>Z 0.0000

We wouldlha_e obtained the same results if we had typed kappa neg pos.

Example (More thanitwo ratings, constant number of raters,) Each of ten subjects is rated into one of three categories by five raters (Fleiss 1981, 230): li_t

I i

156

kappa-- Interrateragreement subject 1. 2. S. 4. 5. 6, 7. 8. 9. 10.

cat1 1 2 3 4 5 6 7 8 9 10

cat2 1 2 0 4 S 1 5 0 1 3

cats 4 0 0 0 0 4 0 4 0 0

0 S 5 1 2 0 0 1 4 2

We obtain the kappa statistic: • kappa earl-cat3 Outcome

Kappa

Z

Prob>Z

catI cat2 cat3

O.2917 0.6711 0.3490

2.92 6.71 3.49

O. 0018 0.0000 O. 0002

combined

0.4179

5.83

O.0000

The first part of the output shows the results of calculating kappa for each of the categories separately against an amalgam of the remaining categories. For instance, the cat I line is the two-rating kappa where positive is carl and negative is cat2 or catS. The test statistic, however, is calculated differently (see Methods and Formulas). The combined kappa is the appropriately weighted average of the individual kappas. Note that there is considerably less agreement about the rating of subjects into the first category than there is for the second. q

> Example Now suppose that we have the same data as in the previous example, but that it is organized differently: • list 1. 2. 3. 4. 5. 6. 7. 8. 9. i0.

subject 1 2 3 4 5 6 7 8 9 i0

raterl 1 1 3 I 1 1 1 2 1 1

In this case, you would kap

use

kap

rater2 2 I 3 1 1 2 I 2 3 1

rater3 2 3 3 1 1 2 1 2 3 1

rather than

kappa:

raterl rater2 raterS rater4 rater5

There are 5 raters per subject: Outcome

Kappa

Z

Prob>Z

1 2 3

0.2917 0.6711 O. 3490

2.92 6.71 3.49

0.0018 0.0000 O. 0002

combined

O. 4179

5.83

O. 0000

rater4 2 3 3 1 3 2 1 2 3 3

rater5 2 3 3 3 3 2 1 3 3 3

_

,

kappa -- Interrater agreement

157

Note that thfe information of which rater is which is not exploited when there are more than two raters.

q

_, Example (More' tha_ two ratings, vmo,ing number of raters!) In this unfortunate case, kappa can be calculated, but there is _o test statistic for testing against _ > 0. You do nothing differently--kappa calculates the total nun{bet of raters for each subject and, if it is not a constant, it suppresses the calculation of test statisttics[ .

1,ist

1,

subject 1

cat 1 1

cat 2 3

2.

2

2

0

3

3.

3

0

0

5

4.

4

4

0

1

5.

5

3

0

2

6.

6

1

4

0

7.

7

5

0

0

8_

8

0

4

1

9;

9

1

0

2

10.

10

3

0

2

• k_pp_

0

catl-cat3 Outcome

Kappa

cat i

O. 2685

cat2

O. 64,57

cat3

O. 2938

combined note:

cat3

Z

Prob>Z

O. 3816

}Number of ratings

per

subject

vary;: cannot

calculate

test

Istatistics,

q

Example This case _s similar to the previous example, but the data are organized differently: • list

I.

_ubject i

raterl I

rater2 2

rater3 2

2.

2

3.

3

4.

1

1

3

3

3

3

3

3

3

3

4

1

1

t

1

3

5-

5

1

1

1

3

3

6. 7.

6 7

1 1

2 1

2 1

2 1

2 1

8.

8

2

2

2

2

3

9.

9

1

3

10.

10

1

t

1

3

In this

cas_,

|

we

specify

kap,

instead

of

kappa:

rater4

rater5 2

3 3

i i

158

kappa -- Interrater agreement • kap raterl-rater5 There are between 3 and 5 (median = 5.00) raters per subject: 0utcome

Kappa

1 2 3

0.2685 0.6457 0.2938

Prob>Z

0.3816

combined note:

Z

Number of ratings per subject vary; cannot calculate test statistics.

Saved Results kap and kappa save in r(): Scalars r(N)

number

of subjects

(kap only)

r(prop_o) observed proportion of agreement (kap only) r(prop_e)expected proportion of agreement (kap only)

r(kappa)

kappa

r(z) r(se)

z statistic standard error for kappa statistic

Methods and Formulas kap, kapwgt,

and kappa

are implemented

as ado-files.

The kappa statistic was first proposed by Cohen (1960). The generalization for weights reflecting the relative seriousness of each possible disagreement is due to Cohen (1968). The analysis-of-variance approach for k = 2 and ra _> 2 is due to Landis and Koch (1977b). See Altman (1991, 403-409) or Dunn (2000. chapter 2) for an introductory treatment and Fleiss (198t, 212--236) for a more detailed treatment. All formulas below are as presented in Fleiss (1981). Let rn be the number of raters and let k be the number of rating outcomes.

kap: m = 2 Define wij (i = 1.... , k, j = 1,..., k) as the weights for agreement and disagreement (wgt ()) or, if not weighted, define wiz = 1 and wij = 0 for i ¢ j. If wgt (_r) is specified, u'ij -- 1-l i-jt/(k1). If wgt (_r2) is specified, wij -- 1 - {(i-j)/(k The observed proportion of agreement

1)} 2.

is k

k

Po = _ _ wijPij i=1 3=1

where Pij is the fraction of ratings i by the first rater and j by the second. The expected proportion of agreement is k 'De-

_ i=1

k _wijPi-P.j j=l

Ii

_

-- Interrater agreement

159

where Pi. = ___jPij and p.j = Y'_-iP/J" f Kappa is _iven by _ = (Po - Pe) / (I - Pe). The standard error of _ for testing against 0 is

s0 (1-

j

where n is th, number of subjects being rated and Ni. = _j statistic Z= _/'so is assumed to be distributed N(0, 1).

p.jwij

and ¥.j = _i Pi':wiJ" The test

kappa: m > 2,!k = 2 Each sUbjeCt i, i = 1,...,

n, is found by' xi of mi raters to be positive (the choice as to what is

labeled:positiVe being arbitrary). The overail proportion of positive ratings is _ = _i xi/(nN), between-s_bj@s mean square is (approximately)

B

where _

= _-_i rni/n.

_

The

!1 1

--

n

t

i

r°'i

and the w_thla-subject mean square is

W = n(_--

1 1i

E i

xi(mimi

xi) ] i

Kappa is thent defined

i

i

The standard !error for testing against 0 (Fleiss arid Cuzick 1979) is approximately calculated as 1 _'0 = (N.-

1)_

/

(_-

(2(_H z 1)+

_H)(1

equal to and

- @-_)

mpq

ii

i

where Nh, is the harmonic mean of rni and _ = 1 - _. The test st',ttistic Z = "/^_/Sois assumed to be distributed N(0, 1).

i

kappa: m >2,ik> 2 Let .rij be ithe number or ratings on subject i, i = 1,...,

n, into category j, j = 1,...,

k. Define

i

_j as the overall proportion of ratings in category j, _j = 1 - _._, and let _j be the kappa statistic given above f_r k = 2 when category j is compared with the amalgam of all other categories. Kappa

I

is (Landis an_ Koch 19778)

_=

A____PJ-qjr'JJ

t

160 kappa m lnterrater agreement In the case where the number of raters per subject _j xij is a constant m for all i, Fleiss, Nee, and Landis (1979) derived the following formulas for tile approximate standard errors. The standard error for testing

_j against

0 is

/ V /

and the standard

error

for testing

1)

g is

_= _. _f_jx/nm(m-J

2

i)

PJqJ

j

PJ'qJ('qJ-

_j)

References Altman, D. G. t991. Practical Statistics for Medical Research. London: Chapman & Hall. Boyd, N. F., C. Wolfson, M. Moskowitz, T. Carlile, M. Petitclerc, H. A. Ferri, E. Fishell. A. Gregoire, M. Kieman. J. D. Longley, I. S. Simor, and A. B. Miller. 1982. Observer variation in the interpretation of xeromammograms. lournaI of the National Cancer Institute 68: 357-63. Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20: 37-46. 1968. Weighted kappa: Nominal scale agreement with provision for scaled disagreement Psychological Bulletin 70: 213-220.

or partial credit.

Dunn. G. 2000. Statistics in Psychiatry. London: Arnold. Fleiss, J. L. 1981. Statistical Methods for Rates and Proportions. 2d ed. New York: John Wiley & Sons. Fleiss, J. L. and J. Cuzick. 1979. The reliability of dichotomous judgments: unequal numbers of judges per subject. Applie 4 Psychological Measurement 3: 537-542. Fleiss, J. L., J. C. M. Nee, and J. R. Landis. 1979. Large sample variance of kappa in the case of different sets of raters. Psychological BuIletin 86: 974-977. Gould, W. 1997. stata49: Interrater agreement. Stata Technical Bulletin 40: 2-8. Reprinted in Stata TechnicaJ Bulletin Reprints, rot. 7, pp. 20-28. Landis, J. R. and G. G. Koch. 1977a. The measurement of observer agreement for categorical data. Biometrics 33: 159-174. 1977b. A one-way components of variance model for categorical data. Biometdcs 33: 671-679. Steichen, T. J. and N. J. Cox. 1998a. sg84: Concordance correlation coefficient. Stata Technical Bulletin 43: 35-39. Reprinted in Stata Technical Bultetin Reprints, vol. 8, pp. 137-143. 1998b. sg84.t: Concordance correlation coefficient, revisited. Stata Technical Bulletin 45: 21-23. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 143-145. 2000. sg84.2: Concordance correlation coefficient: update for Stata 6. Stata Technical Bulletin 54: 25-26. Reprinted in Stata Technical Bulletin Reprints, vol. 9. pp. 169-170.

Also See Related:

[R] tabulate

i

Title !

i

• kdenslty

Univariate kernel density estimation

i

Syntax kdensity varname [weight] [ifexp][inrange] [, nJ _raphgenerate(neuwarznewvardinsity) n(#) _width(#) [ ibiweight I cQsineI eppanlgausl] parzenl rectangle I triangle ] ! n(_rmal stud(#) at(varz) s_ymbol(...) _connect(...) title(string) [ [

fweigh_s

ggoph_optwns ? i J i and _weights are allowed;see [U] 14.1.6weighi.

Descriptien kdensity!produces

kernel density estimates and graphs the result. /

Options i

nograph suppresses the graph. This option is often Used in combination with the generate () option. generate(n_wvarz newvardensitv) stores the results of the estimation, newvardensity will contain

i t_ I

the densit>l estimate, newvarz will contain the pbints at which the density is estimated. n(#) specifie_ the number of points at which the d_nsity estimate is to be evaluated. The default is min(N; 501, where N is the number of observations in memory.

i i ! i

width(#) sNcifies the halfwidth of the kernel, the width of the density window around each point. If w() is hot specified, then the "optimal" width is calculated and used. The optimal width is the width [hat would minimize the mean integrated square error if the data were Gaussian and a Gaussiar_ kernel were used, so is not optimal in any global sense. In fact, for multimodal and highly skeived densities, this width is usually too wide and oversmooths the density (Silverman

i

1986). bi_reight,

I cbsine,

default, e_t_, [ l i

I i

i

gauss,

parzen,

rectangle,

and triangle

specify the kernel. By

specifying the Epanechnikov kernel, is used.

normal requd?ts that a normal density be overlaid on the density estimate for comparison. stud(#)

i.

speOfies that a Student's _ distribution with # degrees of freedom be overlaid on the density

estimate f_r comparison, at(varz) specifies a variable that contains the v_lues at which the density should be estimated. This optiot0 allows you to more easity obtain density estimates for different variables or different subsamples of a variable and then overlay the e_t_mated densmes for comparison. symbol(...)

i

epan,

!is graph,

is symbollo);

two,ray's symbol()

see [G]graph options.

°pti°h for specifying the plotting symbol. Tile default :

i

(

connect(...)isgraph, twoway'sconnect ()estimation optionforhow pointsareconnected. The default is 162 kdensity -- Univariate kernel density connect (1), meaning points are connected with straight lines: see [G] graph options. title(string) is graph, twoway's title() option for speci_'ing the title. The default title is "Kernel Density Estimate"; see [G] graph options. graph_options

are any of the other options allowed with graph,

twoway; see [G] graph

options.

Remarks Kernel density estimators approximate the density f(z) from observations on z. Histograms do this too, and the histogram itself is a kind of kernel density estimate. The data are divided into nonoverlapping intervals and counts are made of the number of data points within each interval. Histograms are bar graphs that depict these frequency counts the bar is centered at the midpoint of each intervalwand its height reflects the average number of data points in the interval. In more general kernel density estimates, the range is still divided into intervals and estimates of the density at the center of intervals are produced. One difference is that the intervals are allowed to overlap, One can think of sliding the intervalPcaUed a window along the range of the data and collecting the center-point density estimates. The second difference is that, rather than merely counting the number of observations in a window, a weight between 0 and 1 is assigned--based on the distance from the center of the window and the weighted values are summed. The function that determines these weights is called the kernel. Kernel density estimates have the advantages of being smooth and of being independent choice of origin (corresponding to the location of the bins in a histogram).

of the

See Salgado-Ugarte, Shimizu, and Taniuchi (1993) and Fox (1990) for discussions of kernel density estimators that stress their use as exploratory data analysis tools.

Example Goeden investigate histogram. is roughly

(1978) reports data consisting of 316 length observations of coral trout. We wish to the underlying density of the lengths. To begin on familiar ground, we might draw a In [G] histogram, we suggest setting the bins to min(v/-_. 10-loglcn ). which for n = 316 18:

graph length, xlab ylab bin(18)

2-

15"

05"

m 0

length

kdensity -- Univariatekernel density estimation

163

The kernel density estimate, on the other hand, is smooth. . kdens_ty

length,

xlab

ylab

006 -_

004

121

£,02

i

\

1

o

Kernel

Density

tdngth Estimate

Kernel den_ity)stimators are. however, sensitive to an assumption just as are histograms. In histograms, we specify a _umber of bins. For kernel density estimators, we specify a width. In the graph above, we used the d_fault width, kdensity is smarter than graph, histogram in that its default width is not a fixed _:onstant. Even so, the default width is not necessarily bei. i kder_sity !ayes the width in the return scalar width, so typing display Doing this, wd discover that the width is approximately 20.

r(width)

reveals it.

! i

Widths are(ketail. isimilarThe to units the inverse of thearenumber of ofz, bins in histogram in analyzed. that smaller provide more of the width the uhits the avariable being The widths width is specified as ia halfwidth, meaning that the kernel density estimator with halfwidth 20 corresponds to sliding a w!ndow of size 40 across the data.

I

We can specify halfwidths for ourselves using the t the density as imuch. • kdens_ty

length,

epan

width()

i ]

option. Smaller widths do not smooth

w(lO)

xlab ylab

i

I

]

OOB I

oo6_I

\

/

,004

/

e

_oo

5

_Jo

,oo I_ng(h

Kernel

Density

Estimate

s_o

do

• kdensity length, epam xlab ylab w(15)

164

kdensity -- Univariate kernel density estimation

•

006

/_

.004 >.

j

0 200

"\ 3(_0

Kernel

4(_0 length

Density

560

6;0

Estimate

q

> Example When widths are held constant, different kernels can produce surprisingly different results. This is really an attribute of the kernel and width combination; for a given width, some kernels are more sensitive than others at identifying peaks in the density estimate. We can see this when using a dataset with lots of peaks. In the automobile dataset, we characterize the density of we £ght, the weight of the vehicles. Below, we compare the Epanechnikov and Parzen kernels. kdensity weight, epan nogr g(x epan) kdensity weight, parzen nogr g(x2 parzen) • label vat epan "Epamechnikov Density Estimate" • label vat parzen "Parzen Density Estimate" • gr epan parzen x, xlab ylab c(ll)

o Epanechnikov .0008

Density

Estimate

_ Parzen

Density

Estimate

"_

oooo

i'll!

0

1ooo

2o'oo

3ooo Weight

(l_s.)

4otoo

5ooo

!

kdensRy-_ Univariatekerneldensityestimation

165

!

We did not s_ecify a width and so obtained the d_fault width. That width is not a function of the selected kerneil but of data. See the Methods and Formulas section for the calculation of the optimal

I

width.

q

> Example In examining the density estimates, we may wi_h to overlay a normal density or a Student's t • ari_" Mng automobile weights, we can get an idea of the distance from normality density for col_p _u,,. U__ with the normal option. , kdens_ty

weight,

epam

normal

xlab ylab

,0006

i

.ooo4

,ooo2 t

°t 1 1000

il 2000

I 3_00 Weigh!

Kernel

Density

I 4000

5000

(}bs,)

IEstimate

Example Another conmon desire in examining density estimates is to compare two or more densities. In this example, _,e will compare the density estimatesof the weights for the foreign and domestic cars. I

kdensi_y

i

.• kdensi_y kdens@y

i

weight,

negr

weight weight

gen(x

fx)

if gen(f_O) if foreign==O, foreigxl==l, nogr nogr gen(fXl)

label

_ar fxO

"Domestic

label

_ar fxl

"Foreign

at(x) at(x)

cars" cars"

(Continued on twxt page)

"

166

• gr

fxO fxl c(ll) s(TS) xlab ylab kdensity --x, Univariate kernel density estimation

i :

Domestic

cals

D Foreign

cars

OOl.

{

I !

.0005

l

_a_

fX

'

0" 1000

20100

3000' Weight

40t00

5000r

(Ibs.)

chi2

=

0.0000

:

0.1582

Prob Log

.ikelihood

: -98.777998

Pseudo

of obs

R2

, I

:

,

low

i

age twd

i

-. 0464796 .8420615

Std. Err. .0373888 .4055338

z

P>[zI

-1.24 2.08

0.214 O. 038

[957, Conf. -. 1197603 ,0472299

Interval] .0268011 1,636893

black

I.073456

.5150752

2.08

O. 037

.0639273

2.082985

other smoke

.815367 .8071996

.4452979 .404446

1.83 2. O0

0.067 O. 046

-. 0574008 .0145001

I. 688135 i.599899

ptd

i i

Coef.

i .281678

.4621157

2.77

0,006

.3759478

2. 187408

ht

:1. 435227

.6482699

2.21

0.027

.1646415

2. 705813

ui _cons

.65762S6 -1.216781

.4666192 .9556797

I. 41 _1.27

-. 2569313 -3.089878

1. 572182 .6563!7

O. 159 0.203

To _et the odds ratio for black smokers relative to white nonsmokers (the reference group), type

• lincom (1) i: I

200

black

black

+ smoke,

+ smoke

or

= 0.0

Iincom -- Linear combinations of estimators

ili

low

0dds

Ratio

Std.

z

Err.

il

P>,z,

[957, Conf.

Interval]

o0o, lincom computedcxp(_black+ blacknonsmokers,type lincom (I)

smoke

- black

- black, + smoke

low

Odds

(1)

_smoke)

_-

6.56.To seetheoddsratioforwhitesmokersrelative to

or

= 0.0

Ratio

Bid.

.7662425

Err.

z

.4430176

-0.46

P>IzJ

[95% Conf.

0.645

.2467334

Interval] 2.379603

Now let's add the interaction terms to the model (Hosmer and Lemeshow 1989, Table 4.10). This time we will use logistic rather than legit. By defaulL logistic displays odds ratios. . logistic Logit

Log

low

age

black

other

smoke

ht ui

Iwd

estimates

likelihood

=

low

-96.00616

Odds

Ratio

Std.

Err.

z

ptd

agelwd

smokelwd


= = =

189 42.66 0.0000

Pseudo

=

0.1818

R2

P>|zl

[95_ Conf.

Interval]

,8408967 1.068277

1.005344 8,167462

age black

.9194513 2.95383

.041896 1.532788

-1.84 2.09

0.065 0.037

other

2.137589

.9919132

1.64

0.102

.8608713

5,307749

smoke

3.168096

1.452377

2.52

0.012

1.289956

7.780755

ht

3.893141

2.5752

2.05

0.040

1.064768

14.2346

ui

2.071284

.9931385

1.52

0.129

.8092928

5.301191

lwd

.1772934

.3312383

-0.93

0.354

.0045539

6.902359

ptd

3.426633

1.615282

2.61

0.009

1.360252

8.632086

1.15883 .2447849

.09602 .2003996

1.78 -1.72

0.075 0.086

.9851216 .0491956

1.36317 1.217988

agelwd smokelwd

Hosmer and Lemeshow (1989, Table 4.13) consider the effects of smoking (smoke -:- 1) and low maternal weight prior to pregnancy (lwd = 1). The effect of smoking among non-low-weight mothers (lwd -- 0) is given by the odds ratio 3.17 for smoke in the logistic output. The effect of smoking among low-weight mothers is given by • lincom (I)

smoke

smoke

low (1)

+ smokelwd

+ smokelwd

Odds

= 0.0

Ratio

.7755022

Std.

Err.

.5749508

z

P>Izl

[957 Conf.

-0.34

0.732

.1813465

Note that we did not have to specify the or option

After

logistic,

lincom

Interval] 3.316322

assumes or by default.

The effect of low-weight (Iwd = 1) is more complicated since we fit an age x lwd interaction. We must specify' the age of mothers for the effect. The effect among 30-year old nonsmokers is given by t

i _• !|_

i

'

_

i

lin¢om-- Linearcombinations of estimators

i _incom l_d + 30*agelwd (ii) lwd + 30.0 agelwd

i

i.

low

I

(t)

i

t

=

201

0.0

Odds Ratio 14. ?669

Std,

Err.

13. 56689

z 2.93

P>lz[

[95X Conf.

O.003

2. 439266

Interval] 89. 39625

..........

"

lincom _omputed exp(fllwd+30,_agelwd) = 14_.8.It seemsodd that we entered it as lwd+ 30*agelwd. but remember that lwd and agelwd are just'lincom's (and test'S) shorthand for _b[twd] and

i

_b [age_wd].

i

I

_ i

We could

i

i !

typed

(ii) incomlwd _b[1wd] + 30.0+ agelwd 30*_b[agelwd] = 0.0

low (1)

i

!

have

I

Odds Ratio 14. 7669

Std. Err. 13. 56689

z 2.93

P> Iz I O. 003

[957,Conf. Interval] 2. 439266 89. 39625

,

Multiple- quation models lincpm also works with multiple-equation models. The only difference is how"y'ou refer to the coefficiehts. Recall thai for multiple-equation models, coefficients are referenced using the syntax [e_no] varname where e,_nois the equation number or equation_nameand varname is the corresponding variable name for the cpefficient: see [U] 16.5 Accessing coefficients and standard errors and JR] test for detaih. ExampM, Consider the example from [R] mlogit (Taflov et al. 1989: Wells et al. I989).

!

. _logit insure age male nonwhite site2 site3,

nolog

Mu_tinomial regresszon

Number of obs LR chi2(I0)

= =

615 42.99

Lo i likelihood = -534.36165

Pseudo Prob > R2 chi2

= =

0,0387 O.0000

} !

insure

Coef.

age

-.Ol 1745

.0061946

-I.90

O.058

-.0238862

.0003962

male nonwhite

.5616934 .9747768

.2027465 .2363213

2.77 4.12

O.006 0.000

.1643175 .5115955

.9590693 1,437958

site2 site3 cons

.1130359 -. 5879879 .2697127

•2101903 .2279351 .3284422

O,54 -2.58 O.82

O.591 O. 010 O.412

.2989296 -1.034733 -.3740222

.5250013 -. 1412433 .9134476

I

age male

-.0077961 .4518496

.01t4416 .3674867

-0.68 1.23

0.496 0.219

-.0302217 -.268411

.0146294 I.17211

i

nonwhit e

.2170589

.4256361

O.51

O.610

-.6171725

i.05129

i

site2 site3 _cons

-1.211563 -.2078123 -1.286943

.4705127 .3662926 .5923219

-2.57 -0.57 -2.17

O.010 0.570 0.030

-2.133751 -.9257327 -2.447872

-.2893747 .510108 -. 1260135

i "

Izf

4.70

0. 000

[95Y,Conf. Interval] lr

.

.8950741

2.177866

To view the estimate as a ratio of relative risks (see [R] miogit for the definition and interpretation), specify the rrr option. lincom (1)

[Prepaid]

male

[Prepaid]male

+ [Prepaid]

insure I (i) I

nonwhite,

+ [Prepaid]nonwhite

KPd{

rrr = 0.0

Std. Err.

4.648154

z

1.521103

4.70

P> [zl


0.000

2.447517

8.827451

q

Saved Results lincom saves in r(): Scalars r (estimate) r(se)

point estimate estimate of standard

r(df)

degrees

error

of freedom

Methods and Formulas lincom is implemented

as an ado-file.

References Hosmer, D. W., Jr,, and S. Lemeshow. edition forthcoming in 2001,)

f989. Applied

Logistic

Regression.

Tarlov, A. R,, J. E. Ware. Jr,, S. Greenfield. E. C. Nelson, E. Perrin. study, Journal of the American Medical Association 262: 925-930,

New York: John Wiley & Sons. (Second

and M. Zubkoff.

t989.

The medical

outcomes

Wells. K. E., R. D. Hays, M, A. Burnam, W, H. Rogers, S. Greenfield, and J. E. Ware. Jr. t989. Detection of depressive disorder for patients receiving prepaid or fee-for-service care. Journal of the American Medical Association 262: 3298-3302,

Also See Related:

[R] svylc, [R] svytest, [R] test, [R] testni

Background:

[u] 16.5 Accessing coefficients and standard errors, [U] 23 Estimation and pOst-estimation commands

I

rR r-

Title

II

l[nktes -- Specification link test for single-equation models i

!

i

I [

I

I

J

[

iI

I

i

I [

I

T[

I

I I

Syntax

i

lin.kt,

st [if exp] [in range] [, estimation_options ]

_qaen i: cap and in range are not specified,the link _estis performedon the same sample as the previous estimation.

Descripti(,n iin.kt_ st performs a link test for model specificationafter any single-equationestimationcommand such as 1, .gistic.

i

etc.;

regress,

see [R]

estimation commands.

I

:: ! ;

Options estimation_options must be the same option_ specified with the underlying estimation command.

] :

i

Remarks • i

The fotm of the linkltest implemented here if based on an idea of Tukey (1949) which was further descfibedlby Pregibon !(1980), elaborating on ,/,ork in his unpublished tl_esis (Pregibon t979). See Methods _nd Formulas! below for more details.

We at mpt to explifin Exampletl , the mileage ratings Of cars in our automobile dataset using the weight. engine displacement, a_d whether the car is manufactured outside the U.S.: r,_gress mpg

Source Model ;

w_ight

i !1619,71935

Residual

_ Total

mpg i

weight

23.740114 ; 12443.45946 _ SS

'

! i

foreign _cons

Coef.

-.0067745

displacement I

displ

.0019286 i-1.600631 41.84795

t

foreign

3

539.906448

F( 3, Prob > F

70

Ii.7_77159 33.4720474 MS

73 dI

Std. Err. .0011665 .0100701 1.113648 2.350704

t

P>It I

70)

= =

45.88 0.0000

R-squared

=

0,6629

Adj R-squared Root Number MSE of obs

= = =

0.6484 3.4304 74

[95X Conf.

Interval]

-5.81

0.000

-.0091011

0.19

0.849

-.0181556

.0220129

-1.44 17.80

0.155 O. 000

-3.821732 37.15962

.6204699 46.53628

-.0044479

204

linktest -- Specification link test for single-equation models

Based on the R 2. we are reasonably pleased with this model. If our model really is specified correctly, then were we to regress mpg on the prediction and the prediction squared, the prediction squared should have no explanatory power. This is what link"cost does: linktest Source

_

SS

df

f

Model Residual

] |

mpg

Number of F( 2,

1670.71514

2

835.357572

Prob

772.744316

71

10.8837228

73

33.4720474

I Total

MS

2

443

I

.45946

Coef.

I

Std.

Err.

obs 71)

= =

> F

74 76.75

=

0.0000

R-squared

=

0.6837

Adj K-squared Root MSE

= =

0.6748 3.299

t

P>Itl

[95_

Conf.

-0.63

0,532

-i.724283

2.09 2.16

0.041 0.034

.6211541 .0026664

Interval]

i

_hat _cons _hatsq

]

-.4127198

t

14.00705 .0338198

We find that the prediction good as we thought.

.6577736 6,713276 .015624

.8988434 27.39294 .0649732

squared does have explanatory, power, so our specification

is not as

Although linktest is formally a test of the specification of the dependent variable, it is often interpreted as a test that. conditional on the specification, the independent variables are specified incorrectly. We will follow that interpretation and now include weight-squared in our model: • gen

weight2

regress

= weight*weight

mpg

Source

weight I

Model

weight2 SS

displ

foreign

df

MS

Number F( 4,

of obs 69)

74 39.37

1699.02634

4

424.756584

Prob

=

0.0000

744.433124

69

10.7888859

K-squared

=

0.6953

Total

2443.45946

73

33.4720474


= =

0.6777 3.2846

mpg

Coef.

Std.

Residual

Err.

t

P>rtl

> F

= =

[95_

Conf.

Interval]

weight

-.0173257

.0040488

-4.28

0.000

-.0254028

-.0092486

weight2

1.87e-06

6.89e-07

2.71

0.008

4.93e-07

3.24e-08

-.0101625

.0106236

-0.96

0.342

-.031356

foreign

-2.560016

1.123506

-2.28

0.026

-4.801349

-.3186832

_cons

58.23575

6.449882

9,03

0.000

45.36859

71.10291

displacement

.011031

And now we perform the link test on our new model: linktest

Model

1699.39489

Kesidual

i

744. 06457

Total

I

Source

[

2443.45946

SS

2

849.697445

Prob

=

0.0000

71

I0.4797827

R-squared

=

O. 6955

33.4720474

Adj R-squared Root MSE F( 2, 71)

= =

0.6869 3.2372 81.08

Number

=

73

df

MS

> F

of obs

74

'

gi

linktest-- specifi_ationlink test for single-equationmodels

i !•

i

mpg

Coef.

1

i,

hat hatsq

I 141987

.7612218

1.50

0.138

-.3758456

2.659821

i

_cons

- .0031916 -i.50305

.0170194 8.196444

-0.19 -0.18

O.852 O.855

-.0371272 -17.84629

.0307441 14.84019

'

t

P>Itl

[957,Conf. Interval]

! We now pass the link!'test.

! i

Std. Err.

205

Exampl Abo_ we followe_ a standard misinterpretation of the link test when we discovered a problem, we focusted on the exl_lanatory variables of our model. It is at least worth considering varying exactly

i

what thi link test testS. The link test told us it]at our dependent variable was misspecified. For those

i

with _engineeringconsurr_tion__gallonsbackground, mpgperis mile--inindeed a terms strangeofmeasure.andIt woulddisptacement:make more sense to modelan@ergy weight ! _egress gpm _eight displ foreign i Source I 85 df i

i

] Model i Residual

i

Prob > F R_squared

=

0.7659

.01t957628

73

.000163803

Root MSE Adj R-squared

: =

.00632 0.7558

weight displacement I foreign

_cons

Std. Err.

t

P>lt I

[957.Conf, Interval]

.0000144

2.15e-06

6.72

O.000

.0000102

.0000187

.0000186 .0066981

.0000186 .0020531

I.O0 3.26

O.319 O. 002

-.0000184 .0026034

.0000557 .0107928

.0008917

.0043337

0.21

O. 838

-. 0077515

.009535

looks eve _ bit as reasonable as our original model.

inkiest _.

[ [

.003052654 .000039995

Coef.

_

,

Source

.

I li

SS

df

Residual

i .002782409 I Total .011957628 Model l ) i .009175219

gpm [i

Coef.

hat hatsq

I i I i

.6608413 3.275857

li

-_cons

))

.008365

irt a m( eparsimonio_s

MS

Number of obs =

F( 2,

!

i

74 76.33 0. 0000

3 70

gpm

This re+el

Number of obs = F( 3, 70) =

: .009157962 ;: .002799666

Total

! !

i

MS

71

.008039189

73 2

.000163803 .00_587609

Std. Err.:

t

P> It{

74

71) :

117.06

R-squared = Adj R-squ_red : Root = Prob MSE > F =

0.7673 0.7608 .00626 0.0000


.5152751 4.936655 ;

1.28 0.66

0.204 0.509

-.3665877 -6.567553

1.68827 13.11927

.0130468

0.64

0.523

-.0176496

.0343795

speecmcanon " .

206

linktest -- Specification link test for single-equation models

> Example The link test can be used with any single-equatio/q estimation procedure, not solely regression. Let's mm our problem around and attempt to explain whether a car is manufactured outside the U.S. by its mileage rating and weight. To save paper, we will specify logit's nolog option, which suppresses the iteration log: . logit foreign mpg weight, nolog Logit estimates

Number of obs LR chi2 (2) Prob > chi2

= = =

74 35.72 0.0000


Pseudo R2

=

0.3966

foreign

Coef.

mpg weight _cons

-.1685869 -.0039067 13.70837

Std. Err.

z

P>]z_

.0919174 .0010116 4.518707

-1.83 -3.86 _ 3.03

0.067 O.000 O.002

[95_. Conf. Interval] -.3487418 -.0058894 4.851864

.011568 -.001924 22. 56487

When you run linktest after logit,the result is another logit specification: linktest, nolog Logit estimates

Number of obs LR chi2(2) Pro5 > chi2

= = =

74 36.83 0.0000


Pseudo R2

=

0.4090

foreign

Coef.

_hat _hatsq _cons

.8438531 -.1559115 .2630557

Std. Err. .2738759 .1568642 .4299598

z 3.08 -0.99 0.61

P>Iz_ 0.002 0.320 0.541

[95_ Conf. Interval] .3070661 -.4633596 -.57965

1.38064 .1515366 1.105761

The link test reveals no problems with our specification. If there had been a problem, we would have been Virtually forced to accept the misinterpretation of the link test we would have reconsidered our specification of the independent variables. When using logit, we have no control over the specification of the dependent variable other than to change likelihood functions. We admit to seeing a dataset once where the link test rejected the logit specification. We did change the likelihood function, re-estimating the model using probit, and satisfied the link test. Probit has thinner tails than logit. In general, however, you will not be so lucky. q

_3Technical Note You should specify exactly the same options with linktest as you do with the estimation command, although you do not have to follow this advice as literally as we did in the preceding example, logit's nolog option merely suppresses a part of the output, not what is estimated. We specified nolog both times to save paper. : '_

If you are testing a cox model with censored observations, however, you must specify the dead() option on linktest as well. If you are testing a tobit model, you must _pecify the censoring points

i

just as you do with the tobit;

command.

T

_

linktest

!

Specification linktest for single-equation models

207

i I . !

If youiare not sure which options are important, duplicate exactly what you specified on the command. estimatio_ _ If youido not specie' if exp or in range :with li_ktest, Stata will by default perform the link test _n the same s_unpleas the previous estimation. Suppose that you omitted some data when performin_ your estimation, but want to calculate the link test on all the data, which you might do if you belleved the moclelis appropriate for all !thedata. To do this, you would type 'linktest if

:{

e(

i

pl4) -=.'.

SavedRemrs linkt,mtsaves in ::(): Scalars r(t)

t statisticon _!aats_

r(df)degreesof freedom

linkt_stis not an estimation command in the sense that it leaves previous estimation results unchangeql.For instan@, one runs a regressiofi and then performs the link test. %,ping regress without a_gumentsarid the link test still replays the original regression.

i

In ternls of integrati g an estimation commafid with linktest, linktestassumes that the name of the estimation com_nand is stored in e(cmtt) and that the name of the dependent variable in e (depval_). After estirhation, it assumes that the number of degrees of freedom for the t test is given

i

by o(df_._)if the ma!ro is defined. If the estimation co_amandreports Z statistics instead of t statistics, then tinktestwill also report Z aatistics. TheiZ statistic, however, is still returned in r(t) and r(df) is set to a missing

i

vai ,e

I i

!

Methods ,nd ForMulas, linkt.=st is implemented as an ado-file. The link test is based on the idea that if a regression or

i , !

regressior-like equatioriis properlyspecified,on_ should not be able to find any additional independent variables :hatare signil_cantexcept by chance. One kind of specificationerr'or is called,a link error. In regression, this means that the dependent vmiable needs a transformation or "link function to

i !

properly relate to the i_dependent variables. Th_ idea of a link test is to add an independent variable to the eqt ation that is l_specialb likely to be significantif there is a link error,

i

Let

l

Y = f (X/3) be the m( del and _ be!the parameter estimatesl linktest

l } I I ! ! ,i

calculates

_hat= Xa and _hat_q= _hat2 The mod_l is then refit with these two variablesl and the test is based on the significance of _hatsa. This is tNe form suggelted by Pregibon (1979/based on an idea of Tukey (t949). Pregibon (1980) su_ests slightly different method that has tome to be known as "Pregibon s goodness-of-link tesf'. We _referredthe!olderversion because it is universally applicable, straightforward, and a good second-or ier approximation. It is universally applicable in the sense that it can be applied Io any' single-eq_,ationestimation technique whereas Pregibon s more recent tests are estimation-technique specific.!

|

i

=__

Pregibon, D. 1979. Data Analytic Methods for Generalized Linear Models. Ph.D. Dissertation 1980. Goodness of link tests for generalized linear modelS. Applied Statistics 29: 15-24. Tukey, J. W. 1949. One degree of freedom for non-additivity. Biometrics 5: 232-242.

Also See Related:

[R] estimation

commands,

JR] lrtest,

[R] test, [R] testnl

University of Toronto.

_ !

Title

i

1

I

list --

Sy ntax I!

f

i

I i l

'

_list

Iv_rlist! [i:fe_]

i

[n o]_display

nolabel

noobs doublespace

]

Descrlptic_n di ;plays the v

es of variables, If no v_rlist is specified, the values of all the variables are

_,lso see bro_vsein [R] edit.

displayed.

Options . [nojdisplgy forces th_format into display or tabular (nodisplay) format. If you do not specify one its judgment of which would be most one of t_ese two options, then Stata chooses based on re.adabk nolabel

[

[in range][,

by ... : mai, be used with kist; see [R] by. The second vadist may _ntain_ime-seriesoperators;see [U114.4.3"l'ime.seHesvaNi_s.

list I

I

, s v lues of variables

uses the nur eric codes rather than the label values (strings) to be displayed.

noobs sup _ressesprintiJ g of the observation numbers. doublesp_tce requests mt a blank line be inserted between observations. This option is ignored in displa format.

Remarks l

i° ! i

I ! I

list,t,. )ed by itself lists all the observations and all the variables in the dataset. If you specify

varlist, onl those vafia_tes are listed. Specifyifig one or both of in range and if e_p limits the observatiot listed.

:;

_ Examplei

list h. s two outpu

listing a f_w variables, whereas the display format is suitable for listing an unlimited number of variables. _tata chooses automatically between those two formats: Obse_ :vation 1 li_t in 1/2 make rep78 weight

I

formats, known as tabular and display. The tabular format is suitable for

I _ispla-t

AMC Concord 3 2,930 121

price headroom

4,099 2.5

mpg trunk

22 11

length

186

turn

40

gear_r-o

3.58

foreign

Domestic

Observation ri

'

I ,

--,-

2

.,,o..

_,.r._ vauu_ Ul vitrlODleS make AMC Pacer price rep78 3 headroom

weight

3,350

displa-t . list

make

258 mpg

weight

displ

make I. AMC Concord 2. AMC Pacer ;

3. AMC

The

Spirit

length

_

mpg trunk

17S

gear_r~o rep78

_

4,749 3.0

turn

2.53

40

foreign

Domestic

in I/5

mpg 22 17

weight 2,930 3,350

displa~t 121 258 121

rep78 3 3

22

2,640

4. Buick

Century

20

3,250

196

3

5. Buick

Electra

15

4,080

350

4

first case is an example

of display

17 II

format;

[he second

is an example

of tabular

format.

The

tabular format is more readable and takes less space, but it is effective only if the variables can fit on a single line across the screen. Stata chose to list all twelve variables in display format, but when the varlist was restricted to five variables, Stata chose tabular format. tf you are dissatisfied with Stata's choice, you can make the decision yourself. Specify the display option to force display format and the nodisplay option to force tabular format.

Oi

25

2,650

121

Foreign

1 _ Example

1

You can Imake the list easier to read by specifOng the doublespaceoption:

I

i lis_ make make '

i

Pont. iPhoenix

19

3,420

231

Domestic

i

Pont. Igunbird

24

2,690

151

Domestic

Audi_000

17

2,830

131 Foreign

Audi

'ox

23

2,070

97

Foreign

BMW 3:!0i

25

2,650

121

Foreign

}

i } ! i_

mpg weight displ foreign in 51/55, noobs double mpg weight 4ispla~t foreign

21Technical Note

You can !suppress the use of value labels by specifying the nolabel option. For instance, the variable foreign in the _:xamples above really contains numeric codes. 0 meaning Domestic and 1 meaning Foreign.When you list the variable however, you see the corresponding value labels rather than the underlyin_ numeric code: lis_

foreign

I

51

iforeign _omestic

i

52.

_omesl;ic

I

211

53. 54.

iF°reign _Foreign

55.

IForeign

Specifying t e nolabel

in

1/55

i 1 1

ption displays the underlying numeric codes:

list

_!

212

_,

_

5i. 52. 53. 54. 55.

foreign

in

51/55,

nolabel

listforeign -- List values of variables 0 0 1 1 1

0

References Riley, A. R. 1993. dml5: Interactively list values of variable,s.Stata Technical Bulletin 16: 2-6. Reprinted in Stata TechnicalBulletin Reprints. vol. 3, pp. 37-41. Royston, P. and P. Sasieni. 1994. dml6: Compact listing Of a single variable. Stata Technical Bulletin 17: 7-8. Reprinted in Smta Technical Bulletin Reprints, vol. 3, pp_41-43. Weesie, J. t999. din68: Display of variablesin blocks. Stata TechnicalBulletin 50: 3-4. Reprinted in Stata Technical Bulletin Reprints. vol. 9, pp. 27-29.

Also See Related:

[R] edit, [P] display,

[P] tabdisp

i

Ins

! i

; i i i _I I

j i

"_

I

0 -- Find z

iit

1

_l

I

I

ire-skewness log or BoxLCox transform

lnske'_O ,,ewvar = ._xp [if exp] [in range] [, level(#) I

_delta(#)

_zero(#) 1

delta(#)

--

Syntax bcskei_O newvar = _.rp [if 1

e_,7_ ] [ill range] [ .

m

level(#)

--

--

zero(#)

] d

Deseripti n of inske_10creates n@var = ln(+exp - k). choosing k and the sign of exp so that the skewness newvar is zero. bcske_FO creates n 'vat= (exp _ - 1)/A, .the Box-Cox power transformation (B x and Cox 1964), ch_osing A so t_at the skewness of ned,vat is zero. exp must be strictly positi_°c. Also see

[R] boxeo

for maximu_n likelihood estimation of A

Options level (#) specifies the confidence level for a confidence interval for k (lnskewO) or A (bcskewO). Unlike usual, the ccnfidence interval is calculated only if level() is specified. As usual, # is _ecified as an integ,.'r; 95 means 95% confidence intervals. The level() option is honored onl>_ if the umber of observations exceeds 7. delta(#) specifies the increment used for calculating the derivative of the skewness function with respect to k (lnske'gO) or A (bcskewO). The default values are 0.02 for lnskewO and 0.0I for bcske_O. zero(#) s_ecifies a vah Eefor skewness to determine convergence that is small enough to be considered zero arld is by defau it 0.001.

Remarks

Example

1

Using dur automobih_ dataset (see [U] 9 Statai's on-line tutorials and sample datasets), we want to generatd a new variab le equal to ln(mpg- k) t6 be approximately normally distributed, mpg records the miles r gallon for _ach of our cars. One feature of the normal distribution is that it has skewness

• in_kewO lnmpg Transfor_

mpg k

[95Y,Cdnf. Interval]

Skewness

(not calculated)

-7.05e-06

....

in(mpg-k)

5.383659

214

InskewO-- Find zero-skewness log or Box-Cox transform

This created the new variable lnmpg = ln(mpg - 5.384): describe Inmpg

variable name

storage type

Inmpg

display format

float

value label

X9.0g

Since we did not specify the level we could have typed

variable label in(mpg-5.383659)

() option, no confidence

interval was calculated.

At the outset,

InskewO inmpg = mpg, level(95) Transform

I

In(mpg-k)

[

k 5.383659

[95_

Conf. Interval]

-17. 12339

Skewness

9.892416

-7.05e-06

The confidence interval is calculated under the assumption that In(mpgk) really does have a normal distribution. It would be perfectly reasonable to use Inskew0 even if we did not believe the transformed variable would have a normal distribution--if we literally wanted the zero-skewness transform--although in that case the confidence inte_'al would be an approximation of unknown quality to the true confidence interval. If we now wanted to test the believability of the confidence interval, we could also test our new variable lnmpg u!sing swilk with the !nnormal option. q

El Technical Note lnskewO (and bcskewO) reports the resulting skewness of the variable merely to reassure you of the accuracy of its results. In our above example, lnskew0 found k such that the resulting skewness was -7- 10-6, a number close enough to zero for all practical purposes. If you wanted to make it even smaller, you could specify the zero () option. Typing lnskewO new=mpg, zero (le-8) changes the estimated k to -5.383552 from -5.383659 and reduces the calculated skewness to -2.10 -11 When you request a confidence interval, it is possibl+ that lnskew0 will report the lower confidence interval as '. ', which should be taken as indicating the lower confidence limit kL = -oc. (This cannot happen with bcskew0.) As an example, consider a sample of size n on z and assume the skewness of z is positive, but not significantly so at the desired significance level, say 5%. Then no matter how large and negative you make kz,, there is no value extreme enough to make the skewness of ln(x - kL) equal the corresponding percentile (97.5 for a 95% confidence interval) of the distribution of skewness in a normal distribution of the same sample size. You cannot because the distribution of ln(z - kL) tends to that of zpapart from location and scale shift--as z --+ oc. This "problem" never applies to the upper confidence limit ku because the skewness of ln(:c - ku) tends to -co as k tends upwards to the minimum value of z.

Example In the above example, using lnskewO a natural

zero

and

we

are

shifting

variable such as temperature zero is indeed arbitrary.

that

with a variable like mpg is probably zero

arbitrarily.

measured in Fahrenheit

On

the

other

hrmd,

use

undesirable,

mpg has

of tnskew0

with

or Celsius would be more appropriate

a

as the

i i

_

lnskewO-- Find zerO-skewnesSlog or Box-Cox transform 215 For a var able like mpg it makes more sense touse the Box-Cox power transform (Box and Cox 1964): y(_)=

Y_-I

A is free to :ake on any ,,alue. but note that y(1) _ y_ bcskew0 works like 1:1skew0:

1. y(0) _ In(y), and yf-l_

= 1 - 1/y.

• bcs_ewO bcmpg = ipg, level(95) 4 i Transform (_pg'L-1)/L

L

[95_,Conf. Interval]

-. 3673283

-1. 212752

Skewness

,4339645

0001898

,i

It is worth n( ting that the _ _,%confidence interval i,cludes ), = - ] (A is labeled L in the output), which has a rather nore pteasin_ interpretation--gallons iper mile--rather than (mpg-'3673 - 1)/(-3673).• The confide;_ce interval, however, is calculated under the assumption that the power transformed

_

variable is rormally distributed. It makes perfect sense to use bcskewO even when one does not believe that the transforrred variable will be normallv distributed, but in that case the confidence interval is an approximaticn of unknown quality, ffone believes thai the transformed data are normally gging. Command logs are ah,,ays straight ASCII text files and this makes them easy to convert into

, !

do-files. (In ihis respect, it would make more sen.4e if the default extension of a command log file was .do be&use commaz]d lo_osare do-files. The default is .txt. nOt .do. howe_er, to keep you

i i

from accidenialty overwriting your important do-files.) Full logs !are recordedlin one of two formats: SMCL (Stata Markup and Control Language) or

_° i

text (meaning ASCII). The default is SMCL. but s_t logtype can change that, or you can specify an option to state the forrrm you wish. We recommend SMCL because it preserves fonts and colors. SMCL logs c_n be convert,_d to ASCII text or to other formats using the translate command: see [R] translate; translate can also be used to produce printable versions of SMCL IO_S.or you can print SMCL l_gs by pullin_ down File and choosing Log. SMCL logs can be viewed in the viewer, as can any file: !see [R] view.

: ! i

21_

_

zl _

tog -- Ec.o copy of session to file or device

log or cmdlog,

typed without arguments, reports the status of logging.

log using and cmdlog using open a log file. log close and cmdlog close close the file. Between times, log off and cmdlog off, and log on and cmdlog on can temporarily suspend and resume logging. set logtype specifies the default format in which full logs are to be recorded. Initially, full logs are set to be recorded in SMCL format. set linesize specifies the width of the screen currently being used (and so really has nothing to do with logging). Few Stata commands currently respect linesize, but this will change in the future.

Options Optionsfor use with both log and logcmd append specifies that results are to be appended onto the end of an already existing file. If the file does not already exist, a new file is created. replace specifies that filename, if it already exists, is to be overwritten, and so is an alternative to append. When you do not specify either replace or append, the file is assumed to be new and an error message is issued and logging is not started.

Options for use with log text and smcl specify the format in which the log is to be recorded. The default is complicated describe but is what you would expect: If you specify the file as filename.smcl, (regardless of the value of set logtype).

to

then the default is to write the log in SMCL format

If you specify the file asfilename, log, then the default is to write the log in text format (regardless of the value of the set logtype). If you type filename without an extension and specify neither the smcl or text options, the default is to write the file according to the value of set logtype. If you have not reset set logtype, then that default is SMCL. In addition, the filename you specified will be fixed to read filename, sracl if a SMCL log is being created or fiiename, log if a text log is being created. If you specify either of the options text or smcl, then what you specify determines how the log is written. Iffilename was specified without an extension, the appropriate extension is added for you.

Remarks For a detailed explanation

of logs, see [U] 18 Printing

and preserving

output.

Note that when you open a fulI log, the default is to show the name of the file and a time and date stamp: log

using

log

log: type :

opened

L

on:

myfile

C: \data\proj smcl 12 Dec

2000,

Ikmyfile. 19:28:23

smcl

log _ Echo copy of sessionto file or device

i

219

The above information ' ,'ill appear in the log. If you do not want this information to appear, precede

i

the comm_ nd by quiet . qu etly

l

quietly

!

Ly:

log using myfile

'_ill not suppr,;ss an}, error messages qr anything else you need to know.

i

Simila@ when you :lose a fuel log, the default is to show the full information:

i

. lo*

I

i

close

i log- c:\_t_\proj l\_y_ile, s_l

clo!ed on

12

c 2000,

12:32:41

and lhat information wili appear in the log. If you want to suppress that, type quietly log close,

i I

SavedReSults log

and cmdlog sav_ in r ()" Macros

i

r (filename) name of file I

AlsoSee

} _ i {

I

! i

r(s_atus)

on or off

r(type)

text or smcl

i

Complemehtary:

[Ri translate; [R] more, [R] query

Baekgrounh:

17 Logs: Printing and saving output [G:;W] 17 Logs: Printing and saving output, [G:;U] 17 Logs: Printing and saving output, [G!M]

[U 10 -more-- conditions, [Ui 14.6 File-naming conventions, [UI 18 Printing and preserving output

' 1

Title [ I°gistic

-- L°gisfic , regressi°n

,

i

t

Syntax logistic

depvar varlist [weight]

cluster

(varname)

maximize_options lfit

[depvar] all

lstat lroc all

[depvar]

genprob

asis

[if

exp]

[in

range]

[if exp] _in range]

[weight]

[if

(varname)

coef

[, group(#)table

outsample

expl

[in

[. cutoff(*)all

range]

[, nograph

beta(mamame)

]

graph_options

]

[weight]

beta(matname)

offset

robust

]

[weight]

(varname)

[, _level(#)

]

[weight]

beta(mamame)

Isens [depvar]

all

score (newvarname)

beta(matname) [depvar]

[i£ exp] [in range]

[if

gensens

expl

[in

(varname)

range]

[. nograph

genspec

(varname)

graph_options replace

]

by ... : may be used with logistic; see [R] by. logistic allows fweights andpweights; lfit, lstat, lro¢, and lsens allow only fweights; see [U] 14.1.6weight. logistic shares the features of all estimation commands; see [U] 23 Estimation and post-estimation commands. logistic may be used with sw to perform stepwise estimation; see [R] sw.

(Continued

on next page)

220

yntax fir predict predict

[type] ,ewvarname [if exp] [in range] [, statistic rules

asif

nooffset

]

where slatistic is p xb strip * d_eta * deviance

, ___2 , ddeviemce , hat , number r,esiduals , rstandard

probability of a positive outcome (the default) xib, fitted values standard error of the prediction Pregiborl(198t) A 13influence statistic deviance residual Hosmer and Lemeshow (1989) A X2 influence statistic Hosmer and Lemeshow (1989) A D influence statistic Pregibon (1981) leverage sequential number of the covariate pattern Pearson residuals; adjusted for number sharing covafiate pattern standardized Pearson residuals: adjusted for number sharing covariate pattern

Unstarred statistics are available both in and out of sample; type predict ... if e(sataple) ... if wanted only for the estimation sample. Starred statistics are calculated only for the estimation sample even when if e(sample) is not specified.

DeScription logisticesumates a logistic regression of _lepvaron vartist, where depvar is a 0tl variable lot, more precisely, a 0/non-0 variable). Withoutarguments, logistic redisplays the last logistic estimates, logistic displays estimatesas odds ratios; to view coefficients,type logit after running logistic. To obtain odds ratios for any covariate pattern relative to another, see JR] lineom. ].fi_ displays either the Pearson goodness-of-fit test or the Hosmer-Lemeshow goodness-of-fit test is'_at displays various summary statistics, including the classification table, lroc graphs and calculates the area under the ROe curve. lsens graphs sensitivity and specificity versus probability cutoff and optionally creates new variables containing these data lfit, lstat, lroc, and lsens can produce Statisticsand graphs either for the estimation sample or for;any set of observations. However, they always use the estimation sample by default. When weights, if, or in are used with logistic, it ig not necessary to repeat them with,these commands when you want statistics computed for the estimition sample. Specify if, in. or the all option onb,' whe_nyou want statistics computed for a set of obsen_ationsother than the estimation sample. Specify wmghts (only fweights are allowed with these commands) only when you want to use a different set oftweights. Bydefault, if it. lstat, lroc, and lsens use the fastmodelestimated by logistic. Alternatively, the model can be specified by inputting a vector of coefficients with the beta() option and passing the name of the dependent variable depvar to ttie commands, The lfit,

lstat,

lroc. and lsens commands may also be used after logit

or probit.

Here is a list of other estimation commands that may be of interest. See |R] estimation commands for a complete list. See Gould (2000_for a discussion of the interpretation of logistic regression.

I)

222

lOgistic --

LOgiStiC regression

blogit

[R] glogit

Maximum-likelihood logit regression for grouped data

bprobit

[R] glogit

Maximum-likelihood probit regression for grouped data

clogit

[R] ciogit

Conditional (fixed-effects) logistic regression

cloglog

[R] cloglog

Maximum-likelihood complementary log-log estimation

glra

[R] glm

Generalized linear models

glogit

[R] glogit

Weighted least-squares togit regression for grouped data

gprobit

[R] glogit

Weighted least-squares probit regression for grouped data

heckprob

[R] heekproh

Maximum-likelihood probit estimation with selection

hetprob

[R] hetprob

Maximum-likelihood heteroskedastic probit estimation

logit

IR] logit

Maximum-likelihood logit regression

mlogit

[R] mlogit

Maximum-likelihoo_l multinomial (polytomous) logistic regression

nlogit

[R] nlogit

Maximum-likelihood nested logit estimation

ologit

[R] ologit

Maximum-likelihood ordered logit regression

oprobit

[RI oprobit

Maximum-likelihood ordered probit regression

probit

[R] probit

Maximum-likelihood probit regression

scobit

[R] scobit

Maximum-likelihood skewed logit estimation

svylogit

[R] svy estimators

Survey version of logit

svymlogit

[R] svy estimators

Survey version of mlogit

svyologit

[R] svy estimators

Survey version of ologit

svyoprobit

[R] svy estimators

Survey version of oprobit

svyprobit

[R] svy estimators

Survey version of probit

xtclog

[R] xtclog

Random-effects and population-averaged ctoglog models

xtlogit

[R] xtlogit

Fixed-effects, random-effects, and population-averaged

xtprobit

[R] xtprobit

Random-effects and population-averaged

xtgee

[R] xtgee

GEE population-averaged generalized linem: models

logit models

probit models

Options Optionsfor logistic level

(#) specifies

the confidence

or as set by set

level;

level, in percent,

for confidence

see [U] 23.5 Specifying

the width

intervals.

Tile default

of confidence

is level

(95)

intervals,

robust specifies that the Huber/White/sandwich estimator of variance is to be used in place of the traditional calculation; see [U] 23.11 Obtaining robust variance estimates, robust combined with cluster() be independent

allows between

If you specify

pweights,

cluster(varname) not necessarily cluster(personid) estimated estimated

observations clusters).

which

are not independent

robust

is implied;

see [U] 23.13

specifies within

that

groups, in data

the

observations

Weighted

independent

(although

they must

estimation. across

groups

varname specifies to which group each observation with repeated observations on individuals, cluster()

standard errors and variance-covariance coefficients; see [U] 23.11 Obtaining

used with pweights to produce command in [R] svy estimators

are

within cluster

matrix of the estimators robust variance estimates,

(clusters)

but

belongs; e.g., affects the

(VCE) but not the cluster() can be

estimates for unstratified cluster-sampled data, but see the svylogit for a command designed especially for survey data.

logistic-- Logisticregremsion c_hister () implies robust' by itself.

specifying robust cluster()

is equivalent to typing cluster

223 ()

scorei(newvarname) creates newvar containing uj = 01nLj/0(xjb) for each observation j in the sample. The score vector is _ 01nLj/ab = _ujxj; i.e., the product of ne_'var with each covariate summed over observations. See [U] 23.12 Obtaining scores. asis forces retentionof perfectpredictor variables and their associatedperfectly predictedobservations and may produce instabilities in maximization; see [R] probit (sic). offset (varname) specifies that varname is to be included in the model with coefficientconstrained tolJe 1. coef causes logistic to report the estimated coefficients rather than the ratios (exponentiated coefficieJas),coef may be specified when the _odel is estimated or used later to redisplay results. c0ef affects only how resuks are displayed ahd not how they are estimated. marimize_options control the maximization process; see [RI maximize. You should never have to specify Uhem.

'

Options!forlilt, Istat,troc,andIsens group(#) (ifit onl_y)specifies the number of quantiles to be used to group the data for the Hosmer-Lemeshow goodness-of-fit test. groqp(lO) is typically specified. If this option is not _iven, ttie Pearson goodness-of-fit test is computed using the covariate patterns in the data as groups.

I_

table (If it only) displays a table of the groups used for the Hosmer-Lemeshow or Pearson goodness-of-fit test '_,ithpredicted probabilitieS,observed and expected counts for both outcomes. anldtotal_ for each group, oulzsample (lfit only) adjusts the degrees of freedom for the Pearson and Hosmer-Lemeshow goodness-of-fittests for samples outside of the estimation sample. See the section Samples other thsn_estimation sample later in this entry. all requests that the statistic be computed for all observations in the data ignoring any if or in restrictions specified with logistic. beta(matn_lme) specifies a row vector containing coefficients for a logistic model. The columns of the row vector must be labeled with the corresponding names of the independent variables in the data. The dependent variable depvar must be specified immediately after the command name. See the section Models o/her than last estimated model later in this entry. cutoff (#) (1star only) specifies the value for determining whether an observation has a predicted positive outcome. An observation is classified as positive if its predicted probability is > #. The default is_0.5. nograph (1roe and lsens) suppresses graphical output. eraph_options (1roe and lsens_ are any of the options allowed with graph, lzwoway;see [G] graph options. genprob (va'rname). gensens (varname), and gelaspec (varname) (lsens only) specily the names of new variables created to contain, respectively, the probability cutoffs and the corresponding ser_sitivityand specificity. replace (lsens only) requests tha) if existing variables are specified for genprob(), or geaspec (), they should be ovem'ritten.

'i

gensens(),

1

Optionsfor predict p, the default, calculates the probability

of a positive outcome.

xb calculates the linear prediction. std_p calculates the standard error of the linear prediction. dbeta calculates the Pregibon (1981) A_ influence statistic, a standardized measure of the difference in the coefficient vector due to deletion of the observation along with all others that share the same covariate pattern. In Hosmer and Lemeshow (1989)jargon, this statistic is M-asymptotic, that is, adjusted for the number of observations that share the same covariate pattern. deviance calculates the deviance residual. dx2 calculates the Hosmer and Lemeshow (1989) AX2 influence statistic reflecting the decrease in the Pearson X2 due to deletion of the observation and all others that share the same covariate pattern. ddeviance calculates the Hosmer and Lemeshow (1989) AD influence statistic, which is the change in the deviance residual due to deletion of the observation and all others that share the same covariate pattern. hat calculates the Pregibon (1981) leverage or the diagonal elements of the hat matrix adjusted for the number of observations that share the same covariate pattern. number numbers the covariate patterns observations with the same covariate pattern have the same number. Observations not used in estimation have number set to missing. The "first" covariate pattern is numbered l, the second 2, and so on. residuals calculates the Pearson residual as given by Hosmer and Lemeshow for the number of observations that share the same covariate pattern.

(1989) and adjusted

rstandard calculates the standardized Pearson residual as given by Hosmer and Lemeshow and adjusted for the number of observations that share the same covariate pattern.

(1989)

rules requests that Stata use any "rules" that were used to identify the model when making the prediction. By default, Stata calculates missing fot excluded observations. See JR] legit for an example. asif requests that Stata ignore the rules and the exclusion criteria and calculate predictions for all observations possible using the estimated parameter from the model. See [R] logit for an example. nooffset is relevant only if you specified offset (vamame) for logistic. It modifies the calculations made by predict so that they ignore the Offset variable: the linear prediction is treated as x3b rather than x ab + offset a.

Remarks Remarks are presented under the headings logistic and logit Robust estimate of variance lilt lstat lroc lsens Samples other than estimation sample Models other than last estimated model predict after logistic

]

1 225

- uxjmtm lOgisciadd Iogit i

logistic provides an alternative and preferr_ way to estimate maximum-likelihood logit models, the other Choice being logit described in [R] iogit, First, let us dispose of some confusing terminology. We use the words logit and logistic to mean the same thing: maximum likelihood estimation. To some, one or the other of these words connotes trarisfOrming the dependent variable and using weighted least squares to estimate the model, but that is riot ho'& we use either word here. Thus, the logit and logistic commands produce the same res_tlts, The logistic command is generally preferred to logit because logistic presents the estimates in terms 6f odds ratios rather than coefficients. To a few, this may seem a disadvantage, but you can type logb:t without arguments after logistic to see the underlying coefficients. Nevertfieless. [R] log'it is still worth reading because logistic shares the same features as logit. incl_ud_ngomitting variables due to collinearity or one-way causation. For an introduction to logistic regression, see Lemeshow and Hosmer (1998) or Pagano and Gauvreau (2000, 470-487); for a thorough discussion, see Hosmer and Lemeshow (t989: second edition_ foghcoming in 2001).

> Example Colisidtr the following dataset from a study of risk factors associated with low birth weight des¢ribed ]n Hosmer and Lemeshow (1989, appendix 1). ., ddscribe Contains data from Ibw.dta ob_: 189 ,vaz]s: Ii :size:

_ari_ble name . ]

3,402 storage type

Hosmer _ Lemeshow data 18 Jul 2000 16:27

(95.!% of memory f_ee) di6play format

valu_ label

variable label

race

fd

int

_8,0g

identification code

]_bw v(ge l%_t Z_tce

byte byte int byte

_8.Og XS.0g _8.Og _8.0g

birth weight chi2

Pseudo

R2

189 33.22

Robust low

Odds

age lwt _Irate

2

Ratio

Std.

Err.

.9732636 .9849634

.0329376 .0070209

z -0.80 -2.13

P>Izl

[95_, Conf.

Interval]

O. 423 O. 034

.9108015 .9712984

1.040009 .9988206

3.534767

I.793616

2.49

O.013

I.307504

9.556051

_Irace_3

2.368079

I.026563

1.99

O.047

i.012512

5. 538501

smoke

2. 517698

,9736416

2.39

O. 017

1.179852

5,372537

ptl ht

1. 719161 6.249602

.7072902 4.102026

1.32 2.79

O. 188 O. 005

.7675715 1. 726445

3. 850476 22. 6231

ul

2.1351

I. 042775

1.55

O. 120

.8197749

5. 560858

chi2

=

0. 0003

Pseudo

=

0.1416

R2

Robust low

Odds Ratio

Std.

Err.

z

P>}zl

[957, Conf.

Interval]

0.423 0.034

.9108015 .9712984

1.040009 ,9988206

r

age lwt iIrace_2 JIrace_3 smoke ptl ht ui

.9732636 .9849634

.0329376 .0070209

-0.80 -2, 13

3. 534767 2. 368079

I.793616 1.026563

2.49 i.99

0.013 O.047

I.307504 1.012512

9.556051 5.538501

2. 517698

.9736416

2.39

0. 017

1.179852

5.372537

1.7t9161

.7072902

1.32

O. 188

.7675715

3. ff50476

6,249602 2. 1351

4. 102026 1.042775

2,79 I. 55

0.005 O. 120

1.726445 .8197749

22.6231 5.560858

Additionally, robust allows you to specify cluster() and is then able, within cluster, to relax the assumpiion of independence. To illustrate this, we have made some fictional additions to the low-birth-Weight data. Pretend [hat these data are not a random sample of mothers but instead are a random sample of mothers+from a random sztmple of hospitals. In fact, that may be true--we do not know the history of these_dam but we can pretend in any case.

i i

H0spital$ specialize and it would not be too incorrect to say that some hospitals specialize in more difficult cases. We are going to show two extremes. In one, all hospitals are alike but we are going to estimate bnder the possibility that they might differ. In the other, hospitals are strikingly different In bc/_hCases, we assume patients are drawn from 20 hospitals. In both examples, we will estimate the same model and we will type the same command to estim_ate!it. !Below are the same data we have been using but with a new variable hospid, which ident_fie_ frbm which of the 20 hospitals each patient was drawn (and which we have made up):

_r.o

" F_

,ug,_uu-

Logm.c

regressl0N

. xi: logistic low age lwt i.race smoke ptliht ui, robust cluster(hospid) i.race _Irace_1-3 (naturally coded; _Irace_l omitted) Logit estimates

Log likelihood =

-100,724

Number of obs Wald chii(8) Prob > chi2

= = =

189 49.67 0.0000

Pseudo R2

=

0.1416

(standard errors adjusted for clustering on hospid) Robust low

Ratio

Std. Err.

age lwt _Irace_2 _Irace_3 smoke

.9732636 .9849634 3,534767 2.368079 2.517698

.0397476 .0057101 2.013285 .8451325 .8284259

ptl ht ui

1.719161 6.249602 2,1351

.6676221 4.066275 1o093144

z

P>_zl

[957 Conf. Interval]

-0.66 -2.6!1 2.2_ 2.42 2.81

0,507 0,009 0.027 0.016 0.005

.898396 .9738352 1,157563 1.176562 1.321062

1.05437 .9962187 10,79386 4.766257 4.79826

1.40 2.82 1.48

0.163 0.005 0,138

.8030814 1,745911 .7827337

3.680219 22.37086 5.824014

The standard errors are quite similar to the standard etTors we have previously obtained, whether we used the robust or the conventional estimators. In this example, we invented the hospital ids randomly. Here are the results of the estimation with the same data but with a different set of hospital ids: . xi: logistic low age lwt i.race smoke ptl ht ui, robust cluster(hospid) i.race _Irace_l-3 (naturally coded; _Irace_1 omitted) Logit estimates

Log likelihood =

-100.724

Number of obs Wald chii(8) Prob > chi2

= = =

189 7,19 0.5167

Pseudo R2

=

0.1416

(standard errors adjusted for clustering on hospid) Robust Std. Err.

low

0dds Ratio

age lwt _Irace 2 _Irate 3 smoke

.9732636 .9849634 3.534767 2.368079 2.517698

.0293064 .0106123 3.120338 1.297738 1.570287

ptl ht ui

1.719161 6.249602 2.1351

.6799153 7.165454 1.411977

z

P>[zJ

[957 Conf. Interval]

-0,90 -1,41 1.43 1.57 1.48

0.368 0.160 0.153 0,116 0.139

.9174862 ,9643817 .6265521 .8089594 .7414969

1.032432 1.005984 19.9418 6.932114 8.548654

1.37: 1.60 1.15

0.171 0.110 0.251

.7919046 .660558 .5841231

3.732161 59.12808 7.804266

Note the strikingly larger standard errors. What happened? In these data, women most likely to have low-birth-weight babies are sent to certain hospitals and the decision on likeliness is based not just on age, smoking history, etc., but on other things that doctors can see but that are not recorded in our data. Thus, merely because a woman is at one of the centers identifies her to be more likely to have a low-birth-weight baby. So much for our fictional example. The rest of this' section uses the real low-birth-weight data. To remind you, the last model we left off with was

r"

logistic-- Loglstlcregression i

i •

229

,

• Xi: logistic low age lwt i.race smoRe ptl ht ui i._ace _Irace_1-3 (naturally coded; _Irace_l omitted) Logit estimates

Log likelihood =

Number of obs LR chi2(8) Prob > chi2 Pseudo R2

-100.724

low

Odds Ratio

age l'.*t _!race_2 _Irace_3 smoke

.9732636 .9849634 3.534767 2.368079 2. 517698

.0354759 .0068217 1.860737 1.039949 1. 00916

1.719161 6.249602 2.1351

.5952579 4.322408 .9808153

ptl ht ui

Std. Err.

z -0.74 -2.19 2.40 1.96 2.30 I.56 2.65 1,65

= = = =

189 33.22 0.0001 0.1416

P> iz I

[95_ Conf. l_terval]

O.457 O.029 O.O16 O.050 0.021

.9061578 .9716834 1.259736 1.001356 1. 147676

t. 045339 .9984249 9.918406 5.600207 5. 523162

O.118 0.008 O. 099

.8721455 1.611152 .8677528

3,388787 24.24199 5. 2534

lilt Ifit Computes i goodness-of-fit tests, either the Pearson X2 test or the Hosmer-Lemeshow i i

test.

By _de(ault, lfit. Istat,lroc, and lsenS compute statistics for the estimation sample using the llast rdodel estimated by logistic. However, samples other than the estimation sample can be spetifibd;t'l.seethe section Samples other than esflmation sample later in this entry. Models other thanthe )last mbdel estimated by logistic can also be specified; see the section Models other than last estimated )model

> Example

i 1

lfi_c, fyped without options, presents the Pearson X2 goodness-of-fit test for the .estimated model. The: Pdarsbn k 2 goodness-of-fit test is a test of:the observed against expected number of responses usir_g ¢elI_ defined by the covafiate patterns; see predict with the numbers option below for the defiiaitibn bfcovariate patterns. ._if_.t L_gi_tic model for low, goodness-of-fit test I number of observations = 189 _umi_r of covariate patterns = Pearson chi2(t73) = Prob > chi2 =

182 i_9.24 0.3567

Our mddell fits reasonably well. We should note, however, that the number of covafiate patterns is close td the{number of observations, making the applicability of the Pearson X2 test questionable, but not riec_ssa_ily inappropriate. Hosmer and Lemeshow (1989) suggest regrouping the data by ordering on the predicted probabilities and then forming, say, I0 nearly equal-size groups. 1fit with the group(_

o_tion does this:

.!l_i_, group(to) L_gis_ic model for low, goodness-of-fit test -_(_abl_collapsed on quantiles of estimated probabilities) i ) number of observations = 189 number of groups = :_ iHosmer-Lemeshow ohJ2(8) = i

Prob > chi2 =

10 @.65 0.2904

,

;

230 Logistic regression Again, welogistic cannot --reject our model. If you specify the tableoption, Ifit displays the groups along with the expected and observed number of positive responses (low-birth-weight babies):

_'

Ifit, group(lO) table Logistic model for low, goodness-of-fit test (Table collapsed on quantiles of estimated probabilities) _Group

_Prob

_Obs_ i

_Exp_l

_Obs_O

_Exp_O

_Total

1 2

0.0827 O. 1276

0 2

1.2 2.0

19 17

17.8 17.0

3

0.2015

6

3.2

13

15.8

19

4

0.2432

1

4.3

18

14.7

19

5

O. 2792

7

4.9

12

14.1

19

6

O. 3138

7

5.6

12

13.4

19

7 8

O. 3872 0.4828

6 7

6.5 8.2

13 12

12.5 i0.8

19 19

9

0.5941

10

10.3

9

8.7

19

0.8391

13

12.8

5

5.2

18

10

number

of observations

number Hosmer-Lemeshow

=

189

of groups = chi2 (8) =

Prob

> chi2

19 19

i0 9.65

=

0.2984

q

Q Technical Note ifit with the group() option puts all observations with the same predicted probabilities into the same group. If, as in the previous example, we request 10 groups, the groups that lfit makes are [P0,Plo], (Pl0_P20], (P20,P30], -.-, (P90_Pl00], where Pk is the kth percentile of the predicted probabilities, with Po the minimum and Ploo the maximum. If there are large numbers of ties at the quantile boundaries, as will frequently happen if all independent variables are categorical and there are only a few of them, the sizes of the groups will be uneven. If the totals in some of the groups are small, the X 2 statistic for the Hosmer-Lemeshow test may be unreliable. In this case, either fewer groups shtutd be specified or the Pearson goodness-of-fit test may be a better choice. El

> Example The tableoption can be used without the group()option. We would not want to specify this for our current model because there were 182 covafiate patterns in the data. caused by the inclusion of the two continuous variables age and lwt in the model. As an aside, we estimate a simpler model and specify table with lfit: logistic Logit

Log

low

_Irate_2

_Irate_3

smoke

ui

estimates

likelihood low

= -107.93404 Odds

Ratio

Std.

Err.

z


= = =

189 18.80 0.0009

Pseudo

=

0.0801

R2

P>Iz_

[95_

Conf.

1.166749

Interval]

_Irace_2

3.052746

1.498084

2.27

0.023

_Irace_3

2.922593

1.189226

2.64

0.008

1.31646

6.488269

2.945742

1.101835

2.89

0.004

1.41517

6.131701

2.419131

1.047358

2.04

0.041

1.03546

5.651783

smoke ui

7.987368

_

.

i

i

tf_t,

logistic-- Logistic regression

_Exp_O

Total

l

_1 12

O. 1230 0.2533

I3

4.9 1.0

373

35.1 3.0

404

!

!4, ':5

O. 2923 0.2997

15 3

12.6 3.9

28 10

30.4 9.1

43 13

i8 i9

O. 5087 O. 5469

2 2

1.5 4.4

1 6

I. 5 3.6

3 8

_0

O.5577

6

5.6

4

4.4

10

_I

0.7449

3

3.0

1

1.0

4

16 0.4978 !7 0.4998

__ro_p | _I i2 i3 14 i5 !6 17 i8 19 dO _I

I

231

tab

LCgi_tic model for low, goodness-of-fit test __rodp Prob _Obs_l _Exp_I _Obs_O

! I

4 4

4.0 4.5

_Prob O. 1230 O. 2533 O.2907 O. 2923 O. 2997 O. 4978 O. 4998 O.5087 O. 5469

_Irace_2 0 0 0 0 1 O 0 1 0

_irace_3 O O 1 O O 1 0 O 1

O.5577 O.7449

1 0

0 1

number of observations _umter of covariate patterns Pearson chi2(6) Prob > chi2

= = = =

4 5

4.0 4.5

smoke O 0 0 1 0 0 I O 1

ui 0 1 O O O 1 1 I 0

1 1

0 1

i

8 9

18_ 5.71 0.4569

chi2 =

6_67 0_5721

Note tha we did not specify an if statement _-lth Ifit since we wanted to use the estimation sample. ;inqe ! the test is nonsignificant, we are satisfied with the fit of our model. Running _roc gives a measure of the discrimination: .

oc,

nograph

Logistic model for low 94 0.8158

number of observations = ar_a u.%derROC curve =

No_: we lest the calibration of our model by lJerforming a goodness-of-fit test on the validation sample. We _pecify the outsampleopUon so that the degrees of freedom are 10 rather than 8, Lo@istic . !fitI ifmodel group==2, for low, group(iO) goodness-of-fit table outsample t_st (Table collapsed on quantiles of estimateflprobabilities) _G_ou_ ' II

_Prob 0.0725 O. 1202 O. 1549 O.1888

_Obs_l 1 4 3 1

_ExpI 0.4 O.8 1.3 1.5

_Obs_O 9 5 7 8

_Exp_O 9.6 8.2 8.7 7.5

_Total 10 9 I0 9

O. 3258

4

2.7

5

I

I

O.2609 O.42t7

3 2

2.2 3.7

7 8

6.3 6. 3 7.8

9 10 10

I

181

O. 6265 0.4915 0.9737

4 3 4

5.5 4.1 7.1

6 65

4.5 4.9 1.9

10 9 9

I

235

I number of observations nu.iIlber groups [ of osmer-Lemeshow chi2(lO) Prob > chi2

=

95

= = =

I0' 28',03 0.0018

.._

We must acknowledge that our model does not fit well on the vaIidation 236 logistic -- Logistic discrimination in the validation regression sample is appreciably lower as well. • iroc

if group==2,

Logistic number i

area

model

nograph

for

low

of observations

under

ROC

sample. The model's

curve

=

95

=

O. 5839

,

q

Models other than last estimated model By default, lfit, lstat, lroc, and lsens use the last model estimated by logistic. specify other models using the beta() option.

One can

i> Example Suppose that someone publishes the following logistic model of low birth weight: Pr(low

= 1) = F(-0.02

age - 0.01 lwt + 1.3 black

where F is the cumulative logistic distribution. are the equivalent of what logit produces.

+ 1.1 smoke + 0.= ptl

Note that these coefficients

-t 1.8 ht + 0.8 ui + 0.5) are not odds ratios: they

We can see whether this model fits our data. First, we enter the coefficients as a row vector and label its columns with the names of the independent variables plus _cons for the constant (see [el matrix define and [P] matrix rowname). matrix

input

• matrix

b =

colnames

C-.02

-.01

b = age

lwt

1.3 black

1.1

.5 1.8

smoke

.8 .5)

pt]. ht

ui

_cons

We run lfit using the beta() option to specify b. The dependent variable is entered right after the command name, and the outsample option gives the proper degrees of freedom. . ifit

low,

Logistic (Table

beta(b)

model

for

collapsed number

group(lO) low,

goodness-of-fit

on quantiles

of observaZions

number Hosmer-Lemeshow

outsample

=

of groups = chi2 (I0) =

Prob

> chi2

:

Although the fit of the model is poor, lroc •iroc

low,

Logistic number area

beta(b)

model of

under

for

probabilities)

189 i0 27.33 0.0023

shows that it does exhibit some predictive

ability.

nograph low

observations ROC

test

of estimated

curve

= =

189 0.7275

q

logistic-- Logisticregression

237

predict after logistic p#edictis used after logisticto obtain predicted probabilities, residuals, and influence statistics for tt_e ostimation sample. The suggested diagnostic graphs below are from Hosmer and Lemeshow (1989). Where they are more elaborately explained. Also see Collett (1991. 120-t60) for a thorough discussion cjf model checking.

predict _wiihout options Typing p_edict p after estimation calculates _he predicted probability of a positive outcome. We ptevibusly ran the model logisticlow age Ivt _Irace_2 _Irace_3 smoke ptl ht ui. We obtain tl_epredicted probabilities of a positive outcome by typing • _re4ict

P

(o]_ti_n

p assumed;

• dum_arize

Pr(tow))

p low Obs

V_iable p low

189 189

Mean .3121693 .3121693

Std. Dev. .1918915 .4646093

Max

Hin .0272559

.8391283 0

I

predibt _vit_ the xb and stdp options predict with the xb option calculates the linear combination xjb, where xj are the independent variaNes! in he jth observation and b is the estimated parameter vector. This is sometimes known as the incle_ fu _ction since the cumulative distribution function indexed at this value is the probability of a _siiive outcome. With the _tdp option, predict calculates the standard error of the prediction, which is not adjusted tbr replidated covariate patterns in the data. The itifluence statistics described below are adjusted for replicated c(_variate patterns in the data.

predict Wit_ the residuals option predict _can calculate more than predicted probabilities. The Pearson residual is defined as the square root bf the contribution of the covariate pattern to the Pearson X2 goodness-obfit statistic. signed adcor_ing_to whether the observed number of positive responses within the covanate pattern is less th_n Or greater than expected. For instance, lz_red_ct N_rize

r,

residuals r, detail Pearson

i '

residual

_ercentiles

Smallest

IZ_

-1.750923

-2.283885

5_

-1.129907

-1.750923

IOZ

-,9581174

-1.636279

Obs

25_:

-,6545911

-1.636279

Sum of

50Z

-.3806923

189 Wgt.

Mea_ Largest 2.23879

Std.

189 -.0242299

Dev.

,9970949

75_

.8162894

90ZI

1.510355

2.317558

Variance

.9941981

95Z 99Z

1.747948 3.002206

3,002206 3,126763

Skewness Kurtosis

.8618271 3.038448

238 notice logistic-Logistic We the prevalence of aregression few, large positive residuals: t

'"

• sort list

r id r 10w

p age

race

in -5/1

185.

id 33

r

low 1

186.

57

2.23879

1

187.

16

2.317558

1

188.

77

3.002206

1

189.

36

3.126763

1

2.224501

p

age 19

race white

,166329

15

white

.1569594

27

other

.0998678

26

white

.0927932

24

white

.1681123

predict with the number option Covariate patterns play an important role in logistic regression, Two observations are said to share the same covariate pattern if the independent variables for the two observations are identical. Although one thinks of having individual observations, the statistical information in the sample can be summarized by the covariate patterns, the number of observations with that covariate pattern, and the number of positive outcomes within the pattern. Depending on the model, the number of covariate patterns can approach or be equal to the number of observations or it can be considerably less. All the residual and diagnostic statistics calculated by Stata are in terms of covariate patterns, not observations. That is, all observations with the same covariate pattern are given the same residual and diagnostic statistics. Hosmer and Lemeshow (1989) argue that such "M-asymptotic" statistics are more useful than "N-asymptotic" statistics. To understand the difference, think of an observed positive outcome with predicted probability of 0.8. Taking the observation in isolation, the "residual" must be positive--we expected 0.8 positive responses and observed 1. This may indeed be the "correct" residual, but not necessarily. Under the M-asymptotic definition, we ask how many successes we observed across all observations with this covariate pattern. If that number were, say, 6, and there were a total of 10 observations with this covariate pattern, then the residual is negative for the covariate pattern we expected 8 positive outcomes but observed 6. predict makes this kind of calculation and then attaches the same residual to all observations in the covariate pattern. Thus, there may be occasions when you want to find all observations number allows you to do this: predict

pattern,

• summarize

We

number

pattern

Variable

Obs

pattern

189

previously

estimated

Mean

89.2328

the model

ui over 189 observations.

predict

sharing a covariate pattern.

logistic

Std.

Dev.

Min

53. 16573

low

age

1

lwt

_Irace_2

Max

182

_Irace_3

smoke

ptl

ht

There are 182 covariate patterns in our data.

with the deviance

option

The deviance residual is defined as the square rooT:of the contribution to the likelihood-ratio test statistic of a saturated model versus the fitted model. It has slightly different properties from the Pearson residual (see Hosmer and Lemeshow, 1989): predict

d,

de_iance

•

......

........

logistic-- Logistic regression Summarize

Percentiles

5_

residual

Smallest

-1.843472

1_

-1.911621

-i. 33477

-1.843472

10_ '_

-!. 148316

-I .843472

Dbs

25_

-.8445325

-1.674869

Sum of _/gt.

50_

-.5202702

Mean Largest

189 189 -. 1228811

Std. Dev.

I.049237

175_ : 90_

.9129041 1,541558

1.894089 1. 924457

Variance

1. 100898

!95_ !99_

i.673338 2. 146583

2. 146583 2. 180542

Skewness Kurtosis

.6598857 2. 036938

predict With the rstandard option z PearsOn residuals do not have a standard deviation equal to t. a fine point, rstandard Pearson _esiduals normalized to have expected: standard deviation equal to !. i redict

rs,

ummarize

generates

rstandard r rs

Variable

Obs

Mean

Std. Dev,

Mix

Max

r

189

-.0242299

.9970949

-2.283885

3. 126763

rs

189

-.0279135

i,026406

-2.4478

3.149081

I

'

• +rrelate

i(o=189>

r rs r

r

t, 0000

rs

O. 9998

rs

1. 0000

Rememblr that we previously created r containing the (unstandardized) Pearson residuals, In these data, wh&her you use standardized or unstandardized residuals does not much matter, I

pred_t

ith the hat option

, : _

hat @culates the leverage of a covariate pattern a scaled measure of distance in terms of the in_tep_ndent variables. LaNe values indicate covariate patterns "far" from the average covariate patlern--_patterns that can have a large effect on the estimated model even if the corresponding residual i[ small. This suggests the following:

[

}

239

d, detail deviance

[

i

i

(Continuefl on next page)

,_

240

.

predict logistich, graph h r,

hatLogis_c regression border yline(O) ylab xlab

°g 15

o

_,

o o000

0

o

o

cO0

oooo

oo

I

o

o

_,

o °°

o

o

o

o

_

vj °

oo

O-

Pearson

residual

The points to the left of the vertical line are obserx,_ed negative outcomes: in this case, our data contain almost as many covariate patterns as observatiens, so most covariate patterns are unique. In such unique patterns, we observe either 0 or 1 success and expect p, thus forcing the sign of the residual. If we had fewer covariate patterns, which is to say, if we did not have continuous variables in our model, there would be no such interpretation arid we would not have drawn the vertical line at O. Points on the left and right edges of the graph represent large residuals--covariate patterns that are not fitted well by our model. Points at the top of our graph represent high leverage patterns. When analyzing the influence of observations on the model, we are most interested in patterns with high leverage and small residuals patterns that might otherwise escape our attention. predict with the dx2 option There are many ways to measure influence of which hat is one example, dx2 measures the decrease in the Pearson X 2 goodness-of-fit statistic that would be caused by deleting an observation (and all others sharing the covafiate pattern):

(Continued

on next page)

{

}

• _re_ict graph

dx2,

dx2

dx2 p, border

ylab

xlab

Paraphrasing Hosmer and Lemeshow (1989), the points going from the top left to the bottom right, correspond to covariate patterns with the number of positive outcomes equal to the number in the group; the points on the other curve correspond to 0 positive outcomes. In our data. most of the covariale patterns are unique, so the points tend to lie along one or the other curves: the points that are off' the curves correspond to the few repeated covariate patterns in our data in which all the outcomes a_e not the same.

! i

We exa_ainethis graph for large values of dx2--there are two at the top left. I i

i

predct w th the ddeviance option Anothel measure of influenceis the change in _thedeviance residuals due to deletion of a covarJate pattern:,

!

pr_dict As

with

d_2,

dd, ddeviance one

typically graphs ddevi_uce:

against the

probabi]ir}, of a positive outcome.

We

direct you ito Hosmer and Lemeshow (I989) foran example and the interpretation of this graph. predi_ With the dbeta option One_of the more direct measures of influence of interest to model fitters is the Pregibon (1981} ,me tsure, a measure of the change in the!coefficientvector that would be caused by deleting an observ_tion (and all others sharing the covartate pattern): dbe_a

i

(Continued on next page}

. predict

242

db, dbeta

logistic -- Logistic regression

ilt!

graph

db p, border

ylab

xlab

I

.75

{

I

J

I

"

-o p

o

_ _

o o o o

o

o

oo

o

o

c_mlmm_

.25 t

o

Q

o

o

_o

a_l_J_,::_o

o_Oo

o

o

o

J

o

eOo_

_

''

o 1

"T

Pr(low)

One observation .

sort

• list

has a large effect on the estimated coefficients.

in I

Observation

189

id lwt

188 95

ptl

3

ht

0

fry

0

bwt

_3637

0 117

p d

.839i1283 -1. 9111621

r rs

-2. 283885 -2.4478

5.99_726

dd

4. 197658

_Irace_3 pattern h db

dx2

low race

dx2

.1294439 ,8909163

Hosmer and Lemeshow graph

We can easily find this point:

db

p [w=db],

0 _ite

age smoke

25 1

ui

I

_Irace_2

0

(1989) suggest a graph that _combines two of the influence measures: border

Symbol

ylab

xlab

size proportional

tl("Symbol

size

proportional

to dSeta

I

Pro'fowl,

We can easily spot the most influential points by the dbeta

and dx2 measures.

to dBeta")

-

] i i

•

.

'_

logistic -- Logistic

regression

243

SavedReSults i

ti ! ! ! I

_og_s Scalars ic saves e(N) e df_.m) e r2_p)

in e(): number of observations model de_ees of freedom pseudo R-sqeared

tog likelihood, constant-only model number of clusters X2

e(ll_O) e(N_clust) e(chi2)

log likelihood

e Ii) :Macro_ e_ e_depvar) el wtype)

logistic name of dependent variable weight type

e(clustvar) name of cluster variable e(vcetype) covariance estimation method e(chi2type) Wald or LR: type of model X2

e(wexp)

weight expression offset

e(predict)


coefficient vector

e (V)

variance-covariance matrix of the estimators

eloffset)

test

Matrices e_b)

Functichls e sample)


_fi_s:_vesin r(): Scalars

_st_t i

rmmber of observations

r(df)

degrees of freedom

r(m)

number of covariate patterns _r groups

r(chi2)

X2

_aves in r ():

Scalars r(P_c_rr)

i r(P-n

I

r(N)

4)

r(P_p(_) r(P_.n_)

!roc

percent correctly classified

r(P_lp)

putative predictive value

sensitivity

r(P_ln)

negative predictive

specificity

r(P_Op)

false

false positive rate given true negative false negative rate given true positive'

r(P_On)

false negative rote given classified negalivc

positive

Vall}e

rate given

classified

positive

s_ves in r(): Scalars

i

r(N)


r(area)

area under the ROC curve

in i

r():

Isens

Scalars r(N) number of observations

_

,,

Methods 3ndFormulas logis';ic,

lfit,

lstat,

lroc.

and lsens

are implemented

as ado-files. l! o

l

Define xj as the (row) vector of independent Variables, augmented by 1. and b as the correspond _v estimated 3arameter (column) vector, The logistic regression model is estimated bv logit: see {RI loRit for demih

!

of estimation.

....

The odds ratio corresponding to the ith coefficient is ¢i --- exp(bi). The standard error of the odds ratio = g'_si,-- where si isregression the standard error of bi estimated by logit. zo_ is sS Iogmuc Logistic

_!

Define lj = xj b as the predicted index of the jth observation. positive outcome is

The predicted probability

of a

exp,±i) 1+ exp(I )

Pj

If it Let 2_r be the total number of covariate patterns among the N observations. View the data as collapsed on covariate patterns j = 1, 2.... , M and define mj as the total number of observations having covariate pattern j and yj as the total number of positive responses among observations with covariate pattern j. Define pj as the predicted probability of a positive outcome in covariate pattern

j.

The Pearson X2 goodness-of-fit

statistic is M

x2=

(uj mjpj)2 j=, mjpj(i-

This X2 statistic has approximately M - k degrees of: freedom for the estimation sample, where k is the number of independent variables including the constant. For a sample outside of the estimation sample, the statistic has M degrees of freedom. The Hosmer-Lemeshow

goodness-of-fit

X 2 (Hosmer and Lemeshow

1980; Lemeshow and Hosmer

1982; Hosmer, Lemeshow, and Klar 1988) is calculated similarly, except rather than using the M covariate patterns as the group definition, the quanti!les of the predicted probabilities are used to form groups. Let G = # be the number of quantiles requested with group (#). The smallest index 1 65 predicts failure perfectly". It will then inform us about the fixup it takes and estimate what can be estimated of our model. logit

(and logistic

note:

4 failures

and probit) and

0 successes

will also occasionally display messages such as completely

determined.

There are two causes for a message like this. Let us deal with the most unlikely case first. This case occurs when a continuous variable (or a combination df a continuous variable with other continuous or dummy variables) is simply a great predictor of the dependent variable. Consider Stata's auto. dta dataset with 6 observations removed. . use

auto

(1978

Automobile

Data)

drop if foreign==O _ gear_ratio>3.1 (6 observations deleted) • logit Logit

Log

foreign

mpg

gear_ratio,

nolog

likelihood

Number

= -6.4874814

foreign mpg weight gear

note:

weight

estimates

Coef.

Err.

=

68

=

Prob

> chi2

=

0.0000

R2

=

0,8484

Pseudo

Std.

of obs

LR chi2 (3)

z

P>Izl

[95% Conf,

72.64

Interval]

-.4944907

.2655508

-1.86

0,063

-i,014961

.0259792

-.0060919

.003i01

-1.96

0.049

-.0121696

-.000014

ratio

15,70509

8.166234

1.92

0.054

-.3004359

31.71061

_cons

-21.39527

25.41486

-0.84

0.400

-71.20747

28.41694

4 failures

and

0 successes

completely

determined.

Iogit -- Maximum-likelihood Ioglt estimation :+ •

I , i

Note that t[ ere are no missing standard errors in the output. If you receive the "completely

determined"

message ar d have one or more missing standard errors in your output, see the second case discussed

;

below. Note g_ar_ratao +large coefficient,logit thought that the 4 observationswith the smallest predictedprobabilitieswereessentiallypredictedperfectly. 1

.(option predict p passumed; !

Pr(foreign))

. so_t"p . li_t p in i/4

!

+

255

i

p

1. 2. 3.

1.34e-I0 6.26e-09 7.84e-09

4. |

!.49e-08

If this hlappensto you,there is no need to dd anything.Computationally,the modelis sound.It

+

i

is the seco d case discussed

+

The se@ndcase occurswhenthe independenttermsare all dummyvariablesor continuousones with repea_edvalues (e.g.. age). In this case, one or more of the estimatedcoefficientswill have missingst_dard errors.:Forexample,considerthis datasetconsistingof 5 observations.

?

• li_t i

y 0 0 1 0 i

1. 2. 3, 4. 5.

I ¢

below that requires

xl 0 1 i 0 0

careful examination.

x2 0 0 0 i i

. lo_it y xl x2, nolog. Logi_ 7 estimates ! i 1 i

Numberof obs LR chi2(2) Prob > chi2 Pseudo R2

-2.7725887

Log likelihood • 1 Coef-. 8td. Err.

++

i8. 26157 t8.26157

{

co

-_8.26157

2 1.414214

P>Izl

[95Y, Conf. Interval]

9 13

0.000

14.34164

-i12.91

0.000

note: i failureand 0 successescompletelydetermined.

i

. predict p (optionp assumed Pr(y)) xl

y

-15,48976

0

0

0

+

2. 3. 4. 5.

0 t 0 1

1 1 0 0

0 0 1 I

Two thiSgs are happeaing

i

covariate

i

dropped,

p

x2

i.

i+

-21.03338

22.1815

, li. _ _

•

5 I,18 0.5530 O. 1761

z

+

+

= = = =

1',. 17e-08

.5 .5 .5 .5

here. The first is tl_at logit is able to fit the outcome

(y = 0) for the

p_ttern+ xl = 0 and x2 = 0 (i.e., the first observation) perfectly. It is this observation that letel' dete + is the "1 f__ilure ...con_ } rm'ned". The second thing to note is that if this observation is t,n

!+

xl, x2, arid the constant

'i

are colli_ar.

....................-_oo---........,o_J,_ "Wmxmmum-.KellnOO0Ioglt estimation

This is the cause of the message "completely determined" and the missing standard errors. It happens when you have a covariate pattern (or patterns) with only one outcome, and there is collinearity when the observations corresponding to this covariate pattern are dropped. If this happens to you, confirm the causes. First identify the covariate pattern with only one outcome. (For your data, replace xl and x2 with the independent variables of your model.) • egen pattern = group(x1 x2) quietly logit y xl x2 predict p (option p assumed; Pr(y)) • snmraarize p Variable

Obs

Mean

p

5

.4

Std. Dev. .22360_8

Min

Max

I. 17e-08

.5

If successes were completely determined, that means there are predicted probabilities 1. If failures were completely determined, that means there are predicted probabilities O. The latter is the case here, so we locate the corresponding value of pat_;ern:

that are almost that are almost

• tab pattern if p < le-7 x2) 1 Total group (xl 1.

Freq.

Percent

Cum.

1

i00.O0

i00.O0

1

100.O0

Once we omit this covafiate pattern from the estimation sample, logit can deal with the collinearity: logit y xl x2 if pattern-=l, nolog note: x2 dropped due to collinearity Number of obs LR chi2 (I) Prob > chi2 Pseudo R2

Logit estimates


= = = =

4 0.00 1.0000 0.0000

|

y [

Coef.

xl _cons

0 0

Std. Err. 2 I.414214

z O.O0 O.O0

P>lz{ 1.000 I.000

[95'/, Conf. Interval] -3.919928 -2.771808

3.919928 2 _771808

We omit the collinear variable. Then we must decide whether to include or to omit the observations with pattern

= 1. We could include them

logit y xl, nolog Logit estimates


= = =

5 O. 14 0.7098


Pseudo R2

=

0.0206

Y I _cons xl j

Coef. -. 6931472 .6931472

Std. Err. I.224742 1. 870827

or exclude them: • logit y xl if pattern~=l, nolog

z -0.57 O. 37

P> lz[ O.571 O. 711

[957,Conf. Interval] -3.093597 -2.973605

I.707302 4. 3599

Iog_ -- Maximum-likelihoodIoglt estimation ,

Logi

estimates _ i

Log _ikelihood '

i

= =

Prob

=

1.0000

=

0.0000

> chi2

Pseudo

R2

4 0.00

1

Y

i

!

= -2.7725887

Number of obs LR chi2 (I)

257

Coef,

xI

0

_cons

0

Std.

Err. 2

1,4142/4

z

P> ]zt

[95_ Conf.

Interval]

O. O0

1.000

-3.919928

3.919928

0.00

1.000

-2.771808

2.771808

If the _ovariate pattern that predicts outcome perfectly is meaningful, you may want to exclude these observations from the model. In this case_ one would report covariate pattern such and such predicted 6utcome perfectly and that the best model for the rest of the data is .... But. more likely. the perfec_ prediction was simply the result of h_ving too many predictors in the model. In this case. one wouldI omit the extraneous variable(s) from further consideration and report the best model for all the datA. 23

'

i

Obtaining redicted values

i

Once y_u have estimated a logit model, you can obtain the predicted probabilities using the predict )remand for both the estimation sample and other sampI_s: see [U] 23 Estimation and post-estimation commands and [R] predict. Here we will make only a few additional comments.

i

predict without arguments calculates the predicted probability of a positive outcome: i.e.. Pr'd/j = .) = F(xjb), With the xb option, it calculates the linear combination xjb, where x_

!

are the in, ependent variableSasin the jth observation and b is the estimated parameter vectOr.atThis is sometin_es known the index function since the ,mmulative distribution function indexed this

i

value is thI probability of a positive outcome. In bothtcases, State remembers any "rules'" used to identify the model and calculates missing for excluded dbservations unless rules or asif is Specified. This is covered in the following example.

e

!

i

!

For inf_rmation about the other statistics available after predict,

> Example In the _revious example, we estimated the 10git model logit foreign

Pr(foreign))

i

(_0 _issing

values

generated)

i

s_arize

foreign

! i

rep_is_l

rep_is_2.

To obtain _redicted probabilities: . predict p (option p assumed;

,

see [R] logistic.

_ariable

!

:foreign p

p 0bs

Mean

58 48

,2068966 .25

Std. Dev. .4086186 .1956984

Min

Max

0 .t

! .5

State rem_ mbers any "'rules" used to identify ihe model and sets predictions to missing for any excluded , bservations, in the previous examplel logit dropped the variable rep_is_l from our model anc excluded l0 observations. Thus. when we typed predict p. those same I0 observations were a_ai_ excluded an_t their predictions were _et to missing.

-

predict'srules option will use the rules in file prediction. During estimation, we were told "rep__is_t-=O predicts failure perfectly", so the rule is that when rep_is_l is not zero, one should zoe _oglt-- MaxJmum-l|lmlthoodIogit predict 0 probability of success or a positive estimation outcome: predict p2,

rules

• summarize foreign p p2 Variable [

)

foreign p p2

predict's asif for all observations

Obs 58 48 58

Mean .2068966 .25 .2068966

Std. Dev. .4086186 .1956984 •2016268

Min

Max

0 .I 0

1 .5 .5

option will ignore the rules and the exclusion criteria and calculate predictions possible using the estimated parameters from the model:

• predict p3, asif . summarize foreign p p2 p3 Variable

Obs

Mean

foreign p p2 p3

58 48 58 58

.2068966 .25 .2068966 .2931034

S%d, D_v.

Min

Max

.4086186 .195698_4 .2016268 .2016268

0 .1 0 .1

1 .5 .5 .5

Which is right? What predict does by default is the most conservative approach. If a large number of observations had been excluded due to a simple rule, one could be reasonably certain that the rules prediction is correct. The asif prediction is only correct if the exclusion is a fluke and you would be willing to exclude the variable from the analysis anyway. In that case. however, you should re-estimate the model to include the excluded observations. q

Saved Results logit

saves in

e():

Scalars e (N)

number

e ('2.1)

log likelihood

e(df_m) e (r2_p) e (N_clust)

model degrees of freedom pseudo R-squared number of clusters

of observations

e(_l_O) e (chi2)

log likelihood, X2

e(cmd)

logit

e(vcetype)

covariance

e(depvar) e (wtype)

name of dependent variable

e(chi2type) e(offset)

Wald offset

e(wexp) e(clustvar)


e(predict)

program

coefficient

e(V)

variance-covariance estimators

constant-only

model

Macros

weight

type

estimation

method

or LR: type of model X_ test used to implement

predict

Matrices e(b)

vector

Functions e(sample)

marks estimation

sample

matrix

of the

t

Iogitj- Maximum-likelihoodIoglt estimation

259

V

; i

.

Methods

Formulas

The wo_ logit is due to Berkson (1944) and is by analogy with the word probit, For an introduction to probit a_d logit, see, for example. Aldrich and Nelson (1984), Hamilton (1992Z Johnston and DiNardo (1_97), Long (1997), or Powers and Xie (2000). The likelihood function for logit is 1

InL=

Ewj

lnF(xjb)

+ Zwiln{1-F(xjb)}

jES

i !

j_S

where S is!the set of all observations j such that yj _ O, F(z) = eZ/(l optional wdights. In L is maximized as described in [R] maximize.

+ eZ), and wj denotes the •

If robusl standard errors are requested, the dalculation described in Methods and Formulas of [R] regresstis carried forward with uj = !1 - F(xjb)}xj for the positive outcomes and -F(xjb)xj for the neghtive outcomes, qc is given b5 its asymptotic-like formula,

Reference

.Aldrich.J. 0' and F. D. Nelson. 1984. Linear Probab_lit);Logit, and Probit Models. Newbury Park. CA: Sage

i o

Publicatiohs. Berkson,J. @44. Applicationof the logisticfunctionto l'iio-assay.Journalof the AmericanStatisticalA._ociation39: 357-365.' Cleves,M. a}d A. Tosetto 2000. sg139:Logisticregressionwhenbinary outcomeis measuredwith uncertainty.Stata T_chmcallBulletin55: 20-23.

',

Hami['ton.L.!C 1992. f_egres_ionwith Graphics.PacificGrove.CA: Brooks/ColePublishingCompany.

i

ltosmer. D +'.. Jr.. and S. Lemeshow.1989. AppliedLOgisticRegression.New York:John Wiley & Sons. (Second editionforthcoming"in 200I.) Johr_ston.J. ]nd J. DiNardo. I997. EconometricMethods.4th ed. New York:McGraw-Hill. Judge,G. G..!W.E Griffiths,R. C. Hill.H. L/itkepohl.andT.-C.Lee. 1985. The Theoryand Practiceof Econometrics. 2d ed. Nlw York:John Wiley& Sons. Long. J. S_|997. RegressiobModels for Cate_,oricaland Limited Dependent Variables.ThousandOaks. CA: Sage

i

, i

Publicatic_as. Powers.D, _, and Y. Xie. 2000. StatisticalMezhodsfor CategoricalDataAnah,sis, San Die__o.CA: AcademicPress Pre_ilyon.D.!1981. Logisticregressiondiagnostics.Annals of Statistics9: 705-72&

Also Compleme

atary:

i

[R] clogit, [R] cloglog, [R] cusum, [R] glm, [R] giogit, [R] logistic. [Ri nlogit, [R] probit, [R] scobit. [R] svy estimators. [R] xtelog.

I

[R]

i

i

Related:

[R] adjust, [R] lincom. [R] linktest. [R] lrtest. [R] mfx. [R] predict, [R] roe. [R] sw, [R] test, [R] testnl, [R] vce. [R] xi

Backgrounch

xtgee, [R] xtlogit, [R] _tprobit

[u] 16.5 Accessing coefficients and standard errors, [U_]23 Estimation and post-estimation commands, [U_23.11 Obtaining robu_ variance estimates. [U_23,12 Obtaining scores, [R_maximize

_

Ioneway

:,

--

Large

one-way

ANOVA,

random

effects,

I I

and reliability

I

Syntax loneway

response_var

group_var

[weight t [if

exp]

[in

range I [, mean median exact

l_evel(#) ] by ...

: may be used with loneway; see [R] by.

aweights

are allowed; see [U] 14.1.6 weight.

Description loneway estimates of levels

one-way

of group_var

analysis-of-variance

and presents

different

(ANOVA) models

ancillary,

statistics

Feature

from

on datasets one,ray

with a large number (see [R] oneway):

oneway loneway

Estimate one-way model on fewer than 376 levels on more than 376 levels Bartlett's test for equal variance Multiple-comparison tests Intragroup correlation and S.E. Intragroup correlation Confidence interval Est. reliability of group-averaged score Est. S.D. of group effect Est. S.D. within group

x x

x x x

x x x x x x x

Options mean specifies that the expected value of the Fk-l.N,-k distribution Fm in the estimation of p instead of the default value of 1. median specifies that the median of the Fk-l_N-k distribution in the estimation of p instead of the default value of 1. exact;

requests

confidence not used. level

(#)

that exact intervals.

specifies

default is level(95) intervals.

confidence

This option

the confidence

intervals is allowed

level,

or as set by set

be computed,

level;

be used as the reference

as opposed

only if the groups

in percent,

be used as the reference

for confidence

are equal intervals

see [U] 23.5 Specifying

260

to the default

point

of the coefficients. width

Fm

asymptotic

in size and weights

the

point

are The

of confidence

r

i

_

Re.m

loneway-

i

Large one-wayANOVA,random effects,and reliability

261

> Example lonewa't's output looks like that of oneway except that, at the end, additional information is presented. Jsing our automobile dataset (see [U]'9 Stata's on-line tutorials and sample datasets), we have eated a (numeric) variable called ma_u:facturer_grp identifying the manufacturer of each car an within each manufacturer we have retained a maximum of four models, selecting those with',the h Jcest mpg. We can compute the intradlass correlation of mpg for all manufacturers with at least You models as follows: . 'loneway mpg manufacturer_grp if nummake == 4 One-way Analysis of VarianCe for mpg: Mileage (mpg)

S_urce

SS

df

Number of obs =

36

R-squared :

0. 5228

MS

F

Between|manufactu/~p Withi_ manufactur_p

621.88889 567.75

8 27

77,736111 21.027778

Total

1189. 6389

35

33.989683

Intraclass correlation 0.402T0

Asy. S.E O.18770

Prob > F

3.70

O.0049

[957 Conf. Interval] O.03481

0.77060

Estimatec_SD of manufactur_p effect 3.765247 Estimated SD within manufactur-p 4.585605 Est. reliability of a manufactur-p mean .7294979 (evau%uatedat 11=4.00)

q

In additi(,n to the standard one-way ANOVAoutput, lonewayproduces the R-squared, estimated standardde,,iation of the group effect, estimated standard deviation within group, the intragroup correlation he estimated reliability of the group-a_eraged mean, and, in the case of unweighted data. the asyrr/ptc :ic standard error and confidence interval for the intragroup correlation.

R-squared The R-squared is, of course, simply the underlying R 2 for a regression of response_var on the levels of ¢rqlup_var. or mpg on the various manu(acturers in this case.

The random effects ANOVA model loneway assumes that we observe a variable Yij measured for r_, elements within k groups or classes such that Yij

::=

,tZ+ Ct" i -I-

6ij,

i = 1,!2,...,

k.

3 = 1.2 .....

ni

and %. and _]ij are independent zero-mean randon3 variables with variance a,] and cr2, respectively. This is the random-effects ANOVAmodel, also kno_'n as the components of variance model, in which it is t}_picall31assumed thak the Yij are normally d_stributed.

!

The interpretation '!

with respect to our example is that the observed value of our response

variable,

mpg, is created in two steps. First, the ith manufacturer is chosen and a value c_i is determined--the !o,Large one-way reliability typical mpgtoneway for that --manufacturer less ANUVA, the overallrandom mpg/_. effects, Then aand deviation, eij, is chosen for the jth model within this manufacturer. This is how much that particular automobile differs from the typical mpg value for models from this manufacturer. For our sample of 36 car models, the estimated standard deviations are cr,_ = 3.8 and cr, -- 4.6. Thus, a little more than half of the variation in mpg between cars is attributable to the car model with the rest attributable to differences between manufacturers. These standard deviations differ from those that would be produced by a (standard) fixed-effects regression in that the regression would require the sum within each manufacturer of the eij, ei. for the ith manufacturer, to be zero while these estimates merely impose the constraint that the sum is expected to be zero.

Intraclass correlation There are various estimators of the intraclass correlation, such as the pairwise estimator, which is defined as the Pearson product-moment correlation computed over all possible pairs of observations that can be constructed within groups. For a discussion of various estimators, see Donner (1986). loneway computes what is termed the analysis of variance, or ANOVA, estimator. This intraclass correlation is the theoretical upper bound on the variation in response_var that is explainable by group_var, of which R-squared is an overestimate because of the serendipity of fitting. Note that this correlation is comparable to an R-squared you do not have to square it. In our example, the intra-manu correlation, the correlation of mpg within manufacturer, is 0.40. Since aweights weren't used and the default correlation was computed, i.e., the mean and median options were not specified, loneway also provided the asymptotic confidence interval and standard error of the intraclass correlation estimate.

Estimatedreliability of the group-averagedscore The estimated reliability of the group-averaged score or mean has an interpretation similar to that of the intragroup correlation; it is a comparable number if we average response_var by group_vat, or rapg by manu in our example. It is the theoretical upper bound of a regression of manufactureraveraged mpg on characteristics of manufacturers. Why would we want to collapse our 36-observation dataset into a 9-observation dataset of manufacturer averages? Because the 36 observations might be a mirage. When General Motors builds cars, do they sometimes put a Pontiac label and sometimes a Chevrolet label on them, so that it appears in our data as if we have two cars when we really have only one. replicated? If that were the case, and if it were the case for many other manufacturers, then we would be forced to admit that we do not have data on 36 cars: we instead have data on 9 manufacturer-averaged characteristics.

Saved Results loneway

saves in r O :

Scalars r(N) r(rho) r(lb) r(ub)

number of observations intraclass correlation lower bound of 95% CI for rho upper bound of 95% CI for rho

r(rho_t) r(se) r(sd_w) r(sd_b)

estimated reliability asymp. SE of intraclass correLati_m estimated SD within group estimated SD of group effect

!

'loneway-- Large one-wayANOVA,random effects, and reliability

263

Metl!ods and Formulas is implemented as an ado-file.

lo_e_ !

The r_ean squares in the lone_ay's

ANOVAtable are computed as follows:

Mso= _i wi(_,.-9.)_/(k- t) an_

MS,= _ _ _,j(y,j- _,.)2/(u-k) •

1"

j

in which

j i

i

j

i

= E expected wij w.. values = wi. these Yi. m_an = E squares wiiyij/wi, The c0rre_:i. ;ponding of are

t

and

_.. =

wi.ff_./w..

2 + go% and E(MS_)= _2

E(MS_,) = a 2 in Which

_..- Z,wUw k-1

Note that in the unweighted case, we get

N- Z,-_/N g=

k-1

i

As expecti d, g = rn for the case of no weights _mdequal group sizes in the data, i.e., n_ = m for all i. l_ep[acilLgthe expected values with the obse_,ed values and solving yields the ANOVAestimates of a_ and cry. Substituting these into the defini[ion of the intraclass correlation 2

P= _ + G_ yields the _NOVA estimator of the intraclass correlation: IFobsPA

=

_bbs

--

1 1+ 9

Note that 7obs is the observed value of the F statistic from the ANOVAtable. For the case of no weights ar d equal hi, PA = roh, the intragroup correlation defined by Kish (1965). Two slightly different e:timators are available through the mean and median options (Gleason 19971. If either of these optioas is specified, the estimate of p becomes

•

0= Fob_ _-(__ i-)Fm

i

}

) ' :

For _he:rae..n option, Fm= E(Fk-1,._'-K) = (_}r_ k)/(N - k - 2), i.e., the expected value of the ANO\__,tab e's F statistic. For the median optioh. Fm is simply the median of the F statistic. Note thal setting F,, to I gives PA, so for large samples these different point estimators are essentially the samd. Als_, since the iniyaclass correlation of the random-effects model is by definition nonnegative.

I

:

for any of he three possible point estimators p is truncated to zero if Fobs is less than F,_.

r_ ' it ,:i

For the case of no weighting, interval estimators for PA are computed. If the groups are equal-sized 264 ni equal) Ionewayeffects, exact and reliability (all and the Large exact one-way option isANOVA, specified,random the following (under the assumption that the Yij are normally distributed) 100(1 a)% confidence interval is computed:

-

[

Fobs - FmF_,

Fobs -- FmFz

Fobs + (9 - 1)FmFu'

Fobs + (9 - 1)FmFt

]

with F,_ - 1, Fl = Fa/2,k_l,N_k, and Fu - Fl_a/2,k_l,N_k, F.,k--l,N--k being the cumulative distribution function for the F distribution with k - 1 and N - k degrees of freedom. Note that if mean or median is specified, Fm is defined as above. If the groups are equal-sized and exact is not specified, the following asymptotic 100(1 - a)% confidence interval for PA is computed: [PA --ZaI2V(pA),PA + zaI2V(pA)] where Zal2 is the t00(1 -a/2) percentile of the standard normal distribution and V(pA) is the asymptotic standard error of p defined below. Note that this confidence interval is also available for the case of unequal groups. It is not applicable, and therefore not computed, for the estimates of p provided by the mean and median options. Again, since the intraclass coefficient is nonnegative, if the lower bound is negative for either confidence interval, it is truncated to zero. As might be expected, the coverage probability of a truncated interval is higher than its nominal value. The asymptotic standard error of PA, assuming the Yij are normally distributed, is also computed when appropriate, namely for unweighted data and when PA is computed (neither the mean nor the median options are specified):

V(pA)

= 2(1_P)2i

(A + B + C)

with A = {1 + p(gN-k

1)} 2

B = (1 - p){1 + p(2gk-1 2

1)}

2

C = p {_-_ ni " 2N-1E:nf

(k- 1)2

n_)2}

and PA is substituted for p (Donner 1986). The estimated reliability of the group-averaged score, known as the Spearman-Brown formula in the psychometric literature (Winer, Brown. and Michels t991, 1014), is

prediction

tO Pt--

1 -t- (tfor group size t. loneway

1)p

computes Pt for t -= 9.

The estimated standard deviation of the group effect is aa -- v/(MSa - MSe)/g. This comes from the assumption that an observation is derived by adding a group effect to a within-group effect. The estimated standard deviation within group is the square root of the mean square due to error, or x/--M--Se.

Ioneway -- Large one-wa_ ANOVA, random effects, and reliability

265

AcknOWledgment We wo_lld like to thank John Gleason of Syracuse vernon

University

for his contributions

to this improved

of loneway.

Referencts Donner, A. 1986. A review of inference procedures for'the intraclass correlation coefficient in the one-way random effects ITodel. International Statistical Review 54: 67;-82. Gteason, L !_. 1997. sg65: Computing intraclass correlations and large ANOVAs. Stata Technical Bulletin 35: 25-3t Reprinte_ in Stata Technical Bulletin Reprints. vol. 6, pp I67-176. Kish, L.; 19_5. Survey Sampling. New York: John Wiley & Sons. Win¢r, B. L D. R. Brown. and K. M Michels. 199I. Statistical Principles in Experimental Design. 3d ed. New York: McOraw -Hill.

Also See Related:

[R] onewa_d

'io

lorenz -- Inequality measures

............

[;i"

; _

1

II

IIII H I

i

i

ii

III

II

i

Remarks Stata should have commands for inequality measures, but at the date that this manual was written, Stata does not. Stata users, however, have developed an excellent suite of commands, many of which have been published in the Smm Technical Bulletin (STB),

Issue

insert

author(s)

command

description

STB-48

gr35

N.J. Cox

psr., qsm, pdagum._ qdagum

Diagnostic plots for assessing Singh-Maddala and Dagum distributions fitted by MLE

STB-23

sg30

E, Whitehouse

lorenz, inequal_ atkinson, relsginl

Measures of inequality in Stata

STB-23

sg31

R. Goldstein

rspread

Measures of diversity: absolute and relative

STB-48

sgl04

S.P. Jenkins

sumdist, xfrac,

Analysis of income distributions •

ineqdeca, geivars, i ineqfac, povdeco STB-48

sgl06

S. R Jenkins

smfit, dagumfiti

Fitting Singh-Maddala and Dagum distributions by maximum likelihood

STB-48

sgl07

S. R Jenkins, E Van Kerm

glcurve

Generalized Lorenz curves and related graphs

STB-49

sgl07.1

S. E Jenkins, P. Van Kerrn

glcurve

update; install this version

STB-48

sgl08

P. Van Kerm

poverty

Computing poverty indices

STB-51

sgtI5

D. Jolliffe, B. Krushelnytskyy

ineqerr

Bootstrap standard errors for indices of inequality

STB-51

sgll7

D. Jolliffe, A. Semykina

sepoy

Robust standard errors for the Foster-GreerThorbexke class of poverty indices

Additiona] commands may be available; enter Stata and typ_ search

inequality

measures.

To download and install from the Internet the Jenkins isumdistcommand, for instance, you could 1. Pull down Help and select STB and User-written Programs. 2. Click on http://www.stata.com. 3. Click on stb, 4. Click on stb48. 5. Click on sg 104. 6. Click on click here to install. or you could instead do the following: 266

i

- '

-

lorenz -- Inequality measures

1, !Na_lgate

to the appropriate

_. Type net

267

STB issue:

_rom http://vwv,

stata,

com

stata,

com/stb/stb48

| Type net cd stb Type net cd stb48 or

). Type net

from http://_w,

2. Ty] e net

describe

3. Ty

±nsta_[1

net

sgl04 sgl04

Refemncq s Cox, N. J. I999, gr35: Diagnostic plots for assessing Singh_Maddala and Dagum distributions fitted by MLE, Stata Technic_1 Bulletin 48: 2-4, Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 72-74. Goldstei_, _. 1995. sg3]: Measures of diversity: absolute and relative. Stata Technical Bulletin 23: 23-26. Reprin'ted in Stata Technical Bulletin Reprints, voL 4, pp. 150_-154. Jenkins. S. _. 1999a. sgl04t Analysis of income distributions. Stata Technical Bulletin 48: 4-18. Reprinted in Stata Tec1_ica' Bulletin Reprints, vol. 8, pp. 243-260. -. 19991 sg]06: Fitting Singh-Maddal_ & Dagum distributions by maximum likelihood. Stata Technical Bulletin 48: t9-5. Reprinted in Stata Technical Bulletin Reprints. rot. 8, pp. 26t-268. Jenldns. _S. • and P. Van Kdrm. 1999a, sgl07: Generalized Lorenz curves and related graphs. Stata Technical Bulletin 48: 25- 9. Reprinted in Stata Technical Bulletin Re_qrints,vol, 8, pp. 269-274. --

lff°J9 sgl07.t: Generalized Lorenz cur'¢es and related graphs. Stata Technical Bulletin 49: 23, Reprinted in S_ata Tetfinical Bulletin • epnnts, voL 9, p. 171.

Jolliffe, D. nd B. Krushelrtytskyy. 1999 sgll5: Bootstrap standard errors for indices of inequality, Stata Technical 13_lletin ;1: 28-32. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 191-196, Jotliffe, D. nd A. Semykintt. 1999 sgll7: Robust stant_u'd errors for the Foster-Greer-Thorbecke class of poverty i_ices. 'tara Technical Bulletin_51: 34-36. Reprinted in Stata Technical Bulletin Reprints. vol. 9. pp, 200-203." Van Kerm. 1999. sg]08: Computing povert_ indices. St_ta Technical Bulletin 48: 29-33. Reprinted in Stata Technical Bulletin _eprints, vol. 8,_pp. 274-278. Whitethouse,_E. 1995, sg30: !Measures of inequality in Sltata. Stata Technical Bulletin 23: 20-23. Reprinted in Stata Te_chnicallBulletinReprir_s. vol. 4, pp. 146-150.

! ,

I

Irtest -- Likelihood-ratio I

test after model estimation i

I

II

I

I

Syntax irtest [, saving(name) using(name)

m_odel(name)dr(#) ]

where name may be a name or a number but may not exceed 4 characters.

Description irtest saves information about and performs lil_elihood-ratio tests between pairs of maximum likelihood models such as those estimated by cox, ]_ogit, logistic, poisson, etc. lrtest may be used with any estimation command that reports a tog-likelihood value or, equivalently, displays output like that described in [R] maximize. lrtest, typed without arguments, performs a likelihood-ratio test of the most recently estimated model against the model previously saved by lrtest ,i saving(0). It is your responsibility to ensure that the most recently estimated model is nested within the previously saved model. lrtest

provides an important alternative

to test'for

maximum likelihood

models.

Options saving(name) specifies that the summary statistics as:;ociated with the most recently estimated model are to be saved as name. If no other options are pecified, the statistics are saved and no test is performed. The larger model is typically saved by typing lrtest, saving(0). using(name) specifies the name of the larger mode_ against which a model is to be tested. If this option is not specified, using(O) is assumed. model (name) specifies the name of the nested model (a constrained specified, the most recently estimated model is used.

model) to be tested. If not

df (#) is seldom specified: it overrides the automatic degrees-of-freedom

calculation.

L

Remarks The standard use of Irtest is 1. Estimate the larger model using one of Stata's estimation saving(O). 2. Estimate an alternative,

nested model (a constrained

commands

and then type lrtest,

model) and then type lrtest.

Example You have data on infants born with low birth weights along with characteristics of the mother (Hosmer and Lemeshow 1989 and more fully described in JR] logistic). You estimate the following model: 268

_

irtest -- LiWelihood-ratio test after model estimation

269

lo istic low age lwt race2 race3 smoke ptl ht ui bogi

Dog

estimates

ikelihood =

]

-100.724

age lwt low race2 race3 smoke

.9732636 ,9849634 Odds Ratio 3. 534767 !2.368079 12.517698

ptl ht ui

;1.719161 ;6.249602 2.1351

.0354759 .0068217 Std. Err, 1.860737 1.039949 I.00916 .5952579 4.322408 .9808153

Number of obs LR chi2 (8) Prob > chi2

= = =

189 33.22 0.0001

Pseudo R2

=

0.1416

-0,74 -2.19 z 2.40 1.96 2,30

O.457 0.029 P> Iz J O,016 0.050 O.021

.9061578 1,045339 .9716834 .9984249 [957_Conf. Interval] 1.259736 9.918406 1.001356 5.600207 I.147676 5,52316"2

1.56 2.65 1.65

O.118 O.008 0.099

.8721455 I.611152 .8677528

You now _ ish to test the constraint that the coefficients on age, lwt;, ptl, equivalent] in this case that the odds ratios are all 1). One solution is te3t ( I ( 2 ( 3 (4

age l_t

3.388787 24,24199 5.2534

and ht are all zero (or

pl_ ]at

age = 0.0 lwt = 0.0 ptl = 0.0 ht = 0.0

I

chi2( 4) = Prob > dhi2 =

12.38 0.0147

This test i; based on the inverse of the information matrix and is therefore based on a quadratic approxima ion to the likelihood function: see [R] test. A more precise test would be to re-estimate the model, apt lying the proposed constraints, and then calculate the likelihood-ratio test. lr't:est assists you iin doi lg this. You fir_t save the st_itistics associated with tlie current model: lr zest, saving_O)

The"nam_" 0 was not _h°sen arbitrarily, although we could have chosen any name. Why we chose 0 will bec _me clear sb+rtly. Having saved the information on the current model, we now estimate the constrained model, ,_,hich in this case is themodel omitting age, l_,,,"t:,ptL and ht: Io istic low r_ce2 race3 smoke ui L_gi

estimates


Dog Likelihood = '-107.93404 low race2 race3 smoke ui

Pseudo

R2

= -=

189 18.80 0.0009

=

0.0801

Std. Err.

z

P>Izt


3.052746

I.498084

2.27

O.023

I.166749

7.987368

12.922593 12.945742 2.419131

I.189226 I.101835 1.047358

2.64 2.89 2.04

0. 008 O.004 0.041

1.31646 i.41517 i.03546

6.488269 6.131701 5.651783

Odds Ratio

That done. typing Irteit will compare this model with the model we previously saved: Ir zest Logi 3tic:

likelJhood-ratio test

chi2(4) = Prob > chi2 _

14.42 0.0061

._/i

_'

¢"# 'J LqI_OL -I.,.II_V_IIIIVq.PU--i CIILI_I IIIUUI_I _,_ILI||lidIIqQIFI The more !!precise syntax for theCILIU test |ql_Ol is Irtest, usihg(O),meaning that the current model is to be compared with the model saved as 0. The name 0, a_ we previously said. is special when you do not specify the name of the using() model, using(b) is assumed. Thus. saving the original model as 0 saved us some typing when we performed the test.

Comparing results, test reported that age, lwt, ptl, and ht were jointly significant at the 1.5% level; lrtest reports they are significant at the 0.6% level, lrtest's results should be viewed as more accurate. q

Example Typing lrtest, saving(0) and later lrtest by itself is the way lrtest used, although here is how we might use the other options: logit lrtest, logit

chd age

age2

sex

estimate full model

saving(O) chd

age

save results

sex

estimate

lrtest lrtest, logit

is most commonly

simpler model

obtain test saving(I)

save logit results as t

chd sex

estimate simplest model

Irtest

compare

with full model

irtest, using(1)

compare

with model 1

lrtest,

repeat against furl model wst

model (1)

_>Example Returning to the low birth weight data in the first example, you now wish to test that the coefficient on race2 is equal to that on race3. The base modellis still stored in 0. so you need only estimate the constrained model and perform the test. Letting z be the index of the logit, the base model is z = _0 -1- fllag

e -J- _21wt

+ fls_ace2

+ fl4race3

-k ""

+ fl3lrace2

+ race3)

+ -..

If _3 -- 34, this can be written z --- _0

+ tinge

+ fl21wt

To estimate the constrained model, we create a variable equal to the sum of race2 estimate the model including that variable in their place:

(Continued

on next page)

and race3

and

........

•"_-

ge:

race23

= r_ce2

_

--

I----7

...............................

_....................

LiK_llnooo-ra1[lO le_ff a_er model estlmatlon

271

+ race3

• loiistic low a_e _ lwt race23 smoke ptl ht ui Lbgi' estimates

Lpg

Ekelihood = low

)

-100.9997

Oc_dsRatio

age lvt; race23 smoke ptl ht ui

Number of obs LR chi2(7) Prob > chi2 Pseudo P_

1.9716799 :.9864971 _.728186 5.664498 )1.709129 _.116391 !2.09936

Std. Err. .0352638 .0064627 1.080206 1.052379 ,5924775 4.215585 .9699702

z -0.79 -2.08 2.53 2.48 1.55 2.63 1.61

= = = =

189 32.67 0.0000 0.1392

P>lzl


0.429 0.038 0.011 0.013 0.122 0.009 0.108

.9049649 .9739114 !.255586 1.228633 ,8663666 1.58425 .8487997

1.043313 .9992453 5.927907 5.778414 3.371691 23.61385 5.192407

chi2(1) = Prob > chi2 =

0.55 0.4577

Comparing this model with our original model, we obtain Irl;est Logi_tic:

likelihood-ratio test

By corn )arison, typing testrace2=race3after estimating our base model results in a significance level of .4:;72. q

Saved Re;ults lirtest

saves in r() Scalars r(p)

two-sided p-_alue

r(df)

degrees of fvbedom

r(chi2)

X2

Ptogan mers desiring that an estimation command be compatible with trtest it requires :he following Imacros to be defined: e(c_l)

name q estimationcommand

e(ll) e (dr._m) e(N)

log-likelihood value model degrees of freedom number of observations

should note that

MethOdsand Form! Jlas irtest

is implemen_d

as an ado-file.

Let Lo and Lt be +e log-likelihood values_ associated with the full and constrained models, respectivel '. Then X2 _ -2(L1 - L0) with d_ - dl degrees of freedom, where do and dl are the model Jegrees of freedom associated with the full and constrained models (Judge el al )985, 216-;21q).

Z7Z

+t i

Irtest --

LiKelmooo-ratlo

lesl al_er moael estlmalciOrl

References Hosmer, D. W., Jr., and S. Lemeshow. I989. Applied Logistic Regression. New York: John Wiley & Sons. (Second edition forthcoming in 2001.) Judge, G. G., W. E. Griffiths, R. C. Hill, H. L/itkepohl, and T.-C. Lee. 1985. The Theory and Practice of Econometrics. 2d ed. New York: John Wiley & Sons. P6rez-Hoyos, S. and A. Tobias. 1999. sgtll: A modified likelihood-ratio test command. Stata Technical Bulletin 49: 24-25. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp 171-173. Wang, Z. 2000. sg133: Sequential and drop one term likelihood-ratio Reprinted in Stata Technical Bulletin Rcpnnts, vol. 9, pp. 332-334.

tests. Stata Technical Bulletin 54: 46-47.

Also See Related:

[R] estimation commands, JR] linktest, [R] test, [R] testnl

I

F

Title

i Life tables for survival data

SyntSx lgab:e

timevar [deadvar]

[weight] [if

bxp] [in range]

sm'vival fail_re hazard intervals(interval)

test

[, by(groupvar)

level(#)

tyid(varname)

noadjust

nol;abgraph g_,aph_options noconf ]

fwdigkts Ireallowed;se_ [UI 14.1.6weight.

Deripti

)n

itab] displays and graphs life tables for individual-level or ag_egate data and optionally presents the likeli]lood-ratio and log-rank tests for equNalence of groups, ttable also allows examining the empirical hazard function through aggregation. Also see [R] st sts for alternative commands. timevc r specifies the time of failure or censoring. If deadvar is not specified, all values of timevar are inteq: reted as failure times: otherwise, time_ar is interpreted as a failure time where deadvar _ 0 and as a _ensoring time otherwise. Observations with timevar or deadvar equal to missing are ignored Note arefully that deadvar does not specify_the number of failures. An observation with deadvar eq_aalto or 50 has the same interpretation the observation records one failure. Specify frequency weights )r aggregated data (e.g., itabletim_ [freq=number]).

options bz(groulwar) creates :_eparate tables (or graphs within the same image) for each value of groupvar :group Jar may be siring or numeric. 1eve! (#) specifies the confidence level, in percent, for confidence intervals. The default is level i or,as _et by set level; see [R] le Vel.

(95)

survival, failure, 'and hazard indicate the table to be displayed. If not specified, the default is the survival table. Specifying failure Would display the cumulative failure table. Specifying surv_ va! failure would display both the survival and the cumulative failure table. If graph is specif ed, multiple iables may not be requested. intervals (intowaI) !specifies the time intervals into which the data are to be aggregated for tabular preset tation. A single numeric argument is':interpreted as the width of the interval. For instance. intm'val(2) aggregates data into the time intervals 0 _< t < 2, 2 _< _ < 4. and so on. Not specif¢ing interval() is equivalent to specifying interval(I). Since in most data, failure times are recorded _s integers, tNs amounts _to no aggregation except that implied by the recording of the time variable and so produces Kaplat_-Meier product-limit estimates of the survival curve (with an actuarial _justment; see the noadjust option below). Also see [R] st sts list. Although it is l]ossible to exhmine survival and faihire without aggregation, some form of aggregation is almol j

always req_red for exarnining the tilazard. 273

=

"

274

._

ltable -- Life tables for survival data

When more than one argument is specified, time intervals are aggregated as specified. For instance, interval(O,2,8,16) aggregates data into the intervals 0 < t < 2. 2 _< t < 8, 8 < t < 16, and (if necessary) the open-ended interval t > 16.

....

interval (w) is equivalent to interval (0,7,15,30,60,90,180,360,540,720), corresponding to one week, (roughly) two weeks, one month, two months, three months, six months, 1 year, 1.5 years, and 2 years when failure times are recorded in days. The w suggests widening intervals. test

presents two X2 measures of the differences does nothing if by () is not specified.

between groups when by()

is specified, test

tvid(varname) is for use with longitudinal data with time-varying parameters as processed by cox; see [R] cox. Each subject appears in the data more than once and equal values of varname identify observations referring to the same subject. When tvid() is specified, only the last observation on each subject is used in making the table. The order of the data does not matter, and "last" here means the last observation chronologically. noadjust suppresses the actuarial adjustment for deaths and censored observations. The default is to consider the adjusted number at risk at the start of the interval as the total at the start minus (the number dead or censored)/2. If noadjust is specified, the number at risk is simply the total at the start, corresponding to the standard Kaplan and Meier assumption, noadjust should be specified when using ltable to list results corresponding to those produced by sts list; see [R] st sts list, notab

suppresses displaying the table. This option is often used with graph.

graph requests that the table be presented graphically as well as in tabular form; when notab is also specified, only the graph is presented. When specifying graph, only one table can be calculated and graphed at a time; see survival,failure, and hazard above. graph_options are any of the options alIowed with graph, twoway; see [G] graph options. When noconf is specified, twoway's connect() and symbol() options may be specified with one argument; the default is connect (1) symbol(O). When noconf is not specified, the connect () and symbol () options may be specified with one or three arguments. The default is connect(Ill) and symbol(Oii), drawing the confidence band as vertical lines at each point. When you specify one argument, you modify the first argument of the default. When you specify three, you completely control the graph. Thus. connect(ill) would draw the confidence band as a separate curve around the survival, failure, or hazard. noconf

suppresses graphing the confidence

intervals around survival, failure, or hazard.

Remarks Life tables date back to the seventeenth century; Edmund Halley (1693) is often credited with their development, ltable is for use with "cohort" data and. although one often thinks of such tables as following a population from the "birth" of the first member to the "death" of the last. more generally, such tables can be thought of as a reasonable way to list any kind of survival data. For an introductory discussion of life tables, see Pagano and Gauvreau (2000. 489-495): for an intermediate discussion, see. for example, Armitage and Berry (1994. 470-477) or Selvin (t996 311-355); and for a more complete discussion, see Chiang (1984).

L>Example In Pike (1966), two groups of rats were exposed to a carcinogen and the number of days to death from vaginal cancer was recorded (reprinted in Kalbfleisch and Prentice 1980, 2):

,

itabte -- Life tables for survLvMderta Group i Group 2

143

164 t88 188 190 t92 206 209

213

220

227

230

234

246

265

304

216"

244*

t42 233 344*

155 239

163 240

198 261

205 280

232 280

232 296

233 296

233 323

216 233 204*

The '*' o[ a few of the' entries indicates that the observation was censored--as the rat ha rea_n$.

275

of the recorded day,

still not died due to vaginal cancer but was withdrawn from the experiment for other

Having _ntered these data into Stata, the firs| few observations are i

list

in

1/5 group

1 2 3 4 5,

t 143 164 188 188 190

1 1 1 1 1

died 1 1 1 1 1

That is, t] e first obse_¢ation records a rat from group I that died on the 143rd day. "/'be va6able died reccrds whether that rat died or was wlthdra n (censored): lJst if died==O t 216 244 204 324

group I 1 2 2

18, 19, 39. 40,

died 0 0 0 0

Four rats, wo from each group, did not die but were withdrawn. The sl: lival table f_brgroup 1 is I'able t died lifgroup==l nterval 1,_3 I_M 1_t8 I!)0 I!)2 2_)6 2 )9 2 ,3 2 .6 2 !0 2 !7 _0 2 14 2 _4 2 _6 255 3 )4

144 165 189 191 193 207 210 214 217 221 228 231 235 245 247 266 305

Beg. Sotal 19 18 17 15 14 13 12 11 I0 8 7 6 5 4 3 2 1

Deaths

Lost 1 1 2 1 I 1 1 1 1 1 1 1 i 0 1 1 1

0 0 0 0 0 0 0 0 I 0 0 0 0 1 0 0 0

Survival O.9474 O.8947 O.7895 O.7368 O.6842 O.6316 0.5789 O.5263 0.4709 O.4120 0.3532 O.2943 0.2355 O. 2355 O.1570 O.0785 O.0000

Std. Error O.0512 O.0704 O. 0935 O.1010 O.I066 O.1107 0.1133 O.1145 O.1151 O.1148 O.1125 O.1080 O. 1012 O. 1012 0.0931 O.0724

[957. Conf. O.6812 0.6408 O. 5319 O.4789 O.4279 O.3790 0.3321 O.2872 0.2410 O.1937 O.1502 O.1105 0.0751 0.0751 0.0312 O.0056

Int. ] O.9924 O.9726 O.9153 O.8810 O.8439 O.8044 0.7626 O.7188 O.6713 O.6194 O.5648 O.5070 0.4259 O. 4459 O.3721 O.2864

The repoted survival rates are the survival rates at the end of the interval, Thus. 94,7% of rats su_ived 144 days or r_ore. Example When you do not specify the intervals, Itable uses unit intervals. The only aggregation performed on the data was aggregation due to deaths or withdrawals occurring on the same "day". If we wanted to see the table aggregated into 30-day intervals, we would type Itable t died if group==l, interval(30) Interval 120 150 180 210 240 300

150 180 210 240 270 330

Beg. Total

Deaths

Lost

Survival

19 18 17 11 4 1

1 1 6 6 2 1

0 0 0 1 1 0

O. 9,_74 0. 8947 O. 5_89 O. 24_81 O. 1063 O. O(N)O

Std. Error O. 0512 O. 0704 O. 1133 O. I009 O. 0786

[95Y.Conf. Int.] O. 6812 O. 6408 O. 3321 O. 0847 O. 0139

O. 9924 0. 9726 O. 7626 O. 4552 O. 3090

The interval printed 120 150 means the interval including 120. and up to but not including The reported survival rate is the survival rate just after the close of the interval. When you specify more than one number as the argument to interval(), widths but the cutoff points themselves.

150.

you specify not the

I

i

t

Rab4e-- Life tables for survival data

277

{ o

l_able }

t

died

if

group==1,

interval(l!20,180,210,240,330)

Beg. nterval

Total

Std. Deaths

Lost

Survival

Error

[95Z Conf.

Int,]

I_0

180

19

2

0

0.8947

0.0704

0,6408

0.9726

2 0

240

I1

6

1

0.2481

0.1009

0.0847

0.4552

2II0 0

330 210

4 17

3 6

1 0

0.0354 0,5789

0.0486 0.1133

0.0006 0,3321

0.2245 0,7626

If any of :he underlying failure or censoring tifnes are larger than the last cutoff specified, they are treated as being in the open-ended interval: • l;able

t died

interval(_20,180,210,240)

Beg. Total

Deaths

Lost

210

17

6

0

0.5789

0.1133

0.3321

0.76_6

240

11

6

1

0.2481

0.1009

0.0847

0.4552

4

3

1

0.0354

0.0486

0.0006

0.2245

_nterval

1 0

if group==l,

i! Io

Survival

Std. Error

[95Z Conf.

Int,]

ooo00o4 00o

Whether lhe last interval is treated as open-end_d or not makes no difference for survival and failure tables, bu_' it does affect hazard tables. If the ifiterval is open-ended, the hazard is not calculated for it. q

Examfle !

The by(varname) option specifies that separate tables are to be presented for each value of va;vzame. Remember that our rat dataset contains two goups: l_able

I

t died,

by(group)

interval(30)

I interval

Beg. Total

Deaths

Lost

groap = 1 20 150

gr

Survival

Std. Error

[95Z Conf.

Int.]

19

t

0

0.9474

0.0512

0.6812

0.9924

50

180

18

1

0

0,8947

0.0704

0.6408

0,9726

30

210

17

6

0

0.5789

0.1133

0.3321

0.7626

tO

240

11

6

1

0.2481

0.1009

0,0847

0,4552

_0 )0

270 330

4 1

2 1

1 0

0.1063 0.0000

0,0786

0.0139

0.3090

ap = 2 20 150

21

1

0

0.9524

0.0465

0.7072

0.9932

_50

180

20

2

0

0.857!

0.0764

0.6197

0.9516

_80 _10

210 240

18 15

2 7

1 0

0,7592 0,4049

0.0939 0.1099

0.5146 0,1963

0.8920 0.6053

70

300

6

4

0

0.1012

0.0678

0.0172

0.2749

O0

330

2

1

0

0.0506

0.0493

0.0035

0.2073

30

360

1

0

1

0.0506

0.0493

0.0035

0.2073

40

270

8

2

0

0.3037

0.1031

0.1245

0,5057

278

,,.

Rable -- Life tables for survival data

> Example A fmlure table is simply a different way of looking at a surviv_ • liable t died if group==l, Interval 120 150 180 210 240 300

interval(30)

Beg. Total

Deaths

Lost

19 18 17 ii 4 1

1 1 6 6 2 i

0 0 0 1 1 0

150 180 210 240 270 330

table: failure is 1 - survive:

failure Cum. Failure

Std. Error

0.0526 0.ii053 0._211 0.7519 0.8937 1.0000

0.0512 0.0704 0.1133 0.1009 0.0786

[95_ Conf. Int.] 0.0076 0.0274 0.2374 0.5448 0.6910

0.3188 0.3592 0.6679 0.9153 0.9861

q

Example Selvin (! 996, 332) presents follow-up data from Cuder and Ederer (1958) on six cohorts of kidney cancer patients. The goal is to estimate the 5-year survival probabihty. WithYear Interval Alive Deaths Lost drawn

','ear Interval Alive

1946

19_48

1947

0-1 1-2 2-3 3-4 4-5 5-6 0- 1 1-2 2-3 3-4 4-5

9 4 4 4 4 4 t8 11 ll 10 6

4 0 0 0 0 0 7 0 1 2 0

1 0 0 0 0 0 0 0 0 2 0

19z_9 4 1950 1951

0-1 1-2 2-3 3-4 0-I 1-2 2-3 0-I I-2 0-1

21 10 7 7 34 22 t6 19 13 25

WithDeaths Lost drawn 11 I 0 0 I2 3 1 5 1 8

0 2 0 0 0 3 0 ! ! 2

7

15 1t 15

6

The following is the Stata dataset corresponding

to the table:

list

I. 2. 3. 4. 5. e[c.

year 1946 1946 1946 1947 1947

t .5 .5 5.5 .5 2.5

died 1 0 0 1 1

pop 4 1 4 7 1

As summary data may often come in the form shown above, it is worth understanding exactly how the data were translated for use with 3.table. t records the time of death or censoring (lost to follow-up or withdrawal), died contains 1 if the observation records a death and 0 if it instead records Iost or withdrawn patients, pop records the number of patients in the category. The first line of the table stated that. in the 1946 cohort, there were 9 patients at the start of the interval 0-1, and during the interval. 4 died, and 1 was lo_t to follow-up. Thus, we ent.ered in observation 1 that at t = .5. 4 patients died and, in observation 2 that at t = .5, t patient was censored. We ignored the information on the total population because ].table will figure that out for itself. t

liable -- Life tables for survival data

279

!

i •

The s@ond line of the table indicated that in the interval 1-2, 4 patients were still alive at the beginninglof the interval and. during the interval, 0 died or were lost to follow-up. Since no patients died or wgre censored, we entered nothing into our data, Similarly, we entered nothing for lines 3, 4, and 5 _f the table. The last line for 1946 staied that, in the interval 5-6, 4 patients were alive at the begmr_mg of the mtervat and that those 4 peltlents were w,hdrawn. In obserx:atmn & we entered that there lwere 4 censorings at t = 5.5.

4

}

The f t that we chose to record the times cff deaths or censoring as midpoints of intervals does not matte_: we could just as well have recorded the times as 0.8 and 5.8. By default, liable wilt

l

form mteqvals 0-1, 1-2, and so on, and place Observations into the intervals to which they belong. We sugge,_t using 0.5 and 5,5 because those numbers correspond to the underlying assumptions made

i

by ltabl_;

!

in making its calculations. Using midpoints reminds you of the assumptions.

To ob@n the survival rates, we type

!

. l_abte

t

died

nterval

[freq=pop]

Total

Deaths

Lost

Survival

Error

[95Y, Conf.

Int.]

iO

1

Beg. i26

47

19

O.5966

Std. O. 0455

O.5017

O. 6792

II 12 13

2 3 4

60 38 21

5 2 2

17 15 9

O. 5386 O. 5033 0.4423

O. 0479 O.0508 O. 0602

O.4405 O.4002 O. 3225

O. 6269 O. 5977 O. 5554

14

5

I0

0

6

0.4423

0.0602

O.3225

O. 5554

5

6

4

0

4

O. 4423

0.0602

O. 3225

O. 5554

|

We estimate the 5-year sur_,ival rate as .4423 and the 95% confidence interval as .3225 to .5554, Selvin t1996, 336), in presenting these results, lists the survival in the interval 0-1 as I. in 1-2 as .597, i_ 2-3 as .539. and so on. That is. rdative to us, he shifted the rates down one row and inserted all in the first row. In his table, the survival rate is the survival rate at the start of the interval. 1t_ our table, the survival rate is the survival rate at the end of the interval (or, equivalently. at the star_ of the new interval). This is. of course, simply a difference in the way the numbers are presented!and not in the numbers themselves. 4

Example The di,,crete hazard function is the rate of failure--the number of failures occurring within a time interval divided by the width of the interval (assuming there are no censored observations). While the surviv:fl and failure tables are meaningful at_the "individual" level with intervals so narrow that each cont4ins only a si_lgle failure

that is not true for the discrete hazard. If all intervals contained

!

to be a c_nstant! one death liand if all intervals were of equal widfla, the hazard function would be I/'At and so appear The e_piricalty determined discrete hazard function can only be revealed by aggregation. Gross and Clark!(1975, 37_ print data on malignant melanoma at the M. D. Anderson Tumor Clinic between

1

1944 and 1t960. The interval is the time from i/fitial diagnosis:

i

! {

!

a.,.,v

_uz_u_ --

n_._ taulW_

;, _ _:_i:_i

IUI- _UI'VlVal

Interval (years)

Number withdrawn alive

Number dying

19 3 4 3 5 1 0 2 0 0

77 71 58 27 35 36 17 10 8 0

312 96 45 29 7 9 3 t 3 32

0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 9+

i i

CIRl[a

Number lost to follow-up

For our statistical purposes, there is no difference between the number test to follow-up (patients who disappeared) and the number withdrawn alive (patients dropped by the researchers)--both are censored. We have entered the data into Stata; here:is a small amount of it: . list

1.

t .5

d 1

pop 312

2.

.5

0

19

3.

.5

0

77

4.

t .5

1

96

5.

1,5

0

6.

1.5

0

We entered numbers

in I/6

each group's

of the table, Itable

t d

time of death

recording

or censoring

d as 1 for deaths

[freq=pop],

hazard

Beg. Interval

3 71

as the midpoint

of the intervals

and 0 for censoring.

The hazard

and entered table

the

is

interval(O,l,2,3,4,5,6,7,8,9)

Cum.

Std.

Total

Failure

Error

Std. Hazard

Error

[95_ Conf.

Int.]

0

1

913

0.3607

0.0163

0.4401

0.0243

0.3924

0.4877

1

2

505

0.4918

0.0176

_.2286

0,0232

0.1831

0,2740

2

3

335

0.5671

0.0182

0_.1599

0.0238

0.1133

0.2064

3

4

228

0.6260

0.0188

0i. 1461

0,0271

0,0931

0.1991

4

5

169

0.6436

0.0190

01.0481

0.0182

0.0125

0.0837

5

6

122

0,6746

0.0200

0_.0909

0.0303

0.0316

0.1502

6 7

7 8

76 56

0.6890 0.6952

0.0208 0.0213

0_.0455 0L0202

0.0262 0,0202

0.0000 0.0000

0.0969 0,0598

8 9

9

43 32

0,7187 1.0000

0,0235

0,0800

0.0462

0.0000

0.1705

We specified the interval() option as we did and not as interval(1) (or omitting the option altogether) to force the last interval to be open-ended. Had we not, and if we had recorded t as 9.5 for observations in that interval (as we did), ltable would have calculated a hazard rate for the "interval". In this case. the result of that calculation would have been 2, but no matter what the result, it would have been meaningless since we do not know the width of the interval. You are not limited to merely see the result graphically: . itable

i

t d [freq=pop],

> xlab(0,2,4,6,8,10)

examining

hazard

border

a column

of numbers.

1(0,1,2,3,4,5,6,7,8,9)

With the graph

graph

notab

option,

you can

liable -- Life tables for survival data i

,

i

,l_

I

1

I

|

281

I

{o I

2-

t

"_trne{years)

The verti@ lines in the graph represent the 95% confidence intervals for the hazard; specifying noconf w_uld have suppressed them. Among th+ options we did specify, although it is not required, not_b supl_ressed printing the table, saving us some paper, xlab () and border _ ere passed through

,.ee

o.,,o°.,e.e

made

q

Example You cani _raph the survival function the same way you graph the hazard function: just omit the hazm-d', op! on,

q

Methodsand Formulas It.able lis implemented as an ado-file. Let ri b_ the individual failure or censoring times. The data are aggregated into intervals given by tj, j = !, .... J, and t j+l = oc with each interval containing counts for tj _ Example You have two datasets stored on disk that you wish to merge into a single dataset. The first dataset, called odd. dta, contains the first five positive odd numbers. The second dataset, called even. dta, contains the fifth through eighth positive even numbers. (Our example is admittedly not realistic, but it does illustrate the concept.) The datasets are • use odd (First five odd numbers) list number 1 2 3 4 5

i. 2. 3. 4. 5.

odd 1 3 5 7 9

• use even (5th

through

8th

even

numbers)

list 1. 2. 3. 4.

number 5 6 7 8

even 10 12 14 16

We will join these two datasets using a one-to-one memory (we just used it above), we type merge using

merge. Since the even dataset is already in odd. The result is

- merge using odd number was int now float list

1. 2. 3. 4. 5.

number 5 6 7 8 5

even 10 12 14 16

odd 1 3 5 7 9

_merge 3 3 3 3 2

The first thing you will notice is the new variable _merge. Every time Stata merges two datasets, it creates this variable and assigns a value of 1, 2, or 3 to each observation. The value I indicates that the resulting observation occurred only in the master dataset, 2 indicates the observation occurred only in the using dataset, and 3 indicates the observation occurred in both datasets and is thus the result of joining an observation from the master dataset with an observation from the using dataset. In this case, the first four observations are marked by _merge equal to 3, and the last observation by .._merge equal to 2. The first four observations are the result of joining observations from the two datasets, and the last observation is a result of adding a new observation from the using dataset. These values reflect the fact that the original dataset in memory had four observations, and the odd dataset stored on disk had five observations. The new last observation is from the odd dataset exclusively: number is 5, odd is 9, and even has been filled in with missing. Notice that number takes on the values 5 through 8 for the first four observations. Those are the values of number from the original dataset in memory--the even dataset--and conflict with the value of number stored in the first four observations of the odd dataset, number in that dataset took on the values 1 through 4, and those values were lost during the merge process. When Stata joins observations and there is a conflict between the value of a variable in memory' and the value stored in the using dataset. Stata by default retains the value stored in memory.

_

i i [

ii

:

When t e command 'merge using

merge -- Merge datasets 309 odd was ;zssued, Stata responded with "number was int now

i

Letls describethe datasets in this example: float"., describe usingodd Contains

data

First

ob#:

I

5

vat :

2

starage

display

valm

vari_ible name

_ype

format

labe_

numb_ ,=r

float

_.9.Og

odd

_loat

Y,9.0g

Sort,_d

five

5 Jul 2000

variable

odd numbers 17:03

label

Odd numbers

by :

!

,

I

Cont_dns

i

ob_,_: sin,=:

i

de:;cribe using

!evenl

data 4 40

5th through

8th even numbers

II Jul 2000

14:12

2 1 vari_ble

name

St QTage _3pe

display format

valu_ label

variable

label

[ I

numb_ _r

i_t

Y,8.Og

even!

float

Y,9.Og

Sort+d

by :

Even

numbers

,

!

i

Note that number is stored as a float in oad.dta,but is stored as an int ineven.dta;see [U] t5.2.2 Numeric storage types. When you mergetwo datasets, Stata engages in automatic variable promotion;! that is, if there are conflicts in numeric storage types, the more precise storage type will be used. The resulting dataset, therefore, will have number stored as a float, and Stata told you

i

this when It said "number was int now float".

I MatchimerSe !

In a maich merge, obgervations are joined if ttie values of the variables in the varlist are the same.

[ if

Since the +alues must lie the same, obviously the variables in the varlist must appear in both the master andtthe using datasets.

!

A mate merge proceeds by taking an observation from the master dataset and one from the usm_ dataset and comparing t_e values of the variable_ in _he vartist. If the varlist values match, then the • 1 .. ' . _ . . observatmr_s are joined, if the varhst values do nqt match, the observation from the earher dataset (the dataset whose var/ist valhe comes first in the sort order) is joined with a pseudo-observation from the l_ter data@ (the other dataset). All the variable_ in the pseudo-observation contain missing values. The actual !observation from the later dataset is _etained and compared with the next observation in the earlier [lataset, and the process repeats.

f

i ! [ i !

l

I

!

,_

_ Ju

merge-

Merge

oaTasel[s

> Example

-_

The result is not nearly so incomprehensible as the explanation. Let's return to the dataset used in the previous example and merge the two datasets on variable number. We first use the even dataset and then type merge number using odd: .

even

use

(Sth

through

8th

even

numbers)

• merge number using odd master data not sorted

r(5); Instead of merging the datasets, Stata reports the error message "master data not sorted". Match merges require that the data be sorted in the order of the varlist, which in this case means ascending order of number. If you look at the previous example, you will observe that the data are in such an order, so the message is more than a little confusing. Before Stata can merge two datasets, however, the data must not only be sorted but Stata must know that they are sorted. The basis of Stata's knowledge is the internal information it keeps on the sort order, and Stata reveals the extent of its knowledge whenever you describe the dataset: • describe Contains

data

from

evenl.dta 4 2

obs: vats: size:

40

5th through 11 Jul 2000 (99.9%

storage

of

memory

display

value

format

label

number

int

7.8.Og

even

float

7,9.Og

Sorted

name

numbers

free)

type

variable

8th even 14:12

variable

Even

label

numbers

by :

The last line of the description shows that the data are "Sorted by:" nothing. We tell Stata to sort the data (or to learn that it is already sorted) with the sort command: • sort

number

describe Contains

data

from

evenl.dta

obs: vars:

4 2

size:

40

5th through 11 Jul 2000 (99.8Y, of memory

storage

display

value

format

label

number

int

7,8.Og

even

float

_,9.Og

Sorted

name

by:

numbers

free)

type

variable

8th even 14:12

variable

Even

label

numbers

number

Now when we describe the dataset, Stata informs us that the data are sorted by that Stata knows the data are sorted, let's try again: • merge number using data not

r(5);

using odd sorted

number.

Now

::

Stata stilltrefuses to ca_ out our request, this time complaining that the using data are not sorted. Both data_ets, the masier and the using, must be in ascending order of number before Stata can

ii

perform

I

As befbre, if you look at the previous exardple you will discover that odd.dta is in ascending order of _umber,but as before, Stata does noi kn_w this yet. We need to save the data we just sorted, use the odd da_.a,sort it, and re-save it:

[ [

aimerge.

1

stve even, replace fil_ even. dta sa#ed . u_ e odd

i

cn, t s ode

!

• s¢rt number i

i

. s_veodd.dta odd, repl_ce fil_ saved

Now we hould be able to merge the two datMets

[ i 1 i

. .

(5tl]

through

number

was

8th even

numbers)

int nbw float

li_st

i

number 5

even 10

2 .i 3 .i

6 7

12 14

I

4.

8

16

[

5.{

1

t

2

;

6. i

2

3

2

7

3

5

2

8;

4

7

2

1

I li

i [ I

[

[ i

_ [ !

i [

odd 9

amerge 3 1 1 1

{

It workedl Let's under_and what happened, Even though both datasets were sorted by number, we immediately discern that the result is no longei in ascending order of number. It will he easier to understan( what happer_ed if we re-sort _he d_ta and then list the data a_ain: . sort li st number I 1 _

number 1

even

odd 1

_merge 2 2

2

2

3

3

3

5

2

4

4

7

2

5 .! 6?

5

10

9

3

6

i2

1

7 8

14 16

1 1

8 _

7 8. i

Notice [hat number now goes from 1 to 8, With no repeated values and no values left ou_ of the sequence. Recall that th_ odd datase[ defined observations for number between I and 5, whereas the even data Jet defined o_servations between 5 a_id 8. Thus, the variable odd is defined for number equal to ]llhrough 5, a_d even is defined for n_ber equal to 5 through 8, 1 For insiance, in the first observation number is l, even is missing, and odd is 1, The value of _mer _ _ indicate_ that this ob_ervat on chme from the usin£ dataset odd dta In the last obserxatio_ number is 8, even is 16, and oddis missing. The Value of _merge. this obser_Jation came f}om the master _' dataset_even.dta.

1, indicmes that

i

i

,/

312 merge -- Merge dalmsets The fifth observation is worth comrnent, number is 5, even is 10, and odd is 9. Both even and odd are defined, since both the even and the odd datasets had information for number equal to 5. The value of _.merge, 3. also tells us that both datasets contributed to the formation of the observation. q

> Example Although the previous example demonstrated, in glorious detail, how the match-merging process works, it was not a practical example of how you wilI ordinarily employ it, Here is a more realistic application. You have two datasets containing information on automobiles. The identifying variable in each dataset is make, a string variable containing the manufacturer and the model. By identifying variable, we mean a variable that is unique for every observation in the dataset. Values for make-- for instance, Honda Accord--are sufficient for identifying each observation. One dataset, autotech.dta, also contains mpg, weight, and length. cost. dta, contains price and rep78, the 1978 repair record. describe

using

Contains

autotech

data

Automobile

1978

obs: vats : size

74 4

:

11 Jul

2000

Data 13:55

2,072

variable

name

storage type

display format

value label

variable

label

make

strl8

7.18s

Make

mpg

int

7.8.Og

Mileage

weight

int

7.8.Og

Weight

(lbs.)

length

int

7,8.Og

Length

(in.)

Sorted

by:

• describe Contains obs: vats :

and Model (mpg)

make using

autocost

data

1978 Automobile Data ii Jul 2000 13:55

74 3

size :

1,924

storage

display

value

type

format

label

make

strl8

price

int

rep78

int

variable

Sorted

The other dataset, auto-

name

by:

variable

label

Y,18s

Make

Model

7.8.0g

Price

7.8.0g

Repair

and

Record

1978

make

We assume that you want to merge these two datasets into a single dataset: • use

autotech

(A1_tomobile . merge

make

Models ) using

autocost

I

I

i

F

merge-- Merge datasets Let's now _xamine the _esult: • de: ,cribe

i

!

from

siz4:

11 Jul 2000

2,¢142 (99.6X of memory sto_age

flee)

value

t_pe

format

label

strl8 iht

_,18s

Make

i_t

XS.0g XS.0g

Mileage (mpg) Weight (IBM.)

lengl h pric(

int i:It

Y,8. Y,8.Og Og

Price

(in.)

rep7_ _mez e

i:it b rte

_,8.0g %8.0g

Repair

Record

vari_b!e

name

make mpg weig_ t

Data

13:55

display

variable

label

and Model

Length

Sorted

1978

by:

Note:

datas_t

has changed

since

last saved

We have alsingle defame! containing all the information from the two original datasets--or at least it appears t_at we do. B_fore, accepting that conclusion, we need to verify, the result. We think that we entered ldata for the _ame cars in each dataset, so every variable should be defined for every car. Although "ateknow it is tintikely, we recognize the possibility that we made a mistake and accidentally ,eft some _ars out of ohe or the other dataset. We can reassure ourselves of our infallibility by tabulating _._merge: . ta_late

I

_mergel

a

I Total _merge

74

}

74 i Freq. !

I

ioooo I00.00 Percent

Ioooo Cure.

We see that __merge is !3 for ever), observation in the dataset. We made no mistake--for obsevvation in autocos_.dta, there was an obs+rvation in autotech.dta and vice versa.

every

Nov,' p_etend that £'e have another dataset containing additional information on these automobile_--automord.dta--and we want to merge that dataset as well. Before we can do so. we muM sort the da{a we have in memory, by make since after a merge the sort order may have changed: . sor mer

I

r(ltO ; _m_rg t already

i

1978 Automobile

7

I

! I .

autotech.dta 74

var_:

I I

}

!

Cont_Lins data ob_,:

i

i i : ii

313

make make

usin_

automore

! defined

After sortin_ the data, S!ata refused to merge the new dataset, complaining instead that ._merge _s already de_ned. Every t_me State merges datase|s it wants to create a variable called _merge (or tlarname if!the _merge(!vamame) option was specified). In this case, there is an _merge variable can rename _merge • _ • • i .... left over frdm the last ti_e we merged. \_,e have _hree choices: We the variable, we can dr_p it, or we can specify a different variable name with the _.merge() option. In this case _merge contains nQ useful lnformatmn we already verified that the prevmus merge went as expected ;o we drop ii and try, again: • dro

_merge

i

Stata performed

our request; whatever new variables

were contained

in automore,

dta

are now

contained in our single, master dataset---perhaps. One should not jump to conclusions. After a match merge, you should always tabulate ..merge to verify that the expected actually happened, as we do below: • tabulate _merge _merge

Freq.

Percent

Cum.

1 2 3

1 1 73

1.33 1.33 97.33

1.33 2.67 100. O0

Total

75

100.00

Surprise! In this case something strange did happen. Some 73 of the observations merged as we anticipated. However, the new dataset automore.dta added one new car to the dataset (identified by ..merge equal to 2) and failed to define new variables for another car in our original dataset (identified by _merge equal to 1). Perhaps this is what should happen, bui it is more likely that we have a mistake in automore, dta. We probably misidentified one car so that to Stata it appeared as data on a new car, resulting in one new observation and missing data on another. If this happened to us, we would figure out why it happened. We would type list make if ._merge==l to learn the identity of the car that did not appear in automore, dta. and we would type list make if _merge==2 to learn the identity of the car that automore, dta added to our data. q

[3 Technical

Note

It is difficult to overemphasize the importance of tabulating ..merge no matter how sure you are that you have no errors. It takes only a second and can save you hours of grief. Along the same lines, one-to-one merges are a bad idea. In the example above, we could have performed all the merges as one-to-one merges and saved a small amount of typing. Let's examine what would have happened. We first merged autotech.dta with autocost,dta by typingmerge make using autocost We could perform a one-to-one merge by typing merge using autocost. The result would be the same; the datasets line up and are in the same sort order, so sequentially matching the observations from the two datasets would have resulted in a perfectly matched dataset. In the second case, we merged the data in memory with automore .dta by typing merge make using automore. A one-to-one merge would have led to disaster, and we would never have known it! If we type merge using automore, Stata would sequentially, and blindly, join observations. Since there are the same number of observations in each dataset, everything would appear to merge perfectly. We speculated in the previous automore.dta included data on no error, things have gone awry. match. For instance, assume that

example that we had an error in automore, dta. Remember that one new car and lacked data on an existing car. Even if there is No matter what, the data in memory and automore, dta do not this new car is the first observation of automore, dta and that it

is some (perhaps mistaken) model of Ford. Assume that the first observation of the data in memory is on a Chevrolet. Stata could and would silently join data on the Chevrolet with data on the Ford. and thereafter data on a Volvo with data on a Saab, and even data on a Volkswagen with data on a Cadillac, and you would never know. Every dataset should carry a variable or a set of variables that uniquely identifies each observation, and then you should always use those variables when merging data. Ignore this advice at your own peril. []

I

! merge -- Merge datasets

315

1 i

121Technical INote , Circu n!s t tances ma Y _rise when you will mer,,e . . _,_two datasets knowino5 there will be mismatches,, hay

_

you have _n analysis d!taset on patients from t!e cancer ward of a particular hospital and you have just recei_ded another d_taset containing their demographic information. Actually. this other dataset contains nbt just their dhmographic information but the demographic information on every patient in

! i

the

i

hospital during the year. You could • m_ge patid using demog d_op if _raerge,=2

or

i

':

m_ge

i i_

l

i

, patid using demog, nokeep

.

.

.

The noke_p opnon tell_ merge not to store observations from the usln_ data that do not appear in the mastel_ There is an!advantage in this, \Vhe_ we merged and dropped, we stored the irrelevant • I -i observanotls and then discarded them• so the data in memory temporarily grew, When we merge with the nokee_ option, the _data never grow beyond what is absolutely necessary,

°

i

In our _automobile example, we had a _ingle identifying variable. Sometimes you will have

! !

idenfifvin ! variables, v_riables that, taken togettJer, are unique for every observation," Let's in_agine that. r_ther than having a single variable called make, we had two variables: manuf and mode_,manuf conl_ainsthe manufacturer arid model contains the model, Rather than having a single varihble recording, sav, Honda Accord . we have two variables, one recording Honda" and another re_ording "Accord". Stata can deal with this type of data. You can go back through our previous ekample and substitute manuf model everywhere you see make.For instance, rather than typing meljge make usihg autocost, we would have typed merge manuf model using autocost

! ! ! i iI i i

Now le_'s make one more change in our assumptions. Let's assume that manuf and model are nol strir_g vari_tbles but are ihstead numerically coded variables. Perhaps the number 15 stands for Honda in the roanifvariable a_d the number 2 stands for Accord in the model variable \Ve do not have to remember mr numeric dudes because we have smartly created value labels telling Stata what number stands for what string of characters. We now go back to the step where we merged autotech.dta with auto :ost. dta: • us, ._ autotech i (Aut)mobile mode_s)

I

, me:.'ge manuf mo_ei using autocost (lab._lmanuf alr4ady defined) (2ab,_lmodel alrdady defined)

I i

! !

Stata makds two minor domments but otherwise carries out our request, It notes that the labels manuf and modell, are already. _lefined.•The messages refer to the value labels named manuf and model

:_ ._

Both• d t tasets contait_t value label definitions that turn the numeric codes for manufacturer and model lntolwords. Whed Stata merged the two daiasets, it already had one set of definitions in memory (obtained _hen we type_t use autotech) and thus ignored the second set of definitions contained in autocost;dta. Stata f_lt obliged to mention the second set of definitions while otherwise ignorin_

i

t

.

i

.

.

.

hem slncq they smght _contaln different codings. In this case, we know they are the same since we CreatedI them. (Him:i You should never give the same name to value label's containin._ different codings.) ! i

_,_

p

,_ ,vWhen one mergeis -Merge oatasets performing a match

merge, the master an_or using datasets may have multiple observationswith the same varlist value. These multiple observations are joined sequentially,as in a one-to-one merge. If the datasets have an unequal number of observations with the same varlist value, the last such observationin the shorter dataset is replicated until the number of observations is equal. ;> Example The processof replicating the observationfrom the shorter dataset is known as spreadingand can be put to practical use. Suppose you have two datasets, costs of your firm, by region, for the last year:

dollars,

dta

contains

the dollar

sales and

• use dollars (RegionalSales & Costs) • list region NE N Cntrl South West

I. 2. 3. 4. sforce,

dta

they operate:

sales 360,523 419,472 532,399 310,565

cost 138,097 227,677 330,499 1.65,348

containsthe names of the individualsin your sales force along with the region in which

• use sforce (SalesForce) list region I. NE 2. NE 3. N Cntrl 4. 5. 6. 7. 8. 9. I0. II. 12.

N Cntrl N Cntrl South South South South West West West

name Ecklund Franks Krantz Phipps Willis Anderson Dubnoff Lee McNiel Charles Grant Cobb .

You now wish to merge these two datasets by region, spreading the sales and cost information across

all observations

for which

it is relevant;

that

is, you want

to add the

costs to the sales force data. The variable sales will assume the value observations, $4t9,472 for the next three observations, and so on.

(Continued

on next page)

variables

$360,523

sales

and

for the first two

I

i

i -

i• ! ! i {

i

i

r-

merge -- Merge da'msets • me::ge region (la_;1

region

using

317

dollars

at_,eady

defined)

• li_t 1

I. _

region NE

2. i

name ']ckland

NE

3. N Cntrl 4, _ N Cntrl 5. N Cntrl

sales 360 523

cost 138,097

merge

Freaks

360 523

138,097

3

Krantz Phipps Willis

419 472 419 472 419 472

227,677 227,677 227,677

3 3 3

3

6.

South

7.

South

_nderson bubnof f

399 532 399

330,499 330,499

3 3

8. 9.

South South

: Lee , McNiel

532 399 532 399

330,499 330,499

3 3

West West West

Grant bharles Cobb

310 310 565 565 310 565

165,348 165,348 165,348

3 3 3

11. 10. i 12.

532

Even th)ugh there arc 12 observations in the gales force data and only 4 obse_'ations in the sales and cost data, all the re%rds The sforce, dta contained dollars. Ira was matched record in dollars, dta'was

merged. The dollars, dta contained one obsen'ation for the NE region. two observations for the same region. Thus, the single observation in to both the observations in sforce.dta. In technical jargon, the single replicated, or spread, across the observations in sforce, dta.

2

UpdatingdOata merge

ith the updaie

option varies merge's actions when an observation in the master is matched

with an oblervationin tile using dataset. Without the update option• merge leaves the values in the master dat_set alone and adds the data for the new variables. With the update option, merge adds the new vagables, but ii also replaces missing values in the master observation with corresponding values fron_ the using. (_dissing values mean numeric missin_ (.) and empty, strings (""_)., The vahtes for __merge are extended: _merge

1 2 i

meaning

obs. from masterdata obs. from usingdata obs. from botlt,masteragrees with using obs. from both, missingin master updated obs from both, masterdisagreeswith using

4 5

i

In the easel of __merge = 5. the master values are retained unless replace

i i

case the m_.,.ter values are updated just as if they had been missing. Pretend tlataset l con[ains variables id. ,,_.and b: dataset 2 contains id. a, and z. You merge the

i

two dataset_ by _d. data_et I being the master d_taset in memory and dataset 2 the using dataset on disk. Con._i_er two obseivations that match and _all the values from the first dataset idl, etc., and those fromlthe second i_,. etc. The resulting d_taset will have variables id _ b. a'. and _anerge. merge's tyt_ical logic is i

i

I

.

_

•

o

I. The factl,that the obsdrvations match means idl = ida, Set id = ida.

:_!

_.'_Variablelo occurs in l_oth datasels. Ignore o,, and _;eta = al.

i _

3. Variable b occurs in c}nlydataset 1 Set b = b_. 4. Variable a' occurs in +nly dataset 2. Set z = ,r>

i

5. Set _me::ge = 3.

!

i

is specified, in which

.?

With update,

the logic is modified:

1. (unchanged.) :

Since the observations

match, idl = id2. Set id = idl

2. Variable a occurs in both datasets: a. If al -- a2, set a = al and set __merge = 3. b. If al contains missing and a2 is nonmissing, update was made•

set a = a2 and set ...merge -- 4, indicating

an

c. If a2 contains missing, set a = al and set __merge - 3 (indicating

no update).

d. If al 5¢ a2 and both contain nonnfissing, set a = al or, if replace regardless, set _merge = 5, indicating a disagreement.

was specified, a = a2 but,

Rules 3 and 4 remain unchanged.

> Example In original.dta you have data on some cars that include the make, price, and mileage rating. In updates, dta you have some updated data on these cars along with a new variable recording engine displacement. The data contain • use

original,

(original

clear

data)

list make 1. Chev. 2. Chev.

pri c e 3,299 4,504

Chevette Malibu

3. Datsun

mpg 29

510

5,079

4. Merc.

XR-7

6,303

5. Olds

Cutlass

4,733

19

3,895

26

7,140

23

6.

Renault

Le Car

7. VW Dasher • use updates, (updates,

24

clear

mpg

and

displacement)

• list make i. Chev.

Chevette

mpg

displac-t 231

2. Chev. 3. Datsun

Malibu

22

200

510

24

119

XR-7

14

302

Cutlass

19

231

25

79

23

97

4. Merc. 5.

Olds

6. Renault

Le Car

7. VW Dasher

By updating our data. we obtain • use

original,

(original • merge

clear

data)

make

using

updates,

update

list make i. Chev.

Chevette

price 3,299

mpg 29

displac~t 231

_merge 3

2. Chev.

Malibu

4,504

22

200

4

3. Datsun

510

5,079

24

119

3

4. Merc. XR-7 5. Olds Cutlass

6,303 4,733

14 19

302 231

4 3

6. Renault

3,895

26

79

5

7,140

23

97

3

Le Car

7. VW Dasher

_,

,

i

_

merge-- Merge datasets

319

All observations merged because all have _..merge > 3. The, observations having _.merge = 3 have _pg:just a_ it was recorded in the original dataset. In observation t. mpg is 29 because the updated data,£et had mpg = . : in observation 3. mpg remhins 24 because the updated dataset also stated that

i i

mpg_is 24"I The ob!ervations having ...merge = 4 have had their mpg data updated. The mpg variable was missing in,observations 2 and 4 and new values' were obtained from the update data.

i

The _ob_ervation having _merge = 5 has its mpg just as it was recorded in the original dataset, ! i

just as do the __merge = 3 observations, but ther_ is an important difference. There is a disagreement "about the %lue of rapg; the original claims it is 26 and the updated, 25. Had we specified the

i

replaGe

i

_mergo =i5. replace

_ption, mpg would now contain the u_ated

25 but the observation would still be marked

affects only _xhich value _is retained in the case of disagreement,

q

ReferenCe Nash,J. D. t994. dmlg: Mergingrag, data and dictionaryfiles. Stata TechnicalBulletin 20: 3-5, Reprintedin Su_ta i.

i

i

Techaicalt BulletinReprints' v°l" 4'matchedmerging._tata pp" 22-25" Weesie. J. 2000. din75:Safe andeasy TechnicalBulletin53: 6-1% Reprintedin Stata Technical Bulletin!eprints, vol. 9, pp. 62-77.

{

i

AlsoSee Compimne _tary:

[R] save, JR]sort

Related::

JR] append, [R] cross, [R] _oinby I

BaeRgroun d:

[u] 25 Commands for combining data

meta

--

Meta analysis

)i!

,

II

,

Remarks Stata

should

have

a meta-an_alysis

command,

but

as of the date

that this manual

was

written,

Stata does not. Stata users, however, have developed an excellent suite of commands for performing meta-analysis, many of which have been published in the Stata Technical Bulletin (STB).

Issue

insert

author(s)

command

description

STB-38

sbet6

S. Sharp, J. Sterne

meta

meta-analysis for an outcome of two exposures or two treatment regimens

STB-42

sbel6.1

S. Sharp, J. Sterne

meza

update of sbel6

STB-43

sbel6.2

S. Sharp, J, Sterne

meta

update; install this ve_ion

STB-41

sbel9

T, J. Steichen

metabias

performs the Begg and Mazumdar (1994) adjusted rank correlation test for publication bias and the Egger et al. (1997) regression asymmetry test for publication bias

STB-44

sbel9.1

T.J. Steichen. M. Egger, J. Sterne

metabias

update of sbel9

STB-57

sbel9.2

T.J. Steichen

metabias


STB-41

sbe20

A. Tobias

galbr

performs _he Gatbraith plot (1988) which is useful for investigating heterogeneity in meta-analysis

STB-56

sbe20.1

A. Tobias

galbr


STB-42

sbe22

J. Sterne

metacum

performs cumulative meta-analysls, using fixed- or random-effects models, and graphs the result

STB-42

sbe23

S. Sharp

raetareg

extends a random-effects meta-analysis to estimate the extent to which one or more covariates, with values defined for each study in the analysis, explains heterogeneity in the treatment effects

STB-44

sbe24

M.J. Bradbum. J. J. Deeks, D. G. Altman

metma, funnel, labbe

meta-analysis of studies with two groups funnel plot of precision versus treatment effect UAbb6 plot

STB-45

sbe24.1

M. J, Bradbum J. J. Deeks, D. G. Altman

funnel


STB-47

sbe26

A. Tobias

metainf, meta

graphical technique to look for influential studies in the meta-analysis estimate

STB-56

sbe26.1

A. Tobias

metainf

update: install this version

STB-49

sbe28

A. Tobias

metap

combines p-values using either Fisher's method or Edgington's method

STB-56

sbe28.I

A. Tobias

metap

update: install this version

STB-57

sbe39

T_ J. Steichen

metatrim

pedbrms the Dural and Tweedie (20001 nonparametric "trim and fill" method of accounting for publication bias in meta-analysis

Additional commands may be available: enter Stata and type search

320

meta analysas.

I

meta-- Metaanalysis

f

321

To downl_)adand install from the Interact the Sharp and Stem meta command, for instance. Stata i

yot_cquld

i

i

2. !- Click Pull down on http://www.stat_.com. Help and select STB and Use :-written Programs. I i

3. Click on stb. !

I

4. Click on stM9. 5. Click on she28.

I

6. Clk k on dick here to install i !

or yot1co aid instead do the following: l. Na_igate to the appropriate STBissue: Type net from http://w_, Type net ¢d stb Type net

I

i i

i i

[

cd stb49

or _. Type net from http ://www.sta_a, com/stb/stb49

2. Tyt e net describe sbe28 3.TyFe net installsbe28

fere.cs

Be_, C. B|and M. Mazurndar 1994 Opet_atingcharactdristics of a rank correlation test for publication bias. Biometric._

0:!lo8 -llOi.

t

Bradbu_n,

ii

'

sta_a, corn

--.

_k/l.J., J, J. Decks, and D. G. Altman. 1998a. sbe24: metan--an alternative mcta.anatys_s command. Stat,_

:Yechnicll Bulletin 44: 4-15. Reprinted in Stata TeJhnica! Bulletin Reprints, vol. 8. pp. 86-100. t998_, sbe24.l: Correction to funnel plot. Stata qgechnicalBulletin 45: 21. Reprinted in Stata Technical Bulletin

_eOrint_, vol. 8, pp, 100. ¢ Egger, M.,IG. D. Smith, M. Schneider, and C. Mindef. I997. Bias in meta-analvsi., detected by a simple, graphical _test_Bt ;tish Medical Journal 315: 629-634.

Gat_rai_h,iMe_licile_. 7:F" 889-894.I988' A note on graphical display of"cstimated;,odds ratios from several clinical trials. Statistics i,,_ t L Abbe:,K. A., A. S. Detsky. and K. O'Rourke. 1987. Meta-anaNsis in clinical research. Annals of Internal Medicine 107): 224-233. , i i !

Sh_sp, S. 1998. sbe23: Meta-analvsis rc_rcssion. Stat_ Technical Bulletin 42: 16-22. Reprinted in Stata Technic_! :Bu_letir,Reprints, vol. 7. pp. 1_18-155. i Sharp, S, _nd J. Sterne. t997. sbet6: Meta-analysis. Siata Technical Bulletin 38:9-1,1 Reprinted in Stata Technical 'Bu_detit Reprints, voi. 7. pp. 100-106. 199ta. sbel6.1: New syntax and output for the meta-analysis command. Stata Technical Bulletin 42:6-8 Rcprint,'.d in Stata Technical Bulletin Reprints, vol. 17,pp. 106-108. 'Te_nk al Bulletin Reprints, vol. 8, p. 84. ....... . 19%b. sbel6.2: Corrections to the recta-analysis!command. Stata Technical Bulletin 43: 15. Reprinted in Slam Steichen. _. J. 1998. sbcl9: Tests for publication bias in meta-ana!ysis. Stata Technical Bulletin 41:9-t5 Reprinted in Stat_ Technicel Bulretin Reprints, rot. 7. pp. 125-133. __..i_ 200_a. sbel9.2: Update of tests for publication _ias in recta-analysis. Stata Technical Bulletin ._7:4.

!

-_'

_]!i;

322

meta u

Meta analysis

. 2000b. she39: Nonparametric trim and fill analysis of publication bias in meta-analysis. Stata Technical Bulletin 57: 8-14. Steichen, T. J.. M, Egger, and J. Sterne. 1998. sbel9.1: Tests for publication bias in recta-analysis. Stata Technical Bulletin 44: 3-4. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp, 84-85. Sterne, J. 1998. sbe22: Cumulative recta-analysis. Stata Technical Bulletin 42: 13-t6. Bulletin Reprints, vol. 7, pp. 143-147.

Reprinted in Stata Technical

Tobias. A. 1998. sbe20: Assessing heterogeneity in recta-analysis: the Galbraith plot. Stata Technical Bulletin 41: 15-17. Reprinted in Stata TechnicaI Bntletin Reprints, vot. 7, pp. I33-I36. 1999a, she26: Assessing the influence of a single study in the meta-analysis estimate, Stata Technical Bulletin 47: 15-17. Reprinted in Stata Technical Bulletin Reprints, vot. 8, p. 108-t10, 1999b. she28: Meta-analysis of p-values. Stata Technical Bulletin 49: 15-t7. Bulletin Reprints, vol. 9, pp, 138-140. 2000a. she20.1: Update of galbr. Stata Technical Bulletin 56: 14, 2000b. sbe26.1: Update of metainf. Stata Technical Bultetin 56: 15. • 2000c. sbe28.1: Update of metap. Stata Technical Bulfetin 56: 15,


Title !

!

i! i

=-

i m,x

Obtain marginal effects or elasticities after estimation

i

mfx

i

c_mpRte

[if el]o]

at l atlist) eqlist Rot Linear

i

' [in range]

[, d_dx

eyex

dyex

i

eydx

(eqnames) predict ('predict_option)

nodiscrete

noesample

nowght

nose

level(#)

]

=fX r ,play[,level(#)] where!at/_t is {mean I median I zero [varname=# [, varrrame=#] nu,ntist

(for single_quation estimators only)

mamame

(for single-equation estimators only)

[...]]}

DesCriptii:m numerically calculates the marginal effects or the elasticities and their standard errors after esti_mtion. Exactly what mfx can calculate is determined by the previous estimation command and the ]_redict (predict_option) option. At Whichpoints the marginal effects or elasticities are to be evaluated is determined by the at (atlist) option. By default, mfx calculates the marginal effects or the el_Lsticitiesat the means of the independent variables by using the default prediction option

}

mfx ci,mpute

i i i

associate, with the previous estimation command. mfx r _lay replays the results of the previous rafx computation.

!

i

t

i _

Opti dy x s! ifies that marginal effects are to be calculated. It is the default. eyex .sp_:ifies that elasticities are to be calculated in the form of 0 log b'/Olog z.

i

dy_xsp_ :ifiesthat elasticities are to be calculated in the form of Oy/Ologz

i

eydxspe _ifiesthat elasticities are to be calculated in the form of 0 log y/O_. at(atlist specifies the points around which the marginal effects or the elasticities are to be estimated.

i

The d ,.faultis to estimate the effect around_the means of the independent variables. at( mBan median I zero [ varname = # [, varnarne= #] [...]]) specifies that the marginal effecL,or the elasticities are to be evaluated at means, at medians of the independent variables, or at zer,)s. It also allows users to specify pahicular values for one or more independent variables. assuming the rest are means, medians, or zeros. For instance.

i

• ,

robit foreiffp, mpg _eight price !

.

fx compute, at(mean

mpg=30) 323

_

Jii" i

,._d;-e

..A

--

uuttllll

IIIdlylllal

elTeCI$

or

elasTiciTies

after

estimation

at there (numlist) specifies term that the are tooption be evaluated at the numlist. If is a constant in marginal the model,effects add 1ortothe theelasticities numlist. This is for single-equation estimators only. For instance, • probit • mfx

foreign

compute,

mpg at(21

weight 3000

price

6000

I)

at (matname) specifies the points in a matrix format. A 1 is also needed if there is a constant term in the model. This option is for single-equation estimators only. For instance, • probit

foreign

mpg

• mat

A = (2i,

3000,

• mfx

compute,

at(A)

weight 6000,

price I)

eqlist (eqnames) indirectly specifies the variables for which marginal effects (or elasticities) are to be calculated. Marginal effects (elasticities) will be calculated for all variables in the equations specified. The default is all equations, which is to say, all variables. predict (predict__option) specifies which function is to be calculated for the marginal effects or the elasticities; i.e., the form of ft. The default is the default predict option of the previous estimation command. For instance, since the default prediction for probiZ is the probability of a positive outcome, the predict () option is not required to calculate the marginal effects of the independent variables for the probability of a positive outcome. probit . mfx

foreign

mpg

weight

price

compute

To calculate the. marginal effects for the linear prediction (xb), specify predict (xb). • mfx

compute,

predict(xb)

To see which predict options are available,

see help for the particular estimation

command.

nonlinear specifies that y, the function to be calculated for the marginal effects or the elasticities, does not meet the linear-form restriction. For the definition of the linear-form restriction, please refer to the Methods and Formulas section, By default, mfx will assume that y meets the linearform restriction, unless one or more independent variables are shared bv multiple equations• For instance, predictions after heckman

mpg

price,

sel(for=rep)

meet the linear-form restriction, but those after • heckman

mpg

price,

sel(for=rep

price)

do not. If y meets the linear-form restriction, specifying nonlinear or not should produce the same results. However, the nonlinear method is generally more time-consuming. Most likely, users do not need to specify nonlinear after a State official estimation command. For user-written estimation commands, if you are not sure whether y is of linear-form, specifying nonlinear is always a safe choice. Please refer to the Speed and accuracy section for further discussion. nodiscrelze treats dummy variables as continuous ones. Ifnodiscrete is not effect of a dummy variable is calculated as the discrete change in the expected variable as the dummy variable changes from 0 to 1. This option is irrelevant the elasticities, because all the dummy variables are treated as continuous in

specified, the marginal value of the dependent to the computation of computing elasticities.

noesample only affects at (atlist). It specifies that when the means and medians are calculated, the whole dataset is to be considered instead of only those marked in the e (sample) defined by the previous estimation command.

d

J mfx -- Obtain marginaleffects or elasticitiesafter estimation

325

nowghtof ly affects at(atIist). It specifies that Weights are to be ignored when calculating the means and ime [inns for the atlist. nose asks mfx to calculate the marginal effects or the elasticities without their standard errors. Caleula ing standard errors is very time-consuming, Speci_,ing nose will reduce the running time Of dfx. level (#) specifies the confidence level, in percent, for confidence intervals. The default is level or as s¢ by set level: see [U] 23,5 Specifying the width of confidence intervals.

(95)

Pmarl 1 '

Rentarl_s are presented under the headings ! Obtaining

marginal

effects

after single-equation

Obtaining

marginal

effects

after multiple-equation

Obtaining

three

Speed

forms

(SE) estimation (ME)

esdmation

of elasticities

and accuracy

Obtaining arginal effects after singk -equation (SE) estimation Bef4_re :unning mfx. type help estimation_cmd to see what can be predicted after estimation and to see _e default prediction.

> Example We esti mate a logit model using the auto dataset: io ;it foreign

mpg

price

_ter _ion

O:

log likelihood

=

Ieer_tion

I:

log likelihood

= -36.694B39

I_er_tion

2:

log likelihood

= -36,463_94

_ter_tion

3:

log likelihood

=

I_er_tion

4:

log likelihood

= -36,462189

L_gi

L_g

-45.03_21

-36.46_19

estimates

likelihood

foreign I

Number of obs LR chi2(2) Prob > chi2 = -36.462189

Coef.

Pseudo

Std. Err.

mpg

.2338353

.067t449

price cons

.000266 -7.648111

.0001166 2.043673

z 3.48 2.28 -3.74

R2

= = =

74 17.14 0.0002

=

0.1903

P>]z

[95_ Conf.

Interval]

0.000

.1022338

.3654368

0.022 0,000

.0000375 -11.65364

.0004945 -3.642586

To determine the marginal effects of mpg and price for the probability of a positive outcome at their mean values, issue the mfx command, because the default prediction after !ogit is the probabilit} of a positive outcome and the calculation is requested at the mean values.

_N

.=_

-

-

,,,,A -- uuu==n marginal e_ects or elasticities after estimation

• mfx compute effects after Marginal logit y = Pr(foreign)(predict) = .26347633

/J

variable

dy/dx

Std. Err.

z

P>lzl

[

95Z C.I.

]

X

J

mpg price

The

.0453773 .0000516

first line

of the

output

.0131 .00002

indicates

3.46 2.31

that the

0.001 0.021

marginal

.019702 .071053 7.8e-06 .000095

effects

were

21.2973 6165.26

calculated

estimation, The second line of the output gives the form of fl and the predict would type to get y. The third line of the output gives the value of fl given X, in the last column of the table.

after

a logit

command that we which are displayed

To calculate the marginal effects at particular data points, say, mpg = 20, price = 6000, specify the at () option: mfx compute,at(mpg=20 , price=6000) Marginal effects after logit y = Pr(foreiKn) (predict) = .20176601 variable I mpg price

dy/dx

Std. Err.

I

.0376607 ,0000428

.00961 .00002

z

P>Izl

3.92 2,47

0.000 0.014

[

95_ C.l.

]

.018834 .056488 8.8e-06 .000077

X 20.0000 6000.00

To calculate the marginal effects for the linear prediction (xb) instead of the probability, specify predict (xb). Note that the marginal effects for the linear prediction are the coefficients themselves. mfx compute, predict(xb) Marginal effects after legit y = Linear prediction (predict,xb) =-I.0279779 variable

dy/dx

Std, Err,

.2338353 .000266

mpg price

.06714 .00012

z

P>[z[

3.48 2.28

0.000 0.022

[

95_ C,I.

.102234 .000038

]

.365437 .000495

X 21.2973 6165.26

If there is a dummy variable as an independent variable, mlx will calculate the discrete change as the dummy variable changes from 0 to 1.

gen

record

. replace (34 real

= 0

record changes

= I if rep made)

> 3

_

mfx -- Obtain marginaleffects or elasticitiesafter estimation

327

• _logit foreign mpg record, nolog L_it

estimates


Lo_ likelihood = _31.898321

mpg I

.1079219

record cons ._oreign :.

1

_fx

2,435068 -4.689347 Coef.

.0565077

i.91 3.42 43.54 z

.7128444 i.326547 Std. Err.

O.056 0.001 O.000 P>Izl

= = = =

74 26.27 O.0000 0.2917

-.0028311 .2186749 I 3.832217 -7.28933 -2.089363 [957,Conf. Interval] • 03_18

compute

_Ma_gi_al effects after logit ly = Pr(forei_1) (predict) | = .21890034 ! v_i

le mP_ I

_ec_'-rd* I

dy/dx

Std. Err.

z

P>Izl

[

95_ C.I.

.018_528

.01017

1.81

0.070 -,001475

.4272707

.10432

4.09

0,000

]

X

.038381

21.2973

.63163

.459459

.222712

(*) d'r/dxis for discrete change of dummy variable from 0 to I

If nodiscrete iS specified, mfx will treat the:dummy variable as continuous. Ifx compute,

no_iscrete

!Ma_gilLaleffects a_ter iogit y = Pr(foreign) (predict) = . 218900_34 _vari le

dyjdx

Std. Err.

z

P>Izl

[

957,C.I.

]

X

m_g I

,0184528

.01017

1.81:

0.070

-.001475

.038381

21.2973

recI'rd ]

.4163552

.10733

3.88

0.000

.205994

.626716

.459459

q

Obtaining rr arginal effects after multiple-equation (ME) estimation If you ha ;e not read the discussion above on using mfx after SE estimations, please do so. Except tbr the abilily to select specific equations for the calculation of marginal effects, the use of mfx after ME models Olk)ws atmost exactly the same form as fix SE models. The details of prediction statistics that are specific to particular ME models are documented with the estimali,)n command. Users of mfx after ME commands should first read the documentation of predict for the estimation command. For a general introduction to the ME models, we will demonstrate mfx after heckman and mlogit.

...........

F /_i

_ ........

_v,_

v, _,aau_lu_a,

_ImatlON

Example Heckman selection model Number of obs . heckman mpg weight length, sol(foreign = displ) nolog (regression model with sample selection) Censored obs

: i

_Ik_r

Log likelihood = -87.58426 Coef.

Std, Err.

z

=

74

=

52

Uncolored Wald chi2(2)obs

= =

22 7.27

Prob > chi2

=

0.0264

P>Izl


mpg weight length _cons

-.0039923 -.1202545 56.72567

.0071948 .2093074 21.68463

-0.55 -0.57 2.62

0.579 0.566 0.009

-.0180939 -.5304895 14.22458

.0101092 .2899805 99.22676

-.0250297 3.223625

.0067241 .8757406

-3.72 3.68

0.000 0.000

-.0382088 1.507205

-.0118506 4.940045

/athrho

-.9840858

.8112212

-1,21

0.225

-2.57405

.6058785

/lnsigma

1.724306

.2794524

6.17

0.000

1.176589

2.272022

-.7548292

.349014

-.9884463

.5412193

5.608626 -4,233555

1.567344 3.022645

3.243293 -10.15783

9.698997 1.690721

foreign displacement _cons

rho sigma lambda

LK test of indep, eqns. (rho = 0):

chi2(1) =

1.37

Prob > chi2 = 0.2413

heckman estimated two equations, mpg and foreign; see [R] heckman. Two of the prediction statistics a_er heckman are the expected value of the dependent variable and the probability of being observed. To obtain the marginal effec_ of the independent variables of all the equations for the expected value of the dependent variable, specify predict (yexpected) with mfx. . mfx compute, predict(yexpected) Marginal effects after heckman y = E(mpg*IPr(foreigm)) = .56522778 variable

dy/dx

weight length displa~t

-.0001725 -.0051953 -.0340055

(predict, yexpected)

Std. Err. .00041 .01002 .02541

z -0.42 -0.52 -1.34

P>Izl

[

95Z C.l.

0.675 0.604 0.181

-.000979 -.02483 -.083802

]

.000634 .01444 .015791

X 3019.46 187.932 197.297

To calculate the marginal effects for the probability of being observed, since only the independent variables in equation foreign affect the probability of being observed, specify eqlist (foreign) to restrict the calculation. mfx compute, eqlist(foreign)

predict(psel)

Marginal effects after beckman y = Pr(foreign) (predict, psel) = .04320292 variable

dy/dx

dispia-t

-.0022958

Std. Err. .00153

z -1.50

P>Izl

[

95Z C.I.

0.133

-.005287

]

.000696

X 197.297

q

V

mfx -- Obtain ma

inal effects or elasticities after estimation

329

E>ExampleI predict after mlogit has a special feature that most other estimation commands do not. It can predict m_ltiple new variables by issuing predict only once; see [R] mlogit. This feature cannot be adopte 1into mix. To calculate the marginal effects for the probability of each outcome, run _z septtra_ely for each outcome.

• _ ogit rep78 mpg disp_,

notog

MuI_inomialregression

Log likelihood=

-82.27874

rep78 I

Numberof obs LR c_i2(8) Prob > chi2 PseudoR2

Coef.

Std. Err.

z

P> z[

= = = =

69 22.83 O. 0036 0,1218

[95Z Conf, Interval]

i 1

mpg

displacement cons

-.0021573

.2104309

-0.01

0.992

-.0052312 -1.566574

.0126927 6.429681

-0.41 -0.24

0.680 0.808

-.0301085 -14.16852

.4145942

.4102796 .0196461 11.03537

• mpg _is_,acement _cons

.01509S4 .1235325 ,0020254 ,0063719 -2.09099 3.664348

0.12 0,32 -0.57

0.903 0.751 0.568

-.2270239 -.0104634 -_.272981

.2572147 .0145142 5.091001

mpg d_s .acement _cons

.0070871 ,0883698 -,0066993 .0053435 .7047881 2.704785

0,08 -1.25 0.26

0.936 0.210 0.794

-.1661146 -.0171723 -4.596492

.1802888 .0037737 6,006069

,0808327 ,0983973 -.0231922 .0119692 .652801 3.545048

0.82 -1.94 0,18

0.411 0.053 0.854

-.I120224 -.0466514 -6,295365

.2736878 ,0002671 7,600967

5 i

mpg displacement _cons (Dut,:ome rep78==3 • mf: compute,

is the comparison

group)

predict(outcome(1))

Marg nal effectsafter mlogit y = Pr(rep78==1)(predict,outcome(I)) = ,03438017

-.0003566 _y/dx -.0000703

.C_679 Std. Err. -0.05 z .00041

-0.1V

0.958 .01295] P>lz[ -.013663 [ 95Z C.I.

21.2899 X

0.864

198.000

-.000873

.000732

• mfI compute,predict(outcome(2)) Marginaleffects after mlogit = .12361544 l y =I Pr(rep78==2)_predict,outconm(2)) variable

I

dy/dx

mpg d_sp]a-t

[ ] i

.0008507 .0006444 4

Std. Err.

.01277 .00067

z

0.07 0.96

P>Iz]

[

95Z C.I.

0.947 0.336

-.024183 -.000668

]

.025885 .001957

X

21.2899 198.000

....v

,,,,_ --

i":_i

,JUmm

marginal

effects

or elasUciUes

after estimation

mfx compute, predict(outcome(3)) Marginal effects after mlogit y = Pr(rep78==3) (predict, outcome(3)) = .48578012

' 1

variable

I

mpg displa-t

t

t

dy/dx

-.0039901 .0015484

Std. Err. .01922 .00108

z -0.21 1.43

P>Iz[

[

95_ C.I.

0.836 0.151

-.041682 -.000567

]

.033682 .003664

X 21.2899 198.000

. mfx compute, predict(outcome(4)) Marginal effects after mlogit y = Pr(repYS--=4) (predict, outcome(4)) = .30337619 variable

dy/dx

mpg displa-t

-.0003418 -.0010654

Std. Err. .01707 .00106

z -0.02 -1.01

P>IzF

[

95_ C.l.

0.984 0.313

-.033805 -.003136

]

.033122 .001005

X 21.2899 198.000

• mfx compute, predict(outcome(5)) Marginzl effects after mlogit y = Pr(rep78==5) (predict, outcome(5)) = .05284808

variable I

dy/dx

displa~t mpg I

-.0010572 .0038378

Std. Err.

.00047 .00561

z

-2.24 0.68

P>[zl

[

95_ C.I.

]

0.025 0.494

-°001984 -.000131 -.007167 .014843

X

198.000 21.2899

q

Obtaining three forms of elasticities mfx can also be used to obtain all ttu'ee forms of elasticities. option

elasticity

eyex,

0 log y/O log x Oy/O log x 0 log y/Ox

dyex eydx

b, Example We estimate a regression model using the auto dataset. The marginal effects for the predicted value y after a regress are the same as the coefficients. To get the elasticities of torm Olog;_/Ologz, specify the eyex option:

r

mfx -- Obtain marginaleffects or elasticitiesafter estimation

, re_ress =_

weight length

Source

SS ' 1_616.08062

_ Model Ftesidual

df

827.378835

Number of F( 2, Prob > F

2

MS _ 808.040312

71

11.643223

73

33.47;J0474

'

Std. Err. .001586 ,0553577 6.08787

obs = 71) = =

R-squared Adj R-squared Root MSE

t i-2.43 _-1.44 7.87

P> ]t ] 0.018 O. 155 O. 000

= = =

33t

74 69.340,0000 0.6614 0,6519 3,4137

[957 Conf. Interval] - .0070138 -. 1899736 35,746

-.0006891 .0307867 60,02374

•'mr: compute, eyex Elasl icities after regress y = Fitted_values (predict) = 21.297297 I

varieble I

e_/ex

Std. Err.

z

P>Izl

[

95_,C.I.

]

0.015 0.151

-.987208 -.104891 -1.66012 .2554t4

X

t weJght [ -.5460497 !length { -.7023518 --

.22509 .48867

-2.4B -1.44

3019.46 187.932

,, [

The first line of the output indicates that the elasticities were calculated after a regressestimation, I

The titlecha o_ _ge the insecond of the table ingives percent b' for column a 1 percent change x. the form of the elasticities, 0 log y/O log x, the If'the in lependent v_ables have been log-transformed already, then we will want the elasticities of the fO_ 0 log y/Oz _stead. ge_ Inweight = In(weight) gen inlength = _n(lengZh) reg::essropeInweight inlen_h Source SS df

•

Number of obs =

MS

Model

" I_51.28916

2

825.644581

R,sidual

79_2.170298

71

11.1573281

2_43.4594_6

73

33.4720474

F( 2, Prob > F

' Total

,

mpg

Coef.

4 I

1 weight l_length | cons 1

I! I [

• =fx lcompute,

-_3.5974 -9.816726 181. 1196

Std. Err.

P>It I

t

'

4.692504 t0.40316 22.18429

-2.90 -0.94 _.16

0.005 0.349 0.000

74

71) = =

74.00 0.0000

R-squared = Adj R-squared = Root MSE =

0,6758 0.6667 3.3403

[957. Conf. Interval] -22.95398 -30.56004 136.8853

-4.24081i 10.92659 225.3538

eydX

_Elastlcitiesafter': regress _y = Fitted _alues (predict) =[ 21.2972_}7 va_iat [e l

eyldx

Std. Err.

ln_en _h ,ln_tei_ at I

-.4609376 -. 6384565

.48855 .22064

z -0.94 -2.89

P>Izl

[

957.C l

0.345 0.004

-1.41847 .496594 -1,0709 -.206009

]

X 5.22904 7.97875

t

...... _ _ii; • '

........

-.w

umL_=:= _;,,,_illlli_ll,

lon

Note that although interpretation is the same, the results for eyex and eydx differ since we are estimating different the models. If the dependent variable were log-transformed, we would specify dyex instead.
--

]

I i

I i !

ml plot

i

[eq,,a/ne:]name

[# [# [#]]].[, saving(filename[,

replace]>

i

mt init{ m! init l

{ [eqname:]name=#1/eqnaine=# # [# ...], copy

i

} [...]

:

ml init

,,,atname

[, skip

copy]

{ }

ml repcrt i ml trac_

{ on i off }

]

oi

mlco_

'i

ml max

[cle_ Ionlof_ _] ize [, difficult nolog __trace'gradient hessian shovstep iterate(#) Itolerance(#)!tolerance(#) nowarning novce

_.

i

i

sscore:(newvarnames) noout_t level(#) eform(string) noc!e_r

ml

graph

[#] [

saving(fiienarae[, r_place])

]

{

mt dis

lay

[

no_eader

eform(string)

tirst 389

neq(#)

plus

level(C/)

i

i

•,--,, ";i_i I

,,,, -- Maximum IIKelIItOOCl estimation

where method is { If [ dOIdl Idldebug Id2 1d2debug } and eq is the equation to be estimated, enclosed given to the equation, preceded by a colon:

[

in parentheses,

and optionally

with a name to be

([eqname: l [varnames =] [varnames] I" eq-options] ) or eq is the name of a parameter such as sigma with a slash in front /eqname and eq_options are

which is equivalent

noconstant offset

(varname)

to

(eqname:)

exposure

(varname)

fweights, pweights, aweights, and iweights are allowed, see [U] 14.1,6 weight. With all but method lf, you must write your likelihood-evaluationprogram a certain way if pweights are to be specified, and pweights may not be specifiedwith method dO. ml shares features of all estimationcommands;see [U] 23 Estimation and post-estimation commands. To redisplay results,the type ml display.

Syntaxof ml model in noninteractivemode ml model method progname

eq [eq ...]

[weight]

lif exp]

[in rangej,

maximize

I robust cluster(varname) title(string)nopreserve collinear missing ifO(#k #u) continue waldtest(#) constraints (numlist)obs(#) noscvars init(ml_init_args) search({on Iquietly Ioff}) _repeat(#) b_ounds(ml_search_bounds) difficult .nologtrace gradient hessian showstep iterate(g) _Itolerance(#) tolerance(g) nowarning novce score(newvarlist) I Noninteractive

mode is invoked by specifying

option maximize.

Use maximize

when ml is to

be used as a subroutine of another ado-file or program and you want to carry forth the problem, from definition to posting of final results, in one command.

Syntax of subroutinesfor use by method dO,dl, and d2 evaluators mleval

newvarname

= vecname

mleval

scalarname

mlsum

scalarnamelnf

= exp [if exp] [, noweight

mlvecsum

scalarnamelnf

rowvecname : exp [if exp] [, eq(#)

mlmatsum

scalarnamelnf

matrixname = exp [if

= vecname,

[, eq(#)] scalar

[eq(#)]

]

exp] [, eq(#[,#i)

i ]

ml -- Maximum likelihood estimation

341

Syntax of user-writtbn evaluator Summar

of notation

The lo :-likelihood fdnction is In L(Olj, 02j,..:., OEj) where Oij = xijbi andj = 1,..., N indexes observ ttions and i _ t,..., E indexes the liftear equations defined by mt model. If the likelihood satisfie[ the linear-fc_rm restrictions,

Method

it can be decomposed

as In L = Z;=I

In g(Olj, 02j,...,

It evaluators:

program dv:fr_n:jvgname args

inf _,thetal [theta2

...

]

/* if you _eed to createany intermediate results: */ tempvax trap1 trap2... quietly gen double "tmpl" = ... }

°''

;

quietly

replace

"lnf"

= ...

end t

wbe ' !

"!n_]" "the_tal" "th_ta2"

vafihble to be filled in with observation-by-observation values of In £j vari_bte containing evaluation of 1st equation Ou=xljb_ varihbte containing evaluation of 2nd equation Ozj=x,._b2

l

Method d D evaluators:i _ '

prol_ :am define p_ogname version args Code b inf tempvar_etal theta2 ... mleval "_hem1" = "b', eq(1) mleval "_eta2" = "b', eq(2)

/* if there is a 02 */

/* if you _eed to create any intermediateresutts: */ •

!

tempvar

1 i

mlsum "l_ff" --- ...

}

end!

'f_npt trap2...

gen double

"tmpl " = ...

"

where} i i

i !

"todd"

"b" i "lnf_ t

always contains 1 {may be ignored) full parameter row vector b=(bl,,b2,...,bE) scal_ to be filled in with overat_ In L

Method dl evaluators: prog 'am defineprOgname version

! i

tempvar tbetal lheta2 ... mleval "t_etal" = "b', eq(I) mleval "ttheta2" = "b', eq(2)

i _!

/* if there is a 02 */

/* if you n_ed to create any intermediateresults: */ tempvar t_pt trap2 ... gen doubl_e "tmpl " = . . . ...

0,_).

F

.

o,,,_

ml --mls_ Maximum "lnf"

Ii''_P

ilKellrlooO = ...

estimation

if "todo'==O ] "Inf'==. { exit }

iI i

tempname dl d2 ... mlvecsum "Inf" "d1" = formulaforOlngj/O91j, eq(1) mlvecsum "Inf" "d2" = formulafor0 Ing#/002#, eq (2) matrix

"g" = ('dl','d2", ... )

end where "todo"

contains0 or 1

"b" "lnf"

O_'Inf'to be filled in;l_'lnf" and "g" to be filled in fullparameterrow vectorb=(bl,b_,...,bE) scalar to be filled in with overall In L

"g"

row vector to be filled in with overall g---01nL/0b

"negH"

argument

"gl" "g2"

variable variable

to be ignored optionally optionally

to be filled in with 01ng#/0bl to be filled in with ,9 In £_/0b2

Method d2 evaluators: program

define progname version 7 args todo b inf g negH [gl [g2 ... ] ] tempvar thetal thcta2... mleval "thetal" mteval "theta2"

= "b', = "b',

eq(1) eq(2) /*

if there is a 02 */

/* if you need to create any intermediate tempvar tmpl unp2 ... gen double "tmpl= ... mlsum "lnf"

results:

./

= ...

if "todo'==O

[ "inf'==. { exit }

tempname dl d2 ... mlvecsum mlvecsum

"lnf ""dl" "lnf" "d2"

= formula for 0 In gj/OOL7, = formula for 0 tn ej/OO2j,

eq(1) eq(2)

matrix "g" = ('dl','d2", ... ) if "todo'==l [ "inf'==. { exit } tempname dll d12 d22 ... mlmatsum "inf" "dll"= fotmtlla for 02 In_3/08_j, eq(1) mlmatsum "Inf" "dl2" = formulafor- 02 Ingj/OOljOe2j, ecl(l,2) mlmatsum "Inf" "d22" = fonntlla for-02 Inej/OO_j, eq(2) end

matrix

"negH"

= ('dll','dl2",

...

\

"d12'','d22",

...

)

where "todo"

contains

0, 1, or 2

"b" "lnf"

O_'lnf" to be filled in: l_'lnf" and "g" to be filled in: 2_" lnf', "g', and "negtt" to be filled m full parameter row vector b--(bl,b2,...,bE) scalar to be filled in with overall In L

"g" "negH"

row vector to be filled in with overall g=01n L/Ob matrix to be filled in with overall negative Hessian -H=--02

"gl" "g2"

variable variable

optionally optionally

to be filled in with Oln_j/Obl to be filled in with Olngj/Ob2

In L/ObOb"

i

_

ml -- Maximumlikelihood estimation

1

Global

SML__rl SML@2

_ _ i

name of first dependent variable nam_ of second dependent variable, if any

SHL_ _amp

variable containing

1 if observation

SML_'

variable containing

weight associated with observation or 1 if no weights specified

to be used; 0 otherwise

Method If evaluators can ignore SML_samp, but restricting calculations to the SML_samp==l subsaml_te will speedi execution. Method If evaluators must ignore SML_w;application of weights is handl_ by the me_hod itself. Method dO. dl. and _]2 can ignore $ML_samp as tong as ml model's nopreserve

:i

i ,_ ! i

"

[• i

343

m_crosfor use!byall evaluators !

option is not

specifie_l. Method d0_ dl, and d2 will run more quickly if nopreserve is specified. Method dO, dl. and d2 evaluator_ can ignore $ML_w only if they use mlsum, mlvecsum, and mlmatsum to produce final results.:

Description ml cle

r clears the current problem definition. This command is rarely, if ever. used because,

when you t_pe ml modest, anv, previous problem is automatically cleared. m2 mod_t defines the!currenl problem. ml query displays a 8escription of the current problem.

ral check verifies thai the log-likelihood evahaator you have written seems to work, We strongly recommend using this command. ml sea: ch searches for (better) initial values. We recommend using this command. ml plot

provides a g_aphical way of searchip_g for (better) initial ]

values.

ml init Iprovides a Way_of setting initial values to user-specified values. ml report reports thi: values of tn L, its gradient, and its negative Hessian at the initial values •l " or current l_ameter estimates b0. ra! trac_

traces the execution of the user-deft=ned log-likelihood evaluation program.

,

ml co counts the _umber of times the user-defined log-likelihood evaluation program is called. It was inteqded as a del_ugging tool for those developing ml and it now serves little use besides entertainment, ml count! clear clears the coun{er, ml count on turns on the counter• ml count

!

without argt}ments report S the current values of the counters, ml count off stops counting calls.

i i

ml maxillizemaxlmi_es the likelihood function and reports final results. Once ml maximize has succe.tsfully completed, tl_e previously mentioned ml commands may no longer be used--ml graph

[

and rat dis ,lay may be ,fused" m! grapl graphs the ltog-likelihood values against the iteration number.

i ! ?

! , :_ [

,

. i

rnl disp: ay redisplav_ final results. prognam_I is the namd of a program you write to evaluate the tog-likelihood function. In this documentation, it is referled to as the user-written evaluator or sometimes simply as the cvaluator. The progra_._._youwrite is,_,ritten in the style required by the method you choose. The methods are If, dO. d l. If evaluator. and ]t2. Thus. if _ou choose to use methotl If, your program is called a method Method If e_aluators are Zrequired to evaluate th_ observation-by-observation log likelihood In (_, .j : I,.....,'It. Method dO evaluators are required |o evaluate the overall log likelihood tn L. Method d I evaluator_ are required io evaluate the overall log likelihood and its gradient vector g = 0 In L/Oh Method d2 e_,aluators are [equired to evaluate the Overall log likelihood, its gradient, and its neeative Hessian matrix -H = -0-°In L/ObOb'

:_

344 ml -- Maximum likelihood estimation mleval is a subroutine used by method dO, dl, and d2 evaluators to evaluate the coefficient vector b that they are passed. mlsum is a subroutine used by method dO, dt, and d2 evaluators to be returned.

to define the value In L that is

mlvecsum is a subroutine used by method dl and d2 evaluators to define the gradient vector g that is to be returned. It is suitable for use only when the likelihood function meets the linear-form restrictions. mlmatsum is a subroutine used by method d2 evaluators to define the negative Hessian -H matrix that is to be returned. It is suitable for use only when the likelihood function meets the linear-form restrictions.

Optionsfor use with ml model in interactive or noninteractive mode robust

and cluster

(varname)

specify the robust variance estimator, as does specifying pweights.

If you have written a method If evaluatOr, robust, is nothing to do except specify the options.

cluster

(), and pweights

If you have written a method dO evaluator, robust, cluster(), Specifying these options will result in an error message.

and pweights

will work. There wilt not work.

If you have written a method dl or d2 evaluator and the likelihood function satisfies the linear-form restrictions, robust, cluster (), and pwe±ghts will work only if you fill in the equation scores; otherwise, specifying these options will result in an error message. title

(stritTg) specifies the title to be placed on the estimation

output when results are complete.

nopreserve specifies that it is not necessary for ml to ensure that only the estimation subsample is in memory when the user-written likelihood evaluator is called, nopreserve is irrelevant when using method lf. For the other methods, if nopreserve is not specified, then ml the original dataset) and drops the irrelevant observations before This way, even if the evaluator does not restrict its attentions results will still be correct. Later, ml automatically restores the ml need not go through these machinations evaluator calculates observation-by-observation

saves the data in a file (preserves calling the user-written evaluator. to the $ML_samp==l subsample, original dataset.

in the case of method If because the user-written values and it is ml itself that sums the components.

ml goes through these machinations if and only if the estimation sample is a subsample of the data in memory. If the estimation sample includes every observation in memory, ml does not preserve the original dataset. Thus. programmers must not damage the original dataset unless they preserve the data themselves. We recommend interactive users of ml not specify nopreserve; chances of incorrect results.

the speed gain is not worth the

We recommend programmers do specify nopreserve, but only after verifying that their evaluator really does restrict its attentions solely to the $ML_samp==l subsample. collinear bpecifies that ml is not to remove the collinear variables within equations. There is no reason one would want to leave coltinear variables in place, but this option is of interest to programmers who. in their code. have already removed collinear variables and thus do not want ml to waste computer time checking again.

+ {

I ,

_

ml -- Maximum likelihoodestimation

345

missings])ecifies that observations containing variables with missing values are not to be eliminated from th_ estimation somple. There are two reasons one might want to specify missing:

! _

Prograrr _ers may wi_h to specify" missingbecause, in other parts of their code, they have already eliminat_ observatio+hs with missing values and thus do not want ml to waste computer time

i ! + •

looking again. All user_ may wish tO specify missingif their model explicitly deals with missing values. Stata's heckmaa command i_ a good example of this. In such cases, there will be observations where

_

missing values are allbwed and other observations where they are not--where their presence should cause the observationl to be eliminated. If you specify missing,it is your responsibility to specify an if e rp that elimidates the irrelevant obserVations.

_; !

If0(#k #u_ is typically _sed by programmers. It Specifies the number of parameters and log-likelihood value o_the constant-only model so that ml++ can report a likelihood-ratio test rather than a Wald

i

i } + + i

+_

test. Th_se values w_re, or they may have been determined by t + perhaps, analytically_idetermined, • • a previous estimation! of the constant-only m6del on the estimation sample. "

1

Also so the continueoption directly below: If you specify IfO(),it must be safe for you to specify the missing option, too, else how did you cal 1 works like the above except that it forces a Wald test to be reported even if the infommtion to perform the likelihood-ratio test is available and even if none of robust, cluster, or pweights were specified, waldtest(k), k > 1, may not be specified with lf0(). constraints (numlist) specifies the linear constraints to be applied during estimation. The default is to perform unconstrained estimation. Constraints are defined using the constraint command and are numbered: see [R] constraint. See [R] reg3 for the use of constraints in multiple-equation contexts. obs (#) is used mostly by pro_mers. It specifies that ultimately stored in e (N), is to be #. Ordinarily, ml Programmers may want to specify th_s option when, in for N observations, they first had to modify the dataset observations.

the number of observations reported, and works that out for itself, and correctly. order for the likelihood-evaluator to work so that it contained a different number of

noscvars is used mostly by programmers. It specifies that method dO, dl, or d2 is being used but that the likelihood evaluation program does not calculate nor use arguments "gl ", "g2", etc., which are the score vectors. Thus, m! can save a little time by not generating and passing those arguments.

Options for use with ml model in noninteractive mode In addition to the above options, the following options are for use with ml model in noninteractive mode. Noninteractive mode is for programmers who use ml as a subroutine and want to issue a single command that will carry forth the estimation from start to finish. maximize

is not optional. It specifies noninteractive

mode.

init (ml_init__args) sets the initial values be. mI_init_args ml init command.

are whatever you would type after the

search({onlquietlyloff }) specifies whether ml search is to be used to improve the initial values, search (on) is the default and is equivalent to running separately ml search, repeat (0). search(quietly) is the same as search(on) except that it suppresses ml search's output. search(off) prevents the calling of ml search altogether. repeat(#) repeat

is ml search's repeat (0) is the default.

bounds(ml_search_bounds) model issues is 'ml search

() option and is relevant only if search(off)

is not specified.

is relevant only if search(off) is not specified. The command ml ml--search_bounds, repeat (#) '. Specif_,ing search bounds is optional.

difficult, nolog, trace, gradient, hessian, showstep, iterate(), Itolerance(), tolerance(),nowarning,novce,and score() are ml maximize'sequivalent options.

Options for use when specifying equations noconstant specifies that the equation is not to include an intercept. offset (varname) specifies that the equation is to be xb + varname: that the equation is to include varname with coefficient constrained to be 1. exposure (varname) xb + ln(varname).

is an alternative to offset (varname); it specifies that the equation is to be The equation is to include In(varname) with coefficient constrained to be 1.

V

ml -- Maximumlikelihoodestimation

347

Options f( use withlml search repeat(#1

specifies the number of random attempts that are to be made to find a better initial-_alue

! ! ! ) } !

vector. "he default ropeat(lO). repea_ (0) specifiesi__at no random attempts are to be made. More correctly, repeat (0) specifies that no "andom attempts are to be made if the initial initial-value vector is a feasible starting point. If it is aot, ml search will make random attempts even if you specify repeat (0) because it has no _tltemative. Tl_e () option refers to the number of random attempts to be made to ) repeat improv_ the initial values, When the initial starting value vector is not feasible, ml search will make u]_to 1,000 rafidom attempts to find starting values. It stops the instant it finds one set of values t aat works and then moves into its improve-initial-values logic.

i

repeat (k), k > O. _pecifies the number of random attempts to be made to improve the initial values. !

)

nolog spe!ifies that no 6utput is to appear while ml search !

looks for better starting values. If you

!

specify tolog and th_ initial starting-value vector is not feasible, ml search wilt ignore the fact that youIspecified the nolog option. If ml search must take drastic action to find starting values, it feels },ou should kriow about this even if you attempted to suppress its usual output.

! !

trace spee ifies that you want more detailed output about ml search's actions than it would usually provide. This is more entertaining than useful, ml search prints a period each time it evaluates the likel hood functioh without obtaining a better result and a plus when it does.

! l i

restart stecifies that rlndom actions are to be taken to obtain starting values and that the resulting starting _alues are notto be a deterministic function of the current values. Users should not specify this opti )n mainly bejzause, with restart, mt search intentionally does not produce as good a set of st lrting values _s it could, restart is included for use by the optimizer when it gets into

I :

workingltogether, serious tlouble. Thedor_ndom hot result actions in a are long, to endless ensure that loop. the actions of the optimizer and ml search, restar I implies no_escale, which is why !we recommend you do not specify restart. In i i )

)

testing, 4ases were discovered where rescale _worked so well that, even after randomization, the rescaler _;ould bring (he starting _alues right back to where they had been the first time and so defeated lthe intended irandomization. norescale! specifies that_ml search is not to engage in its rescaling actions to improve the parameter vector. We do not recbmmend specifying this Option because rescaling tends to work so well.

Options for use with ml plot saving(fil_'name[,

replace I)_

specifies that the graph is to be saved in fiIename . gph.

Options for use with ml init skip specif es that any phrameters found in the specified initialization vector that are not also found in the m,)del are to bE ignored. The default action is to issue an error message. ) ,_ector b) position ratbter than by name. i )

)

copy speci_es that the ]isi of numbers or the initialization vector is to be copied into the initial-value

r

348

ml -- Maximum likelihood estimation

Options for use with mi maximize difficult specifies that the likelihood function is likely to be difficult to maximize. In particular, difficult states that there may be regions where -H is not invertible and that, in those regions, ml's standard fixup may not work well. difficult specifies that a different fixup requiring substantially more computer time is to be used. For the majority of likelihood functions, difficult is likely to increase execution times unnecessarily. For other likelihood functions, specifying difficult is of great importance. nolog,

trace,

gradient,

hessian,

and showstep

control the display of the iteration log.

nolog

suppresses reporting

trace

adds to the iteration log a report on the current parameter

gradient

vector,

adds to the iteration log a report on the current gradient vector.

hessian

adds to the iteration log a report on the current negative Hessian matrix.

showstep iterate

of the iteration log.

adds to the iteration log a report on the steps within iteration.

(#), ltolerance

iterate(16000) Convergence

(#), and tolerance

tolerance(le-6)

(#) specify the definition of convergence.

ltolerance(le-7)

is the default.

is declared when mreldif(bi+l,bi) _< tolerance () or

reldif{lnL(bi+l),InL(bi)}< Itolerance()

In addition, iteration stops when i -- iterate(); in that case. results along with the message "convergence not achieved" are presented. The return code is still set to 0. nowarning is allowed only with iterate (0). nowarning suppresses the "convergence not achieved" message. Programmers might specify iterate (0) nowarning when they have a vector b already containing what are the final estimate,; and want ml to calculate the variance matrix and post final estimation results. In that case, specify 'init(b) search(off) iterate(0) nowarning notog'. novce is allowed only with iterate (0). novce substitutes the zero matrix for the variance matrix which in effect posts estimation results as fixed constants. score (newvarlist) specifies that the equation scores are to be stored in the specified new variables. Either specify one new variable name per equation or specify a shorl name suffixed with a *. E.g., score(sc*) would be taken as specifying scl if there were one equation and scl and sc2 if there were two equations. In order to specify score(), either you must be using method If, or the estimation subsample must be the entire dataset in memory, or you must have specified the nopreserve option. nooutput quietly

suppresses display of the final results. This is different from prefixing ral maximize in that the iteration log is still displayed (assuming nolog is not specified).

with

level(#) is the standard confidence-level option. It specifies the confidence level, in percent, for confidence intervals of the coefficients. The default is leveI(95) or as set by set level; see [U] 23.5 Specifying the width of confidence intervals, eform(string)

is ml display's

eform()

option.

noclear specifies that after the model has converged, the ml problem definition is not to be cleared. Perhaps you are having convergence problems and intend to run the model to convergence. If so, use ml search to see if those values can be improved, and then start the estimation again.

ml -- Maximumlikelihood estimation

349

!

i

__

! ! l ,

Options for use with ml graph saving(f/enamel,

replace])

specifies that the graph is to be saved in filename.gph.

Options fo_"use witfi ml display noheader suppresses the display of the header above the coefficient table that displays the final log-likelihood value.!the number of observations, and the model significance test. e:form (str than b intercer coeffici first

_ng) displays !the coefficient table in exponentiated form: for each coefficient, exp(b) rather s displayed _tnd standard errors and _onfidence intervals are transformed. Display of the t, if any. is suppressed, string is the table header that will be displayed above the transformed rots and musi be 1t characters or fewer in length, for example, efornl("0dds ratio").

displays a coefficient table reposing resqlts for the first equation only, and the report makes

it appe_ that the fi@ equation is the only e_uation. This is used by programmers who estimate ancillary, parameters !in the second and subsdquent equations and will report the values of such parameters themselves. neq(#) is!an alternative to first,

neq(#)

displays a coefficient table reporting results for the first

# equat ons. This is !used by programmers who estimate ancillary parameters in the # _I_1 and subsequent equations!: and will report the values of such parameters themselves. i

plus displays the coefficient table just as it would be ordinarily, but then, rather than ending the table

I

in a lin._ of dashes, _nds it in dashes-plus-sign-dashes. This is so that programmers can write additional display c_e to add more results to the table and make it appear as if the combined result ii one table. Pl'ogrammers typically specify plus with options first or neq().

i

t i i

i

i

level.(#) !is the standard confidence-level option. It specifies the confidence level, in percent, for confidence intervals 6f the coefficients. The default is level(95) or as set by set level" see [U] 23,_ Specit'ying lhe width of confidence intervals.

Options fol use with_mleval eq(#) spet:ifies the equation number _ for which Oij = xijbi if eq() is not specified.

is to be evaluated, eq(1)

is assumed

scalar as_erts that the _th equation is known to evaluate to a constant; the equation was specified as () (ha he: ). or/ndtne on the ml model staiement. If you specify this option, the new variable " • created is created as !a scalar. If the ith equation does not evaluate to a scalar, an error message is issue4.

Options fm use with imlsum t

noweight

•

_pecifies that'welghts_ ($ML_v) are to be ignored when summing the likelihood function.

Options for use with 1 i rnlvecsum eq(#)

specifies the equa_ion for which a gradient vector Oln L/Obi

is eq(ll.

is to be constructed. The defauh

_,,

_f,_w

Ill!

--

IflaA|lilUlll

III_III|U_

u_||lna[|on

Options for use with mlmatsum eq(#[,#])

specifies the equations for which the negative

default is eq(:t), which means the stone as eq(1,1), eq(i,j) results in -021n L/Ob_Ob_.

Hessian matrix is to be constructed.

which means -021n L/OblOb'

The

1. Specifying

Remarks For a thorough discussion of ml, see Maximum Likelihood Estimation with Stata (Gould and Sribney 1999). The book provides a tutorial introduction to ml, notes on advanced programming issues, and a discourse on maximum likelihood estimation from both theoretical and practical standpoints. ml requires that you write a program that evaluates the log-likelihood function and, possibly, its first and second derivatives. The stvle of the program you write depends upon the method chosen; methods If and dO require your program evaluate the log likelihood only; method dl requires your program evaluate the log likelihood and gradient; method d2 requires 3,'our program evaluate the log likelihood, gradient, and negative Hessi_m. Methods If and dO differ from each other in that, with method If, your program is required to produce observation-by-observation log-likelihood values In gj and it is assumed that In L = }"_j In £j; with method dO, your program is required to produce the overall value In L. Once you have written the program--called an evaluator--you define a model to be estimated using ml model and obtain estimates using ml maximize. You might type • ml model ... ml maximize

but we recommend that you type • ml ml • ml • ml

model ... check search maximize

ml check will verify your evaluator has no obvious errors, and ml search will find better initial values. You fill in the ml model statement with (1) the method you are using, (2) the name of your program, and (3) the "equations". You write your evaluator in terms of 01, 02, ..., each of which has a linear equation associated with it. That linear equation might be as simple as 0i -- b0, or it might be 0i = blmpg + b2weight+ ha, or it might omit the intercept b3. The equations are specified in parentheses on the ml model line. Suppose you are using method If and the name of your evaluator program is myprog. statement ml model If myprog

The following

(mpg weight)

would specify a single equation with 0_ = blmpg + b2weight _-b3. If you wanted to omit ha, you would type . ml model if myprog

(mpg weight, nocons)

and if all you wanted was 0i -- b0, you would type • ml model if myprog With

multiple

equations,

you

• ml model if myprog

() list

the

equations

(mpg weight)

()

one

after

the

other:

so if you

typed

ml -- MaximumIikelih_

I I

t

i

i

I

351

i you would ,e specifying 01 = blmpg+ b2weight_ b3 and Oz = b4: You would write your likelihood in terms of )t and 02. If _he model was linear reg_ssion, 01 might be the xb part and 0.2.the variance of the resid lals. i When you specify thelequations, you also specify any dependent variables. If you type . m! +odel L

if myp+g

(price = mpg weight_

()

price woulu be the one !and only dependent varihble, and that would be passed to your program in SML,yl. If _sour model _d two dependent variables, you could type 1 • ml _odel

If

mypr_g

I

i )

estimation

(price

displ

= mpg Oeight)

()

i

and then $_L_yl

would !be price

and $ML_y2 _ould be displ.

You can specify however many

dependent "_ariables are _ecessary and specify them on any equation. It does not matter on which equation vo6 specify them: the first one specifie4 is placed in $ML.yl, the second in SNL'y2, and SO on,

Example

I

Using m4thod If. we aketo produce observation-by-observationvalues of the log likelihood. The probit ]og-li_elihood funclion is

= f lne(O j)

lngj

l lncI'(4Olj)

Otj

=

xjb

if

=1

if yj = 0

1

The foltowiI_g is the mettiod If evaluator for this ikelihood function: progrem

define myp_obit version 7. args

Inf t_etal

quietly quietly

re_lace re,lace

"inf" = in(norm('thetal')) "inf" = in(norm(-_thetal'))

if SML_yI==I if $ML_yI==O

end

i If we wante_ to estimate a model of foreignonmpg and weight, we would type i

. ml m_del

!

if mypr_bit

i

(foreign

= mpg weight)

• ml m _ximize

i

The 'forei_ =' part siecifies that y is foreign.The 'mpg weight'part specifies that OIj == blmpgj + b_9_eightj + b,t. The result of running this is

i

I ml m_del If myprrbit ml m _ximize

(foreign

= mpg weight)

initia [:

log likelihood

e -51. 29289

altern itive:

log likelihood

= -45.055271

Iterat ton O:

log likelihood

± -45.05527:

Iterat [on I: festal _: Iterat [on 2:

log likelihood log likelihood log likelihood

= -27.9041L = -45.05527: = -26.8578

Iterat Lon 3: iterat LOn 4:

log likelihood log likelihood

= -26,844191 = "26.84418!

Iterat .on 5:

log likelihood

= -26.84418!

_.og li]_elihood t

i i

Number

of obs

=

74

)

Wald

chi2(2)

=

20.75

= -6.844189

Prob

> chi2

=

0.0000

I i

352

mi -- Maximum likelihood estimation foreign

Coef.

mpg weight _cons

Std. Err.

-.1039503 .0515689 -.0023355 .0005661 8.275464 2.554142

z

P>_zl

[95_Conf.Interval]

-2.02 -4.13 3.24

0.044 0.000 0.001

-.2050235 -.0028772 -.003445 -.0012261 3.269438 13.28149

q

> Example A two-equation, two-dependent variable model is little different. Rather than receiving one theta, our program will receive two. Rather than there being one dependent variable in $ML_yI, there will be dependent variables in $ML_yl and $gL_y2. For instance, the WeibuI1 regression log-likelihood function is Ingj=(tje°'_)

exp(°_) +dj{Olj-Olj+(e

°lj

t)(lntj

Ou) }

Olj = Xjbl 023

_

S

where tj is the time of failure or censoring and dj = I if failure and 0 if censored. We can make the log likelihood a little easier to program by introducing some extra variables: pj -- exp(O2j) Mj = tj exp(-01j)

pj

Rj = In tj - 01j In gj -= -Mj

+ dj{Olj - Olj -}-(pj - 1)nj }

The method If evaluator for this is program

define myweib version 7.0 args

Inf

thetal

theta2

tempvar

p M R

quietly

gen

double

"p"

= exp('theta2")

quietly

gen

double

"M"

= ($ML_yl*exp(-'thetal'))"p"

quietly

gen

double

"R" = in($ML_yl)-'thetal"

quietly

replace

"Inf"

= -'M"

+ SML_y2*('thetal"-'thetal"

+ ('p'-l)*'R')

end

We can estimate a model by typing • ml model If myweib ml maximize

(studytime

died

= drug2

drug3

age)

()

Note that we specified ' ()" for the second equation. The second equation corresponds to the Weibull shape parameter s and the linear combination we want for s contains just an intercept• Alternatively, we could type ml model

if myweib

(studytime

died

= drug2

drug3

age)

/s

Typing /s means the same thing as typing (s:) and both really mean the same thing as ()• The s, either aher a slash or in parentheses before a colon, labels the equation. It makes the output look

f

prettier and that is all:

ml -- Maximumlikelihoodestlmation

• ml_ odel

if mywe Lb (studytime

died

= dr_g2

drug3

353

J

age) /s

• ml _ ax ,

initi_ .i:

log likelihood

=

)

alterl .drive :

log likelihood

= -356.142716

resca]e:

log likelihood

"- -200.802_I

log likelihood

= -138,692342

log

= -136.692_

rescale Iterat

eq:

ion

O:

likelihood

-7_

I

Iteral ion i:

log)likelihood

= -124.117_

i {

Iteral ion 2: Iteral ion 3:

log) likelihood log I likelihood

-113.889iB -iI0.303_

I

Iteralion Iteraiion 5: 4: Iteral;ion 6: i i

log log)ilikelihood likelihood log I likelihood ! .!

Log l:kelihood

= -_I0.26736

i

I Coef.

(not

concave)

-110.267_ -II0.267_7 = -110.267_

Std.

Err.

z

Number of obs Weld chi2(3)

= =

Prob

=

> chi2

P>_zl

48 35.25 0.0000

[95_, Conf.

Interval]

!

drug2 drug3

I. 12966 I). 45917

.2903917 .2821195

3.488 _.172 5

O.000 0.000

.4438086 .9062261

I. 582123 2.012114

)

age _cons

-.0_7&728 6._60723

.0_5688 1. i52845

_3,266 5.257

0.00t 0.000

-. t074868 3. 801188

-. 0268587 8L 320269

.1402154

3.975

0.000

.2825162

.8321504

S

i

i

I

lej,

, OE,land they are required to evaluate the overall log-likelihood In/J rather than

!

Use mle;al to produce) the thetas from the coefficient vector. Use mlsm to sum th_ components that enter into InL. In, the ca se of WeibuI!, In L = _ In gj, and otir method dO evaluator is version progr tm define

7.!0 we4bO

i

tempvar args todot_etal )b inf theta2 mleval "tl_etal" = "b', eq(1)

i

mleval "t_eta2" = "b" local t "_ML_yl"

I

local

l

eq(2) : /* th_s

is just

for readability

d "_ML_y2"

=empvar quietly

p M double gen R

quietly gEn double quietly g(n double mlsum "In_" =-'M"

i

I

i

4

> Example Method 0 evaluatorsTeceiveb = (bl, b2,..., bE), the coefficientvector,ratherthanthe already evaluated 0_. 02....

i

_73333

i i

!

i I

•

i _cons

l i

i

"p, = exp('theta2") "R" = in('t')_'thetal" "M' = ('t'*ex_(-'thetal'))_'p _ + 'd'*('theta_'-'thetal_ + ('p'-I)*'R')

To estimatei our model using this evaluator, we would type . ml

odel dO wei 0 (studytime

i

died = dr_g2

-

drug3

age) /s

*/

i i !

[3 Technical Note i

354 ml -- Maximum likelihood estimation Method dO does not require In L = _-']d In gj, j = 1,..., N, as method If does. function might have independent components only for groups of observations. Panel have a log-likelihood value in L - Y-_iIn Li where i indexes the panels, each of multiple observations. Conditional logistic regression has In L = _k in Lk where k pools. Cox regression has In L = _--_(t)In L(t ) where (t) denotes the ordered failure

Your likelihood data estimators which contains indexes the risk times.

To evaluate such likelihood functions, first calculate the within-group log-likelihood This usually involves generate and replace statements prefixed with by, as in tempvar

contributions.

sumd

by group:

gen

double

"sumd*

= sum($ML_yl)

Structure your code so that the log-likelihood contributions are recorded in the last observation each group. Let's pretend that variable is named "cont ". To sum the contributions, code t empvar

of

last

quietly by group: gen byte "last" mlsthm "inf" ="cont" if "last"

=

(_n==_N)

It is of _eat importance that you inform mlsum as to the observations that contain log-likelihood values to be summed. First, you do not want to include intermediate results in the sum. Second, mlsttm does not skip missing values. Rather, if mlsum sees a missing value among the contributions, it sets the overall result "lnf" to missing. That is how ml maximize is informed that the likelihood function could not be evaluated at the particular value of b. ml maximize will then take action to escape from what it thinks is an infeasible area of the likelihood function. When the likelihood function violates the linear-form restriction In L - _j In t74, j -- 1,..., N, with In gj being a function solely of values within the jth observation, use method dO. In the following examples we will demonstrate methods ,51 and d2 with likelihood functions that meet this linear-form restriction. The d t and d2 methods themselves do not require the linear-form restriction, but the utility routines mlvecsum and mlmatstm do. Using method dl or d2 when the restriction is violated is a difficult programming exercise. El

> Example Method dl evaluators are required to produce the gradient vector g = 0 In L/Ob as well as the overall log-likelihood value. Using mlvecsura, we can obtain O ln L/Ob from 0 in L/OOi, i = 1,..., E. The derivatives of the Weibull log-likelihood function are

Olntj OOla

= pj(Mj

dj)

Oln t 9

o02j= dj - Rjpj(Mj-dj) The method dl evaluator for this is program

define weibl version 7 args

todo

tempvar

b inf g

/* g is new

tl t2

mleval

"tl"

= "b',

eq(1)

m!eval

"t2"

= "b',

eq(2)

local local

t "@HL_yl" d "$ML yi"

_empvar

p M R

quietly

gen

double

"p"

= exp('ti")

*/

5

ml-- Maximum likelihoodestimat_n

355

i

quietly g+n double "M" = ('t'*ex _(-'tl'))_'p" quietly g,..n double "R" = in('t')_'tl" mlsum "in:'"= -'M" + _d'*('t2"-'_1" + ('p':-I)*'R') if "todo'::=0I "Inf"--=.{ exit }:

/* chi2

= = =

616 9.62 0.0081

Pseudo R2

=

0.0086

P> [zl

[95_ Conf. InterVal]

IIndem_ity n_nwhite _cons

-.6608212 .1879149

.2157321 .0937644

-B.06 2. O0

O.002 O. 045

-I.083648 .0041401

-.2379942 .3716896

_/_._nsu_e n_nwhite _cons

-.2828628 -i.754019

.3877302 .1805145

-l). 71 -_.72

O.477 O. 000

-i.0624 -2.107821

.4966741 -1.400217

'(O_tc e insu_e==Prepaid is the comparison group)

Theb_s_ca_egory()option requires thatwe specify thenumericvalueofthecategory, sowe could not type__ _a_e(Prepaid). t Alflao_ghlthe coefficients now appear to be different, note that the summary statistics reported at the toi) ale i_entical. With this paramelerization the probability of prepaid insurance for whites is • !

1 Pv(insure

= Prepaid)

-- rj+ e "188+ e-1"754 -- 0.420

Thinsi$ tNe same answer we obtained previously. q

b. Examfllei Bv_sp_cif:ing rrr,which we can do at estimation time or when we redisplav results, we ._ee the -i. tterms of relative risk ratios: model.m

• mlogit,

364

rrr

Multinomial

regression

Number

of obs

=

616

mlogit-- Maximum-likelihood multinomial(polytomous) logisticregression LR chi2 (2) = 9.62 Prob

Log

likelihood

= -551.78348

insure

KRK

Indemnity nonwhite

.516427

.7536232

> chi2

Pseudo

Std.

Err.

R2

=

0.0081

=

0.0086

z

P>Izl

[957, Conf.

Interval]

.1114099

-3.06

O. 002

.3383588

.7882073

.2997387

-0.71

0.477

.3456254

1. 643247

Uninsure nonwhite

(Outcome

insure==Prepaid

is the

comparison

group)

Looked at this way, the relative risk of choosing an indemnity over a prepaid plan is 0.52 for nonwhites relative to whites.

Example One of the advantages of mlogit over tabulateis that continuous variables can be included in the model and you can include multiple categorical variables. In examining the data on insurance choice, you decide you want to con_ol for age, gende5 and site of study (the study was conducted in three sites): . mlogit

insure

age male

nonwhite

site2

sites

Iteration

O:

log

likelihood

= -555.85446

Iteration

1:

log

likelihood

= -534.72983

Iteration

2:

log

likelihood

= -534.36536

Iteration

3:

log

likelihood

= -534.36165

Iteration

4:

log

likelihood

= -534.36165

Multinomial

Log

regression

likelihood

= -534.36165

Number of obs LK chi2(lO)

= =

Prob

> chi2

=

0.0000

R2

=

0.0387

Pseudo

Std.

Err.

z

P>IzI

[95_

Conf.

615 42.99

insure

Coef.

Interval]

age male

-.011745 .5616934

.0061946 .2027465

-1.90 2.77

0.058 0.006

.9747768

.2363213

4.12

0.000

.1i30359

.2101903

0.54

0.591

-.2989296

site3

-.5879879

.2279351

0.010

-I.034733

_cons

.2697127

.3284422

0.82

0.412

-.3740222

.9134476

age male

-.0077961 .4518496

.0114418 .3674867

-0.68 1.23

0.496 0.219

-.0302217 -.268411

.0146294 1.17211

Prepaid

nonwhite site2

-2.58

-.0238862 .1643175

.0003962 .9590693

.5115955

1.437958 .5250013 -.1412433

Uninsure

nonwhite

(Outcome

.2170589

.4256361

0.51

0.610

-,6171725

1.05129

site2 site3

-1.211563 .2078123

.4705127 .3662926

-2.57 -0.57

0.010 0.570

-2.133751 -.9257327

-.2893747 .510108

_cons

-i.286943

.59_3219

-_.17

0.030

-5.447872

-.1260135

insure==Indemnity

is the

comparison

group)

_-

mlogit -- Maximum-likelihoodmultinomial(polytomous)logistic regl'ession 365 These results suggest that the inclination of nonwhites to choose prepaid care is even stronger than it was withbut controlling. We also see that subjects in site 2 are less likely to be uninsured, fl(2)) and thus the probability of outcome

i

i

i

i

i

i

I

!

i

Ptedicti6n can be used to aid interpretation. 2 ac|uaily (ells relative to that outcome•

! ]

predlctibns;by race. our Forpreviously this purpose, we caninsurance-choice use the "methodmodel, of recycled predictions", we Cbndnu!ng with estimated we wish to describe inthewhich model's

i

varv,, charadteristics of interest across the whole dmaset and average the predictions. That is, we have , data !on!bo_h whites and nonwhites, and our individuals have other characteristics as well. We will first preiend that all the people in our data are wfiite but hold their other characteristics constant. We then Icaltul_ite the probabilities of each outcome. Next we will pretend that all the people in our data are rJon_'hiie, still holding their other characteristics constant. Again we calculate the probabilities of each outcome. The difference in those two sets of calculated probabilities, then, is the difference due to ra_e; _hokling other characteristics constant.

_i

. ge_ I

I

l

i i

byte

nonwhold

/*

= nonwhite

. re_lace

nonwhite

= 0

(426 real

changes

made)

. pr_)ict

wpind,

outcome(Indemnity)

(@pti_bn p assumed;

save real

/* make

predicted

/*

predict

*/

race

everyone

i

white

*/

probabilities'

*/

probability_

(t missing value generated) • predict wpp, outcome(Prepaid) (dptipn p assumed; predicted (_ missing value generated) • !pre'dict wpnoi,

probabilityl

outcome(Ullinsure)

(_tibn p assumed; predicted (_ missing value generated) • _ep_ace (l_I_ real

i

probability)

i

nonwhite=l changes made)

/* make

everyone

nonwhite

*/

i i

• _re_ict nwpind, outcome(Indemnity) (_tipn p assumed; predicted probability

i _ i

(I missing

i

• _re_ict

value nwpp,

generated) outcome(Prepaid)

(_pti_n p assumed; predicted (1 missing value generated) pre

ict nwpnoi,

i I

outcome(Uninsure)

(optipn p assumed; predicted mi#sing value generated) replace (5R8 real

probability)

i

probability!

nonwhite--nonwhold changes made)

/* restore

real

race

*/

.p,n.p*

• V_riable

Obs

Mean

Std' Dev.

Min

Max

wpind

643

.5141673

.08_2679

.3092903

wpp wpnoi

643 643

.4082052 .0776275

,09( ,3286 .03( 0283

.1964103 .0273596

,6502247 .1302816

nwpind nwpp Inwpnoi

643 643 643

:3112809 .630078 .0586411

.08:7693 .095 9976 02_ 7185

.1511329 .3871782 0209648

.535021 .8278881 ,0933874

.71939

i

i

.........................

tl_,,s, Ttu=-uu=

I lug!SIlO

regress=on

unadjusted. The means reported above are the values adjusted for age, sex, and site. Combining the results gives Earlier in this entry we presented a cross-tabulation of insurance type and race. Those values were Unadjusted white nonwhite

Adjusted white nonwhite

Indemnity

.51

.36

.52

.31

Prepaid

.42

.57

.41

.63

Uninsured

.07

.07

.08

.06

We find, for instance, that while 57% of nonwhites in our data had prepaid plans, after adjusting for age, sex, and site, 63% of nonwhites choose prepaid plans,

Q Technical Note Classification of predicted values followed by comparison of the classifications with the observed outcomes is a second way predicted values can help interpret a multinomial logit model. This is a variation on the notions of sensitivity and specificity for logistic regression. Here, we will adopt a three-part classification with respect to indemnity and prepaid: definitely predicting indemnity, definitely predicting prepaid, and ambiguous. (1

predict indem, missing value predict

prepaid,

(I missing

value

outcome(Indemnity) generated)

index

outcome(Prepaid)

index

predict sediff, outcome(Indemnity,Prepaid) (i missing value generated) gen type = I if diff/sediff (504 missing values generated)

obtain

difference

*/

I "Def

type

insure

Ind"

/* _ its

standard

error

*/

> 1.96

/* definitely

prepaid

*/

_ diff/sediff!=.

/_ ambiguous

type = 2 if type==. changes made)

• tabulate

/*

indemnity

replace (404 real

values

*/

/* definitely

type = 3 if diff/sediff changes made)

clef type

indexes

stddp

< -1.96

replace (i00 real

label

obtain

generated)

gen diff= prepaid-indem (I missing value generated)

label

/*

2 "Ambiguous"

3

type

"Def

*/

*/

Prep"

/* label

results

,/

type type

insure

Def

Ind

Ambiguous

Def

Prep

Total

Indemnity

78

183

33

294

Prepaid Uninsure

44 12

177 28

56 5

277 45

Total

134

388

94

.......

616

One substantive point learned by this exercise is that the predictive power of this model is modest. There are a substantial number of misclassifications in both directions, though there are more correctly classified observations than misclassified observations. A second interesting point is that the uninsured look overwhelmingly come from the indemnity system rather than the prepaid system.

as though they might have Q

i

W

mlogit-- Maximum-likelihoodmultinomial(polytomous)logisticregression

369

Tes@nghypothesesaboutcoefficients

i '

,

E=mp,* i i HypOtheses about the coefficients are tested with test just as they are after any estimation c0mina6d;___ see [R] test. The only important point to note is test's syntax for dealing with multiple equa_tior_models. You are warned that test bases its results on the estimated covariance matrix and tlht _alikelihood-ratio test may be preferred; see Estimating constrained models below for an example

ofl t st.

I' i f o_e simply lists variables after the test c0effici_nts are zero across all equations: • itest I) (]2) (I3) (i4)

site2

command, one is testing that the corresponding

site3

[Prepaid]site2 = 0.0 [Uninsure]site2 = 0.0 [Prepaid]site3 = 0.0 [Uninsure]site3 = 0.0 chii(4) Prob > chi2

19.74 0.0006

= :

One :ca0 test that all the coefficients (except the constant) in a single equation are zero by simply typlrig tlhe outcome in square brackets:

:

. test i (i 1) i (' 2) ( 3) (i 4) (I 5) :

'

!

[Uninsure] [Uninsure]age = 0.0 [Uninsure]male = 0.0 [UninsureJnonwhiZe = 0.0 0.0 [UninSure]site2 = [Uninsure}site3 chi2(5)

=

= 0.0 9.31

Prob > ahi2 =

0,0973

S_ectfic_tion of the outcome is just as with predict; you can specify the label if the outcome variable is!lal_ele_t, or you can specify the numeric value of the outcome. We would have obtained the same te_t _is above had we typed test [3], since 3 is the value of insure for the outcome uninsured

i !

Tt_e t_vo syntaxes can be combined. To test that the coefficients on the site variables are 0 in the equation! corresponding to the outcome prepai& we can type •

_est [Prepaid]: site2 site3 (! I) ( !2) '

i

[Prepaid]site2 : 0.0 [Prepaid]site3 = 0.0 chii( 2) = Prob > chi2 =

10.78 0.0046

sFeqfied the outcome and then followed that with a colon and the variables we wanted to test We can _ _lso ' test that coefficients are equal across equations. To test that all coefficients except the constlmt ' • !are equal for the prepaid and uninsured outcomes, _est

;

(il) (i2) (!3) (!4) (i5) ! !

[Prepaid=Uninsure] [Prepaid]age - [Uninsure]age = O,0 [Prepaid]male - [Uninsure]male = 0.0 [Prepaid]nonwhite - [Uninsure]nonwhite = 0.0 [Prepaid]site2 - [Uninsure]site2 = 0.0 [Prepaid]site3 - [Uninsure]site3 = 0.0 chii( 5) = 13.80 Prob

> cbi2

=

0.0169

To test that only the site variables are equal: • test

......

[Prepaid=Uninsure]:

site2

site3

,vu,. -- ,eux-,,um-.Ke.nooo

(1) (2)

[Prepaid]site2

-

[Uninsure]site2

[Prepaid]site3

-

[Uninsure]

=

chi2(2) Prob

> chi2

multinomial (polytomous) logistic regression = 0.0

site3

= 0.0

12.68

=

0.0018

Finally, we can test any arbitrary constraint by simply entering the equation, coefficients as described in [U] 16,5 Accessing coefficients and standard errors. hypothesis is senseless but illustrates the point: test (i)

( [Prepaid] age+ [Uninsure] .5 [Prepaid]age Prob

= 2- [Uninsure] nonwhite

+ [Uninsttre]nonwhits

1)

=

> chi2

=

chi2(

site2)/2

specifying the The following

+ .5 [Uninsure]site2

= 2.0

22.45 0.0000

Please see [R] test for more information on test. All that is said about combining across test commands (the accum option) is relevant after mlogit.

hypotheses q

Estimating constrained models mlogit can estimate models with subsets of coefficients constrained to be zero, with subsets of coefficients constrained to be equal both within and across equations, and with subsets of coefficients arbitrarily constrained to equal linear combinations of other estimated coefficients. Before estimating a constrained model, you define the constraints using the constraint command: see [R] constraint. Constraints are numbered and the syntax for specifying a constraint is exactly the same as the syntax for testing constraints; see Testing hypotheses about coe£IJciems above. Once the constraints are defined, you estimate using mlogit, specifying the constraint () option. Typing constraint (4) would use the constraint you previously saved as 4. Typing constraint (1,4,6) would use the previously stored constraints 1, 4, and 6. Typing constraint (1-4,6) would use the previously stored constraints 1, 2, 3, 4, and 6. Sometimes, you will not be able to specify the constraints without knowledge of the omitted group. In such cases, assume the omitted group is whatever group is convenient for you and include the basecategory () option when you type the mlogit command.

> Example Among other things, constraints can be used as a means of hypothesis testing. In our insurancechoice model, we tested the hypothesis that there is no distinction between having indemnity insurance and being uninsured• We did this with the test command. Indemnity-style insurance was the omitted group, so we typed • test (i) (2)

[Uninsure] [Uninsure]age [Uninsure]male

= 0.0 = 0.0

(3)

[Uninsure]nonwhite

(4) (5)

[Uninsure]site2 [Uninsure]site3 chi2( Prob

5) =

> chi2

= 0.0 = 0.0 = 0.0

=

9.31 0.0973

]

_r

mlogit-- Maximum-likelihoodmUltinomial(polytomous)logistic regression

!

(Had!indemnity not been the omitted group, we would have typed test :

i

,

.)

;T_e r_sults produced by test are based On the estimated covariance matrix of the coefficients _e an approx_mauon. Since the probabthty of being uninsured is quite low, the log hkehhood m_y be honlinear for the uninsured. Conventional statistical wisdom is not to trust the asymptotic

• .

! I

[Uninsure=Indellmity]

371

_ I i

.......

answbr dnder these circumstances, but to perform a likelihood-ratio test instead. State _asa lrtest likelihood-ratio test command; to use it we must estimate both the unconstrained anktttte cbnstrained models. The unconstrained model is what we have previously estimated. Following the ifistr_ction in [R] Irtest. we first save the unconstrained model results:

)

!

.

_rtest, saving(O)

TO e_imme the constrained model, we must re-estimate our model with all the coefficients except the c6nsthnt set to 0 in the Uninsure equation. We define the constraint and then re-estimate:

I

donstraint define 1 [Uninsure]

I

_logit insure age male nonwhite site2 site3, constr(1)

I

(I)

[Uninsure]age

(3) (2) (4) (5)

[Unins_mre]nonwhite [Uninsure]male = 0,0 = 0.0 [Uninsttre] site2 = O. 0 [Unins_Ire] site3 = 0.0

!Iteration O: iIteration 1 : _Iteration 2: riferation 3:

log log log log

= 0.0

likelihood likelihood likelihood likelihood

= = = =

-555.85446 -539.80523 -539.75644 -539.7(6643

!Mu]'tinomialregression ' Log) likelihood = -539.75643


= =

615 32.20

Prob > chi2

=

0.0000

Pseudo K2

=

0.0290

t i insure ,

Coe_.

Std. Err.

z

P> ]Z)

[95Y.Conf. Interval]

I

Prepaid age male nonwhite site2 sil;e3 _cons

-.0107025 .4963616 .942137 .2530912 -. 5521774 .1792752

.0060039 .1939683 .2252094 .2029465 .2187237 .3171372

(dropped) (dropped) (dropped) (dropped) (dropped) -1.8735i

.1601099

-1.78 2.56 4.18 t. 25 -2.52 O.57

O.075 O. 010 O.000 0.212 O.012 O.572

-.0224699 .1161908 .5007347 -. 1446767 -, 9808678 -.4423023

.0010649 .8765324 i.383539 .6508591 -. 1234869 .8008527

Uni_sure age male inonwhite site2 site3 _cons

-11.70

0.000

-2.18732

-I.5597

(Outcome insure==Indemnity is the comparison group)

We ,ca_ new perform the likelihood-ratio test: , l_test Mlo_it : likelihood-ratio test !

chi2(5) = Proh > chi2 =

10,79 0.0557

The lil_eli_ood-ratio ehi-squared is 10.79 with 5 degrees of freedom just slightly greater than the ma_ic _ J4 .05 level. Thus. we should not call _is difference significant. l

o TechnicalNote 372In certain mlogit circumstances, -- Maximum-likelihood a multinomialmultinomial logit model(polytomous) should be estimated logistic regression with conditional logit; see [R] ciogit. With substantial data manipulation, clogit is capable of handling the same class of models with some interesting additions. For example, if we had available the price and deductible of the most competitive insurance plan of each type, this information could not be used by mlogit but could be incorporated by cZogit. 71

Saved Results mlogit saves in e(): Scalars e (N) e (k_cat) e(df..m) e(r2_p) e(11)

number of observations number of categories model degrees of freedom pseudo R-squared log likelihood

e (1I_0) e (N_clust) e(chi2) e(ibaseeat) e(basecat)

log likelihood, constant-only model number of clusters X2 base category number the value of depvar that is to be treated as the base category

mlogit

e (clustvar)

name of cluster

Macros e (emd)

variable

(depvar) name of dependent variable e(wtype) weight type e(wexp) weight expression Matrices

covariance estimation method e(chi2type)Waldor LR: _ype of model X2 test e(predict) program used to implement predict

e (b) e (cat) Functions

e (V)

e

e(sample)

coefficient vector category values

e (vcetype)

variance-covariance matrix of the estimators


Methods and Formulas The model for multinomial

logit is

Pr(Y/=

k) = rr_=1

"j-_--O

This model is described in Greene (2000, chapter t9). Newton-Raphson

maximum likelihood is used; see [R] maximize.

In the case of constrained equations, the set of constraints is orthogonalized and a subset of maximizable parameters is selected. For example, a parameter that is constrained to zero is not a maximizable parameter. If two parameters are constrained to be equal to each other, only one is a maximizable parameter.

mlogit -- Maximum-likelihoodmultinomial(polytomous)logistic regression •

'

373

L_t r!be the vector of maximizable parameters, Note that r is physically a subset of the solution p_a_ete_rs b. A matrix T and a vector m are defined b=Tr_m t

wiih [he lconsequence that df=df_ T, db dr d2f = T d2f _, db 2 d-_ -t T 'consists of a block form in which one part is a permutation of the identity matrix and the other pa_ describes how to calculate the constrained parameters from the maximizabte parameters,

ReferenQes Aldrich.J.! H. and F. D. Nelson. t984. LinearProbabilit):Logit, and Probit Models. Newburv Park, CA: Sage _Puliticaiions. Grdene.W_H. 2000. EconometricAnalysis,4th ed. Upper SaddleRiver,NJ: Prentice-HalL HamiltOn,_..C. 1993 sqv8: Interpretingmultinomia]logisticregression.Stata TechnicalBulletin[3: 24-28. Repnnted in _tat_TechnicalBulletin Reprints,vol. 3, pp. 176-181. Hefidri_kx, IJ 201)0.sbe37:Specialrestrictionsin multii_omiallogisticregression,Stata TechnicalBulletin56: 18-26_ Hosrner.D_W., Jr,, and S. Lemeshow 1989.AppliedLogisticRegression.New York:John Wiley & Sons. lSecot_d !editionIforthcomingin 2001.) Judge,G. _,, W. E. Griffiths,R. C. Hill.H. Lfitkepohl.and T,-C.Lee. 1985, The Theoo"and Practiceof Econometrics. !2db_d,New York:John Wiley & Sons. Long, 1. SI 1997.Regression Models for Categoricaland Limited Dependent Vari,_bles.ThousandOaks, CA: Sage PuNica_ions, Tarlbx:, _A. _,..1. E. Ware,Jr., S. Greenfield,E. C, Nelson,E. Pen-in.and M. Zubkoff. 1989. The medicaloutcome_ study.3aurnalof the American MedicalAssociation,262: 925-930. We_ls.K. E R. D. Hays, M A. Burnam,W, H. Rogers,IS.Greenfield,and J. E. Ware,Jr. 1989, Detectionof depressive disdrderfor patientsreceivingprepaidor fee-for-servicecare, Journalof the AmericanMedical Association262 3298-3 _02.

i

AlsoiSie Corn_en_entary:

[R] adjust, [R] constraint. [R] lincom. [R] lrtest. [R] mfx, [R] predict, [R] test, [R] testnL [R] xi

Re_t0d:

[R] clogit, [R] logistic, [R] nlogit, [R] ologit, [R] svy estimators

Baekgtouhd:

i

[U] 16.5 Accessing coefficients and standard errors. [U] 23 Estimation and p_st-estimation commands. [t;] 23.11 Obtaining robfist variance estimates. [u] 23.12 Obtaining scores,

I

[R] maximize

°

more

-- The --more--

i

message

1

]

[

Syntax

set more{ onloaf } set

_p_agesize #

Description set more on, which is the default, tells Stata to wait until a key is pressed before continuing when a --more-message is displayed. set more off

tells Stata not to pause or display the --more--

set pagesize

# sets the number of lines between --more--

message. messages.

Remarks When you see --more

at the bottom of the screen

Press ...

and Stata...

letter t or Enter

displays the next line

letter q

acts as if you pressed Break

space bar or any other key

displays the next screen

In addition, you can press the More button, or click on --more--,

to display the next screen.

--more is Stata's way of telling you it has something more to show you, but that showing you that something more will cause the information on the screen to scroll off. If you type set more oft.--more--conditions at full speed. If you type set more on, --more Programmers

conditions

will never arise and Stata's output will scroll by will be restored at the appropriate

should see [p] more for information


[R]

query, [P] more

Background:

[U] 10 ---more---

conditions

374

on the more programming

places.

command.

.

Title

,

-- Change missing to coded missing value and vice versa

Synta _ercode

varlist [if exp] [in range] , my(#) [override

]

m_d, code varlist [-if exp] [in range] , my(#)

Destcttipl:ion m_veJ code changes all occurrences of mis_ing to # in the specified varlist. m_d_code changes all occurrences of # to missing in the specified varlist.

options my(#) }pecifies the numeric value to which or from which missing is to be changed and is not opti@al. oveJ'ri[te specifies that the protection provided by mvencode is to be overridden. Without this option, m_rer_code refuses to make the requested change if # is already; used in the data.

, !

Remalrk One _occasionalty reads data where missing (e.g., failed to answer a survey question, or the data were ndt collected, or whatever) is coded wi|h a special numeric value. Popular codings are 9. 99. 29, -9_), and the like. If missing were encoded as -99, ' • mvdecode _all, my (-99)

would translate the special code to the Stata missing value ' ' Use this command cautiously since. even if L99 were not a special code, all 99's in the data would be changed to missing. Conxlersely, one occasionally needs to export data to software that does not understand that .' friends r_issmg value, so one codes missing With a special numeric value. To change all missings to -99i _nvencode

_all, my(-99)

mvenco_leis smart: it will automatically recast variables upward if necessary, so even if a variable is strred as a byte. its missing values can be recoded to. say, 999. In addition, mvencode refuses to make th_ change if # _-99 in this case) is already used in the data, so you can be certain that your coding ig unique. You can override this feature by including the override option. _. Example O_ur_utomobile dataset (described in [U] 9 State's on-line tutorials and sample datasets) contains 74 observations and 12 variables• Let us first attempt to translate the missing values in the data to 1: 375

. mvencode

_

..............

,

_all,

my(1)

make : string ,..,,-,,_ variable ,,,,oo.,y ignored .u uuueu rmssmg rep78: already I in 2 observations

foreign: already no action taken

i in

22

value aria vice versa

observations

r(9) ; Our attempt failed, mvencode first informed us that make is a string variable--this is not a problem but is reported merely for our information. String variables are ignored by mvencode. It next informed us that rep78 already was coded 1 in 2 observations and that foreign was already coded 1 in 22 observations. Thus, 1 would be a poor choice for encoding missing values because, after encoding, you could not tell a real 1 from a coded missing value t. We could force mvencode to encode the data with 1 anyway by typing mvencode _all, my(l) override and that would be appropriate if the ls in our data already represented missing data. They do not, however, and we will code missing as 999: • mvencode

_all,

make:

mv(999)

string

rep78:5

variable

missing

ignored

values

This worked, and we are informed that the only changes necessary were to 5 observations

of rep78.

Example Let us now pretend that we just read in the automobile data from some raw dataset where all the missing values were coded 999. We can convert the 999's to real missings by typing • mvdecode

_all,

mv(999)

make : string variable ignored rep78:5 missing values

We are informed that make is a string variable and so was ignored and that rep78 observations with 999. Those observations have now been changed to contain missing.

contained

5 q

Methods and Formulas mvencode and mvdecode are implemented

Also See Related:

[R] generate, [R] recode

as ado-files.

:

!

Title•

........

i

-- Multivariateregression

Syntax :mere!depvarlist = vartist [weigh,] [if expl [in range] [, noconstantcorrnoheader

by..

_ : _nay be used with m-rreg; see JR] by.

aw_i_tsland_: fweights are allowed;see [Ul 14.1.awei_t. mvteff sh_es the features of all estimation commands; see [U] 23 F_imation and l_t-estimati_

commands.

SyntaxIfo_predict predict I

[type] newvarname [if exp] Iin range][,

{ _b !stdp Iresiduals I_difference

i

equation(eqno

[,eqnoj)

Is_tdap }]

i

These gtati{sticsare available both in and out of sample: type predict :theiesfi_nation sample. :

...

if e(s_ple)

...

if wanted onl? fl)r

i I

Desaripti_n I

T avte_ estimates multivariate regression models.

Optienis :

1

no_:o_st_mt omits the constant term from the estimation. corr _lis ys the correlation matrix of the residuals between the equations. _ !la noheddei" suppresses display of the table reporting F statistics. R-squared, and root mean square errbr a _ove the coefficient table notable suppresses display of the coefficient table. leve2 (# specifiesthe confidencelevel, in percent, for confidenceintervals. The default is level (95) or as _t by set level: see [U] 23.5 Specifying the width of confidence intervals 1

Optionsf_r predict t

oqu o (qo[.qnot

,ow ich eq tiooareefem g.

equat _on() is filledin with one eqno for options zb, stdp, and residuals, equation(#1) would mean the calculation is to be made for the fi_stequation, equation(#2) would mean the second, and so on. Alternatively, you could refer to the equations by their names, equation(income) Wotild"efer to the equation named income and equation(hours) to the equation named hours. '

37"/

not -specify equation(), results are as if you specified equation(#1). oreff you do mvreg Multivariate regression difference and stddp refer two equations; e.g., equat ion be specified, equation() is prediction of equation(#1)

to between-equation concepts, To use these options, you must specify (#1, #2) or equation (income, hours). When two equations must not optional. With equation(#1,#2), difference computes the minus the prediction of equation(#2).

xb, the default, calculates the fitted values--the

prediction

of xj b for the specified equation.

stdp calculates the standard error of the prediction for the specified equation. It can be thought of as the standard error of the predicted expected value or mean for the observation's covariate pattern. This is also referred to as the standard error of the fitted value. residuals difference

calculates

the residuals.

calculates the difference between the linear predictions

of two equations in the system.

stddp is allowed only after you have previously estimated a multiple-equation model. The standard error of the difference in linear predictions (xljb - x2jb) between equations 1 and 2 is calculated. For more information on using predict

after multiple-equation

estimation commands,

see [R] predict.

Remarks Multivariate regression differs from multiple regression in that several dependent variables are jointly regressed on the same independent variables. Multivariate regression is related to Zellner's seemingly unrelated regression (see [R] sureg) but, since the same set of independent variables is used for each dependent variable, the syntax is simpler and the calculations faster. The individual coefficients and standard errors produced by mvreg are identical to those that would be produced by regress estimating each equation separately. The difference is that mvreg, being a joint estimator, also estimates the between-equation covariances, so you can test coefficients across equations and, in fact. the tesl: syntax makes such tests more convenient.

> Example Using the automobile data. we estimate a multivariate regression for "space" variables (headroom, trunk, and turn) in terms of a set of other variables including three "pertbrmance variables" (displacement, gear_ratio, and mpg): . mvreg

headroom

trunk

turn

Parms

= price RMSE

mpg

displ

gear_ratio

Equation

Obs

"R-sq"

headroom

74

7

.7390205

O. 2996

4. 777213

trunk

74

7

3. 0523 L4

0. 5328

12. 7265

turn

74

7

2. 132377

0. 7844

length

weight

F

40.

62042

P 0. 0004 0. 0000 0. 0000

mvreg-- Multivariateregression

¢-...

i

Coef.

Std.

Err.

t

P>(t[

[95_ Conf.

379

Interval]

he. oom price I mpg displacement g#ar_ratio ! length weight _cons

-.0000528 -.0093774 .0031025 .2108071 .015886 -.0000868 -.4525117

.000038 .0260463 .0024999 .3539588 .012944 ,0004724 2.170073

-1.39 -0.36 i,24 O.60 I.23 -0.18 -0.21

0.168 O.720 0.219 O.553 O.224 0.855 O.835

-.0001286 -.061366 -.0018873 -.4956976 -.0099504 -,0010296 -4.783995

.0000229 .0426112 .0080922 .9173118 .0417223 .0008561 3.878972

, price ' mpg _is_lacement 1 g_ar_ratio length c weight _cons

.0000445 -. 0220919 .0032118 -,2271321 .170811 - ,0015944 -13.28253

,0001567 .1075767 .0103251 I.461926 .0534615 ,001951 8. 962868

0,28 -0.21 0.31 -0.16 3.20 -0.82 - 1.48

0,778 O.838 0.757 0.877 O,002 0.417 0. 143

-.0002684 -. 2368159 -,0!73971 -3.145149 .0641014 - ,0054885 -31. 17249

.0003573 .1-926322 .0238207 2.690885 .2775206 ,0022997 4.607429

price mpg displacement

-.0002647 -.0492948 .0036977

,0001095 .0751542 .0072132

-2.42 -0.66 O. 51

O.018 O.514 O.610

-.0004833 -.1993031 -. 0106999

-.0000462 .1007136 .0180953

gear_ratio --length

-. 1048432 .072128

I.021316 .0373487

-0.10 I. 93

0.919 O. 058

-2. 143399 - J)024204

1.933712 .1466764

_cons

20. 19157

3.22

O.002

7.693467

32. 68967

i

_un!

I

i !

i

6.261549

!

We should have specified the corr option so that we would also see the correlations between the residu_ils _ of the equations. We can correct our omission because mvreg--like all estimation com_ahds!--typed without arguments redisplaysiresutts The noheader and notable (read no-table) options Sul_press redisplaying the output we have already seen:

g

• mv_'eg, notable noheader corr

i

COrrl_lationmatrix of residuals: headroom trunk turn h@ad]'oom i.0000 t]'unk O.4986 I.0000 urn -0.1090 -0.0628 1.0000 Breu _ch-Pagantest of independence: chi2(3) =

19.566, Pr = 0.0002

The Breusc h-Pagan test is significant, so the residuals of these three space variables are not independent of eachiott er. t I

The thre_eperformance variables among our ihdependent variables are mpg, displacement, and gear_ratio. We can jointly test the significance of these three variables, in all the equations, by typing

i


I!iI'!

• test

mpg

(1) ,

displacement

[headroom]mpg

(2)

[truak]mpg

(3)

[turn]mpg

gear_ratio

= 0.0

= 0.0 -- 0.0

(4)

[headroom]displacement

(5)

[trunk]displacement

(6)

[ttu'-n]displacement

(7)

[headroom]gear_ratio

(8) (9)

[trul_k]gear_ratio [t_rn]gear_ratio F(

9,

67)

Prob

= 0.0 = 0.0 = 0.0 = 0.0

= O. 0 = 0.0

=

0.33

> F =

0.9622

These three variables are not, as a group, significant. We might have suspected this from their individual significance in the individual regressions, but this multivariate test provides an overall assessment with a single p-value. We can also perform a test for the joint significance of all three eqtmtions: - test

[headroom]

(output

omitted

• test

)

[trunk],

(output

omitted

• test

accum

)

[turn],

accum

(I)

[headroom]price

(2)

[headroom]mpg

= 0.0

(3) (4)

[headroom]displacement [headroom]gear_ratio

= 0.0 = 0.0 = 0.0

(5)

[headroom]length

(6) (7)

[headroom]weight = 0.0 [tr%L_k]price = 0.0

= 0.0

(8)

[trunk]mpg

(9)

[trumk]displacement

= 0.0

(i0)

[trunk]gear_ratio

(II)

[trunk]length

= 0.0

(::[2)

[trunk]weight

= 0.0

(13)

[turn]price

= 0.0 = 0.0

= 0.0

(14) [turn]mpg= 0.0 (15)

[turn]displacement

(16)

[turn]gear_ratio

= 0.0 = 0.0

(17)

[turn]length

= 0.0

(18)

[turn]weight

= 0,0

F(

18, Prob

67)

=

> F =

19.34 0.0000

The set of variables as a whole is strongly significant. individual equations.

We might have suspected this, too, from the q

C3Technical

Note

The mvreg command provides a good way to deal with multiple comparisons. If we wanted to assess the effect of length, we might be dissuaded from interpreting any of its coefficients except that in the trunk equation. [trunk]length--the coefficient on length in the trunk equation has a p-value of .002, but in the remaining two equations it has p-values of only .224 and .058. A conservative statistician might argue that there are 18 tests of significance in mvreg's output (not counting those for the intercept), so p-values above .05/18 = .0028 should be declared insignificant

i ,

'

,

,

mweg -- Multivariate _k:m

at!the 5! _level. A more aggressive but, in our opinion, reasonable approach would be to first note

1

Then_ w: :hree would work through the individual using test, inpossibly = .0083 that _he equations are jointly significant, variables so we are justified making using some .05/6 interpretation. (6_becau,,e there are 6 independent variables) for the 5% significance level. For instance, examining lemg_h:

!

._stlength (t)

[headroom]length = O.0

(2) (3)

[t_,-_]le_h = o.o [t_m]lem_h = 0.0 F(

3, Prob

67) = > F =

4.94 0.0037 i

The r_por_ed significance level of .0037 is less than ,0083, so we will declare this variable significant. [tru_]iengeh is certainly significant with its p-value of .002, but what about in the remaining two equationsiwith p-values .224 and .058? Performing a joint test: . l;_s_,[headLrooI_]length [t_]length ((!))

I

[tttrn] lenl_ch= O.0 [headroom]length = O.0

F( 2,Prob 67) = > F =

2.91 0.0613

At t_isipolnt; reasonable statisticians could disalgee. The .06 significance value suggests no interpretation t_ut}hese were the two least-significant values out of three, so one would expect the p-value to be a litkte high. Perhaps an equivocal statement is warranted: there s_ms to be an effect, but chance cannot Ibe _xcluded. Q

SavedReSults mvreg '

_aves

in e () : Scalars e(N)

number of obsep;atior_

e (k)

number of parameters ifincluding constant)

e(k_eq)

number of equations

e(df_I)

residual degrees of freedom

e(chi2)

Breusch-Pagan

e (df_chi2)

degrees of freedom for Breusch-Pagan

X2 (corr

only) X2 (curt

Macros e (cmd)

mvreg

e(eqn_es)

names of equations

e(r2) e (rinse)

R-squared for each eqt!ation RMSE for each equatidn

e(F)

F statistic for each eqdation

e(p._F)

significance of F for each equation

e(predic_)


Matrices

I

e (b)

coefficient vector

e(V)

variance-covariance

e (Siuna)

_

malrix of the estimators

matrix

i Functions e(sample)

t.

marks estimation samptd

only)

•

Methods and Formulas _

......implemented :, ,.u,.,,,m i -gresslon mvregis as ,at_ an ado-file.

p independent variables (including the constant), the parameter estimates are Given given qbyequations the p × qand matrix B-

(XtWX)-lxtwY

where Y is an n × q matrix of dependent variables and X is a n x p matrix of independent variables. W is a weighting matrix equal to I if no weights are specified. If weights are specified, let v: 1 x n be the specified weights. If fweight frequency weights are specified. W = diag(v). If aweight analytic weights are specified, W = diag{v/(l'v)(l'l)}, which is to say, the weights are normalized to sum to the number of observations. The residual covariance matrix is R={YIWy

B tX' (

.WX)B}/(n-;)

The estimated covariance matrix of the estimates is R ® (X _WX)-I These results are identical to those produced by sureg when the same list of independent variables is specified repeatedly; see [R] sureg. The Breusch and Pagan (1980) X2 statistic--a

Lagrange multiplier statistic--is

given by

q i-I =nZ

.2 z=l .,4=1

where vii is the estimated correlation between the residuals of the equations and n is the number of observations. It is distributed as X 2 with g(q - 1)/2 degrees of freedom

References Breusch. T. andStudies A. Pagan. t980. The LM test and its applications to model specification in econometrics. Review of Economic 47: 239-254.


[R] adjust, [R] lincom, [R] mfx, [R] predict, [R] test, [R] testnl, [R] vce

Related:

[R] reg3, [R] regress, [R] regression diagnostics. [R] sureg

Background:

[U] 16.5 Accessing coefficients and standard errors. [U] 23 Estimation and post-estimation commands

{

Title -- Negative binomial regression

syntax nbrqg depvar [indepvars] [weight] [if exp] [in range] [,

{

d!spersion({mean Iconstant} ) level(#)irr exposure(varname)oflset(varneme) r_bust cluster(varname)score(ne_'vars) noconstantcoBsr, raints(numlist)

!

[

n__test

nolog maximize_options ]

fffibr_ E depvar[indepvar,] [_eight] [ifexp][inrange][,inalpha(varlist) level(#) irrr ; e_posure (varname) offset (vantame) robust cluster (varname) score (newvars) , ? nc constmat constraint s (numlist) n_log maximtze._options j by ../ : may be used with nbrog; see IR] by, f_ei_hts iweights, and p-aeights are allowed; see [U] 14,1.6 weight, T_ese_,con mands share the features of all estimation commands; n_re_ m_

see [U] 23 Estimation

and post-estimation

commands,

be used with sw to perform stepwise estimation; see [R] sw.

Syntax!fo predict p!cec .ct [_pe] newvarname where st_nsnc is

n ir xb stdp

[if exp] Iin range] [, statistic

predicted number of events (the default) incidence rate (equNalent to predict ..., linear prediction standard error of the prediction

In _dd!itiqn. relevant only after gnbreg _alpha lnalpha stdplna

i

nooffset

i

n nooffset)

are

predicted values of alpha predicted values of In(alpha) standard error of predicted In(alpha)

The,_e !tati tics are available both in and out of sample; type predict the _esti_ation sample.

...

if

e(sample)

...

_f wanted only Io_

Desclriptign nbreglesttmates a negative binomial maximum-likelihood regression of depvar on varlist, where dep_,ar is / _ nonne_ative count variable. In this model, the count variable is believed to be gene|atcd cept that the greater than that of a true Poisson. This cxuu by _ Pbislon-like process, ex variation is variation ls referred to as ox.erdispersion. See [R] poisson before reading this entry. 1 383 l

i ;:!

_breg is a generalized negative binomial regression; the shape p_'ameter alpha may also be parameterized. Persons who have panel data should see [R] xtnbreg

Options dispersion ( {mean I constant } ) specifies the pararneterization of the model, dispersion (mean), the default, yields a model with dispersion equal to 1+ cr exp(zib + offset/); that is, the dispersion is a function of the expected mean: exp(xib + offseti), dispersion(constant) has dispersion equal to I + 6; that is. it is a constant for all observations. level (#) specifies the confidence level, in percent, for confidence intervals. The default is level or as set by set level; see [U] 23.5 Specifying the width of confidence intervals. irr

(95)

reports estimated coefficients transformed to incidence rate ratios, i.e., e b rather than b. Standard errors and confidence intervals are similarly transformed. This option affects how results are displayed, not how they are estimated or stored, irr may be specified at estimation or when replaying previously estimated results.

exposure(varname) and offset(varname) are different ways of specifying the same thing, exposure () specifies a variable that reflects the amount of exposure over which the depvar events were observed for each observation; ln(varname) with coefficient constrained to be 1 is entered into the log-link function, o:ffset() specifies a variable that is to be entered directly into the log-link function with coefficient constrained to be I, so exposure is assumed to be e varnarne. robust specifies that the Huber/White/sandwich estimator of variance is to be used in place of the traditional calculation; see [U] 23.11 Obtaining robust variance estimates, robust combined with cluster () allows observations which are not independent within cluster (although they must be independent between clusters). If you specify pweights,

robust

is implied; see [U] 23.13 Weighted

estimation.

cluster(varname) specifies that the observations are independent across groups (clusters) but not necessarily within groups, varname specifies to which group each observation belongs; e.g., cluster (person±d) in data with repeated observations on individuals, cluster() affects the estimated standard errors and variance-covariance matrix of the estimators (VCE), but not the estimated coefficients; see [U] 23.11 Obtaining robust variance estimates. cluster() by itself.

implies robust;

specifying

robust

cluster()

is equivalent

to typing cluster()

score (newvars) creates newvar containing % = OlnLj/0(x/b) for each observation j in the sample. The score vector is _7] OlnLj/Ob --: _ujxj; i.e., the product of newvar with each covariate summed over observations. If two newvars are specified, then the score from the ancillary parameter equation is also saved. See [U] 23.12 Obtaining scores. noconstant

suppresses

the constant term (intercept) in the regression.

constraints (numlist) specifies by number the linear constraints to be applied during estimation. The default is to perform unconstrained estimation. Constraints are specified using the constraint command: see [R] constraint. See [R] reg3 for the use of constraints in multiple-equation contexts. nolrtest suppresses fitting the Poisson model. Without this option, the Poisson model is fit and the likelihood is used in a likelihood-ratio test of the alpha parameter, This option is only valid for nbreg; gnbreg has no likelihood-ratio comparison test (see Technical Note in the section on gnbreg within this entry).

nbreg-

385

n(_logst_ppresses i the iteration log. maxb_iz__options control the maximization process; see [R] maximize. You should never have to specif_, them, although we often recommend specifying trace. lr_alpha (varIist) is a/lowed only with gnbreg. If this option is not specified, gnbreg and nbreg :will produce the same results because the _hape parameter will be parameterized as a constant. lntalt.ha() allows specifying a linear equation for In(alpha). Specifying lnalpha(male old) means in(alpha) = ao + almale .a a2old, where a0, al, and a2 are parameters to be fitted along wilh t ae other model coefficients.

Options n. I !

t i

Negative binomialregression

predict

the _efault, calculates the predicted number of events, which is exp(xjb) if neither of_s_t(varname) nor exposure(varname) was specified when the model was estimated: exp(xib + offset) if offset(varname) was specified: or exp(x_b) • exposure if exposuite (varname) was specified.

ir caicul ttes the incidence rate exp(xjb), which is the predicted number of events when exposure is I. Thi ; is equivalent to n when neither offset (varname) nor exposure (varname) was specified when he model was estimated. xb.calcul ires the linear prediction. I

strip caklulates the standard error of the linear prediction. •

a!pha, l_alpha, and stdplna are relevant after gnbreg estimation only; they produce the predicted values !of alph_ or In(alpha) and the standard error of the predicted In(alpha), respectively. nooffse_ is relevant only if you specified offset(varname) or exposure(vamame) when you esffma_ed the model. It modifies the calculations made by predict so that they ignore the offset or ex_ost_re variable: the linear prediction is treated as xjb rather than x./b + offseb. and specifying predilzt ... is equivalent to specifying predict ... , nooffset Jr.

Remarks See Lo_ng(1997. chapter 8) for an introduction to the negative binomial regression model and lot a discassibn of other regression models for count data. Negati_,e binomial regression is used to estimate models of the number of occurrences (counts) of an event when the event has extra-Poisson Variation; that is. it has overdispersion. The Poisson re_essionl model is Yi "- PoisSon(#i ) where tti = exp_xi,O + offseti) for obser_.led counts Yi with covanates xi for the ith observation. One derivation of the negative binomial i_ that individual units follow a Poissdn regression model, but there is an omitted variable u_ such th_atc"_ follows a gamma distribution With mean 1 and variance a:

5'i "_ Poissorl(/z_) wheTe

_,,. _

/_' = exp(xi/_ and ssu

+ offset/+

ui)

nbreg -- Negative binomial regression

c ~ gamma(1/ , (Note that the scale (i.e., the second) parameter for the gamma(a, A) distribution is sometimes parameterized as 1,/A: see the Methods and FormuIas section for the explicit definition of the distribution. ) We refer to a as the overdispersion parameter. The larger a is, the greater the overdispersion. The Poisson model corresponds to a = 0. nbreg parameterizes c_ as In a. gnbreg allows In G to be modeled as In _xi = ziT, a linear combination of covariates z,. nbreg will estimate two different parameterizations of the negative binomial model. The default, described above and also given by the option dispersion(mean), has dispersion for the ith observation equal to 1 + a. exp(xd3 + offset/). The alternative parameterization, given by the option dispersion(constant), has dispersion equal to 1 + 6, i.e. it is constant for all observations. The Poisson model corresponds to 6 = 0.

nbreg It is not uncommon to pose a Poisson regression following data appeared in Roddguez (i993):

model and observe a lack of model fit. The

list 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. II. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21.

cohort 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3

age_mos 0.5 2.0 4.5 9.0 18.0 42.0 90.0 0.5 2.0 4.5 9.0 18.0 42.0 90.0 0.5 2.0 4.5 9.0 18.0 42.0 90.0

deaths 168 48 63 89 102 81 40 197 48 62 81 97 103 39 195 55 58 85 87 70 10

exposure 278.4 538.8 794.4 1,550.8 3,006.0 8,743.5 14,270.0 403.2 786.0 1,165.3 2,294.8 4,500.5 13,201.5 19,525.0 495.3 956.7 1,381.4 2,604.5 4,618.5 9,814.5 5,802.5

gen logexp = in(exposure) quietly tab cohort, gen(coh) poisson deaths cob2 cob3, offset(logexp) Iteration Iteration Iteration Iteration

O: I: 2: 3:

log log log log


Poisson regression


= = = =

-2160.0544 -2159.5182 -2159.5159 -2159.5159 Number of obs LI%chi2(2) Prob > chi2 Pseudo R2

= = =

21 49.16 0.0000 0.0113

l• i ! !

T

_

nbreg-

I

deaths

Coef.

Std. Err.

z

Negative binomial regression

P>Izl

[95_, Conf.

387

Interval]

, J

I

coh2

-. 3020405

.0573319

coh3

.0742143

.0589726

-3.899488 (offset)

.0411345

_cons logexp

-5.27 1,26 -94.80

0.000

-. 4144089

-. 1896721

O. 208

-. 0413698

.1897983

O. 000

-3.

98011

-3.818866

. _oisgof Goodness-of-fit Prob

chi2

> chi2(18)

=

4190.689

=

0.0000

The extreme significance of the goodness-of-fit X2 indicates the Poisson regression model is inap_opTiate suggesting to us that we should try a negative binomial model: Lbreg deaths Negative _

I

coh2

binomial

Lo_ likelihood

=

coh3,

offset(loge_p)

nolog

regression


= = =

21 0.40 0.8171

-131.3799

Pseudo

=

0.0015

deaths

Coef.

Std. Err.

z

R2

P>[zl

[95_, Conf.

coh2

-. 2676187

.7237203

-0.37

O. 712

-I.

coh3

-. 4575957

.7236651

-0.63

O. 527

- I.875753

.511856

-4.08

0.000

_cons

-2.086731

logexp

(offset)

/Inalpha

.5939963

686084

-3.08995

.2583615

.0876171

Interval] I. 150847 .9609618 -I.083511

1. 100376

l

alpha Li_elihood

I. 811212 ratio

test

of

.4679475 alpha=O:

I. 09157 chibar2(01)

= 4056.27

Prob>=chibar2

3.005295 = 0.000

Our original Poisson model is a special case of the negative binomial--it corresponds to a = O. nbreg, however, ' estimates a indirectly, estimating instead In a. In our model. In c_ = 0.594. meanin_ that a 't.81 (nbreg undoes the transformati6n for us at the bottom of the outputk In!order to test o = 0 (equivalent to lna, = -ao),

nbreg

performs a likelihood-ratio test. The

stag_erin_z._ ...r ,_2 value of 4.056 asserts that the probability that we would observe these data conditional on cti= t_ is ,_irtually zero. i.e., conditional on!the process being Poisson. The data are not Poisson. It is _ot ]accidental thal this _2 value is quite close to the goodness-of-fit statistic from the Poisson regre_sio! itself. 1 "

t

Q TechnicaI!Note Tl'/e u_ual Gaussian test of ct = 0 is omitted since this test occurs on the boundary, invalidating the u_ual! theory, associated with such tests However. the likelihood-ratio test of a. = 0 has been modifiedlto be valid on the boundao,. In partieular, the null distribution of the likelihood-ratio lest statistic _, not the usual ;_2 but rather a 50:50 mixture of a )_o _ (point mass a, zerot and a t'_. denoted as _02i. 5[ee Gutierrez et al. (2001) for more details.

' : r_

[] Technical Note v,,,,

...,._

_ _egatwe olnomla! regression

The negative binomial model deals with cases where there is more variation than would be expected were the process Poisson. The negative binomial model is not helpful if there is less than Poisson

i

Poisson models arise because of independently generated events. Overdispersion comes about if some of the parameters (causes)of of Poisson areitsunknown. obtain underdispersiom the variation--if the variance the the count variableprocesses is less than mean. ButTounderdispersion is uncommon. sequence of events would have to somehow be regulated; that is, events would not be independent, but controlled based on past occurrences. []

gnbreg gnbreg is a generalization of nbreg. Whereas in nbreg a single tn _ is estimated, gnbreg allows In a to vary observation by observation as a linear combination of another set of covariates: ln c_=z_. We will assume the number of deaths is a function of age whereas the in _ parameter of cohos. To estimate the model, we type gnbreg

deaths

Fitting

age_mos,

constant-only

Inalpha(coh2

coh3)

O:

log

likelihood

=

Iteration

I:

log

likelihood

= -148.64462

-187.067

= -132.49595

Iteration

2:

log

likelihood

Iteration

3:

log

likelihood

= -131.59338

Iteration

4:

log

likelihood

= -131.57949

log

likelihood

= -131.57948

Iteration

5: full

model:

Iteration

O:

log

likelihood

= -124.34327

Iteration

I:

log

likelihood

= -117.72418

Iteration

2:

log

likelihood

= -117.56349

Iterazion

3:

log

likelihood

= -117.56164

Iteration

4:

log

likelihood

= -117.56164

Generalized

offset(logexp)

model:

Iteration

Fitting

negative

binomial

regression

Number LR

likelihood

deaths

= -117.56164

Cool.

Pseudo

Std.

Err.

z

P>IzI

of obs

=

chi2(1)

Prob Log

is a function

=

21 28.04

> chi2

=

0.0000

R2

=

0.1065

[95X

Conf.

Interval]

deaths age_mos _cons logexp

-,0516657 -1.867225 (offset)

.0051747 .2227944

-9,98 -8.38

0,000 0.000

-.061808 -2.303894

-.0415233 -1,430556

cob2

.0939546

.7187747

0.13

coh3

.0815279

,7365477

0.II

0.896

-1.314818

1.502727

0.912

-1.362079

1.525135

0.356

-1.486614

.5346978

inalpha

_cons

-.4759581

.5156502

-0.92

We find that age is a significant determinant of the number of deaths. The standard errors for the variables in the In c_ equation suggest that the overdispersion parameter does not vary across cohorts. We can test this by typing

i

-' _

i

i

'

nbreg -- Negative binomialregresldon

. !test coh2 cob3 i

_ I)

[inalpha] coh2 = O.0

d 2)

[inalpha]coh3

Prob

:

= 0.0

2)

chi2(

0.02

=

> chi2 =

0.9904

There isl no evidence of variation by cohort in these data.

i

[3Techr icl Note

!

NOte the intentional absence of a likelihood-ratio test for o_ = 0 in gnbreg. The test is affected by the .,ame boundary condition the affects the comparison test in nbreg, however, when a is paramet(:rized by more than a constant term the null distribution becomes intractable. For this reason we recot nmend using nbreg to test for overdispersion and if overdispersion exists, only then model

i

the over tispersion using gnbreg.

! 1

Predicted values '

i

After!nbreg

and gnbreg,

predict

returns the predicted number of events:

_breg deaths coh2 coh3, nolog Negative binomial regression

Lo_ likelihood

:

= -108.48841

deaths

Prob

=

O. 9307

=

0.0007

[95Y. Conf.

Interval]

> chi2

Err.

z

.2978419

_cons

4.435906

.2107213

21.05

O. 000

4.0229

-. 0538792

•2981621

-0.18

O. 857

-. 6382662

.5305077

- 1.207379 .29898

.3108622 .0929416

-I. 816657 .1625683

-.5980999 .5498555

• _redict

ratio

test

of alpha=O:

O. 20

P> Izl

chibar2(01)

O. 843

R2

.0591305

"LiKelihood

=

-. 5246289

434.62

Prob>=chibar2

count

(o_tion n assumed; _lmmarize Variable

i

= =

coh2

/ Inalpha alpha

i

Std.

2I O. 14

Number of obs LE chi2(2)

Pseudo

Coef.

coh3

deaths

predicted

number

of events)

count Obs

Mean

Std. Dev.

Min

Max

i

i

deaths

21

84.66667

i

count

2i

84.66667

48.84192

10

4.00773

80

(Continuett on next page)

|

389

197 89. 57143

.64289 4.848912

= 0.000

Saved Results nbreg and gnbreg save in e O" Scalars e (N) e (k) e (k_eq)

number of observations number of variables number of equations

e (N-clust) e(re) e (chi2)

number of clusters return code X2

e(k_dv) e (df_.m)

number of dependent variables model degrees of freedom

e(chi2_c) e (p)

_2 for comparison test significance

e (r2_p)

pseudo R-squared log likelihood

e (ie) e (rank)

number of iterations rank of e (V)

log likelihood, constant-only model log likelihood, comparison model

e(ram.k0) e(alpha)

rank of e(V) for constant-only model the value of alpha

e (cmd)

nbreg or gnabreg

e (opt)

_ype of optimization

e(depvar) e(title) e(wtype)

name of dependent variable title in type estimation outpul weight

e(chi2type) e(chi2_ct)

e (11)

e(ll_O) Macros e(ll_c)

e(wexp) weigh!expression e(clustvar) name ofcluster variable

e(offset#) e(dispers)

Wald or LR; type of model X _ test Wald or LR; type of model X 2 test corresponding to e(chi2_c) offset forequation # mean or constant

e(user) e(vcetype)

e(predict)


name covanance of likelihood-evaluator estimation method program

Matrices e (b)

coefficient vector

e(ilog) Functions e(samp2e)

iteration log (up to 20 iterations)

e(Y)

variance-covariance the estimators

matrix of


Methodsand Formulas nbreg

and gnbreg

See [RJ poisson

are implemented

and Feller

(1968,

as ado-files. 156-164)

for an introduction

to the Poisson

distribution.

A negative binomial distribution can be regarded as a gamma mixture of Poisson random variables. likelihood is The number of times something occurs, Yi, is distributed as Poisson(ui# i). That is, its conditional f(Yi

where _ui = e×p(xij3+

offseh)

Jui) --

(uilzi)U'e-_"_'

r(y, + 1)

and u_ is an unobserved g(u)

parameter

with a gamma(I/a,

i/a)

density:

= u(1-,_)t%__,/,_ cei/c'F(1/a)

This gamma distribution has mean 1 and variance c_, where o_ is our ancillary parameter. (Note the scale (i.e., the second) parameter for the gamma(a, A) distribution is sometimes parameterized l/A; the above density defines how it has been Parameterized here.)

that as

I

nbreg -- Negative binomialregression

j,,,_

i

391

The unconditional likelihood for the ith observation is therefore

/0

f(Y_) =

:

i

f(Yi I u)g(u) du

_

+y,)

, r(y + 1)r(m)

whe/re _i = I/(1 + c_#i) and m 1/a. SolUtions for a are handled by searching for lna since c_ is requiied to be greater than zero. The ,cores and log likelihood (with weights and offsets) are given by

,(z) = digamma _ction evaluated at z _ls(z) = trigamma function evaluated at z '

a = exp(Tl

m = I/a

Pi = 1/(1 + c_#i)

Pi -- exp(xd3 + offset/)

; c= i=1

I

scOrei[3)/ = Pi (Y_ - #i) scOre(iT-}i = -m

1 + a#i {_(_ui-yi)

In the, :ase of gnbreg,

tn(l +°_gi) 4g'(Yi

_m)-_b(m)}

a is allowed to vary across the observations according to the parameterization

[IIO_i m _:W"

M_in_ization for gmbreg is via the If linear-form method and for nbreg described in [R] ml.

is via _he d2 method

i

Refemnccs Cameron, A! C. and P. K Trivedi. 1998. Regression analysis of count dat_. Cambridge: Feller, W, 1§68. An Introduction

Sons.

to Probabititv

Theory and Its Applications,

Cambridge

University Press.

vol 1 3d ed. New York: John Wile_ &

i

Ounen'ez. R 1 G., S. L. Carter, and D. M Dmkker. 200t, On boundary-value l_,]le_m, _forthcoming. !

likelihood

ratio rests. Stata Technical

Hilbe, J. 19_8. sggl: Robust variance estimators /or MLE Poisson and negative binomial regression_ Stata Technic_d Bulletin _5: 26-28. Reprinted in Stata Technical Bulletin Reprints, vol. 8. pp. 177-180. I

.

!999. @102: Zero-truncawd

poisson

and negative

binomial

regression.

Long, J. S, 1997. Regression Models tot Categorical aad Limited Dependent Reprinted Iin Stata Technical Butledn Reprints, vol, 8; pp. 233-236. Pubtieatit_s, l

1

Rodr{gueL 4" 1993. sbel0- An improvement tu poisstm, Stata Technical 7_-hnie_llBulletin Reprint,_. vol. 2, pp. 94-98.

I

Stata

Technical

Bulletin

Variables. Thousand

Bulletin

11: 11-t4,

47:_7-'40

Oaks, CA: Sage

Reprinted

in St_lta

v

• ii::'i

.....

Rogers, W. H. 1991. sbel: Poisson regression with rates. Stata Technical Bulletin t: 11-t2 Bulletin Reprints, vol. 1, pp. 62-64.


1993, sgl6.4: Comparison of nbreg and glm for negative binomial. Stata Technical Bulletin 16: 7. Reprinted in Stata Technical Bulletin Reprints, vol. 3, pp. 82-84.


JR] adjust, [R] constraint, [R] lincom, [R] linktest, [R] Irtest, [R] mfx, [R] predict, [R] sw, [R] test, [R] testnl, [R] vce, [R] xi

Related:

[R] glm, [R] poiss0n, [R] xtnbreg, [R] zip

Background:

[U] 16.5 Accessing coefficients and standard errors, [U] 23 Estimation and post-estimation commands, [U] 23.11 Obtaining robust variance estimates, [U] 23.12 Obtaining scores, JR]maximize

I

Tit Install and manage user-written additions from the net

Syrlta:oi fro=

directory._or_.url

_ael cd

path_or_urt

no1 link

tinkname

he1 search

keywords

(see [RJ net search)

taet net

describe

pkgname

net net

set ado set other

dirnarae dirname

ne_ query net

install

net ii get

I

i' all

replace

j

pkgname

[[J all

replace

]

adoi

[, f__i_ad(string) frora(dirname)

ado dir

r ' i, f__ind(string) from(dirname)

ado i describe

[pkgid]

_, fired(string)

ado!uninstall

pkgid

[, fr_m(dimame)

where

i

pkgname

i p_enarne is p_,id is or di "nanle is or or or

]

fArom(dirname)] f

name of package name of a package a number in square brackdts: [#] a directory name PEKSONAL STBPLUS SITE

(default)

DesCription net etches and installs additions to Stata. The additions can be obtained from the Internet or from m__tia. The additions can be ado-files _new commands), help files, or even datasets. Collections of:files Z'e bound together into packages. For instance, the package named zz49 might add the xyz comman( to Stata. At a minimum, such a package would contain xyz.ado, the code to implement the new _:ommand. and xyz .hlp, the on-line help to describe it. That the package contains two files is a detail: you use net to fetch the package z:z49 regardless of how many files there are. ado n_anages the _Packages you have installed using net. The ado command allows you to list packages you have previously installed and to uninstall them. Users can also access the net and ado features by pulling down Help and selecting STB and User-wri ten Programs. 393

I_I

394

net -- Install and manage user-written additions from the net

Options ,

all is used with net install and net get. Typing it with either one makes the command equivalent to typing net Lnsza11 followed by net get. replace is for use with net install and net get. existing files if any of the files already exist.

It specifies that the fetched files are to replace

find(string) is for use with ado, ado dir, and ado describe. It specifies that the descriptions of the packages installed on your computer are to be searched and that the package descriptions containing string are to be listed. from(dimame) is for use with ado. It specifies where the packages are installed. The default is from(STBPLUS). STBPLUS is a codeword that Stata understands to correspond to a particular directory on your computer that was set at installation time. On Windows computers, STBPLUS probably means the directory c:\ad.o\stbplus, but it might mean something else. You can find out what it means by typing sysdir, but this is irrelevant if you use the defaults.

Remarks For an introduction to using net and ado. see [U] 32 Using the Internet to keep up to date. The purpose of this documentation is 1. To briefly but accurately describe net 2. To provide documentation to Stata.

and ado and all their features.

to those who wish to set up their own sites to distribute

additions

Remarks are presented under the headings Definitionof a package The purposeof the :net and ado commands Content pages Package-descriptionpages Where packages are installed A summary of the net command A summary of the ado command Relationship of net and ado to the point-and-click interface Creating your own site Format of content and package-descriptionfiles Example 1 Example 2 Metacharactersin content and package-descriptionfiles Error-free file delivery

Definitionof a package A package is a collection of files typically . ado and . hip files--that provides a new feature in Stata. Packages contain additions that you wish were part of Stata at the outset. We write such additions and so do other users. One source of these additions is the Stata Technical Bulletin (STB). The STB is a printed and electronic journal with corresponding software. If you want the journal, you must subscribe, but the software is available for free from our web site. If you do not have Intcrnet access, you may purchase the STB media from StataCorp.

i

net -- Install and manage user-writtenadditionsfrom the net

i

i

i

I +

I ii

l

395

The purp(se of the net and ado commands 1

The pprpose of the net command is to make distribution and installation of packages easy. The goal is tO_get you quickly to a package description page that summarizes the addition: rte_stat

• n_et describe ! '

package

rte

star from

http://www.wemak_itupaswego.edu/fscu!ty/sgazer/

i

+

_I_ rte_stat.

i '

The robust-to-everything

_ES_PrIOl/tCrln S. Gazer, Aleph-O

(S) Dept.

_ '

of Applied

I00_, confidence

applications; |

i

__

Aleph-I

statistic ; update.

Theoretical

intervals confidence

Mathematics,

proved

WMIUAWG

too conservative

intervals

have been

Univ.

for some

substituted.

The new robust-to-everything supplants the previous robust-toeverything-conceivable statistic. See "Inference in the absence of data"

1

(forthcoming).

After

ihstallation,

_IN+ILLITIOE FILES _ rt e.ado

see help (type

net

rl}o. ins_a]/

rte_stat)

rte.hlp i

nullset, ado random, ado

Should y,)u addition might prove u_fu], net makes the installation easy: at decide install the rte_stat checking

r_e_stmt

consistency

ins talling into c :\ado\stbpius\ ins ;allation complete•

and verifying

not already

installed.-.

...

The p_rpose of the ado command is to help you manage packages installed with net. Perhaps you reme[nber that you installed a package that calculates the robust-to-everything statistic but cannot remembe I the command's name. You could use ado to search what you have previously installed for the r'_e dommand:

[I]i package sg146 from http://waw.ststa, STB-66 sg146. Scalar measures of ( _utput omitted

[I+ 1i

rto_stat. package

com/stb/stb56 fit for re_ression

models.

)

rte_statThe

robust-to-svoryth_-_ sta_2s_ic! Ilpdmte, from http://wwa._emakeitupaswego.edu/faculty/sgazer

(_utputomitted) [2+

package STB-62

sgl21from http://www, stata, com/stb/stb52 sK121: Seemingly unrelated est. t cluster-adjusted

sandwich

est.

or, you _ight type . afro,package find("robust-to-everything") [15_] fro_star from http://www.i_emakeitupaswego.edu/faculty/sgazer rte_s_at. The robust-to-ovorytb_n_ statistic! update. 1

Perhaps 9ou decide that rte You can hse ado to erase it: o uninstall pa+age

(_kag, !

rte_s_at rte_stat.

_-,tall.d)

despite the author's claims is not worth the disk _paee it oeeupie
Example neweyi,lag(O) is equivalent to regress,robust: . r, gross

price

weight

displ,

with

robust

standard

Re_ _ession

[ [

robust errors

[

{

74 14,44 0,0000

R-squared Root MSE

O.2909 2518.4

= =

Robust price

Coef.

i weight dis_)lacement

1.823366 2,087054

,7808755 7.436967

2.34 O. 28

O.022 O.780

.2663445 -12.74184

3. 380387 16. 91595

247,907

1129.602

O. 22

O.827

-2004.455

2500.269

i

_cons

. niwey !

price

ReCession i


maximum

weight

with lag

Std. Err.

displ,

Newey-West

t

P>Jtl

[95_. Conf,

lag(O) standard

errors

Number

of obs

F( 2, Prob > F

: 0

i

Interval}

71)

=

74

= =

14.44 0.0000

Newey-West

dis

price

Cool.

weight ,tacement

I.823366 2.087054

_cons

247.907

Std. Err.

t

P>Itl

[957, Conf.

Interval}

.7808755 7.436967

2.34 0.28

O.022 0.780

.2663445 -12.74184

3. 380387 16.91595

1129.602

0.22

0.827

-2004.455

2500.269

:1

[

i[

.-Example ha e time-series measurements on variables usr and idle

mo_el; bit

obtain

Newey-Wes!

_land,_rd

,'rrors

allowing

for

a lag

and now

of

up

wish

to 3:

to estimate an o15

i_ !

414

newey --

Regression with Newey-West

standard errors

t • newey

usr

Regression maximum

idle, with

lag

lag(3)

t(time)

Newey-West

standard

errors

Number

of

F( I, Prob > F

: 3

usr

Coef.

idle

-.2281501

_cons

23.13483

Std,

Err,

t

.0690927 6.327031 Newey-West

P>[tl

[95_

obs

=

28)

= =

Conf.

30 10.90 0.0026

Interval]

-3.30

0.003

-.3696801

-.08662

3.66

0.001

!0.17449

36.09516

q

Saved Results newey saves in e(): Scalars e (N)


e (F)

F statistic

e (dr_m) e(df_/)

model degrees of freedom residual degrees of freedom

e(lag)

maximum lag

e (cmd)

newey

e (wexp)

weight expression

e(depvar) e(wtype)

name of dependent variable weight type

e(vcetype) e(prediet)

covariance estimation method program used to implement predict

coefficient vector

e (V)

variance-covariance the estimators

Macros

Matrices e (b)

matrix of

Functions e(sample)


Methods and Formulas newey is implemented newey

calculates

as an ado-file.

the estimates flons

-- (X'X)-IX'Y

- (x'x)-lx'hx(x'x) That is, the coefficient For the case of lag formulation:

estimates (0)

are simply

those

(no autocorrelafion),

X' X

= X'noX

-1

of OLS linear regression.

the variance

estimates

are calculated

using the White

n i

Here gi - Yi XiflOLS, where xi is the ith row of the X matrix, n is the number of observations, and k is thc number of prodictors in the modal, including the constant if there is one. Note that the above formula is exactly the same as that used by regress, robust with the regression-like formula (the default) for the multiplier qc; see the Methods and Formulas section of [R] regress.

newey -- Re ression with Newey-West standarderrors

F i

FOr e case of lag(#), (1987) f_rmulation X'_X

415

# > 0, the variance estimates are calculated using the Newey-We_t

= X t"J'_0X +

n

n-kt=t

Z

eiei_i(xix__t ---/t

/'

m+l

+ xt/_lxi)

i=t+1

where Q is the maximum lag.

ReferenCes i •

i

Hardin.J.!W 1997.sg72:Newey-Wes1standarderroN for probit,logit, and poissonmodels. Stata TechnicalBulletin 39: 32_35. Reprintedin Stata TechnicalBulletinl_eprints,vol. 7. pp. 182-186. covari ce matrix. Econometrica55: 703-708. Newey, \_ 1980. and K. West. 1987. A simple, positixesemi-definite,heteroskedasticitv and test autocorretationconsistent White, H.; A heteroskedasticitv-consisten, cov_ance matrixestimatorand a direct for heteroskedasticitv Econonefrica48: 817-838.

Also.Set C0mplel lentao,: Related:

JR] adjust, JR] lincom, jR] linktest, JR]mfx, JR] test, JR] testnl, JR] vce [R] regress, IN] svy estimators, [R] xtgls. [R] xtpcse

Backgro and:

[U] 16.5 Accessing coefficients and standard errors, [U] 23 Estimation and iaost-estimation commands

i/

Title i

news -- Report Stata news

II

[l

I

]

Syntax news

Description news displays a brief listing of recent news and information of interest to Stata users. It obtains this information directly from Stata's web site. The news command requires your computer to be connected to the Interact. An error message will be displayed if the connection to Stata's web site is unsuccessful. You may also execute the news command by selecting News from the Help menu.

Remarks news provides an easy way of displaying a brief list of the latest Stata news. More details and other items of interest are available at Stata's web site; point your browser to http://www.stata.com. Here is an example of what news produces: . news StataCorp

News

---

* Intercooled Windows * STB-68

2001 (July

* NetCourse

information http://www,

23, for

is sold 2002)

151:

* Proceedings For

July Stata

2002 Windows

is now

on these

8th London and

will

release:

available

"Introduction

of the

2001

(projected

to

Stata

User

additional

---

be available Aug use

the

net

Programming"

Group

topics

Meeting point

the

first

day

i, 2002) command begins now

your

to download next

month

available

web

browser

to:

stata, com

In this case news indicates that there is a new STBavailable. Users can click on STB and User-written Programs from the Help menu to download STBfiles. Alternatively, the net command (see [R] net) can be used.

Also See Related:

[R] net

Background:

[U] 32 Using the Interne! to keep up to date

416

[

I

l [

nl --

least squares

Syntax i n-t kn_depvar

[varhst] [weight] [ifexp] [inrange][, level(#)i_nit(..,) _ "

!e_ve eps(#) _o!og trace it_erate (#) delta(#) i nllni_

fcn_options

ll__isq(#)

]

# parameterHist I

by ....

: n_ay be used witl_ hi; see

[R] by,

aweights lnd fweights are allowed; see [U] 14,1.6 Weight. nl _hare._ the features of eli estimation commands, see[U] 23 Estimation

and post-estimation

commands.

Syntaxfo! predict pred}ct t

[t3,pe] newvarname

[if

exp] [in range]

These _tatiitics are available both in and out of sample: type predict the estimation sample.

[,

{ _yhat . ..

if

residuals

e(sareple)

...

} ] if wanted only for

i 1

Descriptibn n! fit_ an arbitrary nonlinear function to the dependent variable depvar by least squares. You provide tt_e function itself in a separate program with a name of your choosing, except that the first two letteds of the name must be nl. fcn refers to the name of the function without the first two letters. F6r _1 example, you type nl nexpgr ... 1o estimate with the function defined in the prepare nlneapg[. nlini!,

is useful when writing nlfcns. i

Options level: (#}

specifies the confidence level, in percent, for confidence intervals. The default is level or as _et by set level; see [U] 23.5 Specifying the width of confidence intervals.

(95)

i,

init (. • .i specifies initial values for parameters that are to be used to ovemde the default initial

I l i

val_es Examples are provided below. inlsq(#' fits the model defined by nlfcn using "log least squares", defined as least squares with shifted lognormal errors. In other words, ln!(depvar - #) is assumed nommllv distributed. Sum_

[_

i i I

of squ,.res and deviance are adjusted m the same scale as depvar. leave le,]ves behind after estimation a set of new variables with the same name_ as the estimated pamm_ters containing the derivative of E(!/) with respect to the parameter. eps(#)

_pecifies the convergence criterion for successive parameter e._timates and for the residual

sum o[squares, eps(le-5) _

is the default. 417

_I i 1

nolog suppresses the iteration log. trace expands the iteration log to provide more details, including values of the parameters step of the process. iterate(#) delta(#)

at each

specifies the maximum number of iterations before giving up and defaults to 100. specifies the relative change in a parameter

to be used in computing

the numeric deriva-

tives. The derivative for parameter fll is computed as {f(X, fll,,32,... ,fli + d, fli+l,...) f(X, fll,fl2,... ,/3_,fli+x,...)}/d where d is 6(13i -t- _). The default 5 is 4e-7. fen-options

refer to any options allowed by nlfcn.

Options for predict yhat,the default, calculates the predicted residuals calculates the residuals.

value of depvar.

Remarks Remarks are presented under the headings nlfcns Some common nlfcns Log-normal errors Weights Errors General comments on fitting nonlinear models More on nlfcns nl fits an arbitrary nonlinear function to the dependent variable depvar by least squares. The specific function is specified by writing an nlfcn, described below. The values to be fitted in the function are called the parameters. The fitting process is iterative (modified Ganss-Newton). It starts with a set of initial values for the parameters (guesses as to what the values will be and which you also supply) and finds another set of values that fit the function even better. Those are then used as a starting point and another improvement is found, and the process continues until no further improvement is possible.

nlfcns nl uses the function defined by nlfcn, nlfcn has two purposes: to identify the parameters of the problem and set default initial values, and to evaluate the function for a given set of parameter estimates.

> Example You have variables 9 and x in your data and wish to fit a negative-exponential parameters B0 and B_: Y -- Bo (I - e -Bta:) First, you write a program to calculate the predicted

values:

growth curve with

-

t

nl -- Nonlinearleast squares

pr

419

am define I_inexpgr version 7.0

I

if "'i "_global == "7" S_I { i !

"80 BI"

g!obal

BO=-I

global exit

BI=. 1

/* ... /* if Query declarecall parameters

*/ */

/*

*/

and initialize

them

} replace i

"

"1"=$BO*(l-exp(-$Bl*x)

/* otherwise,

calculate

function

*/

endt

! !

To estimate the model, you type nl nexpgr y. nl's first argument specifies the name of the function. although you do not type the nl prefix. You type nexpgr, meaning the function is ntnexpgr, nl's second mgument specifies the name of the dependent variable. Replicating the example in the SAS manual (985, 588-590): . u ,e sasxmpll • n

nexpgr

y

(oh = 20) Ite:'ation

O:

residual

SS =

.1999027

Ite:'ation I:

residual

SS =

.0026064

Ite."ation 2:

residual

SS =

.0005769

Ite:'ation 3:

residual

SS =

.0005768

Source Model

SS

df

17,6717234

IResidual

2

.0005T_81

18

MS

Number

,

F(

17.6723003

20

18)

20

= 275732.74

8.83S86172

Prob

'> F

=

O.0000

.00013t32045

R-squared

=

1.0000

.883_H5013

Adj R-squared Root MSE Kes. dev.

= = =

1.0000 .0056608 -152.317

I Total

of obs =

2,

i (ne.l)gr) y

Coef.

BO BI

.9961886 .0419539

(SE "s, P values,

i

i

CI's,

Std.

Err.

.0016138 .0003982

and correlations

t 617.29 105.35

P>[t [

[95_, Conf.

O. 000 0.000

.9927981 .0411172

are asymptotic

Interval] .9995791 .0427905

approximations)

Notice -:th_ : the initial values of the parameters ;were provided in the nlnexpgr program. You can, however, ,verride these initial values on the nl_ command line. To estimate the model using .5 for the initial _alue of B0 rather than 1, you can tylje nl nexpgr y, iniZ(B0=. 5). To also change the q

i

i i i

i

initial vail e of B1 from.1 to .2, you type nl nexpgr y, init (B0=.5 The_:oulline of all nlfcn's is the same: program

', i

define

I

B1=,2).

nltcn

version 7.0 if "'I .... == "7" { global

'

(tnhialize axit

S_I

"parameternames"

paramelers

)

} replace

"I" = ...

emd

• ' • " "_" " to place the na mes of the P.arameters in the On a q_ler_ call, Indicated b}, "i- being . , the_nlfcn is global mac_-oS_l and ififtialize the parameters, t_arameters are stored as macros, so ff ,_lfc, declares

!

!_ .!

420

nl -- Nonlinear least squares

that the parameters are A, B, and C (via global S_l "A B C"), it must then place initial values in the corresponding parameter macros A, B, and C (via global A=O, global B=I, etc.). After initializing the parameter macros, it is done. On a calculation call, "1" does not contain "?"; it instead contains the name of a variable that is to be filled in with the predicted values. The current values of the parameters are stored in the macros previously declared on the query call (e.g., $A, SB, and $C).

1>Example You wish to fit the CES production

functions defined by

lnq = Bo + Aln{Dl

R + (1 - D)k 2}

where the parameters to be estimated are B0, A, D, and R. q, l, and k refer to total output and labor and capital inputs. In your data, you have the variables lnq, labor, and capital. The ntfcn is program

define nlees version 7.0 "'1""

if

==

"7"

{

global

8_i

global

BO = 1

"BOA

global

A = -1

global

D =

global exit

R = -1

D R"

.5

} " I'=$BO

replace

+ SA*in($D*labor'$R

* (l-$D)*eapitai'$R)

end

Again using data from the SAS manual (1985, 591-592): . use

sasxmpl2

nl ces inq (obs = 30) Iteration

O:

residual

SS =

37.09651

Iteration

I:

residual

SS =

3-5.48655

Iteration

2:

residual

SS

=

22.69058

Iteration

3:

residual

SS

=

1.845468

(output omitted) Iteration

20:

residual

SS

=

Iteration

21:

residual

SS

=

Source

SS

Model Residual

1.761039 1.761039 df

MS

Number

of obs

30

59.5286148

3

19.8428718

1.76103929

26

.06773228

R-squared

=

0.9713

Adj K-squared Root MSE Res. dev.

= = =

0.9680 .2602543 .0775147

Total

61.2896541

29

Inq

Coef.

Std.

2.11343635

26)

=

F( 3, Prob > F

= =

292.96 0.0000

(ces)

BO

* Parameter (SE's,

.1244892

Err.

Conf.

Interval]

0.124

-.0365497

.2855282

-.3362823 .3366722

.2721671 .1361172

-1.24 2.47

0.228 0.020

-.8957298 .0568793

.2231652 .6164652

R

-3.011121

2.323575

-1.30

0.206

-7.787297

1.765055

BO taken

as

CI's,

constant and

term

correlations

1.59

[957

A D

P values,

.0783443

P>ltl

t

in model are

_ ANOVA

asymptotic

table approximations)

i


421

.......

I ! i

If the ncnlinear model contains a constant term, nl will find it and indicate its presence by placing an asteri ;k next to the parameter name when displaying results. In the output above. B0 is a constant. (nldetelmines that a parameter BO is a constant term because the partial derivative f = OE(y)/OBO has a coffficient of variation (s.d./mean) less than eps(). Usually. f = I tbr a constant, as it does

i

in, th,;_ctse.)

I

_utput mimics that of regress;calculates see [R]them, regress. The means model inF this test,case R-squared, of' nl's squares, etc..closeh are calculated as regress which that theysmn: are correcte_t for the mean. If no "constant" is present, as was the case in the negative-exponential gowth • example _prevlouslv. the usual caveats apply tO the interpretation of the F and/?-squared statistics:

I

i

, l

see comr_ents and'references in Goldstein (1992). When! making its calculations, nl creates flee partial derivative variables for all the parameters. giving e_ch the same name as the corresponding parameter. Unless you specify leave, these are discardecl when nl completes the estimation. "_erefore. your data must not have data variables that have thel same names as parameters. We recommend using uppercased names for parameters and

! !

lowercas¢d names (as is common) for variables. After _stimating with nl, typing nl by itself will redisplay previous estimates. Typing correlate, _coef w!ill show the asymptotic correlation matrix of the parameters, and typing predict myvar will creale new variable myvar containing the' predicted values. Typine predict

res,

resid

will

create, r_s containing the residuals. ntfcn'_ have a number of additional featurei that are described in More on nlfcns below.

Someoorlmonnlfcns Ar_ impo:tant feature of nl. in addition to estimating arbitrary nonlinear regressions, is the facility for addin prewritten common fi:ns.

i

i

Three ?ns are provided for exponential regression with one asymptote:

:

_xp3

Y - b0 4- bl b2 x

_xp2

Y -- bib x

For irrst_ _ce. typing nl exp3 ras dvl estimates the three-parameter exponential model tparameters bo. bl, ard 52) using Y = ras and X = dvl. TwOfi ns are provided for the logistic function (symmetric sigmoid shape; not to be confused with log_stic r( _ression): _g4

Y-bo

+ bl/l'

+ exp{-52(X-

b3)}]

Finally, t_,_ofens are provided for the Gompertz function (asymmetric sigmoid shape):

_3Technical Note You may find the functions above useful, but the important thing to note is that, if there is a nonlinear function you use often, you can package the function once and for all. Consider the function we packaged called exp2, which estimates the model Y = bib x. The code for the function is program

define nlexp2 version 7.0 if

"'I'"=="?"

{

global global

S_2 S_I

"2-param. "bl b2"

* Approximate local exp t empvar Y quietly

}

exp.

initial

"['e(wtype)

growth

values

by

" "e(wexp)

curve,

"e(depvar)"=b1*b2"'2""

regression

of

log Y on X

"]"

{ gen "Y" ffilog('e(depvar)') if e(sample) regress "Y" "2" "exp" if e(sample)

global

51 = exp(_b[_cons])

global exit

b2

= exp(_b['2"])

} replace

"i "=$b1*$b2-"

2"

end

Becausewe were packagingthisfunction forrepeated use,we went tothetroubleofobtaining good initial values, whichin thiscasewe couldobtainby takingthelogof bothsides,

Y = bib X ln(Y)

= ln(blb X) -ln(bl)+

tn(b2)X

and then using linear regression to estimate ln(bl) and ln(b2). If this had been a quick-and-dirty implementation, we probably would not have bothered (initializing bt and b2 to 1, say) and so forced ourselves

enough.

to specify

better initial values with nl's

initial()

option when they were not good

The only other thing we did to complete the packaging was store nlexp2

as an ado-file called

nlexp2, ado. The alternatives would have been to type the code into Stata interactively or to place the code in a do-file. Those approaches are adequate for occasional use, but we wanted to be able to type nl exp2 without having to wor O, whether the program nlexp2 was defined. When nl attempts to execute nlexp2, if the program is not in Stata's memory, Stata will search the disk(s) for an

ado-file of the same name and, if found, automatically load it. All we had to do was name the file with the .ado suffix and then place it in a directory where Stata could find it. In our case, we put nlexp2, ado in Stata's system directory for StataCorp-written ado-files. In your case, you should put the file in the director}, Stata reserves for user-written ado-files, which is to say, c:\ado\personal (Windows), -/ado/personal (Unix), or - : ado: persona/ (Macintosh). See [U] 20 Ado-files.

Q


423

Log.normltl errors A non] near model with identically normally distributed errors may be written

y, =

+

~ N(0,,,2)

(1)

i

for i = 1._..., n. If the Yi are thought to have a:k-shifted lognormal instead of a normal distribution. that is, lnt y_ - k) 4 N (t,, r2), and the systemaiic part/(xi,/3) of the original model is still thoughi

l

approlmat '_ :, the model becomes ln(yi-k)=¢i+v,=in{f(xi,/3)-k}+vi This rood t is estimated if lnlsq(k)

i ! i i

vi"_N(0,

r =)

(2)

is specifidd.

If ntod,_l (2)is correct, the variance of (Yi- _)is proportional to {f(xi,/3)k} 2. Probably the most corn non case is k = 0, sometimes called :"proportional errors" since the standard error of Yi is proport anal to its expectation, f(xi,/3). Assuming the value of k is known. (2) is just another nonlinear nodel in/3 and it may be fitted as usual. However, we may wish to compare the fit of (1)

i i

with that ( f (2) using the residual sum of square i or the deviance D, D = -2 x log-likelihood, from each mo& I. To do so, we must alloy, for the ctjange in scale introduced by the log transformation. Assuming, then, the y, to be normally distributed, Atkinson (1985, 85-87, 184), by considering the Jacobi_n IJ ]0 ln(yi - k)/Oy{I, showed that multiplying both sides of (2) by the geometric mean :,:

of Yi - k.1!1, gives residuals on the same scale as those of Yi- The geometric mean is given by

which is aiconstant for a given dataset. The residual deviance for (1)imd for (2) may be expressed as ,

':

D(_)

=

l+ln(2rr_

2) n

(3)

i

where _ i the maximum likelihood estimate (MLE) of/3 for each model and n_ 2 is the RSS from

i

(1), or tha1 from (2) mtfltiplied by b2. i Since (_) and (2) are models with different !error structures but the same functional form, the

! {

_ _

arithmetic _lifference in their RSS or deviances is _ot easily tested for statistical significance. However, if the devtance difference is large" (> 4, say), one would naturally prefer the model with the smaller de_,iance. Of course, the residuals for e_ch model should be examined for departures from

i_ '_

assumptiots (nonconstant variance, nonnormalit3_, serial correlations, etc.) in the usual way. Consider alternatively modeling E(yi) = I_(C + Ae Bx') E(1/yi) = E(y_) = C + Ae Bx'

i ,

(4)

(5)

where C.._, and 13 are parameters to be estimated. We will use the data (y, x) = (.04, 5). (.06, 12), (.08.25). (I.1.35), (1_ 42). (.2,48), (.25,60), (,3,75), and (.5,120)(Danuso 1991). Model C IA B RSS Deviance I

(4) (4) with l_lsq(0)

1.781 1.799

25.74 2545

-.03926-.001640 -.04051 -.001431

t!

(5) (5) with lnlsq(0)

1.781 1.799

25)4 2745

-.03926 -.04051

! i

,

! I

24.70 17.42

There is lit@ to choose between the two versions ;f the logistic model (4), whereas for the exponential model (5) _the fit using inlsq(O) is much betier (a deviance difference of 7.28). The reciprocal •

i

8.197 3.65t

-51.95 -53.18

I

¢

.

transformation has introduced heteroskedasticity into '_liwhich is countered by the propomonal errors property o_ the lognorrfial distribution implicit :in lnlsq(0). The deviances are not comparable between th_ logistic and}exponential models because the change of scale has not been allowed for. althcmgh inl principle it d°uld be"

•_ 'i

,

424

nl -- Nonlinear least squares

Weights Weights are specified the usual way--analytic and frequency weights are supported; see [U] 23.13 Weighted estimation. Use of analytic weights implies that the Yi have different variances. Therefore, model (i) may be rewritten as Yi -- f(xi,_)

+ ui,

ui "-' N(O, a2/wi)

where wi are (positive) weights, assumed known and normalized number of observations. The residual deviance for (la) is D(_)

--- { 1 + ln(27r_ 2) }n - E

(la)

such that their sum equals the

In(w/)

(3a)

(compare with equation 3), where

Defining and fitting a model equivalent to (2) when weights have been specified as in (la) is not straightforward and has not been attempted. Thus, deviances using and not using the lnlsq() option may not be strictly comparable when analytic weights (other than 0 and 1) are used.

Errors nl will stop with error 196 if an error occurs in your nlfcn program and it will report the error code raised by nlfcn. nl is reasonably robust to the inability of nlfcn to calculate predicted values for certain parameter values, nl assumes that predicted values can be calculated at the initial value of the parameters. If this is not so, an error message is issued with return code 480. Thereafter. as nl changes the parameter values, it monitors nlfcn's returned predictions for unexpected missing values. If detected, nl backs up. That is, nl finds a linear combination of the previous, known-to-be-good parameter vector and the new, known-to-be-bad vector, a combination where the function can be evaluated, and continues its iterations from that point. nl does require, however, that once a parameter vector is found where the predictions can be calculated, small changes to the parameter vector can be made in order to calculate numeric derivatives. If a boundary is encountered at this point, an error message is issued with return code 481. When specifying inlsq(), an attempt to take logarithms error rues sage with return code 482.

of Yi - k when Yi ottoz

type

restaurant

type I

Preebirds NamasPizza

type2

C_feEccell Lk)sNorte~s WtingsNmore

type3

Christ op-s MadCows

_]

Test,oftNe indeperidenceof irrelevant alternatives (IIA) i The I:roperty of th_ multinomial logit model and conditional ]ogit model where odds ratios are independent of the other alternatives is referred to as the independence of irrelevant alternatives (IIA). Hausraan and McPadden (1984) suggest that if a subset of the choice set truly is irrelevant with respect t the other alternatives, omitting it frbm the model will not lead to inconsistent estimate_. Ttierefor Hausman's:_1978) specification test can be used to test for IIA.

'3 ExampleI Supp(,se we want to run ctogit on our choice of restaurants dataset. We also want to test IIA between the alternatives of family restaurants and the alternatives of fast food places and fancy restaurants. To do so, we need to use Stata's hausman command: see [R] hausman. We fi "st run the e_timation on the full bottom alternative set: save the results using hausman, save; ard then run th_ estimation on the bottom alternative set, excluding the alternatives of family restaurarts. We then mn the hausman test. w_th the less option indicating the order in which our models ,_ere fit. 1

en incFast _en

incFancy

en kidFast

(type

== I) *

income

_ (type == 3) * income _ (type

== I) * kids

en kidFancy = (type == 3) * kids logit chose_ cost rating distance Iteration

O:

log

It(ration It_ration

2: I:

_og likelihood likelihood _og

It( ration It_ ration

3: 4:

_og likelihood i _og likelihood

It_ ration

5:

Col,ditional t Lo

_og likelihood

(fiied-effects) _

likelihood

i

likelihood

_ -488.90834

i incFast "_

incFancy

kidFast

kidFancy,

group(family_id)

= -564._7856 = -489.$5097 -496.41546 = -488. _1205 -488.90854 -488.g0834 logistic

regression


: =

2100 189.73

Pseudo R2 Prob > chi2

= =

0.1625 0.0000

_

....................

,,,vuu

chosen

Coef.

cost rating

IOgl!

Std. Err.

z

estimation

P>IzI


-.1367799 .3066622

.0358479 .1418291

-3.82 2.16

0.000 0.031

-.2070404 .0286823

-.0665193 .584642

t

-.1977505

.0471653

-4.19

0.000

-.2901927

-.1053082

incFancy incFast I kid_Past kidFancy__[

.0407053 -.0390183 -.2398757 -.3893862

.0080405 .0094018 .1063674 .1143797

5.06 -4.15 -2.26 -3.40

0.000 0.000 0.024 0.001

.0249462 -.0574455 -.448352 -.6135662

.0564644 -.0205911 -.0313994 -.1652061

distance i

-u3zeO

• hausman, save clogit chosen cost ratine distance incFast incFancy kidFast kidFancy if type group(family_id)

l= 2,

note: 222 groups (888 obs) dropped due to all positive or all negative outcomes. Iteration Iteration Iteration Iteration Iteration Iteration

O: 1: 2: 3: 4: 5:

Conditional

log log log log log log

likelihood likelihood likelihood likelihood likelihood likelihood

= = = = = =

-104.85538 -88.077817 -86.094611 -85.956423 -85.955324 -85.955324

(fixed-effects) logistic regression


chosen

Coef.

cost rating distance

-.0616621 .1659001 -.244396 -.0737506 .4105386

incFastI kidFast __

Std. Err.


= = = =

312 44,35 0.0000 0.2051

z

P>JzJ


.067852 .2832041 .0995056

-0.91 0.59 -2.46

0.363 0.558 0,014

-.1946496 -.3891698 -.4394234

.0713254 .72097 -.0493687

.0177444 .2137051

-4.16 1.92

0.000 0.055

-.108529 -.0083157

-.0389721.8293928

• hausman, less Coefficients--j'

cost d kidFast_

Test:

Ho:

i

(b) Current

(B) Prior

-.0616621

-.1367799

-.244396 -.0737506 .4105386

-.1977505 -.0390183 -.2398757

(b-B) sqrt (diag(V_b-V B)) Difference S.E. .0751178 -.0466456 -.0347323 .6504143

.0576092 .0876173 .015049 .1853533

b = less efficient estimates obtained from clogit B = fully efficient estimates obtained previously from clogit difference in coefficients not systematic chi2(5)

= (b-B)'[(V_b-V_B)-(_I)](b_B) = 10.70 Prob>chi2

=

0.0577

The small p-value indicates that the IIA assumption between the alternatives of family restaurants and the bealternatives should utilized. of other restaurants is weak, hinting that the more complex nested logit model

t

• _

_

...................................................................

_

;

nlogit --

/laximum-likelihoodnested Iogit estimation

437

Model! ea timation Exampt¢ In tl_is example, _e want to examine how alternative-specific attributes apply to the bottom altemati,i_.eset (all se_.,en of the specific restaurants), and how family-specific attributes apply to the altema@e set at the Ifirstdecision level (all ttiree types of restaurants). Inlogitchoseh (restaurant = cost ra_ing distance ) (type = incFast incFancy > kidFast kidF_ncy), group(family_id)Inolog tzee structure specified for the nestbd logit model top--_bottom type fast

_estaurant !Freebirds

_asPizza family

fancy

_afeR.ccell _osNort;e-s WingsNmore _ristop~s MadCows

Ne _ted logit Le'rels = De')endentvariable = Lo likelihood =

2 chosen -483,9584


= = =

2100 199.6293 0,0000

i' Coef.

z

P>Jz[


re:_taurant cost

-,0944352

-2.78

O.006

-.1611131

-.0277572

rating distance

.1793759 -.1745797

.126895 1.41 .0433352 , -4,03

O,157 0.000

-,0693338 -,2595152

,4280855 -.0896443

.0116242

-2.47

0.013

I incFancy

-.0287502 . 0458373

5.14

O. 000

-.0515332 0283722 .

-, 0059672 0633024 .

I kidFancy , kidFast

-.0704164 -,3626381

.1394359 ' .1171277

-0.51 -3.10

0.614 O.002

-.3437058 -.5922041

-.1530721 .2028729

2.45 1.49 3.52

0.014 0 135 0.000

1,143415 -.5366608 .6494642

10.2881 3.979105 2.283711

t_e i l

Std. Err.

incFast

.03402

.0089109

(Ii params) /fast /family i /fancy

5,715758 1.721222 1.466588

2,332871 1 152002 .4169075

I

i

LR _est of homo$cedastlclty (iv = 1): 1 •

I

In thi_ model.

"

[

' Ji

chi2(3)=

9.90

Prob

> chi2 = 0.0194

:

Pr(restdurant I tyPe)= !

_

I

Pr(tvpe)!-

Pr(_cost cost + 3rati_ rating + 3dist distance) i

Pr(a, iva_ incFast +

_ Tfast

IVfast

+ 7family

aiFancy

ineFancy +

!V 'I family

+ Tfancy

CtkFast

"

kidFast +

O kFast

kidFast

IVfancy)

T he [J_ test against t_e ' constant-only model iMicates_ that the model is significant (p-value = 0.000). and t.466588. The inclul}ive value, part,meters for Iast, famil 'iy,and import are 5.......... 715758.1 -7o_o-_o

-..... _

,,,..,_,.-- m.x,,,,u.,-..e.noo,

nesteaIOglt estimation

respectively. The LRtest reported at the bottom of the table is a test for the nesting (heteroskedasticity) against the null hypothesis of homoskedasticity. Computationally, it is the comparison of the likelihood of a nonnested clogit model against the nested legit model likelihood. The X2 value of 9.90 clearly supports the use of the nested legit model with these data, Example Continuing with the above example, we fix all the three inclusive value parameters to be 1 to recover the model estimated by clogit. . nlogit

chosen

> kidFast User

defined

I000:

(restaurant

kidFancy),

[fast]_cons

distance

nolog

) (type

ivc(fast

=I,

= incFast

family=l,

incFancy fancy=l)

notree

= 1

[family]_eons

998:

[fancy]_cons

= 1 = 1

legit =

Dependent Log

rating

constraint(s):

999:

Nested Levels

= cost

group(family_id)

variable

2

=

likelihood

Number

chosen

LR

= -488.90834

Coef.

Prob

Std.

Err.

z

of obs

=

chi2(7) > chi2

P>lzl

2100

=

189.T294

=

0.0000

[95_ Conf.

Interval]

restaurant -.1367799

.0358479

-3.82

0.000

-,2070404

-.0665193

rating distance

.3066622 -.1977505

.1418291 .047i653

2.16 -4.19

0.031 0,000

.0286823 -.2901927

.584642 -.I053082

incFast

cost

type -.0390183

.0094018

-4.15

0.000

-.0574455

-.0205911

incFancy kidFast

.0407053 -.2398757

.0080405 ,1063674

5.06 -2.26

0.000 0.024

.0249462 -.448352

.0564644 -.0313994

kidFancy

-.3893862

.I143797

-3.40

0.001

-.6135662

-.t652061

(IV params) type 1

/fast /family

1

/fancy

I

LR test

clogit

of homoscedasticity

chosen

cost

rating

(iv = I):

distance

chi2(O)=

incFast

> group(family_id) Iteration Iteration

O: 1:

log likelihood io g likelihood

= -564.57856 = -496.41546

Iteration

2:

log

likelihood

= -489.35097

Iteration

B:

log

likelihood

= -488.91205

0.00

incFancy

Prob

kidFast

> chi2

kidFancy,

=

i

•

! I

_

_ Itezation

ii

4:

nlogit-- Mlaximum-likelihood Iogitestimation 439 _ nested .......... i....

l_g

likelihood


= =

2100 189,73

Prob

=

0.0000

Log likelihood

Pseudo

=

0.1625

[95Y, Conf.

Interval]

= -488.90834

,

l

i

Coef.

5

Std.

Err.

z

> chi2

P>Izl

1{2

cost

, .1367799

,0358479

-3.82

O.000

-. 2070404

-.0665193

r rating

i "3066622

1418291

2.16

0.031

.0286823

.584642

distance , incFast

_- 1977505 _ .0390183

.0471653 .0094018

-4.19 -4,15

0.000 0.000

-.2901927 -.0574455

-. 1053082 - :0205911

5.06

O. 000

.0249462

.0564644

-2,26 -3.40

O.024 O.001

-.448352 -• 6135662

-. 0313994 -. 1652061

lincFancy

. 0407053

i kidFast !kidFancy

.0080405

_. 2398757 _. 3893862

i

= -488.90834

Itez ation 5: (fixed-effects) lag. likelihoodlogistic = -488. re_ression 90834 Concitional ! } i _

chosen

i zl

.1063674 •1143797

'

i i

i

Obtainingredicted!values predictmay be use_lafter nlogitto obtain the predicted values of the probabilities, the conditional

!

probabili@s, the linear predictions, and the inclusive values for each level of the nested logit model Predicted _robabilities _or nlogitmust be inte_reted carefully. Probabilities are estimated for each group as _whole and dot for indi'_idual observations. ?

Example i 1

Contim _ingwith our Nodel with no constraintsl we can predict pb = Pr(restaurant); pl = Pr(type); condpb = Pr(restaura_t I type); xbb, the line_ prediction for the bottom-level altemativesi xbl, the linear ?rediction fo_ the first-level alternatives; and the inclusive values ivb for the bottom,level alternative _. • q_i nlogit

i

i

l

i

chosen

(restaurant

k±dFancy), group [family_id) . pzedict pb (opt ion

pb

assum,,,d;

distance

) (type

nolog

= incFast

incFancy

kidFast

i

Pr (mode))

. pxedict

p1,

• pzodict

condpb

• predict

xbb,

x!>b

. predict

xbl,

xl_)l !

pli condpb

• list predict id chosenlpb ivb, i'rb

i

pl condpb

in 1/14

pb .0831245

pl ; 1534534

condpb .5416919

.070329 ,2763391 .284375

11534534 ', 7266538 _,7266538

.4583081 .3802899 .3913486

0

.1659397

! 7266538

.2283615

0

.03992 t5

11198928

.3329766

I 2

0 0

.0799713 . Ol i76

_ 1198928 10286579

.6670234 •4103599

2 2

0 0

• 0168978 .2942401

i0286579 _7521651

. 5896401 .3911909

t ._

id 1

2. ! 3.1 4.1

1 1 1

_

0 0 0

5.:

1

i

6. i

1

7, i 8. 9 105

= cost _ating

chosen 1

i

j7

11. 12. 13. 14.

iF

,

2 2 2 2

.........

1 0 0 0

.2975767 .1603483 .1277234 ..vv_vv .0914536

.7521651 .3956268 -7521651 .2i31824 Iw_mt .219177 _OtllllQ||_ll .582741 .219177 .417259

. list id chosen xbb xbl ivb in 1/14 id chosen xbb xbl 1. 1 1 -.731619 -1.191674 2. 1 0 -.8987747 -1.191674 3. 1 0 -1.149417 0 4. 1 0 -1.120752 0 5. 1 0 -1.659421 0 6. 1 0 -3.514237 1.425016 7. 1 0 -2.819484 1.425016 8. 2 0 -1.22427 -1.878761 9. 2 0 -.8617923 -1.878761 10. 2 0 -1.239346 0 11. 2 1 -1.22807 0 12, 2 0 -1.846394 0 13. 2 0 -2.804756 1.570648 14. 2 0 -3.138791 1.570648

i

ivb -.1185611 -.1185611 -.1825957 -.1825957 -.1825957 -2.414554 -2.414554 -.3335493 -.3335493 -.3007865 -.3007865 -,3007865 -2.264743 -2.264743

Saved Results nlogitsaves in e O: Scalars e(N) e (k_eq)

number number

of observations of equations

e(tevels) e (re)

depth of the model return code

e(N_g)

number

of groups

e(chi2)

x2

e(df._m)

model

degrees

of freedom

e(df...me)

model

degrees

of freedom

e(ll) e(ll_O)

log likelihood log likelihood,

constant-only

log likelihood,

clogit

e(ll_c) Macros

for clogit model

model

e(chi2_c)

X 2 for comparison

e(p_c)

p-value

for comparison

test

e(p) e(ic)

p-value numoer

for X 2 test of iterations

e(rank)

rank of e(V)

e (cmd)

nlogit

e (vcetype)

covariance

e (level#)

altsetvar#

e (user)

]ikelihood-evaluator

e(depvar)

name of dependent

e(opt)

type of optimization

e(title)

title in estimation

output

e(chi2type)

LK: type of model

e(group) e(wtype)

name of group() weight type

variable

e(predict) e(cnslist)

program used to implement constraint numbers

e(iv._names)

parameters

e(V)


variable

e (wexp) Matrices

weight

e (b) e(ilog) Matrices

coefficient vector iteration log (up to 20 iterations)

e (sample)

marks

expression

estimation

sample

estimation

test

method program

X 2 test

for inclusive

predict

values

matrix

of the

nlogit -- Maximum-likelihoodnested togit estimation

441

Methods and For.mlas

!

provide ltroductions t the nested logit model nlogit is implem,,'nted as an ado-file. Greene (2000, 865-871) and Maddala (1983, 67-70) We _ 11present the! methods and formulas for a three-level nested logit model. The extension of this mo& to cases m_olvmg more levels of a tree is apparent, but is more complicated.

!

Using !he same not_tion as Greene (2000), we index the first-level alternative as i, the second-level

! !

ahemativ_ as j, and tte bottom-level alternative as k. Let Xijk, }_j and Zi refer to the vectors ot explanato_; variables _ecific to categories (i,j, k), (i,j), and (i), respectively• We write

i

:

--Prkli j Prjl i Pr_

The cond fional probability Prkto will involve only the parameters/3: eff Xij_ Prklij = Y'_ne'°'X_'_ We define the inclusiw values for category (i,d) as

1"1

and

easyij +ri5Iij PrJli = }-_,mea'V"_+ri'_h'_ Define in(lusive values! for category (i) as

m

/

then

e'f'

Zi-b_i

Ji

Pri = -= El eYrZt+rlJl If we r_strict all the

I

form:

l

where

i

Prijk

i

_ij

and 6i to be 1, we then recover the conditional logit model of the following

eVijk

= fl'X, k + c,'Y j+

,_ il i_ '

,,,,,_ mogrzm Maxlmum-iiKellllOOOnested logit estimation There are two ways to estimate the nested logit model: sequential estimation and the full information maximum likelihood estimation, nlogit estimates the model using the full information maximum likelihood method. If g = 1, 2,..., G denotes the groups, and Pr_j k is the probability of category (i, j, k) being a positive outcome in group 9, the log likelihood of the nested logit model is In L = E

ln(Pr;jk)

9

=

In Pr_lij + InPr_i

+ tnPrf

/

)

References Amemiya.

T. 1985. Advanced

Econometrics.

Greene. W. H. 2000. Econometric

Analysis.

Cambridge,

Hausman.

J. 1978. Specification

tests in econometrics.

Hausman,

J. and D. McFadden.

1984. Specification

Maddala. G. S. 1983. Limited-dependent Press.

McFadden, D. EconomeNc

1981. Econometric models Applications, pp, 198-272.

Saddle

University

46: 125t-I27t.

tests in econometrics.

for analyzing

Press°

River, NJ: Prentice-HalL

Economerrica

and Qualitative

McFadden, D. 1977. Quantitative methods Foundation Discussion Paper no. 474.

MA: Harvard

4th ed. Upper

Econometrica

Variables

in Econometrics.

behavior

of individuals:

of probabilistic choice. In Smacturat Cambridge, MA: MIT Press.

52: 1219-1240.

Cambridge: some recent Analysis

Cambridge developments. of


[R] lincom, [R] lrtest, [R] predict, [R] test, [R] testnl, [R] xi

Related:

[R] elogit, [R] logistic, [R] logit, JR] mlogit

Background:

[u] 16.5 Accessing coefficients and standard errors. [U] 23 Estimation and post-estimation commands, [U] 23.11 Obtaining robust variance estimates, [R] maximize

Discrete

University CoMes Data

with

[ ie !

notes

i

i

Place

in data

Syntax vama,ne] notes

t_xt : !

_otes

notes drop evarlisf [in #[/#]] where eva list is a varl:_'t but may also contain _theword _dta and # is a number or the letter 1. If text incl ides the letters TS surrounded by blanks, the TS is removed and a time stamp is substituted in its p ace.

Descripti(,n notes

attaches note: to the dataset in memory. These notes become a part of the dataset and are

attached generically to :he dataset or specifically to a variable within the dataset.

i

Remarksi saved when

the dataset is saved and retrieved When the dataset is used: see [R] save, notes can be

j [

A not_ is nothing formal; it is merely a string of text--probably words in your native language Treminding you to do something, cautioning you against something, or anything else you might] feel like jotti lg down. People who work with real data invariably end up with paper notes plastered ground their tlerminal saying things like "Send the new sales data to Bob" or "Check the

I

income salary95; I don't believe if' or "The gender was significant!" would be betterv_iable jf theseinnotesi were attached to the dataset. Attached to dummy the terminal, they tend toItfall off

l

and get lost. Addin_ a note to y ur dataset requires typing note or notes (they are synonyms), a colon (:L and whatever _ou wan_ to remember. The note is added to the dataset currently in memory.

4

. n_te:

I

i

Send co_y to Bob once verified.

nite s

i

You can +play your n_tes by typing II

notes

(or note)

by itself.

!

Send copy ,_oBob once verified.

!

Once youi resave your _ata, you can replay the note in the future, too. You add more notes just as

i

you did tl_e first:

[

. nSte:

Ii 2i

i

Mary war_ts a copy, tOO.

Send copy to Bob once verified. Mary ,,rants a copy, too.

443

You can place time stamps on your notes by placing the word TS (in capitals) in the text of your note: • note: TS merged • notes

updates

from

JJ_F

_dta: I. 2. 3.

Send copy to Bob once verified. Mary wants a copy, too. 19 Jul 1000 15:38 merged updates

from JJ&F

The notes we have added so far are attached to the dataset generically, which is why Stats prefixes them with _dta when it lists them. You can attach notes to variables: • note mpg: is the 44 a mistake?

Ask Bob.

note mpg: what about the two missing values7 • notes _dta: i. 2. 3. mpg: i. 2.

Send copy to Bob once verified. Mary wants a copy, too. 19 Jul 2000 15:38 merged updates from JJ_F is the 44 a mistake? Ask Bob. what about the two missing values?

Up to 9,999 generic notes can be attached to _dta and another 9,999 notes can be attached to each variable.

Selectively listing notes notes by itself lists all the notes. In full syntax, notes is equivalent to typing notes _all in 1/1. Here are some variations: notes notes notes notes notes notes notes

_dta mpg _dta mpg _dta in 3 _dta in 3/5 mpg in 3/5 _dta in 3/1

list list list list list list list

all generic notes all notes for variable mpg all generic notes and mpg notes generic note 3 generic notes 3 through 5 mpg notes 3 through 5 generic notes 3 through last

Deletingnotes notes drop works much like listing notes except that typing notes all notes; type notes drop _a11. Some variations: notes notes notes notes notes

drop drop drop drop drop

_dta _dta in 3 ._dta in 3/5 _dta in 3/i mpg in 4

delete delete delete delete delete

drop by itself does not delete

all generic notes generic note 3 generic notes 3 through generic notes 3 through mpg note a

5 last

"

_

T

.......

!

............................

._ .......................................................

_

-_ .i ¸

i

_:

notes -- Place notes in data

" 445

Warningsi 1. Notes _re stored wit_ the data and, as with _her updates you make to the data, the _additions and deletions are not pei_nanent until you save the data; see JR] save, i i

I

2. The m_ximum lengt_ of a single note is 1,000 characters with Small Stata and 67,784 characters

+

with I ercooled Stala.

Methods it nd Forrrtulas ! '

noteaiis

implemen_d

as an ado-file.

1

i

References Gleason,

J, R. I998.

in Stata

Technical

i dm571

A notes

Butlqtin

editor

Reprints,

vol.

for Window 8, pp.

i

Also See

i

i

Complenenta_v:

[_] describe, [R] save

!

Related:

_] codebook

i

Backgrou nd:

L_]15,8 Characteristics

i

and Macintosh.

10_13.

1

i

J

Stata

Technical

Bulletin

43: 6-9.

Reprinted

f"f .."

! !

Title I nptrend, , -- Testfor trend across,°rderedgroups ,,,

I

Syntax nptrend

varname [if exp] [in range], by(groupvar) [ nodetail s_core(scorevar)]

Description nptrend

performs a nonparametric test for trend across ordered groups.

Options by(groupvar) is not optional; it specifies the group on which the data is to be ordered. nodetail

suppresses the listing of group rank sums.

score (scorevar) defines scores for groups. When not specified, the values of groupvar are used for the scores.

Remarks nptrend performs the nonparametric test for trend across ordered groups developed by Cuzick (1985), which is an extension of the Wilcoxon rank-sum test (rar,_ksum:see [R]signrank). A correction for ties is incorporated into the test. nptrend is a useful adjunct to the Kruskal-Wallis test; see [R] kwallis.

In addition to nptrend, for nongrouped data the signtest and spearman commands can be useful: see [R] signrank and [R] spearman. The Cox and Stuart test, for instance, applies the sign test to differences between equally spaced observations of varname. The Daniels test calculates Spearman's rank correlation of varname with a time index• Under appropriate conditions, the Daniels test is more powerful than the Cox and Stuart test. See Conover (1999) for a discussion of these tests and their asymptotic relative efficiency. > Example The following data (Altman 1991, 217) show ocular exposure to u]traviolet radiation for 32 pairs of sunglasses classified into 3 groups according to the amount of visible light transmitted. Group

Transmission of visible light

I 2

< 25% 25 to 35%

3

> 35%

Ocular exposure to ultraviolet radiation 1.4 0.9 2.6 0.8

1.4 1.0 2.8 1.7

1.4 1.1 2.8 1.7

Entering these data into Stata, we have 446

1.6 1.1 3.2 1.7

2.3 t.2 3.5 3.4

2.3 1.2 1.5 1.9 2.2 2.6 2.6 4.3 5.t 7.1 8.9 13.5

I

|

i _

V

......................................

_

............... i¸ 4

nptrend!--Test for trend across ordered groups

|

447

, li,t _xposmte 1.4

2.

1

1.4

3._

i

1.4

1

2.3

1

2.3

(o 7; ut omitted) 2 31 "i 3

s2.1

i

.9 8,9

s

ls.s

]

We use nt_trend to tes for a trend of (increasing) exposure across the 3 groups by typing . nl_rend exposure, by(group)

l

group 1

2z

=

sum of ranks 76

score t

obs 6

3

8

162

..522

18

290

3 i

!

i > Izl = i,13 Pr?b When the l_rou ps are g{iven any equally saced scores (such as -1, O, 1), we will obtain the same p , answer as !above. To illustrate the effect of changing scores, an analysis of these data with scores 1,

i

2, and 5 (_dmittedh' no! very sensible in this c_se) produces

ii

[

geb mysc = con_(group==3,5,group) nl_rend exposul_e,by(group) score(mys_)

I

group

s4ore

1 2 3 z

i _i

2 5 1

=

.46

Pr_b> Izl :

_.14

obs

sum of ranks

18 8 6

290 i62 76

"

This example suggests ihat the analysis is not all that sensitive to the scores chosen.

q

! i 3 Technical _lote

_

! I

The grc_uping variabt_ may be either a string v_able or a numeric variable. If it is a string variable and no scdre variable id specified, the natural nfimbers 1, 2, 3, .. are assigned to the goups in the son order !of the string!variable. This may not always be what you expect. For example, the sort raer olttle strings one, two, three _s one, three, two.

l

a

]

i

4 group 1

1.

i

SavedReSults nptrer@ii saves in r ): _

i

_calars r(N) r(p)

nuNber of observitions

r(z)

z statistic

two,sided p-value

r(T)

test statistic

i

-

- _

[,Ii!

..-

,,_.._,,,.

--

,_ot ,u, .u.u

across oruere(! groups

Methods and Formulas nptrend

is implemented as an ado-file.

nptrend is based on a method in Cuzick (1985). The following description of the statistic is from Altman (1991, 215-217). We have k groups of sample sizes ni (i = 1,..., k). The groups are given scores, li, which reflect their ordering, such as 1, 2, and 3. The scores do not have to be equally spaced, but they usually are. The total set of N = _ n_ observations are ranked from 1 to N and the sums of the ranks in each group, R/, are obtained. L, the weighted sum of all the group scores, is k L = E lini i=1

The statistic T is calculated as k T = E liRi i-=1

Under the null hypothesis, the expected value of T is E(T) = .5(N + l)L. and its standard error is

se--'(T)

=

I

(

n + 1 --_ N

k i=I

li2ni -- L 2

)

\ /

so that the test statistic, z, is given by z = { T - E(T) }/se(T), which has an approximately standard Normal distribution when the null hypothesis of no trend is true. The correction for ties affects the standard error of T. Let 2_"be the number of unique values of the variable being tested (N _

......

[,

u'"

.,.-^,,.u.m-.n¢..ooa

rep77

Coef.

foreign

oroerea

SCd. Err.

1.455878

.5308946

j

_cut1 _cut2

-2. 765562 -. 9963603

.5988207 .3217704

I

_cut3 _cut4

3.123351 .9426153

.3136396 .5423237

rep77

ioglt estimation

z 2.74

[95% Conf.

O. 006

.4153436

Interval] 2.496412

(Ancillary parameters)

Probability

Poor Fair Average Good Excellent

P>[zl

Observed

Pr( xb+u k), then odds(k1) and odds(k2) have the same ratio for all independent variable combinations. The model is based on the principle that the only effect of combining adjoining categories in ordered categorical regression problems should be a loss of efficiency in the estimation of the regression parameters (McCullagh 1980). This model was also described by Zavoina and McKelvey (1975), and previously by Aitchison and Silvey (1957) in a different algebraic form. Brant (1990) offers a set of diagnostics for the model. Peterson and Harretl (1990) suggest a model that allows nonproportional explanatory variables, Fu (1998).

odds for a subset of the

ologit does not allow this, but a model similar to this was implemented

by

The stereotype model rejects the principle on which the ordered logit model is based. Anderson (1984) argues that there are two distinct types of ordered categorical variables: "grouped continuous", like income, where the "type a" model applies; and "'assessed", like extent of pain relief, where the stereotype model applies. Greenland (1985) independently developed the same model. The stereotype mode/starts with a multinomial logistic regression model and imposes constraints on this model. Goodness of fit for ologi'l;

can be evaluated by comparing

the likelihood value with that obtained

by estimating the model with mlogit. Let Lj. be the log-likelihood value reported by ologit and let L0 be the log-likelihood value reported by mlogit, if there are p independent variables (excluding the constant) and c categories, mlogit will fit p(c - 1) additional parameters. One can then perform a "likelihood-ratio test", i.e., calculate -2(L1 - L0), and compare it to )C2{p(c2)}. This test is only suggestive because the ordered logit model is not nested within the multinomial logit model. A large value of -2(L1 - L0) should, however, be taken as evidence of poorness of fit. Marginally large values, on the other hand, should not be taken too seriously. The coefficients and cut points are estimated using maximum-likelihood as described in [R] maximize. In our parameterization, no constant appears as the effect is absorbed into the cut points. ologit and oprobit begin by tabulating the dependent variable. Category i = 1 is defined as •"the minimum value of the variable, i = 2 as the next ordered value, and so on, for the empirically determined [ categories. The probability

of observing an observation

Pr(outcome

= i) = Pr

t_i-1

Example

i !

Assu__ethat we ha,_eone subject and are interested in determining the drug profile for that subject. A reasonable, experiment would be to give, thei subject the drug and then measure the concentration • _ .

}

of the d4g m the subject s blood over a t,me period. For example, here is a dataset from Chow and --

time

1

o

.g

[

[ l

i'on

o 0

1.5 2 3 1 4

4.4 4.4 4,7 2.8 4.1

8 12

3.6 3

24 32 16

1.62 2.5

°

1

concentrat

,

)

Examining these d ta, we notice that the concentration quickly increases, plateaus for a short period, a_d then slowh' decreases over time. pkexamine is used to calculate the pharmacokinetic

i

measuresi°f interest" li_examine is explained !n detail in [R] pkexamine The °utpul is

le I pk-

Pharmacokinetic

(biopharmaceutical)

data

[

I

I

i

Description The term pk refers to pharmacokinetic

data and the commands,

all of which begin with the letters

pk, designed to do some of the analyses commonly performed in the pharmaceutical industry. The system is intended for the analysis of pharmacokinetic data, although some of the commands are of general use. The pk commands pkexamino pkst__mm pkshape pkcross pkequiv pkcollapse

are [R] [R] [R] [R] [R] [R]

pkexamine pksumm pkshape pkeross pkequiv pkeollapse

Calculate pharmacokinetic measures Summarize pharrnacokinetic data Reshape (pharmacokinetic) Latin square data Analyze crossover experiments Perform bioequivalence tests Generate pharmacokinetm measurement dataset

Remarks Several types of clinical trials are commonly performed in the pharmaceutical industry. Examples include combination trials, multicenter trials, equivalence trials, and active control trials. For each type of trial, there is an optimal study design for estimating the effects of interest. Currently, the pk system can be used to analyze equivalence trials. These trials are usually conducted using a crossover design; however, it is possible to use a parallel design and still draw conclusions about equivalence. The goal of an equivalence trial is the assessment of bioequivalence between two drugs. While it is impossible to prove two drugs behave exactly the same, the United States Food and Drug Administration believes that if the absorption properties of two drugs are similar, then the two drugs will produce similar effects and have similar safety profiles. Generally, the goal of an equivalence trial is to assess the equivalence of a generic drug with an existing drug. This is commonly accomplished by comparing a confidence interval about the difference between a pharrnacokinetic measurement of two drugs with a confidence limit constructed from U.S. federal regulations. If the confidence interval is entirely within the confidence limit, the drugs are declared bioequivalent. An alternative approach to the assessment of bioequivalence is to use the method of interval hypotheses testing, pkequiv is used to conduct these tests of bioequivalence. There are several pharmacokinetic measures that can be used to ascertain how available a drug is for cellular absorption. The most corn mort measure is the area under the time-versus-concentration curve (AUG). Another common measure of drug availability is the maximum concentration (Cmax) achieved by the drug during the follow-up period. Stata reports these and other less common measures of drug availability, including the time at which the maximum drug concentration was observed and the duration of the period during which the subject was being measured. Stata also reports the elimination rate, that is, the rate at which the drug is metabolized, and the drug's half-life, that is. the time it takes for thc drug concentration to fall to one-half of its maximum concentration. 507

1 l

_: ...........

.

i

506

.............

.........

pergram-- IPeriodogram

Also See C0mple[ _enta_:

IR] tsset

Related:

IR] corrgram, JR] cumsp, JR]wntestb

Baekgro_rod:

_tata Graphics Manual

pergram-- Periodogram

505

Methodsand Formulas P. The pth percentile is then

x[pl = !

x(_-l) +ix(i) 2 x (_)

if 1,9}i_1)= P otherwise

Whenlthe option a Ltde_ is specified, the _:followingalternative definition is used. In this case, weights _e not allowe:l. Lel i e integer flo,_rof (n _ l)p/lO0: i.e., i is largest integer i _ _ a. t)p/lO0. Let h be the remain& h = (n + llp/lO0 - i. The pth percentile is then |

where x j

x[p] = (1 - h)xii ) + hz(i+1)

is taken to _e x(i) and _(n+l) is taken to be x(n). /

xtile)roduces thelcategories

-i

pctile -- Create variable containing percentiles

497

_pctile _pctile is a programmer's command. It computes percentiles [U] 21.8 Accessing results calculated by other programs, You can use _pctile . _pctile . ret

and stores them in r();

see

to compute quantiles just as you can with pctile:

weight,

nq(lO)

list

scalars :

_pctile results.

r(rl)

=

2020

r(r2)

=

2160

r(r3)

=

2520

r (r4)

=

2730

r(r5)

=

3190

r(r6)

=

3310

r(rT)

=

3420

r (r8)

=

3700

r(r9)

=

4060

is, however,

The percentiles wish: _pctile ret

weight,

limited to computing () option (abbreviation p(10,

33.333,

45,

21 quantiles since there are only 20 r()s p()) 50,

55,

to hold the

can be used to compute any odd percentile 66.667,

you

90)

list

scalars : r(rl)

=

2020

r(r2)

=

2640

r(r3)

=

2830

r(r4)

=

3190

r(r5)

=

3250

r(r6)

=

3400

r(r7)

=

4060

_pctile, pctile, and xtile each have an option that uses an alternative definition of percentiles, based on an interpolation scheme; see Methods and Fom_ulas below. _pctile • ret

weight,

p(10,

33.333,

45,

50,

55,

66.667,

903

altdef

list

scalars : r(rl)

=

2005

r(r2)

=

2639. 985

r(r3)

=

2830

r(r4) r(rS)

= =

3190 3252.5

r(r6)

=

3400. 005

r(r7)

=

4060

The default formula inverts the empirical distribution function. We believe that the default formula is more commonly used. although some consider the "alternative" formula to be the standard definition. One drawback of the alternative formula is that it does not have an obvious generalization to noninteger weights.

_ i

Ii• i

rp !

496

jl pctile - create vadablecontaini_ percentiles i 120

1

3

18. 19.

17.

120 125

12o

1 1

1

3 4

:[I.

132

1

4

to,

13o

i2,

1

93

l

94 131 94 (o_qmtornitted)

lo

1_o. i

o

3

4 i

1 1

o

4

136

00

0 TechnicalNote

I_

In th!. iztite' last examplb.catego_y if=webp i:E°nlycase==l ,wanted cut!(pct)t° categorize cases, we could have issued the command

i * _ ! i

Mos_ Stata commahds follow the logic that Using an if exp is equivalent to dropping observations not satisfyi_2 the expressi on and running the command. This is not true of xtile when the cutpoints () option i_Jsed. (_qaer_ the eutpoints () option' is not used, the standard logic is true.) This is because xtile _ill use all no,missing values of the cutpoint () variable regardless of whether these values belon_ io observation that satisfy the if expression,

!

If yoh do not wan: to use all the values i. the cutpoint () variable as cutpoints, simply set the ones that you do not _eed to missing, xtile does not care about the order of the values or whether

I

they are separated by! missing values.

!

I

i i

_ TechnicalNote

,

Note!that quantile_are not always unique. If we categorize our blood pressure data on the basis of quinttles rather tha_ quartiles, we get t _ctile pet = 4 bp, _tile quint bp, nq(5) nq(5) genp(percent) _ist percent

I

bp

quint

pct

!

98

1

104

20

100

1

120

40

lo4

1

_25

80

1

_i

_.

!

_. 5. _.

! I

9. _. 1_. 1t._

i

i

_.

I_o 120 12o t2o

12o 13o la2 125

2 2 2

12o

60

2

2 s s 4

The 40t_ and 60th percentile are the same; t_ey are both 120. When two (or more) percentiles are the samd, they are gixen the lower category nhmber.

i i i i

pctile -- Create variable containing percentiles

495

• xtile category = bp, cut(class) list bp class category 1, 2. 3. 4. 5. 6. 7, 8, 9. 10. 11.

bp 98 100 104 110 120 120 120 120 125 130 132

class 100 110 120 130

category 1 1 2 2 3 3 3 3 4 4 5

The cutpoints can, of course, come from anywhere. They can be the quantiles of another variable or the quantiles of a subgroup of the variable. For example, suppose we had a variable case that indicated whether an observation represented a case (case = 1) or control (case -- 0). . list bp 98 IO0 104 ii0 120 120 120 120 125 130 132 116 93 115

case 1 1 1 1 1 1 1 1 1 1 1 0 0 0

(outputomi_ed) 110. 113

0

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

We can categorize the cases based on the quanfiles of the controls. To do this. we first generate a variable pet containing the percentiles of the consols' blood pressure data pctile pet = bp if case==O, nq(4) list pet in i/4 I. 2. 3. 4.

pet 104 117 124

and then use these percentiles

as outpoints to classify bp for all subjects.

xtile category = bp, cut(pet) gsort -case bp • list bp case category 1. 2. 3. 4. 5.

bp 98 lOO 104 110 120

case 1 1 1 1 1

category 1 1 1 2 3

494

pctile -- Cr,_atevariable containingpercentiles

xtil_ can be used to create a variable quart

i

• tile

quart

•

= bp,

98

i

nq(4)

I

I

i

1oo

i

_. ,

Ii0 t20 104

2 2 1

_.

12o

2

b_ i q"_I

10. :

I

i

11.

i

that indicates the quartiles of bp.

130 125

! !

132 1

4 3 4

The categories created i_are

I

(+_,x[2_l] ' (xi2_,xis_] ' (Xi_oi, X[7_l],(x[75_,+oo) where z_5, Ziso] an ZiTsi are, respectwely, the 25th, 50th (me&an), and 75th percentiles of bp We coul use the pc le command to genera[e these percentiles:

!

I i

-_

1

ictile pet = _p, nq(4) genp(percent) _ist bp quart ipercent pet i bp quart percent

pet

_.

98

I

25

104

_. _.

104 100

1 I

75 50

125 120

4. d._

llo 12o

2 2

_

_.

120

2

i

_.

12o

2

I_.

t20

2

1I!.

130 132

4 4

!

i

xtil(_ can categori_e a variable based on _y set of cutpoints, not just percentiles. Suppose that we wish iocreate the _ollowing categories for blood pressure:

i

(-_.,!_oo],(too, ! t_ot (U0.120] (i2o,_3o].(_3o,+o0)

To do thi_, we simph', ,create a variable contairiing the cu'lpoints

i:

i

class ihput class I!. I00

23i.i. t20 :io 5i. end

and then iuse xtile with the cutpoints()o_tion. |

{

: ]

i

i

pctite -- Create variable containing percentiles

493

Note that summarize, detail calculates standard percentiles. • summarize mpg, detail Mileage (mpg) Percentiles

Smallest

1_ 5_ 10Z 25_

12 14 14 18

12 12 14 14

50_

20

75X 90X 95X 99_

25 29 34 41

Larges¢ 34 35 35 41

0bs Sum of Wgt.

74 74

Mean Std. Dev.

21.2973 5.785503

Variance Skewness Kurtosis

33.47205 .9487176 3.975005

can onlycalculate thesepa_icular percentiles. The commands let you compute any percentile. But s_Immarize,

detail

Weights can be used with petile,

_pctile,

pctile

and

_pctile

and xtile:

. pctile pet = mpg [w=weight], nq(10) genp(percent) (analytic weights assumed) . list percent pet in I/I0 I. 2. 3. 4. 5. 6. 7. 8. 9. 10.

percent i0 20 30 40 50 60 70 80 90

pet 14 16 17 18 19 20 22 24 28

/

The result is the same no matter which weight type you specify--aweighz,

fweight,or pweight.

xtile xtile will create a categorical variable that contains categories corresponding to quantiles. We illustrate this with a simple example. Suppose we have a variable bp containing blood pressure measurements: list

i. 2. 3. 4. 5. 6. 7, 8. 9. 10. 11.

bp 98 I00 104 II0 120 120 120 120 125 130 t32

+

!_'+

I I ,i !

492

pctile -- Ci .=atevariable containin+ percentiles

cutpoi+tts(vamame, requests that xtile ise the values of varname, rather than quantiles, as cut ints for the c legories. Note that all v_lues of vamame are used, regardless of any if or in restri,:tion; see the technical note in the xt_le section below. percentiles(m+mtist) requests percentiles Corresponding to the specified percentages. Percentiles

+ i

are p!aced in r(r]t), r(r2) ..... etc. For example, percentiles(10(20)90) 10th.130th, 50th. 7Dth, and 90th percentilei be computed and placed into r(rl),

}

r(r4_,

i

detail_ on ho,a to _pecify a numIist.

and r!r5)I

Up to 20 (inclusive)p_rcentiles

requests that the r(r2), r(r3),

can be requested. See [u] 14.1.8 numlist for

Remark.. pctile pctil,ecreates a _ew. variable containing percentiles. You specify the number of quantiles that you wan(. and pctil_ computes the corresponding percentiles. Here we use Stata's auto dataset and

!

compute the deciles of mpg: t se auto

+

+

• _ctile pct= _ist pet

i

i

•

pet

14

_, _, _.

20 22 24

_. _.

25 29

!

'_

i

earner to oistinguish be tween the percentiles.

! !

I

'V_

apg, nq(lO)

in I tl0

_illthe

.en.

_ _

. p_tile pet

option

= _pg, t

1_st

percent

!

percent

_ct

.enerate

nq(10) in 1/10 pet

2_ 11. 31

20 10 30

17 14 18

4! si

40 so

19 20

+o .o 80

+

;+

I

:oi°°

.

to

/

22 2+ 25

anot]

genp(percent)

,e_

v_d_ab'e

_vJth

the

co_[espondi_.

_erce_]ta.e..

,, ,.

tie [ petile -- Create variable contlfining percentiles

]

i

Syntax pctile

genp(newvarp) T

xtile newvar

'

altdef = exp

{ nquantiles(#) _pctile

= exp [weight]

[type] newvar

varname

[if

exp]

[in

range]

[, _nquantiles(#)

]

[weight 1 [if

exp]

[in range]

I c_utpoints(varname) [weight]

[if

exp]

[,

} a!tdef

[in range]

t

[,

{ nquantiles(#) I p_ercentiles(numlist) } altdef ] fweights, and pweights are allowed (see [U] 14.1.6 weight) except when the altdef in which case no weights are allowed.

aweights,

option is specified,

Description pctile creates a new variable containing typically just another variable. xtile

the percentiles

creates a new variable that categorizes

of exp. where the expression

exp by its quantiles. If the cutpoints

option is specified, it categorizes exp using the values of vamame as category cutpoints. varname might contain percentiles, generated by pctile, of another variable.

exp is

(varname) For example,

_pct ile is a programmer's command. It computes up to 20 percentiles and places the results tn r(); see [U] 21.8 Accessing results calculated by other programs. Note that summarize, detail will compute some percentiles (1, 5, 10, 25, 50, 75, 90, 95, and 99th); see [R] summarize.

Options nquantiles (#) specifies the number of quantiles. The command computes percentiles corresponding to percentages 100k/m for k = t,2,..., m- 1, where m = #. For example, nquantiles(lO) requests that the 10th. 20th, ..., 90th percentiles be computed. The default is nquantiles(2): i.e., the median is computed. genp(newvarp) specifies a new variable to be generated containing to the percentiles. altdef uses an alternative

formula for calculating percentiles.

the percentages

corresponding

The default method is to invert the

empirical distribution function using averages ((zi + z_+x)/2) where the function is flat (the default is the same method used by summarize; see [R] summarize). The alternative formula uses an interpolation method. See Methods and Formulas at the end of this entry. Weights cannot be used when altdef is specified. 491

1

I"

• i !

_

4_J i I pc°rr _3Technic Note -- Pztrial c°rrelati°n c°efficients Some caution is in order when interpreting the above results. As we said at the omset, the partial corretati )n coefficient is an attempt to estimate the correlation that would be observed if the other variable., were held cc_nstant. The emphasis is on attempt, pcorr makes it too easy to ignore the fact that you are fitting a aodel. In the above example, the model is _price= fl0+ fllmpg+ _2wei_t + _3foreign+ e which is_ in all honestk a rather silly model. Even if we accept the implied economic assumptions of the moddl--that consumers value mpg, weight, and foreign--do we really believe that consumers

i i _. i i !

place equal value on _very extra l,O00 pounds of weight? That is, have we correctly parameterized the mod_l? If we hav_ not, then the estimated :partial correlation coefficients may not represent what the3' clai_ to represen I. Partial correlation coet_icients area reasonable way to summarize data after one is cobvinced that the underlying model is reasonable. One should not, however, pretend that there is no underlying model and that the partial correlation coefficients are unaffected by the assumptions and parai aeterization.

Methodsand Fornlulas pcor_ is implemen :ed as an ado-file. Result t are obtaine, by estimating a linear regression of varnamel on varlist; see [R] regress. The partial correlation coefficient between varnamel and each variable in varlist is then defined as

!

(Theil 19)1, 174). wh_re t is the t statistic, n the number of observations, and k the number of mdependdnt variables i_cludmg the constant but excluding any dropped variables, The significance .

is _iven _y 2, trail

_n - k, abs (t))

References Thei.I.H. 1_7!. Principles_)[Econometrics.New York John Witey& Sons.

i

AlsoSee

i

Related: !

} I

JR] eorrel te, [R] spearman

pcorr -- Partial correlation l coefficients I I

i I

Syntax ;

pcorr varnamel

vartist

[weight]

[if exp] [inrange]

by ... : may be used with pcorr; see [R] by. aweights and fweights are allowed: see [U] 14.1.6 weight.

'

Description pcorr displays the partial correlation coefficient of varnamel the other variables in varlist constant.

with each variable in varlist, holding

z

Remarks Assume that y is determined by xl, x2, ..., xk. The partial correlation between 5' and xl is an attempt to estimate the correlation that would be observed by y and xt if the other x s did not vary.

> Example Using our automobile dataset (described in [U] 9 Stata's on-line'tutorials and sample datasek_), the simple correlations between price, mpg, weight, and foreign can be obtained from correlate (see [R] correlate): • corr price (obs=74)

mpg

weight

foreign

price

mpg

weight

price

i. 0000

mpg

-0. 4686

weight

0.5386

-0.8072

1.0000

foreign

0.0487

0.3934

-0.5928

foreign

I.0000 1.0000

Although correlate gave us the full correlation matrix, our interest is in just the first column. We find, for instance, that the higher the mpg, the lower the price. We obtain the partial correlation coefficients using pcorr: pcorr price (obs=74) Partial

mpg

weight

correlation

Variable mpg

foreign

of price

with

Corr.

Sig.

O. 0352

O. 769

weight

O. 5488

O. 000

foreign

O, 5402

O. 000

We now find that. holding weight and foreign constant, the partial correlation of price with mpg is virtuallv zero. Similarly, in the simple correlations, we found that price and foreign were virtually uncorrelated. In the partial correlations holding mpg and weight constant we find that price and foreign are positively correlated. q

489

!

...............

o!_,_u_,,uet-_le

aataset

)

!

t

I[ i

i

.

1

our:sheet copi_ the data currently loaded in memory into the specified file. About all that can go wrbng is the fil_ you specify already extsts: outsheet

u_ing

[ile tosend._ut

r(602) ;

tosenfl already

exists

)

In thai case, you ca_l erase the file (see [R_ erase), specify outsheet's differe at filenarne, _aen all goes well, out,beet is silent:

replace

option, or use a

i I outsheet

us .ng tosend,

replace

-

tf you are copying tl e data to a program othtr than a spreadsheet, remember to sl_ify option: ) •i outsheet

us_ ng feral,

nenames

"!-

q

i

[

Also See Compl_ mentary:

[R] insheet

Related

[R] outffle

Backgr Example Stata never writes over an ex:{sting file unless explicitly told to do so. For instance, if the file employee, raw already exists and. you attempt to overwrite it by typing outfile using employee, here is what would happen: • outfile

using

file employee.raw r(602) ;

employee already

exists

You can tell Stata that it is okay to overwrite a file by specifying using employee, replace.

(Continued

on next page)

the replace option:

outfile

> Exampl_ Youlhaveentered nto Statasome data on s ven employeesin your firm. The data contain employee r_ame.!mployee identification number, salar_i and sex: •!list !

i

,

i

name

empno

salary

sex

Ii. Carl Mark_

i

57213

24,,000

male

i2. Irene Adl_r i3. Adam Smit_

47229 57323

127,000 24,000

female male

!4. David

57401

24,500

male

i5. Mary Rogers

57802

27,000

female

!76:Carolyn F_ank , Robert La#son

57805 57824

24,000 22,500

female male

Wal_is

i

If yo_ now wish tc use a program other thin Stata with these data, you must somehow get the data over to l_at other prol "am. The standard Stata_format dataset created by save will not dothe job--it is writte_ in a special _ormat that only Stata uhderstands, Most programs, however, understand ASCII datasetsg-standard te datasets that are like ihose produced by a text editor. You can tell Stata to produceisuch a datas_ using outfile. Typi8g outfile using employee createsa dataset called employee,raw that c,)ntains all the data. We Can use the Stata typecommand to review the resulting

i

file:

_utfile

using

employee

i "Carl Marks" _ype employee.raw

57213

24000

i

i "Irene : "Adam

47229 57323

27000 24000

"female" "male"

I

!"David Walli i" "Mary Roger ;"

57401 57802

245a0 270d0

"male" "female"

[tCarolyn

578os

24o o "femaW'

IRobert Lavso_"

57824

22500

i

Adler" Smith"

"male'*

"male"

We se _ , that the fileicontainsi the four variables and that Stata has surrounded the string variables with double quotes. I

1 I

i 3 TechnicalNote

[

!

outfi_e is careful _o columnize the data in :case you want to read it using formatted input. [n the example a_bove,the firs_tstring has a }',-16s display format. Stata wrote two leading blanks and then placed th+ string in a |6-character field, outfile always right-justifies string variables even when

I

the displa__format requests left-justification.

!

The fi!st number h_s a Y,9.0g format. Th_ number is written as two blanks followed by the number. _ght-justified in a 9-character field. The second number has a Y,9.0gc format, outfile ignores tt'_ comma part of the format and also writes this number as two blanks followed bv_the number, right-justified in a 9-character field. :

!

,

The ]aatt entry is really a numeric _ariable,:: but it ha:s an associated value label. Its tbrmat is

} "

Y,-9.0g. 4o Stata wrot_ two blanks and the2 tight-justiSed the value label in a 9-character field Again. ou{fileright-jt_stifies value labels e_en:;when the display formal specifies left-justification.

i

•

I outfile -- Write ASCII-format

dataset

[

I

I

Syntax

outfile[var//s,] using te,,ameexp][inra,,e][,

dictio=y

no!abel noquote replace wide ]

Description outfile writes data to a disk :file in ASCII format, a format that can be read by other programs. The new file is not in Stata format; see [R] save for instructions on saving data for subsequent use in Stata. The data saved by outfile can be read back by infile; see [R] infile. Iffilename is specified without an extension. '.raw' is assumed unless the dictionary option is specified, in which case '.dct' is assumed.

Options comma causes Stata to write the file in comma-separated-value format. In this format, values are separated by commas rather than blanks. Missing values are written as two consecutive commas. dictionary writes the file in Stata's data dictionary format. See [R] infile (fixed format) description of dictionaries. Neither comma or wide may be specified with dictionary. nolabel causes Stata to write the; numeric values of labeled variables. labels enclosed in double quotes. noquote

for a

The default is to write the

prevents Stata from placing double quotes around the contents of string variables.

replace permits outfile to overwrite an existing dataset, replace mav not be abbreviated. wide causes Stata to write the data. with one observation into lines of 80 characters or fewer.

per line. The default is to split observations

Remarks outfile enables data to be sent to a disk file for processin_ by a non-Stata program. Each observation is written as one or more records records that will not exceed 80 characters unless you specify the wide option. The values of the variables are written using their current display formats, and unless the comma option is specified, each is prefixed with two blanks. If you specify the dictionary option, the data are written in the same way, but in front of the data outfile writes a data dictionary describing the contents of the file.

483

i

orth rely uses th( Christoffel-Darboux Both _rtlaog and ,rthpoly

recurrence formula (Abramowitz and Stegun 1968).

normalize thd orthogonal variables such that

Q_Q=MX !

where It _ -- diag(w_,w2,...,wN)

with wl:,w2,...,WN

the weights (all 1 if weights are not

I

specifiedi), and M is t_e sum of the weights (the number of observations if weights are not specified).

i i

Referenqes

I

Abramowiiz.M. and I. 4' Stegun, ed. 1968.Handbook of Mathemat/ca/Functions,7th printing.Washington.DC: Nation_dBureauof Standards.

!

Golub,G. !H.and C. F. Va_Loan. 1996,Matr/x CompUtations,3d ed. Baltimore:JohnsHopkinsUniversityPress,pp.

218-2_9. I

Sribney, _,_(Reprints,voI. M, 1995.sg37:5,Orthogonalpolynomials.S}aa TechnicalBulletin25: 17-18. Reprintedin Stata Technical Bultetii pp. 96-98.

i I

!_o }

i AlsO,See

Related: I

R] regress

Backgrot_nd:

_] 23 Estimation and I_gst-estimation, commands

:

Some of the correlations problems removed,

orthog -- O_hogonal variables and o_hogonal polynomials 481 among the powers of weight are very large, but this does not create any

for regress. Howevel; we may wish to look at the quadratic trend with the constant the cubic _end with the quadratic and constant removed, etc. orthpoly will generate

polynomial terms with this property: . orthpoly weight, generate(pw*)

dog(4) poly(P)

. regress mpg pwl-pw4 Source Model Residual Total

SS

df

MS

1652.73666

4

413.184164

790.722803

69

11.4597508

2443.45946

73

33.4720474

mpg

Coef.

pwl pw2 pw3 pw4 _cons

-4.638252 ,8263545 -.3068616 -.209457 21.2973

Std. Err. .3935245 .3935245 .3935245 .3935245 .3935245

t -11.79 2.10 -0.78 -0.53 54.12


74 36.06 0.0000


0.6764 0.6576 3.3852

P>ItJ

= = =


0.000 0.039 0.438 0.596 0.000

-5.423312 .0412947 -1.091921 -.9945168 20.51224

-3.853192 1.611414 .4781982 .5756028 22.08236

Compare the p-values of the terms in the natural-polynomial regression with those in the orthogonalpolynomial regression. With orthogonal polynomials, it is easy to see that the pure cubic and quartic trends are nonsignificant and that the constant, linear, and quadratic terms each have p < 0.05. The matrix P obtained with the poly () option can be used to transform coefficients for orthogonal polynomials to coefficients for natural polynomials: orthpoly weight, poly(P) deg(4) . matrix b = e(b)*P matrix list b b[1,5] yl

degl .02893016

deg2 -.00002291

deg3 5.745e-09

deg4 -4.862e-13

_cons 23.944212

Examplei i

A_aini considerthe auto.dta dataset.Supposewe wish to fit themode] mpg

=

+ _I weight

+/_2

we_g ht2 + _3 weigh

t3 + ;_4 weight4 + e

We will first compute he regression with natuial polynomials: !

!

i a

double

w2 = wl*wl

, g_n double

w3 = w2*wl

. g,_n double

w4 = w3*wl

. c_rrelate

i!

wl-_4

I

w2

wl

w3

wl

1 .(300

i

w2

0.(

i

w3

O.¢.565

O. 9916

I.0000

i

w4

O. t 279

O. 9679

O. 9922

!

915

1.0000 1. 0000

. r,_gress mpg wl-v4 88

I

w4

.i

df

_

Model !Residual

MS

Number

,,

_( 4,

652.73666 '90.722803

4 69

413._84164 11.4fl_97508

!443.45946

73

33.4_20474

!

i }

Adj Total

mpg

Coef.

Std.

Err.

_i

.0289302

_2 w3

.-. 0000229 5.74e-09

.0000566 _.19e-08

w4

' 4.86e-13

_cons

23.94421 I

Root

t

,1161939

69) =

Prob > F R-squared

. _ i

0.25

P>It I

74

of obs =

R-squared MSE

[95Y, Conf.

30.06

= =

0.0000 0.6764

=

0.6576

=

3,3852

Interval]

O. 804

-. 2028704

.2607307

i -0.40 0.48

O. 687 0.631

-. 0001359 -1.80e-08

.0000901 2,95e_08

9.14e-13

-0.53

0.596

-2.31e-12

1.34e_12

86.60667

'

0.783

-148.83t4

196.7198

i!

0,28

W

odhog -- Odhogonal variables and odhogonal polynomials

. regress

price

length

Source

weight

SS

Model Residual Total

price

weight headroom

trunk

df

MS

Number F( 4,

of obs 69)

74 10.20

4

59004145.0

Prob

=

0.0000

399048816

69

5783316.17

R-squared Adj R-squared

= =

0.3716 0.3352

635065396

73

8699525.97

Root

=

2404.9

Std.

Err.

t

> F

= =

236016580

Coef.

length

headroom

MSE

P>ltl

[95_

Conf.

-185.747

479

Interval]

-I01.7092

42.12534

-2.41

0.018

4.753066 -711.5679

1.120054 445.0204

4.24 -1.60

0.000 0.114

2.518619 -1599.359

-17.67148 6.987512 176.2236

trunk

114.0859

109.9488

1.04

0.303

-105.2559

333.4277

_cons

11488.47

4543.902

2.53

0.014

2423.638

20553.31

However, we may believe a priori that length is the most impo_ant predicton followed by weight, followed by headroom, followed by trunk. Hence, we would like to remove the "effect" of length from all the other predictors; remove weight from headroom and trunk; and remove headroom from trunk. We can do this by running orthog, and then we estimate the model again using the orthogonal variables: • orthog

length

• regress

price

Source Model

weight

headroom

olength

I

trunk,

oweight

SS

i

gen(olength

oheadroom df

oweight

oheadroom

otrunk)

matrix(R)

otrunk

MS

Number F( 4,

of obs 69)

74 10.20

236016580

4

59004145.0

Prob

=

0.0000

399048816

69

5783316.17


= =

0.3716 0.3352

635065396

73

8699525.97

Root

=

2404.9

price

Coef.

Std.

olength

1265.049

279.5584

4.53

oweight oheadroom

1175.765 -349.9916

279.5584 279.5584

4.21 -1.25

0.000 0.215

1.04 22.05

Residual Total

Err.

otrunk

290.0776

279.5584

_cons

6165.257

279.5584

Using the matrix R, we can transform the metric of original predictors: • matrix matrix

t

> F

= =

MSE

P>ltt

[95_

Conf.

Interval]

0.000

707.3454

1822.753

618.0617 -907.6955

1733.469 207.7122

0.303

-267.6262

847.7815

0.000

5607.553

the results obtained using the o_hogonal

6722.961

predictors back to

b = e(b)*inv(K)" list

b

b[1,5] length yl

-101.70924

weight 4.7530659

headroom -711.56789

trunk 114.08591

_cons 11488.475

Technical Note The matrix R obtained using the matrix() option with orthog can also be used to recover .¥ (the original vartist) from Q (the orthogonalized newvarlist) one variable at a time. Continuing with the previous example, we illustrate how to recover the trunk variable: .

matrix

• matrix

C = R[l...,"trunk"]" score

double

rtrunk

= C

!

478 orthog -- )rthogonal variables and orthogonalpolynomials Notei that the coef_cients corresponding tcr the constant term are placed in the last column of the matrix. The last r_bwof the matrix is all tero except for the last column, which corresponds to the _onstant term. 1

Remarks •

Ortht,gonal variab

s are useful for two reasons. The first is numerical accuracy for highly collinear

variableg._ Stata'srel_ress and other estimationcommandscan face a largeamountof coll}nearitv and variables due to stil! produce accbrate results. But, at some point, these commands will drop r ! _!

cotlineahty".- If you ktnow with certainty that the variables are not perfectly collinear, you may want to retain a_lof their eff@ts in the model. B3,'usihg orthogor orthpolytO produce a set of orthogonal . i all vanable_ ...... will be present m the estimauon results. vanablef;,

}

User i are more lik ly to find orthogonal vafi_'ablesuseful for the second reason: ease of interpreting results, brthog and _rthpoly create a set 0f variables such that the "effect" of all the preceding

_

vanable_ have been Fm°ved from each vanable. For example, ff one 2ssues the command

I

. iorthog

xt

x2

x3,

generate(ql

q2

q3)

cons:ant

xl are constant produce ql, is removed from xt the removed 2, and finally the conslant, xl. and x2 then are removed fromand x3 to produce q3.

the the fromeffett x_ toof produce Hence,

tO

}

ql = r01 + rll xl

g

q2 = r02 + r1_2xl + r22 x2

i

q3 = ro3 + rl_3xI + r23 x2 -,- r33 x3 This cm be generali

i i ! _

d and written in matrin notation as

_

X = OR

where ..J,*: is the A" ×!(d + t) matrix representation of varlist plus a column of ones, and Q is the Ar × (di+ l) matrix representation of newvarlist plus a column of ones (d = number of variables in varli._t and N = namber of observations). The (d-t- 1) × (d + 1) matrix R is a permuted upper _riangul_.r matrix: i.e.. R would be upper triangular if the constant were first, but the constant is last. so _he, first row/zolumn has been permuted with the last row/column. Since Stata's estimatmn commar_ds list the constant term last. this allows R, obtained via the matrix() option, to _ used to trans rm estimatk n results.

!i

i

i

I i

.- Example ConsiderStata's md:o. dta dataset Supposewe postulatea mode] in which price dependson the car's le lgth. weight, headroom (headroom);, and trunk size (trunk). These predictors are collinear. bat not .'xtremely so--the correlations are nch that close to l" horrelate

leiLgth weight

headroom

trrmk

(o bs=74) I__ngth

weight

-!_ength

1 0000

_eight

0 9460

i 0000

0 5163 0 7266

0.4835 0.6722

'

he_adroom trunk

"

headroom,

1.0000 0. 6620

trunk

1. 0000

regres:certainly h@ no trouble estimating _hi_,rnodeh

"itle I °rth°g

-- Orth°g°nal ,

variables and °rth°g°nal

p°ly , n°mials

]

Syntax orthog

[varlis,]

tweightl

[matrix(matname)

orthpoly

varname

[if

expl

[in

range],

g_enerate(newvarlist)

]

[weight]

{ generate(newvarlist)

Iif

exp]

[in range],

[p_oly(matname)

} [ degree(#)

]

orthpoly requires that either generate() or poly(), or both. be specified, iweights, fweights, pweights, and aweights are allowed, see [U] 14.1.6 weight.

Description orthog orthogonalizes a set of variables, creating a new set of orthogonal variables (all of type double), using a modified Gram-Schmidt procedure (Golub and Van Loan 1996). Note that the order of the variables determines the orthogonalization: hence, the "most important" variables should be listed first. orthpoly computes orthogonal polynomials

for a single variable

Options generate(newvarlist) is not optional; it creates new orthogonat variables of type double. For orthog, newvarlist will contain the orthogonalized varlist. If varlist contains d variables, then so will newvarlist. For orthpoly, newvarlist will contain orthogonal polynomials of degree 1, 2, .... d evaluated at varname, where d is as specified by degree (d). newvarlist can be specified by giving a list of exactly d new variable names, or it can be abbreviated using the styles newvar 1newvard or newvar,. For these two styles of abbreviation, new variables newvar 1, newvar2, .... newvar d are generated. matrix(mamame) (orthog by X = QR, where X is and Q is the N × (d + 1) of variables in varlist and

only) creates a (d+ 1) × (d + l) matrix containing the matrix R defined the N × (d+ 1) matrix representation of vartist plus a column of ones, matrix representation of newvarlist plus a column of ones (d = number N := number of observations).

degree(#) (orthpoly only) specifies the highest degree polynomial to include. Orthogonal nomials of degree 1, 2.... , d - # are computed. The default is d = 1.

poly-

poly(mamame) (orthpoly only) creates a (d + 1) × (d 4- 1) matrix called matname containing the coefficients of the orthogonal polynomials. The orthogonal polynomial of degree i < d is matname[ i, d + I ] + matname[ i, 1 ] *varname + matname[ + " • + matname [ i, i ]*varname" 477

i, 2 ] *varname 2

I_

I !

,°,,au,_s

..........

In _,aataset

a Tectlnical Note :_

If _,our data ( ontain variables the_ correctly andi yee_r2.

e ,en though

named yearl,

to most computer

year2 ..... programs,

yea_rig,

yearl0

year20_

is alphabetically

aorder

will order

bep,veen yearI

i

I

Methi,dsandI_ormulas a ,rder is imp emented

as an ado-file,

Refe. nces Gleas_n, J, R. 1997. tmSl: Defining and recordirig variable ot'derings. Stata Technical Bultetin40: 10-12. Reprinted in IStata, TechnicaJ Bulletin Reprints, rot. 7, p_, 49-52. }

Weesi_. J. 1999. din7 .: Changing the order of variables in a dataset. Szala Technical Bulletin 52: 8-9. Reprinted in St_ta Technical B_ Iletin Rep6nts, vol. 9, pp. 6_-62. !

AlsoS_ee Coml_lementary:

[R]descry'be

Related:

[R] edit,

[R] rename

W

Contains

data

from

obs: I

74

1978

6

vars: size:

7

2,368

(99.6%

storage I

order -- Reorder variables in dataset

auto.dta

variable

name

of memory

Automobile

Jul

2000

475

Data

13:51

free)

display

value

type

format

label

variable

label

!

i ;1

i

make

strl8

%-18s

Make

mpg

int

%8,0g

Mileage

price

int

%8.0gc

Price

weight

int

%8.0gc

Weight

length

int

%8.0g

Length

(in.)

rep78

int

%8.0g

Repair

Record

Sorted

and

Model (mpg)

(Ibs.) 1978

by:

Note:

dataset

has

changed

since

last

saved

[

1} ' I '

If we now wanted length to be the last variable in our dataset, we could type order price weight rap78 length but it would be easier to use move: . move length describe

rep78

Contains

from

data

auto.dta

obs:

74

wars:

6 2,368

size:

variable

name

1978

Automobile

7 Jul (99.6%

storage type

of memory

display format

2000

Data

13:51

free)

value label

variable

label

make

strl8

Z-18s

Make

mpg

int

_8.0g

Mileage

price

int

_8.0gc

Price

weight

int

%8.0gc

Weight

(Ibs.)

rep78

int

%8.0g

Repair

Record

length

int

%8.0g

Length

(in.)

Sorted

make mpg

and

Model (mpg)

1978

by:

Note:

dataset

has

changed

since

last

saved

We now change our mind and decide that we would prefer that the variables be alphabetized. aorder describe Contains

data

from

obs:

auto.dta 74

1978

6

7 Jul

vars: 2,368

size:

(99.4%

storage

of memory

Automobile 2000

free)

display

value

type

format

label

length make

int sift8

_8.0g _-18s

Length (in.) Make and Model

mpg

int

_8.0g

Mileage

price

int

%8.0gc

Price

rep78 weight

int int

%8.0g _8.0gc

Repair Weigh_

variable

Sorted

name

Data

13:51

variable

label

(mpg) Record (ibs.)

1978

by:

Note:

dataset

has

changed

since

last

saved

_

i

Title tl i

!! 't

i

I

[

1

I

u

"

Syntax ord._r_ vartist Yartlame

movi

_rname2

1

aor_er [varlist]

Descriion

i

order changes tl_e order of the variables in the current dataset. The variables specified in varlist are m@ed, in order, lto the front of the data_t. ! ! movi_ also reorder_ variables, move relocaies vamamel to the position of vamame2 and shifts the

ii_

remain!ng_variables, !includingl varna,ne2, to make room.

_-

aor_er alphabeti_esthe variablesspecifiedin varlistand movesthem to the front of the dataset. If no vhrlist is specihed. _all

Remarks _- Examplb When using order, ., describe C)ntains

I "

obs: tars:

}

i

i

_

i

you must specify a vadist, but it is not necessa_' to specify all the variables

in the dataset._ For e::ample, i

!

!

is assumed.

data from

auto.dta 74 6

• 2,368

;ize:

1978 7 Jul Automobile 2000 13:51 Data (99.6}',of memory

storage

free)

display

value

type

format

label

p :ice

int

_8, Ogc

Price

w_ight m@g m_ke

int int strl8

_,8.0gc 7,8.0g Y,-18s

Weight (Ibs.) Mileage (mpg) Make and Model

l_ngth r_p78

int int

Y,8. Og '/,8.0g

Length Repair

v_riable

S_rted

name

by :

Note: ._order

variable

make

dataset

has

cIianged since

last

pg

describe

474

saved

label

(in.) Kecord

1978

oprobit m Maximum-likelihood

,

ordered probit estimation

473

Saved Results oprobit

saves

in e():

Scalars e (N)


e (11)

log likelihood

number of categories model degrees of freedom pseudo R-squared

e(ll_O) e(chi2) e (N_clust)

log likelihood, constant-only model X2 number of clusters

e(cmd)

oprobit

e(vcetype)

covarlance estimation method

!

e(depvar) e(wtype)

name of dependent variable weight type

e(chi2type) e(offset)

Wald or LR: type of model X2 test offset

[ i

e(wexp) e (clustvar)


e(predict)


coefficient vector category values

e (V)


e (k_cat) e(df_m) e (r2_p) Macros

Matrices e (b) e (cat)

[ t

matrix of the

Functions e fsample)


Methodsand Formulas Please

see the Methods

and Formulas

section

of [R] ologit.

References Aitchison. J. and S. D. Silvey. 1957, The generalization of probit analysis to the case of muhiple responses. Biometrika 44: 131-140. Goldstein. R. 1997. sg59: Index of ordinal variation and Neyman-Barton Reprinted in Stare Technical Bulletin Reprints, vol. 6, pp. 145-147.

GOE Stat_ Technical Bultetin 33: 10-12.

Greene, W. H. 2000. Econometric Analysis. 4th ed. Upper Saddle River, NJ: Prentice-Hall. Long, J. S. 1997. Regression Models tbr Categorical and Limited Dependent _,hriable.s. Thousand Oaks, CA: Sage Publications. Wolfe, R. 1998. sg86: Continuation-ratio models for ordinal response data. Stata TechnicJ Bulletin 44:18-21. in Stata Technical Bulletin Reprints, vol. 8, pp. 149-153.

Reprinted

Wolfe. R, and W. W. Gould. 1998. sg76: An approximate likelihood-ratio test for ordinal response models. State Technical Butletin 42: 24-27. Reprinted in Stata Technical Bulletin Reprints, vol. 7, pp. 199-204.


[R] adjust,

[R] lincom,

[R] linktest.

[R] lrtest.

[R] mix.

[R] predict,

[R] test, [R] testnl, [R] vce, [R] xi Related:

[R] logistic, [R] mlogit, [R] ologit, [R] probit, [R] svy estimators

Background:

[U] 16.5 Accessing coefficients and standard errors_ [u] 23 Estimation and post-estimation commands. [u] 23.11 Obtaining

robust

[U] 23.12 Obtaining tR] maximize

scores,

variance

estimates,

[R] sw,

!

I

!!" r

472

_

_

op_obit

m

Maximum-|i_lihood

o_ered ,;

P _0_

Hypothesistests md predictions

See u] 23 Estim tion and post-estimation commands for instructions on obtaining the variancec_3varialce matrix oi the estimators, predicted values, and hypothesis tests. Also see [R] lrtest for perforating likelihoo_ -ratio tests.

Exampi tn t_e above example, we estimated the model oprobit rep77 foreign length mpg. The predict command can be used to obtain the predicted probabilities. You type predict followed by the nantes of the ne_ variables to hold the p_dicted probabilities, ordering the names from low to high. I_ our data, the lowest outcome is poor and the highest excellent . We have five categories and so inust type fiv, names following predict; the choice of name is up to us:

I !

. !predict poor fair avg good exc (_ption p assu_ed; predicted probabilities)

_

. ilist make mo_el exc good

'

i

! }

i 13.

i

if rep77==.

make AMC

model Spirit

exc .0006044

good .0351813

Ford _zlO[ Buick 44. Mere. _3. Peugeot _6. Plym. _7. Plym.

Fiesta Opel Monarch 604 Horizon Sapporo

.0002927 .0043803 .0093209 ,0734199 .001413 .0197543

.0222789 .1133763 .1700846 .4202766 .0590294 .2466034

_3.1

Phoenix

.0234156

.266771

Pont.

i

For Srdered probill, predict, xb produces Sj = Xlifll -t- x2jfl2 +"" + xk3flk. Ordered probit is identlcal to ordered logit except that one uses a different distribution function for calculating probabilities. The orc_ered-probit predictions are then the probability that Sj 4- uj lies between a pair of cut _ints e;i-1 arid _i. The formulas in the case of ordered probit are

l

I

e_timatiOn

i

Pr(Si

+ u < n)=

I

Pr(Sj

+ w > _,) = i - _(_ - Sj) = _(Sj

Rather than using pr diet i . _predict I . _en . _en

,F(_-

Sj) - n)

directly, we could calculate the predicted probabilities by hand. " "

psco_re,xb I

probexc T norm(pscore-_b[_cut4]:) probgood norm( b[ cut4]-pscol_e)

- norm(

b[ cut3]-pscore)

oprobit -- Maximum-likelihood ordered probit estimation

471

Remarks An ordered probit model is used to estimate relationships between an ordinal dependent variable and a set of independent variables. An ordina/variable is a variable that is categorical and ordered, for instance, "poor", "good", and "excellent", which might be the answer to one's current health status or the repair record of one's car. If there are only two outcomes, see [R] logistic, IN] logit, and [R] probit. This entry is concerned only with more than two outcomes. If the outcomes cannot be ordered (e.g., residency in the north, east, south and west), see IN] mlogit. This entry is concerned only with models in which the outcomes can be ordered. In ordered probit, an underlying score is estimated as a linear function of the independent variables and a set of cut points. The probability of observing outcome i corresponds to the probability that the estimated linear function, plus random error, is within the range of the cut points estimated for the outcome: Pr(outcomej

= i) = Pr(n__l

< fllzlj

+/32x2j

+'"

< _)

uj is assumed to be normally distributed. In either case, one estimates the coefficients ill, 132, ..., flk together with the cut points nl, n2, ..., nz-1, where I is the number of possible outcomes. no is taken as -oo and nz is taken as 4-00. All of this is a direct generalization of the ordinary two-outcome probit model.

> Example In [R] ologit, we sample datasets) to logit to explore the proxy for size), and togit:

use a variation of the automobile dataset (see [U] 9 Stata's on-line tutorials and analyze the 1977 repair records of 66 foreign and domestic cars. We use ordered relationship of rep77 in terms of foreign (origin of manufacture), length (a mpg. Here we estimate the same model using ordered pmbit rather than ordered

. oprobit

rep77

Iteration

O:

log

likelihood

= -89.895098

Iteration

I:

log

likelihood

= -78.141221

Iteration

2:

log

likelihood

= -78.020314

Iteration

3:

log

likelihood

= -78.020025

Drdered

Log

probit

likelihood

foreign

length

mpg

estimates

N_raber of obs LR chi2(3) Prob > chi2

= = =

66 23.75 0.0000

= -78.020025

Pseudo

=

0.1321

repY?

Coef.

foreign

1.704861

length

,0468675

mpg

Std.

Err.

R2

z

P>Iz[

[95_

.4246786

4.01

0.000

.8725057

2.537215

.012648

3.71

0.000

.022078

.0716571

.1304559

.0378627

3.45

0.001

.0562464

.2046654

_cutl _cut2

10.1589 11.21003

3.076749 3.10T522

_cut3

12.54561

3.155228

_cut4

13.98059

3,218786

(Ancillary

Conf.

Interval]

parameters)

We find that foreign cars have better repair records, as do larger cars and caus with better mileage ratings. q

clus_er(varnamt

specifies that the observations are independent

across groups (clusters) but

;

n__ necessarily vithin groups, varname:specifies to which group each observation belongs; e.g,, catuster(pers mid) in data with repeated observations on individuals, cluster() affects •the estimated stand_trd errors and variance-covariance matrix of the estlmators (VCE), but nol the es_mated coeffi :ients; see [t2] 23,11 Obtaining robust variance estimates, cluster() can be us#d with pwe: ghts to produce estinmtes for unstratified cluster-sampled data. but see the sWoprobit colnmand in [R] svy estimators for a command designed especially for survey data.

i

cl_aster()

imp ies robust;

specifying robust

cluster()

is equivalent to typing cluster()

by iitself, }

scor_(newvarlist) creates k new variables, where k is the number of observed outcomes. The firs_ variable cot tains OlnLj/O(xjb); the second variable contains OlnLj/O(_cutlj); the third conhins OlnLj/, _(_cut2j); and so on. Note that if you were to specify the option score(sc*), Sta!a would creale the appropriate number of new variables and they would be named seO. scl, level #) specifies le confidence level, in percent, for confidence intervals. The default is level or _ set by set level: see [U] 23.5 Specifying the width of confidence intervals.

(95)

_,

offse_ (varname) s_cifies that varname is to be included in the model with coefficient constrained to be 1.

i l

maximi_e..options control the maximization process; see [R] maximize. You should never have to spedfy them.

i

Optionsior predicl

I

p.:the d_ault, calculat _s the predicted probabilities. If you do not also specify the out come () option. you must specify new variables, where kis the number of categories of the dependent variable. Say vbu estimatec _ model by typing oprobit result xl x2. and result takes on three values.

i

Then i,ou could tyl:e predict pl p2 p3. to obtain all three predicted probabilities. If you specie' the ot_tcome() opt on, then you specify one new variable. Say that result takes on values 1.2. and 3i Then typing predict pl outcome(I) would produce the same pl. xb. calculates the line_ • prediction. You specify one new variable; for example, predict linear, xb. Tt_e linear prod ction is defined ignoring the contribution of the estimated cut points. i

xb calcult_tes the line prediction. You specify one new variable: for example, predict linear, xb. Ttje linear pred_ fion is defined ignoring the contribution of the estimated cut points.

_ _"

s_:dp calculates the stm dard error of the linear prediction. You specify one new variable: for example, predittse, stdp. outcome outcome) sp 'cities for which outcome the predicted probabilities are to be calculated. owcco_e() should dontain either a single value of the dependent variable, or one of #I, #2 ..... _vith #i meaning the_first categor_ of the dependent variable, #2 the second category, etc.

i _

nooffsetiis

relevant o

if you specified olfset (varname) for oprobit It modifies the calculations made bi, predict s_ that they ignore the offset variable; the linear prediction is treated as x3b rather t_an xjb + eraser,.

le ,

[

oprobit

-- Maximum-likelihood

ordered probit estimation

,

]

T

Syntax oprobit cluster :

depvar

[varlist]

(varname)

[weight]

[if

score (newvarlist)

exp] level

[in

range I [,

(#) 9ffset

t_able_robust

(varname)

maxbnize_options

]

by ... : may be used with oprobit; see [R] by, fweights, iweights, and pweights are allowed; see [U] 14.1.6 weight. oprobit shares the features of all estimationcommands; see [U] 23 Estimation and post-estimation commands. oprobit

may be used with sw to perform stepwise estimation: see [R] sw,

Syntaxfor predict predict [O,pe]newvarname(x)[if exp] [in range] [. { p I xb I stdp } outcome(outcome)

nooffset ]

Note that with the p option, you specify either one or k new variablesdepending upon whether the outcome () option is also specified (where k is the number of categories of depvar). With xb and stdp, one new variable is specified. These statistics are available both in and out of sample; type predict ... if e(sample) ... if wanted only for the estimation sample.

Description oprobit estimates ordered probit models of ordinal variable depvar on tile independent variables varlist. The actual values taken on by the dependent vmiable are irrelevant except that larger values are assumed to correspond to "higher" outcomes. Up to 50 outcomes are allowed in Intercooled Stata and up to 20 are allowed in Small Stata. See [R] logistic for a list of related estimation commands.

Options table requests a table showing how the probabilities equation.

for the categories

are computed from the fitted

robust specifies that the Huber/White/sandwich estimator of variance is to be used in place of the traditional calculation; see [U] 23.11 Obtaining robust variance estimates, robust combined with cluster () allows observations which are not independent within cluster (although they must be independent between clusters). If you specify pweights,

robust is implied; see [U] 23.13 Weighted 469

estimation,

+

i + ' i

I_;+

i

468

oneway-- Dne-wayanalysis of variance

+

The :cheff_ test (Scheffd 1953. 1959: also see Winer. Brown, and Michels 1991, 191-t95)differs in derivdttion, but it altacks the same problem. Let there be k means for which we want to make all the pair,k,ise tests. Two means are declared significantly different if

i

+ iI_

t >_ v/(k-

1)F(a:k-

1,_,)

where /_(a:_ k - 1.__, is the a+-critical value of the F distribution with k - 1 numerator and 12 denominator degrees of freedom. Scheffd's test has the nicety that it never declares a contrast si!mificalt if the over tll F test is nonsignificant. Turnihg the test ar )und, Stata calculates a significance level

}

g=F

,k-l,v

! [ i

I [ ! ;

J For instance,, you hay.• a calculated t statistic of 4.0 with 50 The F test equivalent, says the significance evel is says the same. If vou are doing three comparisons, however, and S0 degrees of'frec:dom, which says the significance level 100021

_

degrees of freedom. The simple t test 16 with t and 50 degrees of freedom. you calculate an F test of 8.0 with 2 is .0010.

+ Referendes +

Ahman. D. G. 1991. Practical Statistics for Medical Research. London: Chapman & Hall. Bartlett, _. S. 1937, Pro _erties of sufficiency and statistical tests, Proceedings .268-2_2. Hochberg. Judge. G. 2d ed.

of the Royal Socieb', Series A 160:

"_, and A.C. mhane. 1987. Multiple Comparison Procedures. New "_brk: John Wile)' & Sons, t13_ W. E. Gri ffit_ s, R, C. Hilt, H L/itkepohl, and T.-C. Lee. 1985. The Theoo, and Practice of Economerncs. !New York: Johb Wiley & Sons.

;

Miller, R. +!G"Jr. 1981. S_nultaneous

•_

Scheff_. H!+t953. A method for judging all contrasts in the analysis of variance. Biometrika

[ ! i

i i

-,

4

195 +. The Analysis

Statistical Inference. 2d ed, New York: Springer-Verlag.

of Variance. New York: John Wiley & Sons.

Sid_k. Z. !1967. Reetangu ar confidence Ameri_n,

Statistical A;sociation

regions for the means of multivariate

normal distributions.

of the

Snedecor, 1_, W. and W. ( Cochran. 1989. Statistical Methods. 8th ed. Ames. tA: Iowa State University Press. \Viner, B. D.R. Brown and K. M. Michels. t991. Statistical Principles in Experimental Design. 3d ed. New York: +tcGra_v-Hfll.

AlsoSeei +

Complementary:

!

Backgrodnd: Related:

+

Journal

62: 626-633.

}

i

40: 87-104.

m]encode

U] anova, 21.8 Accessing results by other programs tl_] [R] loneway, i[R]calculated table

_M

ultiple-comparison tests

oneway n one-way analysis of variance

Let's begin by reviewing the logic behind these adjustments. comparison of two means is

The "standard"

467

_ statistic for the

t=

/±

1

s_/ n_ + nj where s is the overall standard deviation, ffi is the measured average of ,Y in group i, and ni is the number of observations in the group. We perform hypothesis tests by calculating this t statistic. We simultaneously choose a critical level a and took up the t statistic corresponding to that level in a table. We reject the hypothesis if our calculated t exceeds the value we looked up. Alternatively, since we have a computer at our disposal, we calculate the significance-level e corresponding to our calculated t statistic and, if e < c_, we reject the hypothesis. This logic works well when we are performing a single test. Now consider what happens when we perform a number of separate tests, say n of them. Let's assume, just for discussion, that we set oLequal to 0.05 and that we will perform 6 tests. For each test we have a 0.05 probability of falsely rejecting the eq uality-of-means hypothesis. Overall, then, our chances of falsely rejecting at 1east one of the hypotheses is 1 - (1 - .05) 6 _ .26 if the tests are independent. The idea behind multiple-comparison tests is to control for the fact that we will perform multiple tests and to reduce our overall chances of falsely rejecting each hypothesis to c_ rather than letting it increase with each additional test. (See Miller 1981 and Hochberg and Tamhane 1987 for rather advanced texts on multiple-comparison procedures.) The Bonferroni adjustment (see Miller I981; also see Winer, Brown, and Michels 1991, 158-166) does this by (falsely but approximately) asserting that the critical level we should use. a, is the true critical level a divided by the number of tests n, that is, a = a'/n. For instance, if we are going to perform 6 tests, each at the .05 significance lev el, we want to adopt a critical level of .05/6 _ .00833. We can just as easily apply this logic to e, the significance level to our critical level a. If a comparison has a calculated significance adjusted for the fact of n comparisons, is n- e. If a comparison has and we perform 6 tests, then its "real" significance is .072. If we cannot reject the hypothesis. If we adopt a critical level of .10, we

associated with our t statistic, as of e, then its "real" significance, a significance level of, say, .012, adopt a crilical level of .05, we can reject it.

Of course, this calculation can go above 1, but that just means that there is no a < 1 for which we could reject the hypothesis. (This situation arises due to the crude nature of the Bonferroni adjustment.) Stata handles this case by simply calling the significance level t. Thus. the formula for the Bonferroni significance level is eb = min(1, en ) where n - k(k - 1)/2 is the number of comparisons. The Sidg_kadjustment {Si&ik 1967; also see Winer, Brown, and Michels 1991. 165-166) different and provides a tighter bound. It starts with the assertion that a=l-(1-a) Turning this formula around and substituting

1/n

calculated significance

e_=min{1,1-(1-e)

is slightly

levels, we obtain

n}

For example, if the calculated significance is 0.012 and we perform 6 tests, the "'real" significance is approximately 0.07.

i

......

/

466

111_

........

i oneway --

)ne-way analysis of variance

The rbodel one- ay analysis of variance is Methods andiofFondulas

for level! i = 1.... ,/_]and observations j = 1i .... hi. Define Yi as the (weighted) mean of Yij over Yij Yij!_Define --# nt- O_i-t_ij as the weight associated with Yij, which j and _ is the overaili(weighted) mean of w_j is 1 if the, data are untweighted,wij is normaI!zed to sum to 'n = _. _ ni if aveights are used and is othen__seunnormaltz_ • wi refers to _j wij and w refers to _i u'i. The between group sum of squares is then

i ,!

Sl = _ _,,(_ - _)_

! l

i

The t_tal sum of s_uares is

The _ithin group @m of squares is given By S_ = S - $t. The _etween gro@ mean squ_e is s_ = S1/(k - 1) and the within group mean square is s_ = Se/!(u, - k). Th_ test statistic is Y = s21/s2e.See, for instance, Snedecor and Cochran (1989).

t

i

Bartlett'stest

! _=

Bartleit's test assum,._sthat you have m independent,normal random samplesand tests the hypothesis 2 The test statistic, M, is defined c_ =...= c_m.

t t

} _ i

M - (T-

m) tr!_2 - _'_(Ti - 1) ln_?

1 --t-3(m__l)Z_." Ti--'l

T-m

where th(ire are T ove all observations, T/obs_p,_ations in the ith group, and r_ j=l

o

i=l 5

i _

i

An:approkimate test ott the homogeneity of variance is based on the statistic 3I with critical values oNamed"_rom the k"_q_stnbut_on" " " of m"- 1 degrees of freedom. See Bartlett (t937) or Judge et al.

(_9s5,,4_-449).

/

oneway -- One-way analysis of variance

465

Weighted data Example oneway a one-way data, Your population

can work with both weighted and unweighted data. Let's assume that you wish to perform layout of the death rate on the four Census regions of the United States using state data contain three variables, d_rate (the death rate), region (the region), and pop (the of the state).

To estimate the model, you type oneway drate region abbreviates weight as w. We will also add the tabulate summary statistics differs for weighted data:

[weight=pop], although one typically option to demonstrate how the table of

oneway drate region [w=pop], tabulate (analytic weights assumed) Census region

Mean

Sum, mary of Death Rate Std. Dev. Freq.

NE N Cntrl South West

97.15 88. I0 87.05 75.65

5.82 5.58 i0.40 8.23

49135283 58865670 74734029 43172490

9 12 16 13

Total

87,34

10.43

2.259e+08

50

Obs,

Analysis of Variance SS df MS

Source Between groups Within groups Total

2360.92281 2974. 09635

3 46

786.974272 64,6542685

5335. 01916

49

108.877942

Bartlett's test for equal variances:

chi2(3) =

F

Prob > F

12.17

5.4971

0.0000

Prob>chi2 = 0.139

When the data are weighted, the summary table has four rather than three columns. The column labeled "Freq." reports the sum of the weights. The overall frequency is 2.259- l0 s , meaning that there are approximately 226 million people in the U.S, The ANOVAtable is appropriately

weighted. Also see [u] 14.1.6 weight. q

Saved Results oneway saves in r(): Scalars r(N)

number

of observations

r(F)

F statistic

r(df_r)

within group degrees

r(mss)

between

of freedom

group sum of squares

r (dr..m)

between

group degrees

r(rss)

within

r(chi2bart)

Bartlett's

_c_

r(df_bart)

Bartlett's

degrees

of freedom

group sum of squares

of freedom

l

i

.................

Ur :lemeath that number is reported "0.001".

i

This is the Bonferroni-adjusted significance of the

.,,

differ_nce. The dif _renee is significant at the 0.1% level. Looking down the coIumn, we see that

'.

concelltration 3 is lso worse than concentrmion 1 (4.2% level) as is concentration 4 (3.6% level). Ba_;edon this e idence, we would use concentration 1 if we grew apple trees.

!

i

_>Examl_le

i

We!can just as asily obtain the Scheff_adjusted significance levels. Rather than specifying the bonfoirroni i

option_ we specify the scherzo

option.

Weiwill also addlthe noanova option to prevent Stata from redisplaying the ANOVAtable:

i

_ oneway

_omparison o_ Average weight in fframsby Fertilizer weight treatment, noauova (S_heffe) _cheffe

_owMean-I _01 Mean [

!I

2

1

3

0.001 3

-33.25

25.9167

0.039

O. 101

,_ 4

-34.4

I

24.7667

0.034

-1.15

O. 118

0.999

The di] 'erences are he same as we obtained in the Bonferroni output, but the significance levels

are noti According , the Bonferroni-adjuste_ numbers, the significance of the difference between !

fertilize_-concentrations 1 and 3 is 4.2%. Thq Scheff6-adjusted significance level is 3.9%.

I

We _'ill leave it t( you to deride which rdsults are more accurate.

l _ Example !_

Let'si.,.conclude thi I example by obtaining the Sidfik-adjusted multiple-comparison tests. We do this to illustrate Stata s capabilities to calculate these results. It is understood that searching across adjustm4tnt methods u_til you find the results yo_ want is not a valid technique for obtaining significance

!

levels.

I

i

. freeway weigh_ noanova we!ght si_al_ in grams by Fertilizer ; Cc treatment, _arison of Average

I

RO_ MeanCol.: Mean

!

!

2

I

1 -5

2

(Sldak)3

. 1667 0,001

"

3

-33.25 0.04i

25. 9167 O, 116

4

-34,4 0.035

24.7667 O. 137

:

J

_-t. 15 I.000

We find _esutts that an similar to the Bonferroni-adjusted numbers.

F

285.82 1221.49

3 13

95.27 93.98

1.01 59.96

0.4180 0.0000

effect

15.13

2

7.56

6.34

0.0048

effect

8.48

1

8.48

8.86

0.0056

Carryover effect Kesiduals

0.11 29.56

1 30

0.Ii 0.99

0.12

0.7366

Total

1560.59

50

Treatment

Omnibus

measure

of separability

of treatment

and

carryover

=

64.6447_

in this example, the sequence specifier used dashes instead of zeros to indicate a baseline period during which no treatment was given. For pkcross to work, we need to encode the swing sequence variable and then use the order option with pkshape. A word of caution: encode does not necessarily choose the first sequence to be sequence I as in this example. Always double check the sequence numbering when using encode. W

pkctoss -- Analyze crossoverexperiments

521

finishlthe analysis hat was started in [R] pk, little additional work is needed. The data were wi_h pkshape a ad are id 1 2 3 4 5 7

sequence 11 1 1 1 1 il

outcome 150.9643 146.7606 160.6548 157.8622 133.6957 160.639

treat A A A A 1 i

carry 0 0 0 0 0 0

period 1 1 1 1 1 1

8 9 I0 12 13 14 15 18

il 11 i2 !2 12 !2 12 !2

131.2604 168.5186 137.0627 153.4038 163.4593 146.0462 158.1457 147.1977

1 1 B B S B B B

0 0 0 0 0 0 0 0

1 1 1 i 1 1 I 1

19 20 1 2 3 4

12 !2 !I i1 i1 !1

B B B B B B B B B B _ A A A

0 0 A A A n A A A A B B B B

1 1 2 2 2 2 2 2 2 2 2 2 2 2

R _ A A

B B B B

2 2 2 2

5

11

7 8 9 '10 12 13 14

!1 Ii !i i2 12 12 12

164.9988 145.3823 218.5551 133.3201 126.0635 96.17461 188.9038 223.6922 104.0139 237.8962 139.7382 202.3942 136.7848 104.519i

I5 18 19 i20

_ _ _ _

165.8654 139.235 166.2391 158.5146

i

model is fi_ using pkcross: i

_s outcome

i i

sequence variable = sequence period variable = period treatment variable = treat carryover variable = carry id variable = id

;

Ana!_sis of variance (ANOV_) for a 2x2 crossover s%udy urce of Variation SS dd MS F Prob > F tersubjacts Sequence _ffect Residuals

378.04 17991.26

_ 14

378.04 1285.09

0.29 1.40

0.5961 0.2691

Iz_rasubjects i Treatment _ffect Period _ffect

455.04 419.47

1 1

455.04 419.47

0.50 0.46

0.4931 0.5102

i

.......

i I Total 32104.59 3_ Residuals 12860.78of treatment 14 918.63 Ounibus measurelof separability and carryover

,

=

29.2893_

q

w

522

pkcross -

Analyze crossover experiments

> Example Consider the case of a six-treatment crossover trial where the squares are not variance balanced, The following dataset is from a partially balanced crossover trial published by Ratkowsky et al. (1993): . list cow i 2 3 4 1 2 3 4 1 2 3 4

seq adbe baed ebda deab dafc fdca cfda acdf efbc beef fceb cbfe

periodl 38.7 48.9 34.6 35.2 32.9 30.4 30.8 25.7 25.4 21.8 21.4 22.8

period2 37.4 46.9 32.3 33.5 33.1 29.4 29.3 26.1 26 23.9 22 21

period3 34.3 42 28.5 28.4 27.5 26.7 26.4 23,4 23.9 21.7 19.4 18.6

period4 31,3 39.6 27.1 25.1 25.t 23.i 23.2 18.7 19.9 17.6 16.6 16.1

block I 1 1 1 2 2 2 2 3 3 3 3

In cases where there is no vEiancc balance in the design, a square or blocking variable is needed to indicate in which treatment cell a sequence was observed, but the mechanical steps are the same. . pkshape cow seq period1 period2 period3 period4 pkcross outcome, model(block

cowlblock period|block

treat carry) se

Number of obs = 48 Root MSE = .730751

R-squared = Adj R-squared =

Source

Seq. SS

df

MS

Model

2650.0419

30

block cowlblock

1607.17045 628.621899

2 9

803.585226 69.8468777

periodlblock treat

407.531876 2.48979215

9 5

carry

4.22788534

5

Residual

9.07794631

17

.533996842

Total

2659.11985

47

56.5770181

88.3347302

F

0.9968 0.9906 Prob > F

165.42

0.0000

1504.85 130.80

0.0000 0.0000

45.2813195 .497958429

84.80 0.93

0.0000 0.4846

.845577068

1.58

0.2179

When the model statement is used and the omnibus measure of separability is desired, specify the variables in the treatment(), carryover(), and sequence() options to pkcross.

q

Methods and Formulas pkcross is implemented pkcross

as an ado-file.

uses ANOVAto fit models for crossover experiments;

The omnibus measure of separability

is

S= where V is Cramer's

100(1

V)%

V and is defined as

V=

min(r-

1,c-

1)

see [R] anova.

pkcr_ss-- Analyze crossoverexpedments :

523

=

The X2 is calculated as

where 0 andIE are the obs er_'ed and expected counts in a table of the number of times each treatment !

!

i

i

is followed I_ the other treatments.

References Chow. S. C. a_LdJ.

R Liu. 2tl00.Design and Analysis of Bioavedtabilityand BioequivalenceStudies.2d ed New York:MarcelDekker.

Neter. J., M t1. Kutner,C. , Nacbtsheim.and W. Was_rman. 1996. Applied Linear Statistical Models. 4th ed. Chicago:lr_era. Ratkowsky,D. _tk.,M. A. Evans_and J. R. Alldredge.1993. Cross-overExperiments:Design,Analysisand Application. New York: VlarcelDekker.

AlsoSee Related:

[R] _kcollapse, [R] pkequiv. _[R] pkexamine, [R] pkshape, JR] pksumm

Complemenl ary:

[R] Statsby

Background:

[R] _k

Title

f

pkequiv

-- Perform bioequivalence

I

I

I11 II

II I

I

tests II

[ I I

I

exp]

[in range]

Syntax pkequiv

outcome treatmentperiod

sequence id [if

[, compare(string)limit(#) _level(#)noboot fieller symmetric anderson tost ]

Description pkequiv this entry.

is one of the pk commands.

If you have not read [R] pk, please do so before reading

pkequiv performs bioequivalence testing for two treatments. By default, pkequiv calculates a standard confidence interval symmetric about the difference between the two treatment means. pkequ:tv also calculates confidence intervals symmetric about zero and intervals based on Fieller's theorem. Additionally, pkequiv can perform interval hypothesis tests for bioequivalence.

Options compare(string) specifies the two treatments to be tested for equivalence. In some cases, there may be more than two treatments, but the equivalence can only be determined between any two treatments. limit (#) specifies the equivalence limit. The default is 20%. The equivalence limit can only be changed symmetrically, that is, it is not possible to have a 15% lower limit and a 20% upper limit in the same test. level (#) specifies the confidence level, in percent, for confidence intervals. Note that this is not controlled by the see level command.

The default is level

(90).

noboot prevents the estimation of the probability that the confidence interval lies within the confidence limits. If this option is not specified, this probability is estimated bv resampling the data. fieller symmetric

specifies that an equivalence

interval based on Fieller's

specifies that a symmetric equivalence

theorem is to be calculated.

interval is to be calculated.

anderson specifies that the Anderson and Hauck hypothesis test for bioequivalence is to be computed. This option is ignored when calculating equivalence intervals based on Fieller's theorem or when calculating a confidence interval that is symmetric about zero. tost specifies that the two one,-sided hypothesis tests for bioequivatence are to be computed. This option is ignored when calculating equivalence intervals based on FielIer's theorem or when calculating a confidence interval that is symmetric about zero.

_

524

pke_l_ uiv -- Perform bioequivalencetests

525

i

Remarks } :_

l

_

',

i

4

•

pkequiv i designed to +nduct tests for bioequivalence based on data from a crossover experiment. pkequiv requires that the User specify the outcome, uvatment, period, sequence, and id variables. The data mus I be in the sake format as produced lff pkshape;

see [R] pkshape.

> Example

We will co ]duct equivalence testing on the data i_troduced in [R] pk. After shaping the data with .list

i

pkshape, the data id are

I

I

i

[

\_

se(uence

1. 2. 3. 4. 5. 6, 7. 8. 9. 10. 11. 12. 13. 14. i5. 16. t7. 18, 19. 20. 21. 22. 23. 24.

1 1 2 2 3 3 4 4 5 5 7 7 8 8 9 9 I0 t0 12 12 13 13 14 14

1 1 1 1 i 1 1 1 1 1 1 1 1 i 1 1 2 2 2 2 2 2 2 2

outcome 150.19643 218.5551 146.7606 133.3201 160.6548 126.0635 157.8622 96.17461 133.6957 188.9038 160,639 223.6922 131.2604 104,0139 168.5186 237.8962 137.0627 139.7382 153.4038 202.3942 163.4593 136.7848 146.0462 104.5191

great

26. 27. 25. 28. 29. 30. 31.

15 18 15 18 19 19 20

2 2 2 2 2 2 2

165.8654 147.1977 158.1457 139.235 164.9988 166.2391 145.3823

A B B h B h B

B 0 B0 0 B 0

2 1 2! 1 2 1

32.

20

2

158.5146

A

B

2

A B A B A B A B A B A B A B A B B A B A B A B A

now can _onduct a bio_quivalence test between treat

!

!

.pkequlv outcome t_eat period seq id

carry 0

period 1

A 0 A 0 i 0 A 0 A 0 A 0 A 0 A 0 B 0 B 0 B 0 B

2 1 2 I 2 1 2 1 2 i 2 l 2 1 2 1 2 1 2 1 2 t 2

-- A and treat

-- B.

i

C}aSSiC confidence interval for bioe{uivalence ! I

4 i ;

i

! difference: rat_o:

[equivalence limits]

[

-30.296 80X

-11.332 92.519_

plobability t_st•limits are

30.296 120X

test limits

within equivalence limits =

]

26.416 i17.439_ 0.6350

The defau|t tput for pk_quiv shows a confidence interval for the difference of the means (tes! limits), the ra_io of the means, and the federal equivalence limits. The classic confidence interval can

!

.

be constructed around the difference between the average measure of effect for the two drugs or around the ratio of the average measure of effect for the two drugs, pkequiv reports both the difference measure and the ratio measure. For these data, U.S. federal government regulations state that the confidence interval for the difference must be entirely contained within the range [-30.296, 30.296 ], and between 80% and I20% for the ratio. In this case, the test limits are within the equivalence limits. Although the test limits are inside the equivalence timks, there is only a 63% assurance that the observed confidence interval will be within the eqmvalence limits in the long run. This is an interesting case because although this sample shows bioequivalence, the evaluation of the long-run performance indicates that there may be problems. These fictitious data were generated with high intersubject variability, which causes poor long-run performance. If we conduct introduced in [R] limits are within seen in expanded pkequiv

a bioequivalence test with the data published in Chow and Liu (2000), which we pk and fully describe in [R] pkshape, we observe that the probability that the test the equivalence limits is very high. The data from Chow and Liu (2000) can be form in [R] pkshape. The equivalence test is

outcome

Classic

treat

period

confidence

seq

id

interval

for

[equivalence difference:

test

limits]

-16. 512

ratio : probability

bioequivalence

16. 512

80_, limits

are

[

limits

-8. 698

120_. within

test

4. 123

89. 464Y,

equivalence

limits

]

=

104. 994_ 0.9970

For these data, the test limits are well within the equivalence limits, and the probability that the test limits are within the equivalence limits is 99.7%. Example Using the data published in Chow and Liu (2000), we compute a confidence interval that is symmetric about zero: pkequiv

outcome

Westlake's

treat

period

symmetric

seq

confidence [Equivalence

Test

formulat

ion:

id,

75. 145

symmetric interval

for

bioequivalence

limits] 89. 974

[

Test

mean

]

80. 272

The reported equivalence limit is constructed symmetrically about the reference mean, which is equivalent to constructing a confidence interval symmetric about zero for the difference in the two drugs, In the above output, we see that the test formulation mean of 80.272 is within the equivalence limits, indicating that the test drug is bioequivalent to the reference drug. pkequiv will display interval hypothesis tests of bioequivalence if you specify the tost the anderson options. For exanlple,


and/or

ikequiv

i

pk_quiv _i/

outcomeitreat

Classic

con

g

period

dence

seq

interval

id, for

i

i

_oequtvalencetests

527

tpst anderson i blOequlvalence ' limits]

[equivalence

i

Perform

[

test limits

]

i

difference

:

-16.512

r_tio:

I_.512

80Y,

-8.698

120Y,

4.123

89.464Y,

104. 994Y,

! probability i

!test limits

1 schuirmann'i lupper test

Anderson i i

tw° °ne-sided _tatistic

an_ Hauck's

ilo_er test

tes_

1

I

=

tests

limits =

0.9980

I

-5.036

p-value

=

0.000

p-value

=

0.001

test

_tatistic

noncentralit_

!

are withiniequivalence

3.810

parameter

=

4.423

statistic

:

-0.613

" empirical

p-value

=

0.0005

Both of Sc uirmann's ohe-sided tests are hight! significant, suggesting that the two drugs are bioequivale_t. A similar Conclusionis drawn fro_ the Anderson and Hauck test of bioequivalence.

t

q

i

i

t i

Saved Resldts pkexaani: e saves Sc_ ars r(stddev)

in

r_ trea_(displ_ys)I anovg outcome pattern order dlsplay idli_ttern

I i

Number of obs = | Sburce

i 7

Root MSE Partial SS

odel I

pa_tern #rder

18

= 1.59426 df

R-squared

=

0.9562

Adj R-squared = 0.9069 MS F Prob > F

443.666667

9

49.2962963

19.40

0.0002

.333333333 233.333333

2 2

.166666667 116.666667

0.07 45.90

0.9370 0.0000

21,00

3

7.00

2.75

0.1120

8

2.54166667

17

27.2941176

/

!

i

id Ipa_tern Residual

20.3333333

t

|

_otal

!

!

These are the same results reported by Neter et al. (1996).

_ Example Returning

i

I i 1

o the examp e from the pk entry, the data are idI

I

se

auc_concA 150.9643

auc_concB 218.5551

i2 '3 i4

146.76o6i33.32ol _1t 16o 6548 126.o635 157.862296.17461

!5

1{

133.6957

188,9038

17 i8 19

11 1t 1

160.639 i31.2604 168.5186

223.6922 104. 0139 237.8962

li

2 2 2!

153. 4038 163. 4593 137.0627

202. 3942 136. 7848 139.7382

158. 1457

165. 8654

! }

'r

464.00

4

W_

:

b4u

pksbape

I,

--

Reshape

i

18 19 20

[_i

pkshape ±d seq . sort id

,i

2 2 2

(pharmacokinetic)

147.1977 164.9988 145.3823

Latin square

data

'_o

139.235 166.2391 158.5146

auc_concA

auc_concB,

sequence 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 I 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

outcome 150.9643 218.5551 146.7606 133.3201 126.0635 160.6548 96.17461 157.8622 188.9038 133.6957 160.639 223.6922 131.2604 104.0139 237.8962 168.5186 137.0627 139.7382 202.3942 153.4038 163.4593 136.7848 104.5191 146.0462 165.8654 158.1457 139.235 147.1977 164.9988 166.2391 158.5146 145.3823

order(ab

ha)

list

:_

id 1 1 2 2 3 3 4 4 5 5 7 7 8 8 9 9 i0 10 12 12 13 13 14 14 15 15 18 18 19 19 20 20

1. 2. 3. 4. 5. 6. 7, 8, 9. 10. 11. 12. 13. 14. 15. 16. 17, 18. 19. 20. 21. 22. 23. 24, 25. 26. 27, 28. 29. 30. 31. 32.

These

data

can

be analyzed

with

pkcross

treat 1 2 1 2 2 1 2 1 2 1 1 2 1 2 2 1 2 1 1 2 2 1 1 2 1 2 1 2 2 1 1 2

carry 0 1 0 1 1 0 1 0 1 0 0 1 0 1 1 0 0 2 2 0 0 2 2 0 2 0 2 0 0 2 2 0

period 1 2 1 2 2 1 2 1 2 1 1 2 1 2 2 1 1 2 2 ! I 2 2 1 2 1 2 1 1 2 2 1

or anova.

Methodsand Formulas pkshape

is

implemented

as an ado-file.

References Chow. S. C. and J. E Liu. 2000, York: M_cel Dekker. Neter, J,, M. H Kutner. C_cago: [rwin.

C. J

De_;ign and Analysis

Nachtsheim,

of BioavailabiHty

and W. Wasserman.

and Bioeq_livalence

1996, Applied

Linear

Studies.

Statistical

2d ed. New

Models.

4th ed.

Also See ;

Related:

JR] pkcollapse,

Background:

JR] pk

[

fi

[R] pkcross,

[R] pkequiv,

[R] pkexamine,

[R] pksumm;

JR] anova

le i_

'_

pksumm -- Summari e pharmacokinetic data I

"

I

t

I

I1

I

I

I

"

Syntax pksu_ i_ i

i

i_lti,,,econce_tration [if expl [in ra,,gc] [, fit(#)

t__rapezoid

stat (lneasure)_ no'orsnotim_chk_graph graph_options ! where meas_treis one of auc !area under concentration-time cu_'e (AUC0,ec,) aucline under concentration-time curve from 0 to _ using a linear extension t !area under !ii concentration-time curve from 0 to vc using an exponential extension aucexp _area auc!og area under tt_e log concentration-time curve extended with a linear fit

l

half

half life of

drug

i

ke

!elimination cdncentration tmaximum r_ate

I o

tmax tome

cmax

!time at last _oncentration !time of maxilrnumconcentration

Deseriptio, pksummii one of the p commands. If you have no_read [R] pk, please do so before reading this

i

pksumm dbtains the firit four moments from the empirical distribution of each pharmaeokinetic measurement and tests tt{e null hypothesis that the distribution of that measurement is normally distributed.

Options }

_ fit (#) spedides the nur_e ,b r of points, counting back from the last time measurement, to use in

_

fitting the!extension to _stimate the AUC0,vc, The default is fit (3). the last 3 points. This should be viewed as a minimlJm;the appropriate number of points will depend on the data,

;i

.ipecifiesthat _hetrapezoidal role should be usedto calculate the AUC.The default is cubic splines, Whichgive belter results for most situations. In cases where the curve is very irregular,

trapezoid

!

the trape+idal rule m+ give better results. star (statistic) specifies the statistic that pksnrm should graph. The default is stat(auc).

i

graph o_ion is not specified, this option is ignored. nodots suppresses the progress dots during calculation, By default, a period is displayed for every call to calculate the phlarmacokineticmeasures.

!

not imechk

! l

If the

iuppresses th_ check that the follow:up time for all subjects is the same, By default. pksumm e_pects the m_ximum follow-up time _obe equal for all subjects.

graph reque_ts_a graph ofl the distribution of the gtatistic specified with s'cat(), graph_optioi_sare any of lthe options allowed with graph, twoway: see [G] graph options. 541

_,,tt

o_z

pKsumm -- _ummanze pharmacokinetic data

-7

Remarks pksnmm will produce summary statistics for the distribution of nine common pharmacokinetic measurements. If there are more than eight subjects, pksumm will also compute a test for normality on each measurement. The nine measurements summarized by pksurr_ are listed above and are described in the Methods and Formulas sections of [R] pkexamine and [R] pk.

> Example We demons_ate the use of pksumm with the data described in [R] pk. We have drug concen_ation data on 15 subjects, each measured at 13 time points over a 32-hour period. A few of the records are . list id 1 1 1 1 1 1

1. 2. 3. 4. 5. 6.

time 0 .5 1 1.5 2 3

cone 0 3.073403 5.188444 5.898577 5.096378 6.094085

0 .5 1 1.5 2 3 4 6 8 12 16 24 32

0 3.86493 6.432444 6.969195 6.307024 6,509584 6.555091 7.318319 5.329813 5.411624 3.891397 5.167516 2,649686

(ou_utomi_ed) 183. 184. 185. 186. 187. 188. 189. 190. 191. 192. 193. 194. 195.

15 15 15 15 15 15 15 15 15 15 15 15 15

We can use pksmm

to view the summary statistics for all the pharmaco_netic

parameters.

. pksnmm id time cone

Summary statistics for the pharmacokinetie

measures Number of observations =

star. auc aucline aucexp auclog half ke emax tome tmax

15

Mean

Median

Variance

Skewness

Kurtosis

p-value

150.74 408.30 691.68 688.98 94.84 0.02 7.36 3.47 32.00

150.96 214.17 297.08 297.67 29.39 0.02 7.42 3.00 32.00

123.07 188856.87 762679.94 797237.24 18722.13 0.00 0,42 7.62 0.00

-0.26 2.57 2.56 2.59 2.26 0.89 -0.60 2.17

2.10 8.93 8.87 9.02 7.37 3.70 2.56 7.18

0.69 0.00 0.00 0.00 0.00 0.09 0.44 0.00

For the 15 subjects, the

mean

AUCo,t .....

is 150.74 and cr2 = 123.07. The skewness of -0.26 indicates

that the distribution is slightly skewed left. The p-value of 0.69 for the X 2 test of normality indicates that we cannot reject the null hypothesis that the distribution Is normal.

i

'

....

_ pksumm-- Summarizepharmacoldnetic data 543 If we were to consider an3 of the three variants of the AUC&oo, we would see that there is huge variabilib' ant that the distribution is heavily skewed. A skewness different from 0 and a kurtosis different from,3 are expecu,d because the distribution of the AUCo._ is not normal. We now grapt the distribut on of AUe0,tm_ and specify the graph option,

i

!

i "

. pksuml_ id time con:)

graph bin(20)

I

Summari statistics _or the pharmacokinetic measures i

Number of

i

150.7_

Median 150.96

au_line a_cexp

408.38 691._

214.17 297.08

188856.87 762679.94

l

a_clog lhakl_e 1 lcmax

688. 94. 0. O_ 7.3_5

297.67 29.39 0.02 7.42

797237:24 18722,13 0.O0 0.42

l

!tmax !tome

0.00 7.62

1 I

s_at"

n

_ auc

)_e_i_

32. 3.

Variarlce Skewness 123.07 -0.26

32. O0 3.00

f I

i

observations = Kurtosis 2.10

p-value 0.69

8.93 8.87

O.O0 O.O0

2.59 2.26 0,89 -0.60

7.37 9.02 3.70 2.56

O. O.O0 O0 0.09 0,44

2.17

7.18

0.00

2.57 2.56

_

I Area Under

Cu!ve

15

168,5_9

(AUC}

_i

graph, AI.r_0.tm.,.To a graph ofwe one of ask the other pharmacokineticmeasurements. _e needbyde_ult, to specify plots the star () option. plot For example, can Stata to produce a plot of the AUCo._:

I

using the log t;xtension:


544

pksumm -- Summarize pharmacokinetic data . pksumm id time cone, stat(auclog)

graph bin(20)

Summary statistics for the pharmacokinetic measures Number of observations = star.

18

Mean

Median

Variance

Skewness

Kurtosis

p-value

auc aucline

150,74 408.30

150.96 214.17

123,07 188856.87

-0.26 2.57

2,10 8.93

0.69 0.00

aucexp auclog half ke cmax tome tmax

691.68 688.98 94.84 0.02 7.36 3.47 32.00

297.08 297.67 29.39 0.02 7.42 3.00 32.00

762679.94 797237.24 18722.13 0.00 0.42 7.62 0.00

2,56 2.59 2,26 0,89 -0,60 2.17

8,87 9.02 7,37 3.70 2.56 7.18

0,00 0.00 0.00 0.09 0.44 0.00

,666667

-

o.

I

II

I

=

|

18g,135

3624!8 Linear

fit to tog concentration

years multiplied by 2._00 person-years means 40 events are expected:

i I

and so on. _ 3. Over very s_nall exposun'.s e. the probabilib, of finding more than one event is small compared with _. 4. Nonoverlapt_ing exposure t

i !

l l l

are mutually independent.

With these assumptions, to fi _dthe probability of k events in an exposure of size E. divide E into rz subinter,:als Eat, E2 ..... E._ and approximate the answer as the binomial probability of observing h successes in i_ztrials. If ycu let n _ oc. you obtain the Poisson distribution. In the Poiss+n regression model, the incidence rate for the jth observation is assumed to be given

,

?,j = e13a+'21xl,j+"'+13_zkj

tf Ej is lhe ex )osure, the elpected number of events Cj will be

,i

Cj i

i

T

'

= E3eB°+fllXl"+"+c%x_,J = CIn(E-_ )+3°+Btx_'J_'"+fl_zk'3

1

i

his model is elstimated by l_oisson. Without the exposure() or offset() options. Ej is assumed _o be t (equivalent to assuming that exposure i_ unknown_ and controlling for exposure, if necessary,

i :

i is your responsibility. One often _ants to comptre rates and this is mos_ easitv done by calculating incidence rate ratio_

}

(IRR). For inst!nce, what i@he relative incidence rate of'chromosome interchanges in cells as the intensity of radiation increases; the relative incidence irate of telephone connections to a wrong number

i !

as load increases: or the reMive inci:ence rate of deaths due to cancer for females relative io males? That is, one w}nts to hold _ll the x s in the model i:onstant except one. say the ith. The incidence rate ratio for atone-unit cha_ge in xi is

_

i e_(_)+_'++'_(_"+a)+

+,_x_

__ e¢3i

!

More generally, lhe inciden 'e rate ratio for a _xi _hange in xi is e_z_. The lincor_ command can be used atter poisson to display incidence raie ratios for any group relative to another: _ee

i

JR] iincom.

,

> Example Chatteuee, Hadi, and Price (2000, 164) give the number of injury incidents and the propo_ion of flights for each in a single year: airline out of the total number of flights from New York for nine major U.S. airlines • list

I. 2. 3. 4, 5. 6, 7. 8. 9.

airline I 2 3 4 5 6 7 8 9

injuries II 7 7 19 9 4 3 1 3

n 0.0950 0.1920 0.0750 0.2078 0.1382 0.0540 0.1292 0.0503 0.0629

XYZowned 1 0 0 0 0 1 0 0 1

To their data we have added a fictional variable, XYZowned. W%will imagine made that the airlines owned by xYZ Company have a higher injury rate. , poisson

injuries

XYZowned,

exposure(n)

that an accusation is

irr

log likelihood = -23.027197 log likelihood = -23.027177 log likelihood = -23.027177

Iteration O: Iteration 1: Iteration 2:

Poisson regression


Log likelihood = -23.027177 injuries

IPd{

XYZowned n

Std. Err.

1.463467 (exposure)

.406872

z 1.37

= = = =

9 1.77 0.1836 0.0370

P>lzl


0,171

,8486578

2.523675

We specified irr to see the incidence rate ratios rather than the underlying coefficients. We estimate that XYZ Airlines" injury rate is 1.46 times larger than that for other airlines, but the 95% confidence rate. interval is .85 to 2.52; we cannot even reject the hypothesis that xYz Airlines has a lower injury

chi2

= = =

9 19.15 0.0001

Log li_elihood = -2_.332276

Pseudo R2

=

0.3001

injhries I •

poisson-- Poissonregression

'

,

.,!

Std. Err.

z

P>Izl


.6_40667 1.424169

.3895877 .3725155

1,76 3;82

0.079 0.000

-.0795111 ,6940517

4 ._ 63891

.7090501

6;86

O.000

3.474178

r

XYZhwned I :! InN I __

Coef.

553

/.cons! I

1.447645 2.154285 6.253603

L

, ! !

In this case, +ther than sp_'cifyingthe exposure (} option, we explicitly included the variable that would normalize for expos1:re in the model. We did not specify the irr option, so we see coefficients rather than incidence rate r fios. We started with the model i

rate

=

e¢3°+fltXYz°wned

COllltt = /2¢ _°+_lXYzowned

i

= e tn(n)+fl°+fllXYZ°wned

The observed :ounts are therefore which amount_ to constrai dng the coefficient on in(n) to 1, This is what was estimated when ourselves and, !rather than o)nstraining the coefficientto be I, estimated the coefficient.

i

weThe specified tt_e exposure In the abovedistance model away we included estimated coefficienI(n)option. is 1.42, a respectable from 1,the andnormalizing consistent exposure with our speculation thai larger airlin#s also use larger airplanes. With this small amount of data, however, we

I

also have a wi!le confidenc_interval that includes 1. Our estimai_d coefficienI on XYZo,anedis now .684, and the implied incidence rate ratio is e.TM ,_, 1.98 (#lhich we could also see by typing poisson, irr). The 95% confidence interval for the coefficient .till includes ) (the interval for the incidence rate ratio includes 1), so while the point estimate is no_alarger, we ill cannot be very certain of our results.

I i

Our expert 4pinion woulc be that, while there is insufficientevidence to support the charge, there !

is enough evidence to justif3 collecting more data.

"i ! !

Example

:

In a famous_'age-specific_tudy of coronary disease deaths among male British doctors, Dolt and Hilt (1966) rep6rted the folNwing data (reprinted in Rothman and Greenland I998. 259):

i

Smokers Deaths Person-years

Nonsmokers Deaths Person-years

_.

Age

_,

35 - 44 45- 54

32 104

52.407 43.248

2 12

18,790 10.673

55-64 75 - 84 65- 74

206 102 t 86

28,612 5.317 t 2.663

28 31 28

5.710 1.462 2585

i i

The first step __sito I enter thes, data into Stain, which we have done: • list

_

i! i

agecat 1 2 3

smokes 1 1 1

deaths 32 104 206

5.

5

1

102

5,317

6. 7. 8. 9. 10.

21 3 4 5

00 0 0 0

122 28 28 31

18,790 10,673 5,710 2,585 1,462

1. 2. 3.

4

4

1

pyears 52,407 43,248 28,612

186 12,663

agecat 1 corresponds to 35-44, agecat 2 to 45-54, and so on. The most "natural" analysis of these data would begin with introducing indicator variables for each age category and a single indicator for smoking: tab agecat, gen(a) agecat

Freq.

1 2 3 4 5

2 2 2 2 2

20.00 20.00 20.00 20.00 20.00

10

100.00

Total • poisson Iteration Iteration Iteration Iteration

Percent

Cum. 20.00 40,00 60.00 80.00 100.00

deaths smokes a2-a5, exposure(pyears) O: log likelihood = -33.823284 1: log likelihood = -33.600471 2: log likelihood = -33.600153 3: log likelihood = -33.600153

Poisson regression

Number of obs LR chi2(5) Prob > chi2 Pseudo 22

Log likelihood = -33.600153 deaths

IP_

smokes a2 ] a4 a5 pyears a3

1.425519 4.410584

28.51678 40.45121 (exposure) ; 13.8392

irr

Std. Err. .1530838 .8605197

z 3.30 7.61

= = = =

i0 922.93 0.0000 0.9321

P>Iz_


0.001 0.000

1.154984 3.009011

1.759421 6.464997

5.269878 7.775511

18.13 19.25

0.000 0.000

19.85177 27.75326

40.96395 58.95885

2.542638

14.30

0.000

9.654328

19.83809

• poisgof Goodness-of-fit Prob > chi2(4)

In the above, we began by using

chi2

= =

tabulate

12,13244 0.0164

to create the indicator variables,

equal to 1 when ageeat = 1 and 0 otherwise; a2 equal to 1 when agecat and so on. See IV] 28 Commands for dealing with categorical variables.

tabulate created al = 2 and 0 otherwise;

We then estimated our model, specifying irr to obtain incidence rate ratios. We estimate that smokers have 1.43 times the mortality rate of nonsmokers. We also notice, however, that the model does not appear to fit the data well; the goodness-ogfit X 2 tells us that, given the model, we can reject the hypothesis that these data are Poisson distributed at the 1.64_ significance level. So let us now back up and be more careful. We can most easily obtain the incidence within age categories using ir; see [R] epitab:

rate ratios

i,

I

i i

po_sson-- Poissonregression

_-

, ir

d4aths smokes _yeexs, by(_ecat) nocz_de nohet ! agecat IRR [95_ Conf. Interval] t 2 3 4 5 H-_ combined

, 5.736638 2.138812 1.46824 1.35606 .9047304

1.463519 1.173668 .9863626 .9082155 .6000946

49.39901 4.272307 2.264174 2.09649 1.399699

_ 1.424882 I

1.t5470_

1.757784

g,nal=soo os*lagecat= >

'

I

gen _a2 = smokes* agecat==2) gen

34 = smokes_(agecat==3 I agecat==4)

. pois_on deaths sa

i

!zeratdon O:

log

i I

I_erat_on 1: iterat_n 2: IteratiOn 3:

log .ikelihood = -27.788819 log .ikelihood = -27.573604 log .ikelihood = -27.57fi645

i

Iterat_n

I ° i i

I i i

!

(exact) (exact) (exact) (exact) (exact)

"i although we Mll begin by!combining age categories 3 and 4: i

i

I

1.472169 9.624747 23.34176 23.25315 24.31435

robust.} Seeing thiL we will nbw parametefize the smoking effects separately for each age category. l . .

. gen _5

I

M-H Weight

We find that !the mo_alityl incidence ratios are greatly different within age category, being highest for the youn_st categofie_ and actually dropping below 1 for the oldest. (in the las( Case, we might argue that th_se who smoke and who have not died by age 75 are sel_selected to be particularly

i

i

555

•

= smokes*_agecat==5)

4:

sa2 sa34 sa5 a2-a5, exposure(pyears) irr ikelihood = -31.635422

log .ikelihood = -27.572645

P_isso_ regression i


= =

i0 934.99

L_g li_e lihood = -2' .572645

Pseudo Prob > R2 chi2

=

0.9443 0.0000

IRR iaths sal sa2 sa34 sa5 a2 a3 a4 a5

P>_z{ Std. Err.

i

5.f36638 2._38812 1._12229 .9_47304 1_.5631

4.181257 .6520701 .20t7485 .1855513 8.067702 34.3741

98}22766 _99.21

70.85013 145.3357

7.671

[9SZConf.

z

Interval]

2.40 2.49 2.42 -0.49 3.09 5.36

0.017 0.013 0.016 0.625 0.002 0.000

1.374811 1.176691 1.067343 .6052658 2.364153 11.60056

23.93711 3.887609 1.868557 1.35236 47.19624 195.8978

6.36 7.26

0.000 0.000

23.89324 47.67694

403.8245 832.365

pois f _ Goodness-o -fit chi2

=

.0774185

i Prob > chit(1)

=

0.7808

£ Note that the _oodnes_-o_f it X2 is now small: we are no longer running roughshod over the data. Let u_ no_ consider simpli_,in _the model. The point estimate of the incidence rate ratio for smoking in age category i1 is much la__er than that for _moking in age category v but the confidence interval | . for sal iS s_ularty wide, [s the difference real?

• test sal=sa2 =ao ;

(I) [deaths]sal[deaths]sa2 = 0.0 pumson _ l"olsson regression chi2( 1) = 1.56 Prob > chi2 = 0.2117

The point estimates may be far apart, but there is insufficient data and we may be observing random differences. With that success, might we also combine the smokers in age categories 3 and 4 with those in I and 2? • test sa34=sa2, accum (I) (2)

[deaths]sal - [deaths]sa2 = 0.0 - [deaths]sa2 + [deaths]sa34 = 0.0 chi2( 2) = Prob > chi2 =

4,73 0.0938

Combining age categories 1 through 4 may be overdoing it to stop us, although others may disagree.

the 9,38%, significance level is enough

Thus, we now estimate our final model: • gen sal2 = (sallsa2) . poisson deaths sa12 sa34 sa5 a2-a5, exposure(pyears) Iteration Iteration Iteration Iteration

O: I: 2: 3:

log log log log


= = = =

Poisson regresslon



deaths

IRR

sat2 2.636259 sa34 1.412229 sa5 .9047304 a2 4. 294559 a3 23.42263 a4 48.26309 a5 97,87965 pyears i (exposure)

irr

-31. 967194 -28.524666 -28.514535 -28.514535

Std. Err. .7408403 .2017485 .1855513 .8385329 7. 787716 16. 06939 34. 30881

z 3.45 2.42 -0.49 7.46 9.49 11.64 13.08

P>Izl 0.001 O.016 O.625 O.000 O.000 O. 000 O. 000

= = = =

i0 933. Ii 0.0000 0.9424

[95Y,Conf. Interval] 1,519791 i.067343 .6052658 2. 928987 12.20738 25,13068 49. 24123

4.572907 i.868557 1. 35236 6. 296797 44.94164 92. 68856 194.561

The above strikes us as a fair representation of the data. q


l

i i

l

!._

i i

poigson -- Potsson regression

557

SavedResults .....

:i

poisson s_wes in e():

e(N) Scalars e (k) e (k_.eq) e(k.dv)

number of

observations number of lvariables number of equations number of dependent variables

e(ll_0)

log likelihood, constant-only model

e (1__clust) e (re) e(chi2)

number of clusters return code X2

e (dL.m)

model deglMs of freedom

e (p)

significance

e(r2_p) e(lt)

pseudo

e(ic)

e(rank)

number of iterations rank of e(V)

R-squared

log likelihcod

Macros e(emd)

poisson

e(user)

name of likelihood-evaluator program

e(depvar)! e(title)

name of &pendent variaNe title in estimation output

e(opt) e(chi2tTpe)

type of optimization Wald or LR; type of model X2 test

e(wgype) i e(wexp)

weight tyl_ weight exp ession

e(offseSt) e(prediet)

offset program used to implement predict

e(ctustv@) e (vee'_yp@

name of cluster variable covariance _stimation method

e(cnsli_t)

constraint numbers

e (V)

variance-covariance matrix the estimators

Matrices

!!

e (b) e (ilog)

coefficient _vector / iteration lo_ (up to 20 iterations)

of


marks estit nation sample

MethodsanqtVormul_ts _:)oi_so_.

_d

_Doi_go_

_re

inr_plei_e.ted

The log lil_elihood (witl" weights

a_

ande-_Ay offsets)

ado_|_s.

and scores are given by

Pr Y = 5') -

I

5'!

1

(i = _ifl + offseti

!

e- exp(_i)e_,_,

i

I

i=1

}'i

s,:ore(,3)i

= Yi _ e{_

I I

References

i

Bonke_tsch, I.i yon. 18c_8.D_s Gesetz der Kleiner_ Zahlen: Leipzig: Teubner. Cameron, A. C.{and R K. Triv,_di. t998. Regres.sion analysis of count data. Cambridge: Cambridge Universit> Press.

i

Chat|e@e. S.. ,,k. S. Hadi, and B. Price. 2(_10.Regres._ionAnalvsis _ Example. 3d ed. New York: John Wiley &

i

Clarke. R. D. 1146. An applic lion of the Poisson distribution. Journal of the Institute of Actuarie_ 22: 48.

.,.,_,

pu,_un

--

romson

regression

Coleman. J. S. 1964. Introduction m Mathematical Sociology New York: Free Press. [

Doll, R. and A. B. Hill. 1966. Mortality of British doctors in relation to smoking; observations on coronary thrombosis. In Epidemiological Approaches to the Study of Cancer and Other Chronic Diseases. ed, W. HaenszeL National Cancer Institute Monograph 19: 204-268. Feller, W. 1968. An Introduction to Probability Theory and Its Applications. vol. 1. 3d ed. New York: John Wiley & Sons. Hilbe, J. 1998. sg91: Robust variance estimators for MLE Poisson and negative binomial regression. Stata Technical Bulletin 45: 26-28. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp, 177-180. . 1999. sgt02: Zero-truncated poisson and negative binomial regression. Stata Technical Bulletin 47: 37-40. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. 233-236. Hilbe, J. and D. H. Judson. 1998. sg94: Right, left, and uncensored Poisson regression. Stata Technical Bulletin 46: 18-20. Reprinted in Stata Technical Bulletin Reprints, vol. 8, pp. t86-189. Long, L S. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications. McNeil, D. i996. Epidemiological Research Methods. Chichester, England: John Wiley & Sons. Poisson, S. D, 1837 Recherches sur ta probabilitd des jugements en matibre criminetle et en mati_re civile, pr_cddrs des rbgles grndrales du calcul des probabilitds. Paris: Bachelier. Rodrlguez. G. 1993. sbel0: An improvement to poisson. Stata TechnicaI Bulletin Technical Bulletin Reprints, vol. 2. pp. 94-98.

11: 11-14. Reprinted in Stata

Rogers, W_H. 1991. sbel: Poisson regression with rates. Stata Technical Butletin t: l I-12, Bulletin Reprints, vol. I, pp. 62-64.


Rothman. K. J. and S. Greenland. 1998. Modem Epidemiology. 2d ed. Philadelphia: Lippincott-Raven. Rutherford, E., J. Chadwick. and C. D. Ellis. 1930. Radiations from Radioactive Substances. Cambridge: Cambridge University Press. Selvin, S. 1995. Practical Biostatisticat Methods. Belmont, CA: Duxbury Press. 1996. Statistical Analysis of Epidemiologic Data. 2d ed. New York: Oxford University Press. Thorndike, E 1926. Applications of Poisson's probability summation. Bell System Technical Journal 5: 604-624. Tobias. A. and M. J. Campbell. 1998. stsl3: Time-series regression for counts allowing for autocorrelation. Stata Technical Bulletin 46: 33-37. Reprinted in Stata Technical Bulletin Reprints, vo], 8, pp. 291-296.


[R] adjust, [R] predict,

[R] constraint, [R] lincom, [R] sw, [R] test, [R] testnl,

Related:

[R] epitab,

[R] glm, [R] nbreg,

Background:

[U] 16.5 Accessing [U] 23 Estimation

coefficients and

[R] svy estimators, and

post-estimation

[u] 23.11 Obtaining

robust

[El 23.12 Obtaining

scores

[R] linktest, [R] lrtest, [R] vce, [R] xi

variance

standard

errors.

commands. estimates,

[R] xtpois

[R] mfx,

l!litle ....

'

i

ppermn

i [

i ,

test for unit roots

Syntax pperron pperron

vtrname i

[if

,,xp]

[in

range]

[,

is for u,_e with time-series data; see [R] tsset

noconstant

lags(#)

You must_ tsset

t xrend

regress ........

]

your data before usinz_ pperron.

varHame may coniain time-series loperators; see [U] 14.4.3 Time, series varlists

Description [

excludePperr°nthe co_stant,Pe_f°rms the. P illips-Perron testand/or for umt_ roots lagged on_a variable. userdifference may optionally mcludl a trend term, include values The of the of the variable in the regression.

Options ! i

noconstant

st_ppressesthe constant term (intercep0 in the model.

tags (#) specit_esthe number of Newey-West lags io use in the calculation of the standard error. •

i

/

trend speclfie.,tthat a trend!term should be included in *..heassociated regression. This option may not be speci_ed if nocon_tant is specified. regress speci_es that the ",lssociatedregression table should appear in the output. By default, the re_ression t_ble is not pr_luced,

|

I ! f_

i l I

I i

Remarks

,i

Hamilton (I_94) and Fuller (1976) give excellent overviews of this topic; see especially chapter 17 of the forme_r.Phillips (1_86) and Phillips and Pe_on (1988) present statistics for testing whether a time series h_d a unit-roottautoregressive,component.

Example

:

ii

Here, we ttse the international airline passengers dataset (Box, Jenkins, and Reinsel I994, Series G). This datasei has 144 obs_rt ations on the monthlynumber of international airline passengers from 1949 through 1_760. • pperro_ air Phillips4Perron test [or unit root

i

{

Tes Statis ic Z(rho) Z(t)

-6. _64 -i _44

* HacKi_on

i

Number of obs = Newey-West lags =

143 4

Interpolated Dickey-Fuller I_,Critical 5Y,Critical I0_,Critical Value

Value

Value

-19. 943 -3,496

-13. 786 -2,887

"11.057 -2,577

approxima:e p-value for Z(t) = 0.3588 559

i

; I

Note fail to--reject the hypothesistestthattorthere a unit root in this time series by looking either at ,.,vv that we Vk_,_un rrml,ps-i-'erron unit isroots the MacKinnon approximate asymptotic p-value or at the interpolated Dickey-Fuller critical values.

_>Example In this example, we examine the Canadian lynx data from Newton (1988, 587). Here we include a time trend in the calculation of the statistic. • pperron lynx, trend Phillips-Perron

test for unit root

Number of obs = Newey-West lags =

Test Stat istic

IY,Critical Value

-38.365 -4.585

-27.487 -4. 036

Z (rho) Z(t)

113 4

Interpolated Dickey-Fuller 5Y,Critical lOY,Critical Value Value -20.752 -3.448

-17.543 -3. 148

* MacKinnon approximate p-value for Z(t) = 0.0011

We reject the hypothesis that there is a unit root in this time series. q

Saved Results pperron

saves in r ():

Scalars r(N)

number

of observations

r(lags) r (pval)

Number of lagged differences used MacKinnon approximate p-value (not included

if noconstant

r(Zt)

Phillips-Perron

r test statistic

r(Zrho)

Phillips-Perron

p test statistic

specified)

Methods and Formulas pperron

is implemented as an ado-file.

In the OLS estimation of an AR(I) process with Gaussian errors, Yi = PYi-_ + e_ where ei are independent and identically distributed as N(O, cr2) and Yo = O, the OLS estimate (based on an n-observation time series) of the autocorrelation parameter p is given by n

_-_ Yi- I Yi Pn

i=1

71

i=l

We know that if IPI < 1 then v_{(_'n - p) --+ .IV(0,1 p2). If this result were valid for the case that p = 1, then the resulting distribution collapses to a point mass (the variance is zero).

..........

i

_

I

,

............................

pperron-- Phillips;-Perrontest for unit roots

561

L

It is this m6tivation that rives one to check for the possibility of a unit root in an autoregressive process. :In order to comput| the test statistics, we compute the Phillips-Perron regression Yi = a + PYi-1 + q where we mayIexclude the :onstant or include a trend term (i). There are two statistics, Z o and Z_-, calculateA as 1)

=n(pn

1

_r

__

=

_j n = _'-

2 sr_

I!

2

--

?2i_ti-j

n i=j+l

A

j 1 j=l

_

1

I

_j,,_ q+l_

^2 i=l

where ui is the OLS residu I, k is the number i_

1 n_ Ls. -

"2

A

A 70,,_+ 2

)n

2

f covariates in the regression, q is the number of

Newey-West lhgs to use in lthe calculation of _n, and _ is the OLS standard error of ft. The critical !,'alues (which! have the same distribution as the Dickey-Fuller

statistic: see Dickey and

i !

Fuller (1979))included in tt_e output are linearly inte_olated from the table of values which appear in Fuller (197(_), and the M_cKinnon approximate p-,alues use the regression surface punished in

I

MacKinnon (I _94).

i

Referenoes

i

BOX,EnglevaoodG. E. P.,cllffs,G. M.Nj:JenkinS,Prentic,__Hall.and G. C. Reinsel. !994. Time Series Analysis: Forecastingand Control.3d ed.

l

Dickey..D A. an_ W. A. Fuller. 1979. Distributionof the estimatorsfor autore_ressive_ time series with a umt root. Journalof theiAmerican Stati;ticalAssociation74: 427-431. Fuller,W. A. 197B.Introduction:o Statistical TimeSeries. New York:John Wiley& Sons. Hakkio.

C

S.

19i)4., sts6:

Apprcximate

p-values

for unit

root

and cointegration

tests.

Srata

Technical

Bulletin

17:

i

25-28. Repriniedin Stata Te_hnical BulletinReprints,vo]. 3, pp. 219-224. Hamilton,J. D. 1}94. Time Seri_s Analysis. Princeton:PrincetonUniversityPress.

i l

MacKinnon.J. G.!1994.Approxilaateasymptoticdistributionfunctionsfor unit-rootand cointegrationtests. Jottrnal(ff Businessand _conomic Statisics 12: 167-176. Newton,H, J. 19,_8,TIMESLAB A Time SeriesLaboratory.PacificGrove.CA: Wadsworth& Brooks/Cole.

I. g

Phillips.P.aC.B. _986. Time series regression,xith a unit root Economemca56: I021-104._."

}

Phillips,P_C. B. _nd R Pen-on. 988. Testingfor a unit root in time series regression.Biomemka 75: 335-346.

!

Also See

!

l

i t

Complementary:

[R] tss_t

Related:

[R] d ller

i

-i

prais

m Prais-Winsten I

regression

and Cochrane-Orcutt

[

I

regression nrl

II

i,

i

:-

Syntax prais

depvar

twostep

[vartist]

robust

nodw level(#)

[if exp]

[in range I [, corc ssesearch rhotype(rhomethod)

cluster(varname) no log

hc2

maximize_options

hc3 noconstant

h_ascons

savespace

]

is for use with time-series dam; see [R] tsset. You must :sset your data before using prais. depvar and _,arlistmay contain time-series operators: see [U] 14.4.3Time-series varlists. prais shares the features of all estimation commands; see [U] 23 Estimation and post-estimation commands. prais

Syntaxfor predict predict

[type] newvarname

[if

exp]

[in range]

[,

These statistics are available both in and out of sample; type predict the estimation sample.

{ xb I residuals I stdp } ] ...

if e(sample) ...

if wanted only for

Description prais estimates a linear regression of depvar on varlist that is corrected for first-order serially-correlated residuals using the Prais-Winsten (1954) transformed regression estimator, the Cochrane-Orcutt (1949) transformed regression estimator, or a version of the search method suggested by Hildreth and Lu (1960).

Options corc specifies that the Cochrane-Orcutt transformation be used to estimate the equation. With this option, the Prais-Winsten transformation of the first observation is not perfomaed and the first observation is dropped when estimating the transformed equation; see Methods and Formulas below. ssesearch specifies that a search be performed for the value of p that minimizes the sum of squared errors of the transformed equation (Cochrane-Orcutt or Prais-Winsten transformation). The search method employed is a combination of quadratic and modified bisection search using golden sections.

562

! i

! i

-].

............

, prais rais-Winsten regressionand Cochrane-Orcutt regression 563 rhotype(rhor_ethod) setedts a specific computation for the autocorrelation parameter p. where rhomethod ban be • re,tress frbg

['reg -- 9 from the residual regression et - tier-1 /'freg = _3from the residual regression et = flet+l

!

ts_orr clwl _

f'tscorr = e'et_l/e'e, where e is the vector of residuals /'dw -- 1 -- dw/2, where dw is the Durbin-Watson d statistic .

!

th_il

/'theit = Ptscorr(N - k)/N

i

na!gar

[',_gar = (Pdw * N 2 4- k_)/(N 2 - k 2)

I i !

li

The prais !estimator can use any consistent estimate of p to transform the equation and each of these estimdtes meets tha_requirement. The default is regress and it produces the minimum sum of squares s_lution (sses_arch option) for the C0chrane-Orcutt transformation--no computation will produc_ the minimur_ sum of squares solution for the full Prais-Winsten transformation. See Judge Grif_ths, Hill. Lii_kepohl. and Lee (1985) for _ discussion of each of the estimates of p. twostep

speci_es that pra±_ will stop on the first iteration after the equation is transformed bv p

the

two-step efficient estima@. Although it is customa_' to iterate these estimators to convergence.

!

they are ef_ient at eachlste p. , " robust specifies that the Nuber/White/sandwich estimator of variance is to be used in place of the traditiov_al calculation, robust combined with cluster() further allows observations which

,t

are not ind+p endent witt in cluster (although they must be independent between clusters). See

I ! !

[U] 23.11 Obtaining rob,,st variance estimates. Note that all estimates fr,)m prais are conditional on the estimated value of ,o. This means that robust variahce estimates in this case are only robust to heteroskedasticity and are not generally robus! to n{isspecificatio_ of the functional form or omitted variables. The estimation of the

! !

functional fiirrn is intertwined with the estimate of p, and all estimates are conditional on p. Thus, we cannot t{erobust to _isspecification of functional form. For these reasons, it is probably best

li :

i

i

• ! i

to mterpret _obust

mth t spirit of White's _19801 original paper on estimatton of heteroskedastic

consistent @variance matrices. cluster(,,arnb,,,e) specifie_ that the observations are independent across groups (clusters) but not neces._arilv Within groups._'lvarname specifies to which group each observation belongs, cluster () affects the astlmated _ " v stan lard errors and variance,-covariance matrix of the estimators (,Cg). but not lhe estirhated coeffici .'nts. Specifying cluster () implies robust he2 and he3 _pecifv an alt _rnative bias correction for the robust variance calculation; for more informationlsee [R] regre_s, he2 and he3 may not be specified with cluster() Specifying he2 or hcB imp_es robust. 1 t

.

/

• I

!

.

.

.

hascons rodin}tea that a us@defined constant, or set of variables that m hnear combination form a constant, ha_ been includetd in the regression. For Some computational concerns, see the discussion in [RI regreb's. savespace

sNcifies that pz ais attempt to save as much space as possible by retaining only those

!!

variables for eslimation. Theused original are isrestored after space estimation. This option rarely usedre+uired hnd should g meratty be only data if there insufficient to estimate a modelis

[

without the _ption.

!

nodw suppresses reporting o the Durbin-Watson

statistic'.

level(#)specifies the confidence level, in percent, for confidence intervals. The default is level (95) oo,, prms -- _,ram-wmslen regress=onand Cochrane-Orcutt regression H

ii ,

*

or as set by set level, see [U] 23.5 Specifying nolog suppresses the iteration log. maximize_options specify them.

control the maximization

the width of confidence

process;

see [R] maximize.

intervals.

You should never have to

Options for predict xb, the default, calculates the fitted values the prediction of xjb for the specified equation. This is the linear predictor from the estimated regression model; it does not apply the estimate of p to prior residuals. residuals

calculates the residuals from the linear prediction.

strip calculates the standard error of the prediction for the specified equation. It can be thought of as the standard error of the predicted expected value or mean for the observation's covariate pattern. This is al so referred to as the standard error of the fitted value. As computed for prais, this is strictly the standard error from the variance in the estimates the parameters of the linear model under the assumption that p is estimated without error.

of

Remarks The most common autocorrelated error process is the first-order autoregressive assumption, the linear regression model may be written

process. Under this

Yt --=xtfl + ut where the errors satisfy ?_t -'=- P U¢--I

and the et are independent and identically error term e may then be written as

if,

1 1

distributed

_

et

as N(O, 02). The covariance

1 p

p 1

f12 p

... . ..

pT-1 pT-2

p2

p

1

...

p T- 3

pT-3

...

1

matrix if" of the

p2 pT-1

pT-2

The Prais-Winsten estimator is a generalized least squares (GLS) estimator. ]'he Prais-Winsten method (as described in Judge et al. 1985) is derived from the AR(1) model Jbr the enor term described above. Whereas the Cochrane-Orcutt method uses a lag definition and loses the first observation in the iterative method, the Prais-Winsten method preserves that first observation. In small samples, this can be a significant advantage.

Q TechnicalNote To estimate a model with autocorrelated errors, you must specify your data as time series and have (or create) a variable denoting the time at which an observation was collected. The data for the regression should be equalty spaced in time. Q

!

i

i

!

i

,

V

prais-- =rais-Winstenregressionand Cochrane-Orcuttregression

Example

You wish td estimate a e-series model of usr on idlebut are concerned that the residuals may be _nally, correlated. We will declare the variable t to represent time by typing . tsset

t_ 1

We can obtain _ochrane-Or:utt estimates by specifying the core option: • prais

_sr

Itera_io_

idle,

O:

cor"

rho = 0 i0000

Itera_io_ I: rho = 0,3518 (output o_nirted ) Iteratio_ Cochrane

13:

rho = i.5708

Orcutt

So _ rce M

AR(1) i |regressio; _S d

I

el

Resic_ual

40.13 I

9584

--iterated MS I

166.8T8474

27

est imates Number Prob

6.18142498


C_ef.

Std.

Err.

t

_?n, 14.,7641 4.2,2299

I

Durbin-Wdson

statist:.c

(original)

1.295766

I

Durbin-Wa_son

statist:.c

(transformed)

1.4662_2

!

t

}

I

i i i i

i

]'he estimated model is

P>lt

> F

Root.sE

!

_sr

of obs =

40.1309584_

T_al / 207.0_943328 7.39390831

!

i

565

I

[95%

Conf.

29

=

0.0168

= =

0.1938 O. 1640

: 2.4862 Interval]

0.002 5.78036 23.3,=45

,

_srt = --.125 idler

14.55 + u_

+

and

_t = .5708_t_1 + et

We can also estlmate the m_lel with the Prais-Winsten method: • prais u3r idle Iteration 0: rho = 0 0000 Iteration

I:

rho = 0 3518

(output or #tted) Iteration 14: Prais-Win_ten

rho = (.5535 AR(1)

i Source Mo_el

Residgal i To_al

I

rcgression

-- iterated

estimates

43,00 _ S 6941

df1

MS 43. 0076941

F( 1, 28) = Number Prob > Fof obs = =

7.12 0.012530

169.1( 5739 212.1i3433

28 29

6.04163354 7.31632528


0.2027 0.1742 2.458

_sr i_le c_ns

Durbin-Wa son statistic Durbin-Wa_son statist5

Std.

Err.

.0472195 4. 160391

(original) (tremmformed)

t -2.87 3.65

1.295766 1.476004

P>It I O. 008 O. 001

[95_, Conf. -. 2323769 6. 681978

= = =

Interval] -. 0389275 23. 72633

P !i!

ooo

where

pram m P'rals-wlnsten

the Prais-Winsten usrt

As the results

regression and Cochrane-Orcutt

estimated

-

model

-.1357±diet

indicate,

is

+ 15.20 + ut

for these

regression

data there

and

u_ = .5535ut_1

is little to choose

between

+ et

the Cochrane-Orcutt

and

Prais-Winsten estimators whereas the OLSestimate of the slope parameter is substantially different. q

> Example We have data on quarterly sales, in millions of dollars, for five }'ears, and we would like to use this information to model sales for company X. First, we estimate a linear model by OLS and obtain the Durbin-Watson

statistic

regress

csales

using dwstat;

diagnostics.

isales

Source

SS

Model Residual

d_

MS

Number

of obs

=

20

110.256901

1

110.256901

F( 1, Prob > F

.133302302

18

.007405683

R-squared

=

0.9988

5.81001072


= =

0.9987 .08606

.... Total

110.390204

csales

Coef.

isales

.1762828

_cons

see [R] regression

-1,454753

19

Std.

Err.

.0014447

¢

P>Itl

122.02

0.000

.2141461

-6.79

0.000

[95Z

18) =14888.15 = 0.0000

Conf.

Interval]

.1732475 -1.904657

.1793181 -1.004849

• dwstat Durbin-Watson

d-statistic(

2,

20)

=

.7347276

Nofng that the Durbin-Watson statistic is far from 2 (the expected value under the null hypothesis of no sefiN correlation) and well below the 5% lower limit of 1.2, we conclude that the disturbances are serially correlated. (Upper and lower bounds for the d statistic can be found in most econometrics texts, e.g., Harvey, 1993. The bounds have been derived for only a limited combination of regressors and observations.) We correct for the autocorrelation using the ssesearch option of prais to search for the value of p that minimizes the sum of squared residuals of the Cochrane-Orcutt transformed equation. Normally the default Prais-Winsten dataset, but the less efficient Cochrane-Orcutt of the estimator's

_ansformations would be used with such a small transformation will allow us to demons_ate an aspect

convergence.

. prais csales isales, core ssesearch Iteration I: rho = 0.8944 , criterion

=

-.07298558

Iteration

=

-.07298558

2:

(ou_utomittcd) Iteration 15: Cochrane-Orcutt Source Model Residual Total

rho rho

= 0.8944 = 0.9588

AR(1)

, criterion ,

criterion

regression SS

-- SSE

df

=

-.07167037

search MS

estimates Number

of obs

19

2.33199178

1

2,33199178

.071670369

17

.004215904

R-squared

=

0.9702

.133536786


= =

0.9684 .06493

2.40366215

18

17)

=

F( I, Prob > F

= =

553.14 0.0000

I

t

-

.

i

|

_

1 _c_ns ho

Ccef.

! i

i

I I

i"

i

Std. Err.

t

P>ltl

[95_ Conf. .1461233

.160_233

.0068253

23.52

0.000

1 1.73_946 .958_209

1.432674

1.21

0.241

I)urbin-Wa_sonstatistic (original)

.

567

Interval] .1749234

-1.283732

4.761624

0.734728

(transformed) 1.724419

Durbin-Wa_son statist

1 ! i

_._

prais -- _rais-Winsten regressionand CoChrane-Orcu_ regression l

csa_es ! isa_es

.............

It was noted m the Optic

section that with the default computation of p the Cochrane-Orcutt

themeth°dssesearchPr°duce_an estima!, IGi_e )fthat p thatthe minimiZeSmethodsthe sum of squared residuals the same criterion as bption, two produce the same results, why would the search method ever be _referred? lt,t__rnsout that the back-and-forth iterations employed by Cochrane-Orcutt can o_en have difficulty, corn e cging if the value of p is large. Using the same data, the Cochrane -Orcutt iterative procedure requires o_er 350 iterations to converge and a higher tolerance must be specified to prevent premature converg, mce: • prais c_ales isales, core tol(le-9) iterate(500) Iteration O: Iteration_l: Iterationl2:

rho = O. rho = 0.5312 rho = 0.5866

Iteration!3: rho = 0.T161 3000 Iteration!4: rho = Iterationl5: rho = (output onA_t_d) Iteration!377: rho Iterationi378: rho Iteration!379: rho

0.7373 0.[550 = 3.9588 = ).9588 = ).9588

Cochrane-_reutt AR(1) regression -- iterated estimates Source

S

df

Mo_el

2,3319

To_a! !

2,4036_208

csa_es

Co f.

isal_s _colOns

,o

t71

MS

Number of obs =

19

1

2.33199171

Prob > F

=

0.0000

18

.133536782

Root MSE Adj R-squared

= =

.06493 0.9684

Std. Err.

.1605_33

.0068253

1.738_46

1.432674

t 23.52 1.21

P>ltl


0_000

.1461233

.1749234

0.241

-1,283732

4.761625

9 8 1o9

Durbin-Wat_on statisti_ (original) Durbin-WatNon statisti

0.734728

I (transformed) 1.724419

Once convergende is achieve 1,| the two methods produce identical results.

(Con6nued on next page)

q

568

prais --

_1 _|

Saved Results prais saves in

:t

scal

I

Prais-Winsten


regression

-_'_

e()"

e (/0


e (ross) e (df_ta)

model sum of squares model degrees of freedom

e(rss)

residual sum of squares

e(df_.r) e (r2)

residual degrees of freedom R-squared

e (r2_a) e(F)

adjusted R-squared F statistic

e(rmse) e (11) e(N_clust)

root mean square error log likelihood number of clusters

e(rho)

autocorrelation parameter p

e(dw)

Durbin-Watson

e (dw_O) e (to 1) e(max_ic)

Durbin-Watson d statistic of transformed regression target tolerance maximum number of iterations

e (ic)

number of iterations

e(N._gaps)

number of gaps

d statistic for untransformed regression

Macros e(cmd) prais e(depvar) name of dependent variable e(clustvar) name of cluster variable e(rhotype) e(method) e (vcetype)

methodspecified inrhotype option twostep,iterated,or SSE search covariance estimation method

e(tranmeth) corc or prais e(cons) e(predict)

noconstant or notreported programusedto implement pred£ct

Matrices e(b)

coefficient vector

e(V)

variance-covariance

matrix of the estimators



Methods and Formulas prais is implemented

as an ado-file.

Consider the command 'prais from the standard linear regression:

y x z'.

The

Yt = axt An estimate regression:

of the correlation

in the residuals

0-th iteration

is obtained

by estimating

a, b, and c

+ bz_ + c + ut is then obtained.

_t _- PUt--i+ et

By default,

prais

uses the auxiliary

i

I

Y

!

_,.

[_l*_iS

_

_r_is

_W|ns_n

r_ssJ_n

and

Cochr_ne_

O_ut_

This can be cMnged to any )f the computations noted in the rhotype

!

Next we apl_ly a Cochran ,,-Orcutt transformation(l)

_ssion

_9

() option.

for observations t = 2,...,

n

4

! I

i l

v, -

!

=

-

+

-

+ 41 - p)+

(1)

!

and the transformation (1') fi)r t = 1

Thus, the diffe: ences betwe4 n the Cochrane-Orcutt and the Prais-Winsten methods are that the latter uses equa ion (1') in a. Idition to equation (1), _,hereas the former uses only equation (I) and

l i

necessarily dec_ _.asesthe san pie size by one. Equations (1i and (t ') are used to transform the data and obtain new estimates of a, b, and c. When the tubstepoption is specified; the estimation process is halted at this point andthese are

i

the estimates re_orted. Under the default behavior of i!erating to convergence, this process is repeated i i

until the Changei in the estim_Lteof p is within a specified tolerance. The new esti_nates are used to produce fitted values i Yt = axt 4- bzt + "d

t and then p is re[estimated, b

default using the regression defined by

Yt

-- Y% =

lO(Vt--1

Y_--I)

-t- Ut

(2)

i

We then re-estir_ate equatiol and (2) until the_estimate of Convergence ts declared af coixelation between two iterx that this processiwiil always i

I ! i

Under the ss_search opt on a combined quadratic and bisection search using golden sections is used to search f& the value (4 p that minimizes the Sum of squared residuals from the transformed e_uation. The transformation may be either the Cochiane--Orcutt (I only) or the Prais-Winsten (1

i

and 1').

I

(1) using the neu, estimate of p, and continue to iterate between (t) converges. _riterate () iterations or when the absolute difference in the estimated ions is tess than tel (): , see [R] maximize. Sargan (1964) has shown :onverge.

All reported _tatistics are ased on the p-transformed variables and there is an assumption that p is estimated witl_out error. S, Judge et ak (1985) for;:details. I

The Durbin-g[atson

d sta, stic reported by praisand d_rstat is n--1

!

_ (uj+l- uj

l

j=l

j=l

:

t

where uj represo_nlsthe residual of the jth t t 4

observation.

]

ozu

prms N P'ram-wlnsten


regression

Acknowledgment We thank Economics

Richard

Dickens

and Political

Science

of the Centre for testing

for Economic and assistance

Performance with an early

at the version

London

School

of

of this command.

References Chatterjee, S.. A. S. Hadi, and B. Price. 2000. Regression Analysis by Example. 3d ed. New York: John Wiley & Sons. Cochrane, D. and G. H. Orcutt. 1949. Application of least-squares regression to relationships containing autocorrelated error terms. Journal of the American Statistical Association 44:32-61. Durbin, J. and G. S. Watson. 1950 and 1951. Testing for serial correlation in least-squares regression. Biome_ika 409-428 and 38: I59-178.

37:

Hardin, J. W. 1995. stsl0: Prais-Winsten regression. Stata Technical Bulletin 25: 26-29. Reprinted in Stata Technical Bulletin Reprints, vol. 5, pp. 234-237. Harvey, A. C. t993. The Econometric Analysis of Time Series. Cambridge, MA: MIT Press. Hildreth, C. and J Y. Lu. 1960. Demand relations with autocorrelated disturbances. Agricultural Experiment Station Technical Bulletin 276. East Lansing, MI: Michigan State University. Johnston. J. and J. DiNardo. 1997. Econometric Methods. 4th ed. New York: McGraw-Hill. Judge, G. G., W. E. Griffiths, R C. Hill, H. Lfitkepohl, and T.-C. Lee. 1985. The Theory and Practice of Econometrics. 2d ed. New York: John Wiley & Sons. Kmenta, J. 1997. Elements of Econometrics. 2d ed. Ann Arbor: University of Michigan Press. Prais, S. J. and C. B. Winsten. 1954. Trend Estimators and Serial Correlation. Cowtes Conm_ission Discussion Paper No. 383. Chicago. Sargan. J. D. 1964 Wages and prices in the United Kingdom: a study in econometric methodology. In Econometric Analysis for National Economic Planning, ed. P. E. Hart, G. Mills, J. K. Whitaker, 25-64. London: Butterworths. Theft, H. 1971. Principles of Econometrics. New York: John Wiley & Sons. White, H. 1980. A heteroskedasticity-consistent Econometrica 48: 817-838.

covariance matrix estimator and a direct test for heteroskedasticity.


[R] adjust, [R] iincom, [R] vce, JR] xi

Related:

[R] regress,

Background:

[U] 16.5 Accessing

[R] mfx, jR] predict,

[R] regression

[U] 23 Estimation [u] 23.11 Obtaining

diagnostics

coefficients

and standard

and post-estimation robust

JR] test,

variance

errors.

commands, estimates

[R] testnl,

i

.....

l :? 1°..o It,. e t --

_

ic ions, residuals, etc., after estimation .i i i i i

i

i

i

After single-eqtlation (SE) estimators

t

Syntax

predict;

[_,pe_ newvarlrame [if

other_op_ons

exp] [in range] [, xb stdp

nooffset

]

After multiple-_quation CME)iestimators

stdp stdrtp nooffse_

other_options ]

DescriptiOn predict call :ulates predicl ions, residuals, influence statistics, and the like after estimation. Exactly what predict an do is dete mined by the previous estimation command; command-specific options are documented larith each est mation command. Regardless of command-specific options, the actions of predict shale certain sirr ilarities across estimation commands:

i

l) Typing p_edict newv _rname creates newvanvame containing "predicted values"--numbers related to,ithe E(_ljlxj t. For instance, after linear regression, predict newvarname creates t l

t

xjb and, _fter probit, creates the probability/b(xjb). 2) predict _ewvarname, xb creates newvarname containing xjb. This may be the same result i hnear , as (1) (e.g_, regression) or different (e.g., probit), but regardless, option xb is allowed.

! i

3) predict _ewvarname,'_ stdp creates newvarnanie containing the standard error of the linear prediction !xj b. !

I

4) predict/lewvarname,_ther_options may createnewvarname containing other useful quantities; _ee help _r the referende manual entD for the particular estimation command to find out about other avai_ble options. I

i i

5) Addling th4 noel fset @tion to any of the above requests that the calculation ignore any offset or e_posule variable s_cified by including the offset(varname) or exposure(varname)

i

options w_en you estim!ted the model. predict

!

can be used to mah in-sample or out-of-sample predictions:

6) tn general, predictcall ulates the requested statistic for all possible observations, whether they were used in estimating the model or not. predict does this for standard options (1) through (3), and generally does ;his for estimator-specific options (4). 7) To restrict ithe predictio_ to the estimation subsample, type , predict

l

I

!newvarname

i:

e(sample)

....

8) Some stati._tics make se _se only with respect to the estimation subsample. In such cases, the calculation iis automatically restricted to the estimation subsampte, and the documentation for the specific!option states this. Even so, you can still specify if e (sample) if you are uncertain.

571

572

predict -- Obtain predictions, residuals, etc., after estimation

9) predict's you can • use

dsl

(estimate • use

I!

ability to make out-of-sample prediction even extends to other datasets. In particular,

a model) /*

two

• predict

hat,

...

another

/* fill

*/

dataset

in the

predictions

*/

I:
[t [

[95Y, Conf.

Interval]

-4.18

0.000

-.0156287

- .0052232

8,3'3

O. 000

36,66983

61,16676

mpg now, we would obtain

e linear predictions for all 74 observations.

_edictions _iusI for the sample on which we estimated the model, we could type

. predict I pmpg

if e(s_unple)

(option (52 missihg x_ assumed; values ge_erated) f:.tted values) !

:

!

In this e×ample_

I

e_;timatedI the nlodel and the: e are no missing values among the relevam variables. Had there been missing ,,,alues._e (sample) ,'ould also account for t_ose.

e(sample)

is true only for foreign cars because we typed

if

foreign

when we

!

I

574


ti

By the statistics way, theonifthee(sample) be type used with any Stata command, summary estimation restriction sample, wecan could . summarize (output

if

omitted

so to obtain

e(sample) )

Example Using the same auto dataset, assume that you wish to estimate the model: mpg = _lweight + fl2weight2+ fl3foreign+ t4 We first create the weight 2 variable and then type the regress command: • use

auto

(1978

Automobile

generate

Data)

weight 2=weight'2

• regress

mpg

weight

Source

weight2 SS

foreign df

MS

Number F(

Model

of

3,

= =

52.25

=

0.0000

3

563.05124

754.30574

70

10.7757963

R-squared

=

0.6913

33.4720474

Adj K-squared Root MSE

= =

0.6781 3.2827

Total

2443.45946

73

mpg

Coef.

Std.

Err.

t

P>ltl

> F

74

1689.15372

Residual

Prob

obs 70)

[957, Conf,

welght

-. 0165729

.0039692

-4.18

O. 000

-. 0244892

weight 2

1.59e-06

6.25e-07

2.55

O. 013

3.45e-07

foreign _cons

-2.2035 56. 53884

1.059246 6.197383

-2.08 9.12

0.041 O.000

-4.3161 44. 17855

Were we to type 'predictpmpg' now, we would obtain predictions data. Instead, we are going to use a new dataset.

Interval] -. 0086567 2.84e-06 -.0909002 68.89913

for all 74 cars in the current

The dataset newautos, dta contains the make, weight, and place of manufacture of two cars. the Pontiac Sunbird and the Volvo 260. Let's use the dataset and create the predictions: use newaut os (New

Automobile

Models)

list

i. Pont. 2.

make Sunbird

we ight 2690

260

3170

Volvo

predict mpg (optlon xb assumed; variable r(lll) ;

weight2

noZ

fitted found

f or e ign Domestic Foreign

values)

II

! !_ }i

1

_

,

pl_dict -- Obtain predictions, residbals, etc., after estimation

Things did not work. We typed predict mpg and Stata responded with the message "weight2 not found", predictcan calcuh Ltepredicted values on a different dataset only if that dataset contains the variables that ,_ent into the aaodet. In this case, our data do not contain a variable called weight2. weight2 is just the square _,f weight, so we can create it and try again: • generate . predic_

(o_tion i

575

weight2=we ight*2 mpg

Ib assttmed;

itted

values)

. list

i 1.

make Pon_:. Sunbird

2.

Volvo

260

weight 2690

foreign Domestic

weight2 7236100

mpg 23.47137

3170

Foreign

1.00e+07

17.78846

i

i

\Ve obtained o,tr predicted alues. The Pontiac SunNrd has a predicted mileage rating of 23.5 mpg whereas _heVovo 260 has alpredicted rating of 17.8mpg By way of comparison, the actual mileage

'_

ratings are 24 or the Pontia_ and 17 for the Volvo.

q

Residuals i

_, Example

!

t

With many estimators, p_edictcan calculate more than predicted val'ues.With most regressiontype estimator_ we can, for _instance, obtain residuals. Using our recession example, we return to

i

our original daia and obtain residuals by typing

-_

. use

I

(Automobfle

au$o,

predic$ !

,

Models)

!

double

summar!ze ,

clear

res_d,

resid

Variable

residuals

i 8_s

Mean

Std.

J , Dev.

Min,

Max

resid

i

_ i

i

i t

l

1 ?_4

-1,78e-15

3.214491

-5.636126

i3.85172

Notice that wd did this wi!hout re-estimating the model. Stata always remembe_ the last set of J.. estimates, ever_as we use n w datasets. It was not n_cessar_ to t're the double in predict double resid, residuals; but we wanted " " ' variable in front of the variable s name; see to remir_dyouI that you ca_ specify the type of [U] 14.4.2 List_ of new variSbles. We made the ne_ :variableresida doublerather than the defaul_ float_

i

If"you wantiyour residua to have a mean as clese to zero as possible, remember to request the extra precision of double.If we had not specified double, the mean of resid would have been , --s ! -14 -]4 sounds _ more precise than 10-s. the difference roughl) 10 rather than _) • Although 10 really does notimatter.

F,_rlinear rti,.zression, r diet can also calculate standardizedresiduals and studentized residuals • . i_ P with the ophoqs rstandar

and rstudent:

for examples

see JR] regression

diagnostics

576


Single-equation (SE) estimation If you have not read the discussion above on using predict after linear regression, please do so. Also note that predict's default calculation almost always produces a statistic in the same metric as the dependent variable of the estimated model e.g., predicted counts for Poisson regression. In any case, xb can always be specified to obtain the linear prediction. predict is also willing to calculate the standard error of the prediction, which is obtained by using the inverse-matrix-of-second-derivatives estimate for the covariance matrix of the estimators.

Example After most binary outcome models (e.g., logistic, legit, probit, cloglog, scobit), predict calculates the probability of a positive outcome if you do not tell it otherwise. You can specify the xb option if you want the linear prediction (also known as the legit or probit index), The odd abbreviation xb is meant to suggest XB. In legit and probit models, for example, the predicted probability is p -- F(XB), where F() is the logistic or normal cumulative distribution function respectively. . logistic foreign (output omitted ) predict (option

mpg

weight

phat

p assumed;

predict

idLhat,

• summarize

foreign

Pr(foreign)) xb phat

Variable

Obs

foreign phat idxhat

74 74 74

idxhat Mean

Std.

.2972973 .2972973 -1.678202

Dev.

.4601885 ,3052979 2.321509

Min

0 .000729 -7.223107

Since this is a legit model, we could obtain the predicted probabilities index gen

phat2

Max

1 .8980594 2.175845

ourselves from the predicted

= exp(idxhat)/(l+exp(idxhat))

but using predict without options is easier. q

Example For all models, predict attempts to produce a predicted value in the same metric as the dependent variable of the model. We have seen that for dichotomous outcome models, the default statistic produced by predict is the probability of a success, statistic produced by predict is the predicted count specify the xb option to obtain the linear combination of (the inner product of the coefficients and z values). For is the natural log of the count. poisson (output

injuries

omitted

predict (option

injhat n assumed;

predict gen

XYZowned

)

idx,

exp_idx

. summarize

predicted

number

of events)

xb = exp(idx)

injuries

injhat

exp_idx

idx

Similarly, tbr Poisson regression, the default for the dependent variable. You can always the coefficients with an observation's x values poisson (without an explicit exposure), this

)redict -- Obtainpredictions,residuals,etc.,after estimation Vari_le

1

I

Ot

Meam

Min

7.111ni .833333 7.11illt .833333_

injb t

exp__dx iidx injuries

i _

Std. De_.

1.955174 7. 111111

I

.122561 _ 5.48735_)

577

Max

66 7.666667 7.666667 1.791759

1

2.036882 19

We note that o_r "'hand-co_ _uted" prediction of the count (exp_idx)

exactly matches what was

i _

produced by th_ default oper _tion of predict. If our model! has an expo,,',ure-time variable, we can use predict to obtain the linear prediction with or without !the exposure. Let's verify what we are getting by obtaining the linear prediction with and without exl_osure, transfi)rming these predictions to count predictions, and comparing with the default count p_diction from predict. We must remember to multiply by the exposure time when

i

usin_ predict!

) !

i

. poisson

nooffset.

injuries

XY_ owned,

exposure(n)

(outputor__i.ed) . predict double injh_.t (option n assumed; predicted i

. predict

I

. gen dou)le

i

• predict

i

. s_mmari,_ei injuries njhat VariaIle Ob

!

• gen

double

of event_)

xb

exp_idx

double

double

idx,

number

exp(idx)

idxr

xb nooffset

exp_idxn

exp(idxn)*n

exp_idx Mean

exp_idxn idx idxn Std. Dev. Min

injuries

9

7.11t111

inj_at

_

7,111111

3.10936

!

exp_ _dx

9

7. 111111

;

exp_i_xn _dx

_ 9

7.111111 I.869722

i

i_xn

9

4. 18814

!

1

5.487359

Max 19

2.919621

12.06158

3. 1093_

2. 919621

12. 06158

3.1093_ .4671044

2.919621 I.0714_4

12.06158 2.490025

.190404_

4.061204

4.442013

| Looking at t_e identical m_ans and standard deviations for injhat,

exp_idx,

and exp_idxn,

we

! )

, ee that ]! _s possible to reproduce the default computations of predict for pozsson esnmatlons. We have also d_monstrated tlle relationship between the count predictions and the linear predictions

i

with and withodt exposure. q

! i i ! i

Multiple-equation (ME) estimation If you have lot read the _bove discussion on using predict after SE estimation, please do so. With ]he exception of the aNlity to select specific ettuations to predict from, the use of predict after ME model,_ follows almost exactly the same for£ as it does for SE models,

Example i i

The details c;f prediction statistics that are specific to particular ME models are documented wi_h the estimation c_)mmand. Users of ME commands that do not have separate discussions on obtaining predictions wou_d also be well-advised to read the predict section in [R] mlogit, even if their interest isnot in!multinomial _ogistic regression. As a general introduction to the ME models, we will

ii_

demonstrate pr!dict!

after slreg:

, _ureg

(price

foreign

disp1)

(weight

foreign

length)

;_._mlngly unrelated regression _jijation prJco w. lght

Obs

Parms

RMSE

"R-sq'

chi2

P

74 74

2 2

2202.447 245.5238

0.4348 0.8988

45.20554 658.8548

0.0000 0.0000

Coef.

Std. Err.

z

P>Jzl


prlco foreign dl_placement _cons w_|ght foreign length _cons

_ut-g

3137.894 23.06938 680.8438

697.3805 3.443212 859.8142

-154.883 30.67594 -2699.498

75.3204 1.531981 302.3912

cstinmted two equations,

4.50 6.70 0.79

-2.06 20.02 -8.93

one called price

0.000 0.000 0.428

1771.054 16.32081 -1004.361

4504.735 29.81795 2366.049

0.040 0.000 0.000

-302.5082 27.67331 -3292.173

-7.257674 33.67856 -2106.822

and the other weight:

see [R] sureg.

predict prod_p, equation(price) (_q)tionxb assumed; fitted values) }n,odictprod_w, equation(weight) (option xb assumed; fitted values) • .u_arize

price pred_p weight pred_w

Variable

Obs

Mean

price pred_p weight pred_w

74 74 74 74

6165.257 6165.257 3019.459 3019.459

Std. Dev. 2949.496 1678.805 777.1936 726.0468

Min

Max

3291 2664.81 1760 1501.602

15906 10485.33 4840 4447.996

Y,m may Sln\'ifY the equation by name, as we did above, or by number: _;m," Ihing as equation(price) in this case.

equation(#1)

means

the

Methods and Formulas I)cnotc _h_-previously estimated

coefficient vector by b and its estimated

variance matrix by V.

pr,,l* ,'_ x__,Tksby recalling various aspects of the model, such as b, and combining that information witl_ the ,tau currently in memory. Let us write xj for the jth observation currently in memory. 'l'hc t _.......

d value (xb option) is defined _'j

TIw ,nv xf-_--derror of the prediction The

_:.n)_2c3 error o/" the difference

(stdp)

=

xjb

q- offset#

is defined spj = v/xjVx}

in linear predictions between equations

1 and 2 is defined

s% - V/(x_j,-x2j, o,..., 0) v (x_j,-x2j, o,..., o)' See _h_"".=Nvidual estimation Sl_ll isli,'._-

commands

for the computation

of command-specific

predict

i1 iJ

Also See Related:

_redict -- Obtain predi_ions, residuals, etc., after estimation [R] regress, [R] regression diagnostics [P] _wet [ict

Background:

[u] 23 E,,',timation and post-estinlationcommands

I

I

I I

I !

i i

i

! f

579

,

I _,

Title [ probit

--I Maximum-likelihood [

probit estimation

i

I

J

Syntax probit depvar

[indepvars ] [weight]

noconstant r_obust maximize_options

dprobit

cl_uster(varname)

exp]

[in range]

[, level(#)nocoef

score(newvarname)

asis

offset

(varname)

]

[ depvar indepvars

probit_options

[if

[weight]

[if

exp] [in range]]

[, at(matname)

classic

]

by ... : may be used with probit and dprobit; see [R] by. fweights, iweights, and pweights are allowed; see [U] 14.1.6 weight. These commandsshare the features of all estimation commands:see [U] 23 Estimation and post-estimation commands. probit may be used with sw to perform stepwise estimatlon; see [R] sw.

Syntaxfor predict predict

[type] newvarname

[if

exp]

[in range]

[.

{ p

xb

stdp}

rules asif nooffset ] These statistics are available both in and out of sample; type the estimation sample.

predict

...

if

e(sample)

...

if wanted only for

Description probit

estimates a maximum-likelihood

probit model.

dprobit estimates maximum-likelihood probit models and is an alternative to probit. Rather than reporting the coefficients, dprobit reports the change in the probability for an infinitesimal change in each independent, continuous variable and, by default, reports the discrete change in the probability for dummy variables. You may not specify the noconstant option with dprobit, probit may be typed without arguments after dprobit estimation to see the model in coefficient form. If estimating on grouped data, see the bprobit

command described in [R] glogit.

A number of auxiliary commands may be run after probit, for a description of these commands. See [R] logistic for a list of related estimation

commands.

580

togit,

or logistic;

see [R] logistic

......

,

probit --I_laxh_um_i_kelihoodprobe estimation

581

Options

i l

Options,for p,obit

level (#) spec ifies the confi,tence level, in percent, for confidence intervals. The default is level or as set by set level:see [U] 23.5 Specifying the width of confidence intervals

(95)

l

nocoef specifies that the c( efficient table is not to be displayed. This option is sometimes used by programmel but is of n( use interactively.

I

noconstmat s_ppresses the :onstant term (intercept)in the probit model. This option is not available for dprobi_.

!

robust species that the H _ber/'White/sandwich estimator of variance is to be used in place of the traditional dalculation; s_:e [U] 23.11 Obtaining robust variance estimates, robust combined

!

with cluster]

l

be independent between :lusters). If you specify pweights robust is implied; see [U] 23.13 Weighted estimation.

! i ! i

l !_ } i _ i }

I i {

() allows

bservations which are not independent within cluster (although they must

cluster(varn_me) specifi_m that the observations are independent across groups (cluSters) but not necessaily within gr)ups, varname specifies to which group each observation belongs; e.g.. cluster (p _rsonid) in data with repeated observations on individuals, cluster() affects the estimated s_andard error: and variance-covariance matrix of the estimators (vcH), but not the estimated c,)efficients; see [U] 23.11 Obtaining robust variance estimates, cluster() can be the unstratified cluster-_ampled data. but used with _weights to produce estimates for see svyprobitcommand in [R] s_Westimators for a command designed especially for survey data. 1 cluster()limplies by itself. score(newva,lname)

rob_tst; create

speci_ing

robust

cluster()

is equivalent to typing cluster()

newvar containing uj = OInLj/O(xjb)

for each observation j in the

sample. Th_ score vecto is _ OlnLj/Ob = _ u.jxj; i.e., the product of newvar with each covm-iate s_mmed over ,servations. See [u] 23.12 Obtaining scores. asis requests _,thatall spec ]ed variables and observations be retained in the maximization process. This option I is typically r_ot specified and may introduce numerical instability. Normally probit 1

drops variables that perf_ctly predict success or failure in the dependent variable. The associated observation_ are also dr(pped. In those cases, the effective coefficient on the dropped variables is infinity (_egative infinity) for variables that completely determine a success (failure). Dropping the variable, and perfectly predicted observations has no effect on the likelihood or estimates of the remaining cbefficients an,t increases the numericaI stability of the optimization process. Speci_ing this option _orces retenti_)n of perfect predictor _:ariables;and their associated perfectly predicted observationL offset (varmmu:) specifies that varname _sto be included in the model with the coefficient constrained to be 1. madmize_optWns specify thma.

control tt_e maximizalion process: _ee [R] maximize. You should never have to.

_----

l_lv_l_

Illlfli^lflllll_llllll--III11,_llllll//11,/lbl

I./ll./i./ll

1_6LIIII_I|IUI'|

Optionsfor dprobit at (matname) specifies ::i

the point around which the transformation of results is to be made. The default is to perform the transformation around _, the mean of the independent variables. If there are k independent variables, rnatname may be 1 × k or 1 x (k + 1), that is, it may optionally include final element 1 reflecting the constant, at () may be specified when the model is estimated or when results are redisplayed.

classic requests that the mean effects be calculated using the formula f(_b)b_ in all cases. If classic is not specified, f(x-b)bi is used for continuous variables, but the mean effects for dummy variables are calculated as ff(_lb) - _5(2ob). Here 51 = _ but with element i set to 1. 20 - _ but with element i set to 0, and _ is the mean of the independent variables or the vector specified by at(). classic may be specified at estimation time or when the results are redisplayed. Results calculated without classic may be redisplayed with classic and vice versa. probit_options

are any of the options allowed by probit;

see Options for probit, above.

Optionsfor predict p, the default, calculates

the probability

of a positive outcome.

xb calculates the linear prediction. strip calculates the standard error of the linear prediction. rules requests that Stata use any "rules" that were used to identify the model when making prediction. By default, Stata calculates missing for excluded observations. asif requests that Stata ignore the rules and the exclusion criteria and calculate predictions observations possible using the estimated parameter from the model. nooffset

is relevant only if you specified offset

(varv_ame) for probit.

the

for all

It modifies the calculations

made by predict so that they ignore the offset variable: the linear prediction rather than xjb + offsetj.

is treated as xjb

Remarks Remarks are presented under the headings Robust standard errors dprobit Model identification Obtainingpredicwd values Performinghypothesis tests probit performs maximum likelihood estimation of models with dichotomous hand-side) variables coded as 011 (or, more precisely, coded as 0 and not 0).

dependent

(left-

> Example You have data on the make, weight, and mileage rating of 22 foreign and 52 domestic automobiles. You wish to estimate a probit model explaining whether a ca" is foreign based on its weight and mileage. Here is an overview of your data:

r I

_'_

'

!dec I

Contain_

!

size:

_

i

variabl_

i

make mpg

I

data

obs: v_rs:

from

°,3

_uto.dta

7_ 'I 1,9911

name

1978 7 Jul (99,7Z

stora_ type

of

memory

display format

'/,-18s _8.Og

weight

!

int

_8.0gc

foreign

!

byte

_,8.0g

Data

free)

value label

strl int

Aatomobile 2000 13:51

variable

label

Make and Model Mileage (mpg) Weight origin

Car

(Ibs.)

type

S_rted _y: foreign No_e: ! . inspect

dataset

las

changed

since

last

saved

foreign

foreign: Car type

Numberof Observations

i

Total

!

#

*

Negative

# #

'_

#

#

r

# 0 (2 !

|

_ique

NonIntegers

Integers -

Zero Positlve

52 22

52 22

Total

74

74

Missing 1

74

values

f_reign

is

la_eled

and

all

values

ar_

documented

in

the

label.

The variable f_reign take_ on two unique values. 0 and 1. The value 0 denotes a domestic car and t denotes _ foreign car. l

The model _ou wish to e ;timate is Pr(:_oreign - I)= _(_o+ glweight+ g2mpg) where _ is the cumulative n )rmal distribution. i

To estimate his model, y _utype

ItezatioNO:

log likelihood=

Iteration_

log

1 :

lik#lihood

f outputo__i_ed ) Iteration5: Probit

Log

i

iI

i

|

log likJlihood= -26.844189

es _imates

! likellhood

fore!gn

-45.03321 -29.244141

I

=

-26. _4 4189

(

_pg [ -.10_: 503 _clns 8.27 464

Std,

Err,

.0515689 2.554142

z

-2.02', 3.24

Number LR chi2 of (2) obs

= =

Prob > R2 chi2 Pseudo

=

P>,zI

0.044 0.001

[95_

Conf.

-.2050235 3.269438

74 36,38

0.0000 0.4039

Interval]

-.0028772 13.28149

You find that heavier cars are less likely to be foreign and that cars yielding better gas mileage are also less likely to be foreign, at least holding the weight of the car constant. _

-IllaAltliUIIl'llRUil|lIJi_,i See [R]JJIIJIJIt maximize for an explanation _/IgiJl| of the_O|IIIICI|IU|I output.

chi2

= = =

74 30.26 0.0000


Pseudo R2

=

0.4039

Robust foreign weight mpg _cons

Coef. -. 0023355 -. 1039503 8. 275464

Std. Err. .0004934 .0593548 2. 539176

z -4.73 -1.75 3.26

P>Iz I O. 000 0.080 O. 001

[95Y,Conf. Interval] -. 0033025 -. 2202836 3. 29877

-. 0013686 .0123829 :13. 25216

the standard error for the coefficient on mpg was reported to be .052 with a resulting confidence interval of [-.21.-.00]. Without

robust,

robust with the cluster () option has the ability to relax the independence assumption required by the probit estimator to being just independence between clusters. To demonstrate this. we will switch to a different dataset. You are studying unionization of women in the United States and have a dataset with 26,200 observations on 4.434 women between t970 and 1988. For our purposes, we will use the variables age (the women were 14-26 in 1968 and your data thus ._panthe age range of 16-46), grade (years of schooling completed, ranging from 0 to 18), not_smsa (28% of the person-time was spent living outside an SMSA standard metropolitan statistical area), south (4I% of the person-time was in the South), and southXt (south interacted with year, treating 1970 as year 0i. You also have variable union. Overall, 22% of the person-time is marked as time under union membership and 44% of these women have belonged to a union.

r

probit--! Maximum_likelihood probltestimation

You estimate the follow ag model ignoring that the women are observed an average of 5.9 times each in these lata: . probit

union

I_eraticn

0:

log li_elihood

age

fade not_smsa =

south

southXt

I_erati_n

i:

log l±_elihood

= -13548•436

Iteration

2:

log liCelihood

= -13547.308

-13864.23

Probit It_rati¢ 3timates _ 3: log li :elihood = -13547.308

i

Number

_ -

Log Ilk

ihood

= -13_47,308

u_ion i

:oef.

iage g_ade not__mse

i

585

s_uth

i

sou_hXt __ons "

i

Std.

Err.

z

=

26200

LR chi2(5) Prob > chi2

= =

633.84 0.0000

Pseudo

=

0.0229

P> Izl

of obs

R2

[95Z

Conf•

Interval]

•0015798 .0036651 .0202523

3•T6 7.20 -6.44

0.000 0.000 0•000

.0028496 .0192066 -.1700848

,0090425 .0335735 -•0906975

7254

.033989

-Ii.85

0.000

-.4693426

-.3361081

.00 3088 -1,1 3091

•0029253 .0657808

1.13 -16.92

0.258 0.000

-.0024247 -i•242019

.0090423 -.9841628

'

.00;9461 2639 -.13 3911

]

-.40

I I I

The reposed standard errors n this model are probably meaningless. Women are observed repeatedly and so the observations are not independent. Looking at the coefficients, you find a large southern effect against u}aionization a_d little time trend. The robust and cluster () options provide a way to estimate thistmodel and o gtain correct standard errors: • probit _nion

i

not_smsa

i:

log

2: 0: 3:

log lik _lihood = -13547.308 log lik log likelihood _lihood = -13547.308 -13864.23 Number of obs Wald chi2(5)

= =

26200 165.75

i

Prob

=

0.0000

= -135 17.308 I.

(standard

> chi2

Pseudo R2 = on idcode) 0.0229 _djusted for clustering

errors

Robust

un_on,

1

cluster(id)

_ i

i

I•

robust

ilk _lihood = -13548.436

estimates

i

i

south/t,

Iteratior Ite_atioI Iteratior

Log likelihood : i

i

south

Iteratior

Probit

i

age grlde

Cdef.

Std.

Err.

z

P>Izl

[957 Conf.

Interval]

.001327 ,0110282 -.209595

.0105651 .04i7518 -.0511873

_ge grade not_s_sa

.005R461 ,02_39 -.130_911

.0023567 ,0078378 .0404109

2.52 3.37 -3.23

0.012 0.001 0.001

so_th

-.4027_254

.0514458

-7.83

0.000

-.5035573

-.3018935

souz_Xt _c_ns

.003_)88 -I.1131)91

.0039793 .I188478

0.83 -9.37

0.406 0.000

-.0044904 -1.346028

.0111081 -.8801534

'

!

l

Thesestandard _ors arerou_hly50% larger thanth0sereported by theinappropriate conventional calculation. By Comparison. mother model we could estimate is an equal-correlation population-

i

a_eraged probit _odet:

i

i

I

Iteration : tolerance = .04796083 . xtprobitiunion age g:'ade no__smsa Iteration : tolerance = .00352657 Iteration tolerance = .00017886 IZer&_ion l_erat_on

_: tolerance _: tolerance

= 4.150e-07 = 8.654e-06

south

southXt,

i(id) pa

586

probit -- Maximum-likelihood probit estimation GEE population-averaged Group variable: Link: Family: Correlation:

model

Scale parameter: .

,

Number of obs Number of groups Obs per group: min avg max Wald chi2(5) Prob > chi2

idcode probit binomial exchangeable 1

union

Coef.

age grade not_smsa south southXt _cons

.0031597 .0329992 -.0721799 -.409029 .0081828 -I.184799

Std. Err, .0014678 .0062334 .0275189 .0372213 .002545 .0890117

z 2.15 5.29 -2.62 -10.99 3.22 -13.31

P>IzI 0.031 0.000 0,009 0.000 0.001 0.000

= = = = = = =

26200 4434 1 5.9 12 241.66 0.0000

[95_ Conf. Interval] .0002829 .020782 -.1261159 -.4819815 .0031946 -1.359259

.0060366 .0452163 -.0182439 -.3360765 .0131709 -1.01034

The coefficient estimates are similar but these standard errors are smaller than those produced by probit, robust cluster(). This is as we would expect. If the equal-correlation assumption is valid, the population-averaged probit estimator above should be more efficient. Is the assumption valid? That is a difficult correspond to an assumption of exchangeable to assume an AR(I) correlation within person that we do not wish to impose any structure.

question to answer. The population-averaged estimates correlation within person. It would not be unreasonable or to assume that the observations are correlated, but See [R] xtgee for full details.

What is important to understand is that probit, robust cluster () is robust to assumptions about within-cluster correlation. That is, it inefficiently sums within cluster for the standard error calculation rather than attempting to exploit what might be assumed about the within-cluster correlation.

dprobit A probit model is defined Pr(yj where 62 is the standard cumulative

# 0[xj)

= 62(xyb)

normal distribution and xjb

is called the probit score or index.

Since xjb has a normal distribution, interpreting probit coefficients requires thinking (normal quantile) metric. For instance, pretend we estimated the probit equation Pr(yj

# 0) = 62(.08233za

- 1.529x2

in the Z

- 3.139)

The interpretation of the xl coefficient is that each one-unit increase in :t,1 leads to increasing the probit index by 08233 standard deviations. Learning to think in the Z metric takes practice and, even if you do, communicating results to others who have not learned to think this way is difficult. A transformation of the results helps some people think about them. The change in the probability somehow feels more natural, but how big that change is depends on where we start. Why not choose as a starting point the mean of the data? If 51 - 21.29 and 52 = .42. then we would report something like .0257. meaning the change in the probability calculated at the mean. We could make the calculation as follows. The mean normal index is .08233 x 21.29 4- 1.529 × .42 -3.139 = -.7440 and the corresponding probability is _( .7440) = .2284. Adding our coefficient of .08233 to the index and recalculating the probability, we obtain 62(-.7440 + .08233) = .2541. Thus, the change in the probability is .2541 .2284 = .0257.

i

r

probit-- Maximum-likelihood probitestimation

587

In prattle. Feople mak _,this calculation somewhat differently and produce a slightly differcnl numb£r.Ratbi- than make the calculation for a one-unit change in z, they calculate tile slope of the probabilir_.'unction. D( ing a little calculus, they derive that the change in the probabiliU for a change_in.r: _,_,9"" (?.Q) is he height of the normal density multiplied by the zl coefficienu that is. 0@ ;

0X 1

Going throug_ The 'differe this ce between 0257 andobtain .0249.0249. is not much. they differ because the .0257 is the e_act calculat on, they answer Ifor a lone-unit incr _ase in ¢_ whereas .0249 is the answer for an infinitesimal chariot.

I0

extrapolated obt Ot.l[.

dpr_bit ,,lm_the ¢las 1ic option transfom_s results as an infinitesimal change exm_pokued Example .L.._ Consider the a_tomobile dat again:

!

. use a_ ;o, clear i

• gen gc (1978 A_ rdplus :omobile= Dat_ repi I>=4 if rep78-=. (5 missii_ values generated)

I i

dprobi foreign mpg goodplus, classic Iteratio: 0: log liielihood = -42.400729 Iteratio: I: log li_elihood = -27.643138

I

It_ratio: It4_ratio:2: 3:

1 i

log li_elihood -26.953126 log li_elihood == -26.942119

Pr@bit Iteratio: e_timates 4: log li :elihood = -26.942114 _

Number of obs = LR chi2(2) =

Log likelihood = -26.!142114

Prob > R2 Pseudo chi2

foreign

dF/dx

mpg goodplus _cons

.0249187 .46276 -.9499603

obs. P

.3043478

pred. P

.2286624

Std. Err. .0110853 .1187437 .2281006

z

P>Izl

2.30 3.81 -3.82

O.022 0.000 0.000

x-bar

[

69 30.92

= 0.3646 0.0000

95Y.C,I.

]

21.2899 .003192 .046646 .42029 .230027 .695493 1 -I,39703 -.502891

(at x-bar)

z and!P>Izl are t_,etest of the underlying coefficient being 0

Afterestimation with dpro )it, the untransformedcoefficientresults can be seen by typing probit • probit i

I

_ithoutProbit options:! estimates Log likelihood ! fore_n

}[ [.

Number LR cbi2 Prob > Pseudo

= -26.912114 t I I

pZg I __CO S good s ]

Co,._f, Std, Err.

.082 33 -3.138 37 1.528 _92

.0358292 .8209689 .4010866

z

2.30 -3.82 3,81

P>Izl

0.022 O.000 0.000

of obs (2) chi2 R2

= = = =

69 30.92 0.0000 0.3646


.0121091 -4..7428771 747807

.152557 -12.315108 . 5_9668

.,..,v

W,,..L,,t

--

r_ux.llU.l-,iKellnOOO

esUmat|on

proDIt

There is one case in which one can argue that the classic, infinitesimal-change based adjustment could be improved on, and that is in the case of a dummy variable. A dummy variable is a variable that takes on the values 0 and 1 only--1 indicates that something is true and 0 that it is not. goodplus is such a variable. It is natural to summarize its effect by asking how much goodplus being true changes the outcome probability over that of goodplus being false. That is, "'at the means", the predicted probability of foreign for a car with goodplus = 0 is q5(.08233_1 - 3.139) = .0829. For the same car with goodplus = 1, the probability is I'(.08233 E_ + 1.529 - 3.139) = .5569. The difference is thus .5569 -.0829 = .4740. When we do not specify the classic option, dprobit makes the calculation for dummy variables in this way. Even though we estimated the model with the classic option, we can redisplay results with it omitted: i f

dprobit Probit estimates

Log

likelihood

= -26.942114

foreign

dF/dx

Std.

Err.

z

21.2899

,0110853

2.30

O. 022

.4740077

.I 114816

3.81

O. 000

obs.

P

.3043478

pred.

P

.2286624

of dummy

variable

P>[zl

> chi2 R2

= 0.3646

[

.42029

69 30.92

957, C.I.

]

.003192

.046646

.255508

.692508

(at x-bar)

discrete are

= 0.0000

x-bar

.0249187

is for

Prob

P>tzJ

mpg

z and

= =

Pseudo

goodplus*

(*) dF/dx


the

change test

of the

underlying

from

0 to 1

coefficient

being

0

q

[3 Technical Note at (mamame) allows you to evaluate effects at points other than the means. Let's obtain the effects for the above model at mpg = 20 and goodplus = 1: • matrix .

myx

dprobit,

Probit

Log

= (20,1)

at(myx) estimates

likelihood

Number

= -26.942114

foreign

dF/dx

Std.

Err.

z

of obs

=

69

LR chi2(2) Prob > chi2

= 30.92 = 0.0000

Pseudo

= 0.3646

P>Izl

x

R2

[

95_

C.I.

]

mpg

.0328237

.0144157

2.30

0.022

20

.004569

.061078

goodplus*

.4468843

.1130835

3.81

0,000

1

.225245

.668524

obs,

P

.3043478

pred.

P

.2286624

(at x-bar)

pred.

P

.5147238

(at x)

(*) dF/dx

is for

z and P>Iz}

discrete are

the

change test

of dummy

of the

variable

underlying

from

0 to I

coefficient

being

0

Q

t

i

i

,

prObit-- Maximum-likelihoodprobit estimation

589

Model identification The probi_; command h s one more feature, and it is probably the most useful. It will automatically check the model for identification and, if it is underidentified, drop whatever variables and obser_ ations i

are necessary !or estimatior to proceed. 1

> Example

i

Have you ei'er estimated [a probit model where one or more of your independent variables perfectly ! _

predicted" one br the other _utcome? For instanc{e, th! following i consider " " small amount of data: Outcome ?4 Indepeiadent Variable z

I

0

J

!

o1

o (.)

t I

l.et's imagine _'e wish to pn dict the outcome on the basis of the independent variable. Notice that the

!

outcome, is ah_{ayszero whel,ever the independent variable is one. In our data Prty = 0 ix = 1) - 1, ,',,rcn ]n turn ;means that tire probit coefficient on x must be minus infinity with a corresponding infinite stand_d error. At this point, you may suspect we: have a problem. UnfortunatOly, not all suctt problems are so easily detected, especially if you have a lot of independent variables in yohr model. If ,,ou have ever had such difficulties, then you have experienced one of the

i_ } ! _i ! ! *

more unpleas@ aspects of _amputer optimization. The computer has no idea that it is trying to solve for an infinite i:oefficient as it begins its iterative process All it knows is that. at each step, making the coefficient }alittle bigge_, or a little smaller, works wonders. It continues on its merry way until either (1) the _,hole thing c _mes crashing to the grdund when a numerical overflow error occurs or _2) it reaches s_me predeterr tined cutoff that stops the process. Meanwhile, you have been waiting. In addition, the e_timates that ou finally receive, if an3;. may be nothing more than numerical roundoff. i

State watches for these s,)rts of problems, alerts you. fixes them, and then properly estimates the model. ; 1

|

i

Let's return _toour automobile data. Among the variables we have in the data is one called repair that takes on tDee values. 4_ value of 1 indicates that the car has a poor repair wcord, 2 indicates

!

an avera_ze rec+rd, and 3 indicates a better-than-average record. Here is a tabulation of our data:

{

Car

tyre

2

3

Total

repair

I

Do=

ti {

2r

}

3

9

30

18

Foreign

i

Tot_l '

9 12

,58

1 i i

Notice that all ithe cars with 3oor repair records (repair==l) are domestic. If we were to attempt _o predict foreign on the basis of the repair records, the predicted probability for the repair==l category :would!have to be zero. This in turn means thai the probit coeN cient must be minus infinity, and that Would!set most corr.puter programs buzzing,

t l

Let's try, State on this proglem, First. we make up two new variables, rep_is_l lhat indicate thi repair cat,.'gory.

and

rep_is_2,

590

probit -- Maximum-likelihood probit estimation • generate

rep_is_1

= repair==1

generate

rep_is_2

= repair==2

The statement generate rep_is_l=repair==l creates a new variable, rep_is_l, that takes on the value 1 when repair is 1 and zero otherwise. Similarly, the next generate statement creates rep_is__.2 that takes on the value 1 when repair is 2 and zero otherwise. We are now ready to estimate our model: • probit note:

rep_is_2 failure perfectly 10 obs not used

Iteration

O:

log

likelihood

= -26.992087

1:

log

likelihood

= -22.276479

Iteration

2:

log

likelihood

= -22.229184

Iteration

3:

log

likelihood

= -22.229138

Log

I'

estimates

likelihood

Number

= -22.229138

foreign

Coef.

rep_is_2

- t. 281552

_cons

'

rep_is_l

Iteration

Probit

L

for

rep_is_1~=O predicts rep_is_l dropped and

I,21e-I 6

Err.

.4297324 ,295409

48

Prob

=

0.9020

=

0.1765

z

P>lzl

-2.98

O. 003

O. O0

= =

Pseudo

Std.

of obs

LR chi2(1)

I. 000

> chi2 R2

[95_, Conf. -2,123812 -, 578991

9.53

Interval] -.4392916 .578991

Remember that alI the cars with poor repair records (rep_is_l) are domestic, so the model cannot be estimated, or at least it cannot be estimated if we restrict ourselves to finite coefficients. Stata noted that fact. It said, "Note: rep_is_l-=0 predicts failure perfectly". This is Stata's mathematically precise way of saying what we said in English. When rep_is_l is not equal to 0, the car is domestic. Stata then went on to say, "'rep_is_l dropped and 10 obs not used". This is Stata eliminating the problem. First, the variable rep_is_l had to be removed from the model because it would have an infinite coefficient. Then, the 10 observations that led to the problem had to be eliminated as well so as not to bias the remaining coefficients in the model. The 10 observations that are not used are the 10 domestic cars that have poor repair records. Finally, Stata estimated what was left of the model, which is all that can be estimated. q

Technical Note Stata is pretty smart about catching these problems. variable", as we demonstrated above.

It will catch "one-way causation by a dummy

Stata also watches for "two-way causation"; that is, a variable that perfectly determines the outcome, both successes and failures. In this case Stata says, "so-and-so predicts outcome perfectly" and stops. Statistics dictates that no model can be estimated. Stata also checks your data for collinear variables; it will say "so-and-so dropped due to collineari ty". No observations need to be eliminated in this case, and model estimation will proceed without the offending variable. It will estimating age, and perfectly". model.

also catch a subtle problem that can arise with continuous data. For instance, if we were the chances of surviving the first year after an operation, and if we included in our model if all the persons over 65 died within the year, Stata will say, "'age > 65 predicts failure It will then inform us about the fixup it takes and estimate what can be estimated of our

f

IF

i probit

_j

(_nd logit

note:

.

an_ logistic) 0 successes

4 failures

probit_ , -_-,M.... aximum-likelihood probitestimation

591

will also occasionally display messages such as completely

determined.

The. cause!of this mess; ge and what to do if you get it are described in [R] legit. Q

Obtainingpredictedvlues Once you !have estimat_d a probit model, you can obtain the predicted probabilities using the predictcorr[mand for bolh the estimation sample,and other samples: see [U] 23 Estimation and post-estimati4n command, and [R] predict. Here we will make only a few additional comments.

i ! i

predict

'4ithout argur_rots calculates the predicted probability of a positive outcome. With the

i i_ i

xb option, it ¢_alculatesthe linear combination xjb; where xj are the independent variables in the jth observatio_ and b is th,_ estimated parameter vector. This is known as the index function since the cumulatiw density inde_ed at this value is the probability of a positive outcome.

i.

In both cas ',s, Stata remctubers any "rules" used to identify the model and calculates missing for

i

excluded obse vations unle_ rules or asif is specified. This is covered in the following example. Withithe s ;dp option, _redict calculates the standard error of the prediction, which is not adjusted_forre31icated cova iate patterns in the data.

!'_ i

One can c_ culate the u_adjusted-tbr-repticated-covariate-patternsdiagonal elements of the hat matrix, or leverage, by typir_g } . . predic!

pred

• predic_

sgdp,

genera_e

hat

I stdp! = std_2*pred*(l-pred)

> Example V';

In the pre lqus example, _'e estimated the probit model probit "Ib obtain predicted probabililies,

!

(option • predicti

"

p! assumed; p

(aO:missi_g • smmmari_e

_

f

foreign

rep_is_l

rep_is_2.

Pr foreign))

values foreign generated) Pl

r

.2068966 25

.4086186 1956984

0 .1

1 .5

I

Stata remember8 any "rules" used to identi_' the model and sets predictions to missing for any

i

excluded the'previous example, rep_is_lfrom our model andobserv_ttions.In excluded lO obser_'ations.Thus. whenprohitdropped we typed predictthe p.variable those same 10 obser_ation._ v,ere aa,ain excldded and their predictions se_to missing. predic:t's r41es option rill use the rules in the prediction• During estimation, we were told "'rep_is_l-=0 predicts failure )effectly", so the rule is that when rep_is_lis not zero. one should predict 0 probability of succe_ or a positive outcome: • sulmuariz_ foreign . predict _2,• rules

p +

592

probit -- Maximum-likelihood probit estimation Variable

Obs

Mean

foreign p p2

58 48 58

.2068966 .25 .2068966

Std. Dev. .4086186 .1956984 .2016268

Min

Max

0 .1 0

I .5 .5

predict's asif option will ignore the rules and the exclusion criteria, and calculate for all observations possible using the estimated parameters from the model:

predictions

predict p3, asif • summarize for p p2 p3 Variable

Obs

Mean

foreign p p2 p3

58 48 58 58

.2068966 .25 ,2068966 .2931034

Std. Dev. .4086186 .1956984 .2016268 .2016268

Min

Max

0 .1 0 .1

1 .5 .5 o5

Which is right? By default, predict uses the most conservative approach. If a large number of observations had been excluded due to a simple rule, one could be reasonably certain that the rules prediction is correct. The asif prediction is only correct if the exclusion is a fluke and you would be willing to exclude the variable from the analysis anyway. In that case. however, you should re-estimate the model to include the excluded observations.

Performinghypothesistests After estimation with probit, commands; see [U] 23 Estimation

you can perform hypothesis tests using the test or testnl and post-estimation commands.

Saved Results ,,

probit saves in

e()"

Scalars e(N)

number

e(df__m)

model

of observations

e (r2_p)

pseudo R-squared

e (11)

log likelihood

degrees

of freedom

e(ll_0)

log likelihood,

e(N_clust)

number

e (chi2)

X2

e(clustvar)

name of cluster

e(vcetype)

covariance

constant-only

model

of clusters

l

Macros variable

e(cmd)

probit

e(depvar)

name of dependent

e(wtype)

weight

type

e(chi2type)

Weld or LK; type of model

X_ test

e(wexp)

weight

expression

e(predict)

program

predict

e (g)


variable

estimation

method

used to implement

Matrices e (b)

coefficient

vector

Functions e(sample)

marks

estimation

sample

matrix

of the

,rr i

probtt -- Maximum-likelihood probit estimation

dprrbit [

593

s_ves in e()"

Scalars e(l_)

number of _bservations

e (lq_clust)

number of clusters

!

e(df_m)

model deg_'es of freedom

e(¢hi2)

X"_

i

e(r2_p) e(lt) e(lt_0)

pseudo R-s. unfed log likeliho_t log likeliho d. constant-only model

e(pbar) e(xbar) e(offbar)

fraction of successes observed in data average probit score average offset

e (emd) e(depva/-) e(wt_e)

dprobit name of de rendent variable weight type

e (_ cetype) e(chi2type) e(predict)

covariance estimation methodx 2 test Watd or LK; type of model program used to implement predict

e(wexp) e (clustvart)

weight expression name of clu _tervariable

e(dummy)

string of blank-separated 0s and Is: 0 means corresponding independent

I

Macros

i

!

variable is not a dummy, I means that it is

e (b) Matrices

coefficient

ctor vt vafiance-co_afiance matrix of

e(_/)

marginal effects

e(se_dfdx)

standard errors of the marginal effects

the estimators !

Functions e(sample)

i

e(dfax)

marks estim4tion sample

MethodsandFormula Probit analysis

originate

in connection

with bioassay,

and the word probit,

a contraction

"probability unit", was suggested by Bliss (1934). FOr an introduction to probit and example, Aldrich and Nelsor_ (t984), Hamilton (1992). Johnston and DiNardo (t997), or Powers and _#pl. and #obs2 -> #p2.

} I

i

Remarks The priest qutput followslthe output of ttest in providing a lot of information. Each proportion is presented alon_ with a cont_dence interval. The appropriate one- or two-sample test is performed and the two-sidell and both o_e-sided results are included at the bottom of the output, in lhe case of a two-sampleitest, the cal_ulated difference is also presented with its confidence interxal. This command may be used for bo_h large-sample testing and large-sample interval estimation.

i, i

1 I

595

596

prtest-

One- and two-sampie tests of proportions

D Example

i ,

In the first form, priest tests whether the mean of the sample is equal to a known constant. Assume you have a sample of 74 automobiles. You wish to test whether the proportion of automobiles that are foreign is different from 40 percent. . priest

foreign=.4

One-sample

test

of proportion

Variable

Mean

foreign

.29T2973

Std.

Err.

.0531331

Ho: Ha:

foreign:

z

P>lz[

5. 59533

O. 0000

proportion(foreign)

foreign < .4 z = -1.803

Ha:

P < z = 0.0357

of obs

[95Z

=

Conf.

74

Interval]

.1931583

.4014363

= .4

foreign-= .4 z = -1.803

Ha:

P > Izl = 0.0713

The test indicates that we cannot reject the hypothesis .40 at the 5% significance level.

Number

foreign > ,4 z = -I. 803

P > z = 0.9643

that the proportion

of foreign automobiles

is

Izl

- proportion(cure2) Ha:

z = -2.060 P < z =

of obs of obs

z

proportion(curel)

diff

Izl

diff

~= 0

= -2.060 =

0.0394

are statistically

=diff

Interval]

= 0

Ha:

diff

> 0

z = -2.060 P > z =

0.9803

different from each other at any level greater than

_j

prtest -- One-;and two-sampletests of proportions

i

597

Immediate for m

i

Example

I ! !_

pr't;esti is like prtes" except that you specify summary statistics rather than variables as arguments. Foz instance, vo_ are reading an article Which reports the proportion of registered voters among 50 randomly, selecte_ eligible voters as .52. You wish to test whether the proportion is .7: prtesti

i

50 .52 .70

One-samp].e

test

of proportion

Variable I

Mean

x ;

,52

t

x: Number of obs =

Std. Err. .0706541

z

P>Izl

7.3598

O.0000

[95%Conf,

50 Interval]

.3815205

.6584795

!

I

, H_: x < .7

Ho: proportion(x) = ,7 Ha: x -= .7

Ha: x > .7

zl = -2.777

z = -2.777

z = -2.777

P
Iz[ =

0.0055

P > z =

0.9973

Example

i i

In order to jhdge teacher effectiveness, we wish to test whether the same proportion of people from two classds will answe_ an advanced question correctly. In the first classroom of 30 students.

_i

40% answered the question correctly, whereas in the second classroom of 45 students, 67% answered the question cofi'ectly. ! Two-sample test of pr . prtesti!30 .4 45 .6_

Variable

Mean

x y

.4 .67

ortion

x: Number of obs = y: }_umberof obs =

Std. Err.

z

P>[z_

30 45


Izl

=

0.0210

Ha: diff>

0

z = -2.309 P > z =

0.9895

Saved Results •

I

prtest

Scalars saves_in :

r()

r(z)I

z statistic

r(P.__)

proportion

r(N__)

:or variable #

number of obser_'ations

_br variable #

598

prtest -- One- and two-sample tests of proportions

Methods and Formulas prtest and prtesti areimplemented A large-sample

(1 - a)100%

as ado-files.

confidence interval for a proportion p is

pq-"

and a (1 - a)100%

confidence for the difference of two proportions

(P'I

where _"= 1 -_

Zl_a/2v/P-----_

-- P2)

Jr- Zl_ol/2

is given by

/P'tqq + P'2q"2 V nl n2

and z is calculated from the inverse normal distribution.

The one-tailed and two-tailed statistic calculated as

test of a population

proportion

uses a normally distributed

test

_-po

z= ;_/_g_/,_ where P0 is the hypothesized proportion. A test of the difference normally distributed test statistic calculated as

of two proportions

also uses a

Z=

v%%(llnl+11n2) where _p

__. Xl

-}-X2

nl + n2

and xl and x2 are the total number of successes in the two populations.

References Sincich, T, I987. Statistics By Example 3d ed. San Francisco:Dellen Publishing Company.

Also See Related:

[R] bitest, [R] ci, [R] hotel, [R] oneway, [R] sdtest, [R] signrank,

Background:

[U] 22 Immediate

commands

[R] ttest

Stata Reference H-P Release 7

Stata Reference Su-Z Release 7

Stata Survey Data Reference Manual: Release 11

Stata Multiple-Imputation Reference Manual: Release 11

Stata Multivariate Statistics Reference Manual Release 10

Stata Data-Management Reference Manual: Release 11

Stata Time-Series Reference Manual: Release 11

Stata Multivariate Statistics Reference Manual: Release 11

Stata Programming Release 9

Stata User's Guide Release 11

Stata Longitudinal-Data Panel-Data Reference Manual: Release 11

Stata 11 Base Reference Manual

HP

JDK 7 Reference Card

Release

HP Certified: HP-UX System Administration

Release

Release

Release

BusinessObjects XI (Release 2): The Complete Reference

ANSYS, Inc. Theory Reference: ANSYS Release 9.0

JMP Design of Experiments, Release 7

Release

Release

HP-UX CSE: Official Study Guide and Desk Reference

by hp

HP-394, MB-1693

HP-678, MB-2182

HP ProLiant Servers AIS: Official Study Guide and Desk Reference

HP-242, MB-1336

Theory of Hp Spaces

Stata Reference H-P Release 7

Stata Reference Su-Z Release 7

Stata Survey Data Reference Manual: Release 11

Stata Multiple-Imputation Reference Manual: Release 11

Stata Multivariate Statistics Reference Manual Release 10

Stata Data-Management Reference Manual: Release 11

Stata Time-Series Reference Manual: Release 11

Stata Multivariate Statistics Reference Manual: Release 11

Stata Programming Release 9

Stata User's Guide Release 11

Stata Longitudinal-Data Panel-Data Reference Manual: Release 11

Stata 11 Base Reference Manual

HP

JDK 7 Reference Card

Release

HP Certified: HP-UX System Administration

Release

Release

Release

BusinessObjects XI (Release 2): The Complete Reference

ANSYS, Inc. Theory Reference: ANSYS Release 9.0

JMP Design of Experiments, Release 7

Release

Release

HP-UX CSE: Official Study Guide and Desk Reference

by hp

HP-394, MB-1693

HP-678, MB-2182

HP ProLiant Servers AIS: Official Study Guide and Desk Reference

HP-242, MB-1336

Theory of Hp Spaces

Recommend Documents